Cracking the CAPTCHA Code: Essential Tips for Bot Developers and Data Scrapers

Alex Rodriguez
Are you looking for the best ways to bypass CAPTCHAs while scraping data from the web? Then the article below has been written for you. You will learn how to bypass CAPTCHAs without allowing them to prevent you from your scraping and botting exercise.

CAPTCHA, the Completely Automated Public Turing test to tell Computers and Humans Apart, is a test for distinguishing real web users from automated users, also known as bots. This was introduced to provide web servers protection against bot attacks and spam. And they are quite helpful in keeping the web saver. However, while they are beneficial to website owners and administrators, they are bad for bot users and web scrapers.

If you are a web scraper or engage in any kind of botting, just know that this was invented for your sake. And if a website uses it for protection, you will need to bypass it in order to access the data you want to access. There is a good number of levels you can get. You can either avoid getting CAPTCHA at all, which is even more cost-effective, bypass or solve Captcha when they appear, and watch your activities after solving the captcha.


Table Of Contents

CAPTCHA — An Overview

With Captchas, the website put on a challenge to web users to see whether they would be able to solve them or not. The assumption here is the challenge will be easy for humans to solve and difficult for automation users such as bots to solve. The term was first coined in 2003, even though the actual usage of a tool that matches this definition first appeared in 1997. The earlier versions of CAPTCHAs were easier to solve as they were sequences of letters and or numbers in a distorted image.

Using Computer Vision (CV) technology, one can pick what is written on the images and guess the content. CAPTCHAs have become complicated and complex nowadays, with some not even appearing in the first place — just monitoring your activities and spanning out of nowhere in an unpredictable manner if your activities become suspicious. Modern CAPTCHAs can be categorized into 3 categories — text-based, image-based, and audio-based. Each of these has the way they are solved and can be bypassed.


How to Bypass CAPTCHAs

Finding how to bypass CAPTCHAs is very important to you as a web bot developer, especially if your target sites have a CAPTCHA service installed on them. In this section of the article, I will be taking a look at how to bypass CAPTCHAs. There are basically two methods of bypassing CAPTCHAs. You either make use of a manual CAPTCHA solver or use an algorithm to solve it. Let’s take a look at each of these methods below.

reCAPTCHA and the other types of complex CAPTCHAs have become the tools for most websites. For some of these CAPTCHAs, there is no way you can solve them using algorithms or Computer Vision technology. What you delegate the solving of these CAPTCHAs to a third-party service known as anti-captcha services. But what you do not know is that all of these services actually hire manual labor to solve these CAPTCHAs. They employ cheap labor workers in the poorest third-world countries to solve these CAPTCHAs and take a cut from the money they make. One of the most popular services that engage in this is the 2Captcha service. Others include DeathByCaptcha, Anti-Captcha, and Antcpt. Let’s take a look at one of these in detail.

2Captcha — the Service for Solving CAPTCHAs Manually

The 2Captcha service offers CAPTCHAs solving services to web automation developers. It employs some of the best human workers with over 90% level of accuracy to help you solve captchas. The captchas it can help you solve include reCAPTCHA v2, v2 callback, v2 invisible, v3 and Enterprise. It can also be used to solve hCaptcha challenges too. With this service, pricing is based on 1K captchas and varies depending on the captchas you want to solve.

But generally, it can be said to be affordable. All you need to use this service is to register an account, fund it, and integrate their API. They provide an API client for Python, PHP, Java, C#, Go, and Ruby.

If the CAPTCHAs you are dealing with are the simpler version of CAPTCHAs, you might not need to make use of a manual captcha solver. This is because the cost of solving CAPTCHAs might look small on paper. But the cost can scale up if you have to deal with them a lot, especially for bigger projects. Why not just have a script that does that for you without you paying for it anytime you solve it?

There are a good number of CAPTCHA-solving scripts that. Make use of Machine Learning to guess the content of a CAPTCHA challenge. This only worked for some of the early captchas, which were not thoroughly evaluated when being designed. For some, that just displays the content over an image; if the characters aren’t distorted enough, they can be guessed. If you have skills in Computer  Vision programming, you can use it to recognize the characters in an image and then use it to solve the captcha.

This can be done with any programming language of your choice, provided there is a library for computer vision programming. Python seems to be the most popular for this, though.


How to Prevent CAPTCHAs from Appearing

If you ask me about a tip to bypass CAPTCHAs, I will suggest you don’t even allow them to appear in the first place if it is possible. This is because it is suspicious activities that make them to appear, and even when you solve the CAPTCHA, another one will show up in no time. If you are a developer or a web scraper, it pays to avoid getting CAPTCHAs, as most of them won’t be solvable by using a script.  You need services like 2Captcha to solve them, and that will cost you a lot for bigger projects. In this section of the article, I will show you the steps to avoid getting CAPTCHAs in the first place.

There are no two ways about it — you need proxies for web scraping. And you are better off making use of residential proxies. For some sites, you could use a datacenter. But even the best datacenter proxy network with rotating capability is a risk and suspicious because of the origins of the IPs. But for residential proxies, you are covered. This is because residential proxies use residential IPs, the same IP addresses assigned to real Internet users. In fact, most of the providers actually don’t have these IPs to themselves.

They have a pool of devices by third-party owners and route requests via them — thereby using their device resource and IP address. This is the most real thing you can get, and you can be sure of your tracks been hidden. There are a good number of providers that offer these kinds of proxies. But not all of them are good, perform excellently well, and are affordable. As per recommendation, I have used Bright Data, Smartproxy, and Soax, and these 3 have performed excellently well.

There is a mistake I see beginners make any time I check the source code of their bot — it screams I am a bot — ban me. Most don’t set request headers for their bot. They make use of the default headers set by the HTT P request library they use. The problem with this approach is that any anti-spam system could detect it regardless of whether you are using a proxy or not. Each HTTP library has a user agent string that tells a web server about it.

Take, for instance, the Python request; its user agent string is “Python-requests/x.y.z.” this tells a web server that it is a bot using the Python request library with even its version number. You need to change this to the user agent string of a popular browser or acceptable search engine crawler such as Googlebot. Check out this page for the list of popular user agents.

The user agent header is the most important to send, but it is not the only one. There are others you need to send. The best way to know the one you should send is to visit your target website, then use the developer tool in your browser to see the request headers sent by your browser and set them. I recommend you have a bunch of user agents and then rotate them to mix things up. One more thing you need to do is make sure a website sends the same version of a webpage to all of the user agents you set. If not, your code will break or might not function as it should, and you will find it difficult to identify the cause.

One of the ethical recommendations for web scraping is for you to be nice. And what this means is that you should set delays between requests to avoid overwhelming your target website. But for most people, this advice is only for small websites. Except you have a large project, you can’t bring down a popular site (and if you can, you won’t be here). In this article, I recommend you set delays not just for ethical reasons but also to appear more human.

Sending too many requests will make your target more suspicious of your activities, even with other methods of evasion used. Set delays between requests, and when you do that, make sure the delay is random. There is no point in setting delays if the period between requests is predictable. Make it randomly so that you leave no footprint for suspicion.

The easiest web pages to scrape are those open to the public that do not require one to log in or maintain session. If your target is a site that requires you to stay logged, then you will find it difficult to deal with. To an amateur, logging in for every request is an option. But an experienced person knows this is the perfect reason for suspicion, and humans don’t login in 10 times in 2 minutes. Instead, humans log in once, and their browser will send cookies for every request. You need to set up a system where you capture the cookie when available and then keep using it.

For some websites, login is not even compulsory for users. However, capturing cookies and then adding them to your future requests will keep the CAPTCHAs away. However, you need to be smart about this because cookies are also used for user identification and, as such, could also be used to block you. You can create a bunch of cookies and user agents and then rotate them.

One easy way to tell a bot from a human is the predictable nature of the bot. Humans don’t just access a page and leave. They scroll and even click on some UI elements. Some websites will put up a CAPTCHA challenge if you behave like a bot. And the only way to behave like a human is to make use of a browser automation tool. Some of the best options available now include Selenium, Puppeteer, and Playwright.

Using this tool, a browser is used to access a page while loading JavaScript, and you can even use it to type the details of a form before submission. This looks more natural and protects you against the occurrence of CAPTCHAs. However, using Selenium or any of its alternatives can also be counter-intuitive for some websites. Some use browser fingerprinting and can easily detect the use of bots with it. If you are dealing with a target like this, then you need to make use of an antidetect browser with support for automation. Multilogin, Incogniton, and GoLogin are the best in this regard.

If you have tried all of what is written above and you still get to deal with CAPTCHAs, it is time to make use of a web scraping API. With a web scraping API, you do not need to worry about proxies, CAPTCHAs, blocks, and even handling headless browsers. The web scraping API handles all of that for you. Some of the best web scraping API includes ScraperAPI, ScrapingBee, and WebScrapingAPI. All you need is to send an API request, and you get the content of a web page as a response — no need to handle blocks or setting your own proxies.

If your target site is a popular site, you can even get a scraping API that provides you JSON response which makes your work easier for you in terms of parsing. However, the only major problem you will face if you use a web scraping  API is that the pricing for them is based on the number of requests. And if your target is a difficult-to-access site, the configuration you will need which require a lot of credits which could jack up the cost of the scraping project.


FAQs

CAPTCHAs are a way websites use to block bot traffic, and bypassing that to some might look like an illegal thing to do. The reality is bypassing CAPTCHAs is nothing illegal. It might interest you to know that even the top businesses that protect their websites from bots still use bots on other targets. Just make sure you are able to bypass the CAPTCHAs and do nothing illegal when you bypass it, and you are fine.

Q. Is Rotating IPs Enough to Bypass CAPTCHAs?

To some, rotating IPs is the only trick they have to bypass anti-spam systems. It might interest you to know that such tactics are old. In the past, IP tracking is the way websites detect and block bots. Currently, they have become more sophisticated and use an array of techniques to detect and block bots. Aside from your IP address, websites now use cookies, browser fingerprinting, and behavior to detect and block bots. That is why you need to incorporate a good number of strategies to evade suspicion and detection successfully.

Q. Are CAPTCHAs Easy to Bypass?

CAPTCHAs are not easy to bypass in the modern web. They used to be, especially when the technology has not yet evolved into how complicated they are now. Currently, you would most likely get the service of human solvers to deal with them for you if they appear. That is why the best thing for you is to prevent them from occurring in the first place.


Conclusion

CAPTCHAs have come to stay as far as the web is concerned — and they are not going anytime soon. Even though other technologies have been developed to combat spam, their cost-effective nature, coupled with their effectiveness, has made them the tool of choice for many website administrators. For this reason, you are better off learning how to bypass them as you will have to deal with them now and in the future if you are a web bot developer.

Related Posts

Top 10 Craigslist Scrapers of 2023

Having a holistic grab of the listed items, prices, and classified Ads on Craigslist is a step ahead of other competitors or buyers. Do you want to know ho...

Top 10 Amazon Scrapers in 2023

Are you looking for the best Amazon scraper to power your next data scraping quest? Then you should read the article below as I recommend some of the best ...