How to Bypass Cloudflare Turnstile with Scrapy

The era of simple web scraping is over. If you're not using advanced methods, you're already behind.

Scrapy is a great open-source spider for web scraping. My 3 favorite things about it:

Simple to write code for it
The framework can automatically utilize parallel execution to ensure quick scraping
It has wide community support

However, when it comes to data behind an anti-bot system like Cloudflare, DataDome, or Akamai, a simple spider can struggle. In this comprehensive guide I’ll showcase how you can integrate Scrapy with an anti-detect browser (and real browser automation tool) to ensure effective web scraping on indeed.com, which is protected by the Cloudflare Turnstile’s anti-bot protections.

What is Cloudflare Turnstile?

Cloudflare Turnstile is a verification tool designed to replace the frustrating experience of CAPTCHAs. Cloudflare Turnstile confirms web visitors are legitimate users and blocks unwanted bots without slowing down web experiences for real users.

A white card with black dotsAI-generated content may be incorrect.

When your browser shows the Cloudflare Turnstile, in the background, a POST request is sent to Cloudflare endpoints, with some encrypted data in the Payload. It is not rocket science to understand that the payload actually contains the browser fingerprint. All the settings, browser properties, parameters, and behavior of your browser are used by Cloudflare to decide if it needs to signal a red flag indicating bot traffic.

More and more modern websites are protecting precious data with this sophisticated anti-bot system, which prevents web-scraping tools from providing a reliable solution to bypass it.

Bypass Cloudflare Turnstile with Headless Browsers

Kameleo anti-detect browser is a web automation tool that provides a reliable solution on bypassing all types of anti-bot systems, including the Cloudflare Turnstile. Kameleo provides an unlimited number of good quality browser fingerprints and two custom-built browsers (Chroma and Junglefox) to perfectly mimic human behavior and real browsers while emulating any type of OS (Windows, macOS, Linux, Android, iOS) and browser (Chrome, Edge, Safari, Firefox). Thanks to the advanced masking technology, websites will see it as a real web browser of a real user despite it being controlled with automation tools like Selenium, Puppeteer, or Playwright. Unlike other headless browsers, where the presence of WebDriver automation framework, or other cdp leaks can expose automation - resulting in blocking your bots - Kameleo's robust solution ensures your automated browser remain undetectable.

However, a headless browser is not always what you are looking for. Since it needs to render the target URL, it is harder to handle parallelization, making the process slower. This is why I’ll show you how to integrate them with Scrapy for efficient execution. First, let’s see some performance comparisons.

Performance Measurement

In the following examples I used my Dell-XPS 9640 notebook (CPU: Intel Core Ultra 7 155H 3.8 GHz; RAM: 32 GB). My Laboratory has a very reliable internet connection

To benchmark the following examples I ran them 100 times each. I removed the 10 fastest and 10 slowest runs, and recorded the average of the remaining data.

Performance of Scrapy

In the first example, I use Scrapy to gather data from the target domain quotes.toscrape.com. The spider goes over 10 pages of data in 2.6 seconds, ensuring an effective technique.

Performance of Playwright

In the second example I use Playwright to scrape the same dataset from the quotes.toscrape.com website. The headless browser renders the page, making the scraping slower. It takes about 6.4 seconds.

After seeing this it is normal that you would like to choose the spider over the headless browser as it is not just faster, but it uses less resources as well.

As mentioned before, headless browsers come handy when you scrape data from Cloudflare-protected pages or websites with heavy JavaScript. When the data is protected by anti-bot systems or anti-bot solutions employing advanced browser fingerprinting techniques, the best you can do is to utilize an anti-detect browser. Kameleo provides an undetectable web automation browser. This is not an open-source solution; however the platform provides unlimited fresh fingerprints, and ensures that their custom-built browsers (Chroma and Junglefox) are constantly updated to ensure you stay on top of the anti-bot game and bypass Cloudflare anti-bot measures without tiring.

In the second part of this demo I’ll show you how to get data from the review page of BurgerKing on indeed.com, that is protected by Cloudflare Turnstile.

If you open the page the first time, it will test your browser with the Cloudflare Turnstile.

A screenshot of a computerAI-generated content may be incorrect.

If you try to simply scrape it with Scrapy, it won’t be successful, as it returns an HTTP 403 error code.

Most headless browsers fail to bypass this protection layer.

According to Pierluigi Vinciguerra from The Web Scraping Club it is very hard to rely on open source solutions such as Playwright and Cloudscraper. I couldn’t make it work with almost any open-source tools like Puppeteer Stealth or Playwright Stealth. When I found a working solution like Botasaurus as an alternative tool, later when I tried to deploy my code it wasn’t working anymore, and Cloudflare blocked my scraper bot.

Kameleo is an anti-detect browser specialized for web scraping. It allows you to bypass CAPTCHA challenges, canvas fingerprinting, and Cloudflare-protected websites with ease. We are constantly testing our custom-built browsers (Chroma and Junglefox) against anti-bot systems. Updates are quickly deployed to ensure you don't need to maintain your code to keep a high success rate.

Integrate Kameleo with Scrapy to Bypass Cloudflare Turnstile

First you need to bypass the Cloudflare Turnstile with Kameleo web automation browser.
Once your browser captured the cf_clearance cookie you can export it. This cookie works as a “pass through ticket”. Once your requests include it, the website won’t stop you for further verification. The good news is that it is often valid for 6-12 months due to long expiration dates.

You can add the cookies to the spider’s requests and the requests will return the proper data now. Use the cookie until the session is valid. This ensures effective scraping behind Cloudflare's anti-bot protections.

Continue reading as there might be some additional tricks with user-agent and headers.

Bypass Cloudflare Turnstile with Kameleo

In the third example we launch Chroma (one of Kameleo's undetected custom-built browser) with a fresh fingerprint, it simply bypasses the Cloudflare Turnstile and loads the BurgerKing review page on indeed.com. Then I export cf_clearance cookie which is my "pass for future Cloudflare challenges". I also print out the user-agent of the browser fingerprint I used as it will be necessary later.

Note, that while you are testing, it can happen that you need to check the checkbox manually. This happens most likely when you are making various requests from the same IP, please set up a proxy for Kameleo in this case.

Add cookies to Scrapy Requests to Bypass Cloudflare Turnstile

In the last example I add the cf_clearance cookie to the Scrapy requests.

Note that I also need to set up the same user-agent for Scrapy that I used with Kameleo when I was getting the cf_clearance cookie.

When your requests are possessing the cf_clearance cookie it ensures trust. Even though the requests created by Scrapy use Python's http.client + OpenSSL, which create a TLS fingerprint different from common web browsers like Chrome, Firefox, or Edge, the requests are successful as Cloudflare assumes now that Scrapy is the same browser that previously solved the challenge. So Cloudflare is not checking http headers and client-hints any more.

Note, that in some cases on some websites, you have to use the same accept-language headers as you used in Kameleo.

Conclusion

By integrating Scrapy with our anti-detect browser, we’ve unlocked the perfect balance between stealth and speed in web scraping. Instead of relying on a fully headless browser, which is slow and resource-intensive, we’ve strategically combined the strengths of an anti-detect browser with Scrapy’s high-performance architecture.

Kameleo ensures that Cloudflare Turnstile (or any other anti-bot system) perceives our requests as legitimate by providing undetectable browser fingerprints and a valid cf_clearance cookie.
Once this cookie is obtained, Scrapy takes over, utilizing its fast, asynchronous architecture to scrape at scale, avoiding the need to render pages in a browser.
Cloudflare no longer applies strict fingerprinting, meaning that Scrapy’s inherently different TLS stack does not trigger additional verification.
Instead of handling every request inside a full-fledged browser, we only use Kameleo once per session, reducing overhead and allowing us to scrape data at speeds that would be impossible with a purely headful approach.

Navigating Browser Fingerprinting to Overcome Advanced Bot Detection Systems

Bot detection systems are evolving every day that can block your web scraping processes. It is mandatory to find a reliable solution like Kameleo, that can perfectly mimic a non-automated, human operated real browser to bypass protection layers, invisible captchas, and the Cloudflare Turnstile without the difficulties of maintaining your browser fingerprint integrity. The choice is yours:

Utilize Kameleo for web-scraping with a high number of parallel automated browsers, or
use it to bypass the most difficult anti-bot systems and then use the cf_clearance cookie in your spiders.

Free Trial Available Now

Kameleo is now available to test at no cost: create an account, configure your first browser profiles, and verify its fingerprint stability and anti-bot bypass in your own environment. Sign up and start your free tier today.

Zyte x Kameleo Webinar: Dive Deeper for More Insights

Curious to learn more? Don’t miss the webinar hosted by Zyte, featuring Kameleo’s co-founder, Tamás Deák, as they explore this topic in depth.

Share this post