Guide: How to Scrape Amazon.com

Amazon’s ecosystem holds millions of product listings, each identifiable by a unique Amazon Standard Identification Number (ASIN) and enriched with product name, Product price, Product images, Product rating, and thousands of customer reviews -all public data essential for Small Business and Data science teams to track market trends and drive online shopping insights. Yet Amazon’s layered anti-bot protection - combining aggressive rate limits, IP bans, CAPTCHA challenges, and deep browser fingerprinting - renders naïve scripts ineffective. The optimal solution is a hybrid architecture: craft human-like HTTP requests with rotating user-agent strings and Residential Proxies, parse web pages via Beautiful Soup, and employ Kameleo’s anti-detect browser for session priming, then switch to the requests library for fast, large-scale extraction.

Below, we detail each phase - from crawling search results and category tree pages to automating at scale, exporting to CSV file, and maintaining via CI/CD and monitoring.

1. The Strategic Value of Amazon Product Data

1.1 Unmatched Depth of Public Data

Amazon’s digital storefronts mirror every facet of e-commerce: hundreds of millions of SKUs with an ASIN, merchant ID, and seller name, all surfaced on product page HTML. Each page embeds rich product description, interactive Product images, dynamic Product price, cumulative Product rating, and user-generated customer reviews - a corpus that powers AI Product Matcher modules and-price-and-sentiment dashboards.

1.2 Key Business Use Cases

Dynamic Pricing: Monitor Product price shifts across competing merchant IDs in real time.
Sentiment Analysis: Mine Amazon Review Pages for star counts and review text to gauge customer engagement and perform word frequency distributions.
Inventory Forecasting: Track stock indicators and hidden JSON payloads on product details page to anticipate availability.

2. Core Scraping Techniques

2.1 Human-Like HTTP Requests with Python

Employ the requests library to send GET requests decorated with full HTTP request headers - rotating user-agent strings, Accept-Language, Referer, and Accept-Encoding - to mimic real browsers.

2.2 Proxy Rotation with Residential Proxies

Rotate Proxy servers using a pool of Residential Proxies, swapping on HTTP 429/503 or CAPTCHA triggers to spread traffic across real IPs and avoid bans.

proxies = {"http": "...", "https": "..."} resp = requests.get(product_url, headers=headers, proxies=proxies)

2.3 Parsing with Beautiful Soup

Load response.text into Beautiful Soup, an efficient HTML parser, to extract core fields with the following selectors:

#productTitle→ product name
.a-price .a-offscreen → Product price
#landingImage → Product images
#acrPopover → Product rating
#feature-bullets → Product description

2.4 Harvesting Customer Reviews

Iterate through Amazon Review Pages by incrementing the pageNumber parameter, capturing reviewer IDs, star ratings, timestamps, and comment text for downstream analysis.

3. Hybrid Automation with Kameleo

3.1 Why Kameleo’s Anti-Detect Browser Matters

Scripted HTTP layers cannot spoof fingerprint vectors like canvas, WebGL, fonts, or TLS/JA3; Kameleo’s custom Chroma and Junglefox browser profiles randomize these signals per session, bypassing Amazon’s anti-bot defenses out of the box.

3.2 Programmatic Profile Control via Local API

Use Kameleo’s Local REST API (or SDKs for Python/JS/.NET) to create, configure, start, and stop hundreds of isolated browser profiles - each with its own IP address, device fingerprint, and locale - in seconds.

3.3 Headful-to-Headless Workflow

1. Session Priming with a Standard Browser

Launch a headful browser session using Playwright, Puppeteer, or Selenium.
Navigate to your target Amazon URL (e.g., a product or search results page).
Click the “Continue Shopping” (or consent) button to pass Amazon’s in-house anti-bot screen. This action triggers Amazon to issue a session “ticket” cookie (for example, csm-hit or rxc) and attach a valid browser fingerprint.

2. Artifact Harvesting‍

Extract the artifacts you need for headless scraping:

Cookies: All cookies from the priming session (especially the new “ticket” cookie).
User-Agent: The exact navigator.userAgent string from the browser context.

3. Bulk Fetch with requests

‍Use the harvested cookies and User-Agent in your HTTP client for high-throughput scraping:

# Continue extracting product and review pages at scale.

4. When to invoke Kameleo:

‍If the above Playwright-only flow encounters unexpected blocks or additional challenges, repeat the priming steps inside a Kameleo anti-detect browser:

Create and start a Kameleo browser profile via the Local API.
Attach Playwright/Selenium to that profile and click “Continue Shopping.”
Harvest cookies and User-Agent from the proxy profile.
Switch to requests for bulk fetching, carrying the full fingerprint context.

Free Tier Available Now‍

Kameleo is now available to test at no cost. Create an account, configure your first browser profiles, and verify its fingerprint stability and anti-bot bypass in your own environment. Sign up and start your free tier today.

3.4 Automating Behind the Login Wall

‍If you need to scrape pages behind Amazon’s login wall - whether to switch between multiple seller accounts on different marketplaces or to pull review data from authenticated pages - Kameleo’s reusable browser profiles preserve your login state and browser context, making these workflows reliable and seamless.

With Kameleo, you treat your automation as true browser workflows rather than request spoofing, ensuring reliability and session continuity for complex, authenticated tasks.

For a general overview and practical demo on scraping content behind login walls - covering session management, authentication handling, and common anti-bot considerations - watch this webinar recording:

4. Scaling, Monitoring & Best Practices

4.1 Concurrency & CI/CD Hygiene

Shard workloads by search keyword, category page, or geographic locale; assign each shard its own Kameleo profile and proxy pool to prevent fingerprint collisions . Version CSS/XPath selectors in Git; failing unit tests (e.g., missing #productTitle) automatically open pull requests to update parsers before scheduled runs.

4.2 Observability & Alerts

Log HTTP statuses, proxy health, rate limits hits, and CAPTCHA Solver latency. Track element-hit rates - sudden drops signal layout changes requiring immediate parser updates.

4.3 Export & Analytics

Compile records into a Pandas DataFrame and export to a CSV file for downstream data visualization and Data science workflows. Enrich with an AI Product Matcher to cluster similar SKUs and visualize market trends over time.

4.4 Ethical Scraping

Respect Amazon’s /robots.txt crawl-delay directives and throttle requests to mimic typical online shopping patterns. Scrape only public data - product overview, product description, Product images, customer reviews - and never collect PII. Provide sign out and data-removal paths for brand-owner requests.

By blending precision HTTP requests, dynamic Proxy rotation, robust Beautiful Soup parsing, and Kameleo’s anti-detect browser for hybrid workflows, you can build an enterprise-grade Amazon Product Scraper that reliably harvests product data, product page insights, and customer reviews at scale - while staying under Amazon’s radar and compliant with ethical standards.

Share this post