Post-Mortem: What Caused Our Service Disruption on April 30, 2025

Incident Overview

On April 30, 2025, some users encountered intermittent elevated error rates (503 errors) as our system adapted to an unprecedented surge in demand. The service was fully restored within nearly 24 hours thanks to our team’s rapid response. The issue was caused by temporarily operating without request rate limits, despite clear warnings from our co-founder (CTO). This lack of limits, combined with significantly increased user load and other contributing factors, led to the incident.

Root Cause & Next Steps

Dear Kameleo Community,

Today, I'm sharing how our team swiftly identified and resolved the service disruptions on April 30, 2025, when some of you experienced intermittent 503 errors. While the incident posed a challenge, it also highlighted our team’s resilience and commitment to delivering a robust platform.

Timeline of Events

Over the past year, our primary goal has been maintaining four nines (99.99%) uptime - a commitment we've proudly upheld while handling extreme workloads such as individual customers initiating up to 700 browsers per minute, while simultaneously rolling out critical changes including a multi-kernel support, a major API update, a centralized cloud-based browser profile storage, and a new pricing model.

Throughout the past year, our primary focus was to support users scaling their web scraping operations significantly. This has been incredibly successful - some users sent requests at rates up to 40 times higher than the per-minute limits set by competing anti-detect browsers. Motivated by our success, I personally made the decision 4 months ago to temporarily remove API rate limits to better understand the full potential of our platform and push our technological boundaries.

My co-founder and CTO provided valuable insights into the potential challenges, and his guidance was truly instrumental. I’m grateful to him and the engineering team for embracing the ambitious goals we set together. During the 4-month period when we operated without API rate limits, our engineering team displayed outstanding capability, continuously monitoring, optimizing, and scaling our infrastructure - both horizontally and vertically. They successfully addressed significant technical challenges, including transitioning from a costly cloud-based metric tracking solution to an in-house Grafana Loki stack due to exponential growth in data volume.

One of the key lessons from these 4 months was that long-term, we do need a rate limit in place. Our customers generate highly variable load patterns over time, and occasional client-side coding mistakes can easily flood our infrastructure with an overwhelming number of requests. The previously used machine-based rate limiting was inadequate to ensure fairness: it failed to serve both types of scaled users - those running many parallel profiles from a single server, and those running large-scale operations across many smaller servers. That's why we decided to move toward a per-subscription-based rate limiting model. This approach ensures fair allocation of capacity regardless of infrastructure setup. To support this, we’ve implemented the necessary changes in the Kameleo.CLI, shipped with the 4.0 release.

Detailed Incident Review

On April 30th, our team was wrapping up a successful sprint involving our recent 4.0 release and several infrastructure updates ahead of the Labor Day weekend. Just as we prepared for well-deserved rest, our monitoring system alerted us to rapidly increasing request failures. A few minutes later it was already reported through the technical support.

The issue was traced to our fingerprint generation component (aka. Spoofing Server), a core system typically delivering responses under 80 milliseconds. During the incident, response times spiked dramatically, Redis memory usage surged unexpectedly, and containers restarted repeatedly. Thanks to our team’s rapid response and expertise, we introduced rate limits to ensure smooth service. They also migrated our Redis from self-hosted infrastructure to a managed Redis solution, which enabled us to more effectively manage request throttling and system stability during peak load. However, the issue reoccurred a couple of hours later. Initially, the team identified a memory leak introduced by a recent update of a third-party monitoring tool. While this fix was deployed quickly thanks to our rapid CI/CD processes, the issue later reoccurred, indicating it was not the root cause. During further investigation, we discovered that a piece of retry logic implemented in Kameleo 4.0 - designed to help smooth out traffic spikes once rate limits are introduced - was not properly tested due to a strict development roadmap. The incident emerged about a week after the 4.0 release, when a significant number of high-scale users had already migrated to the new version. While the retry logic was intended to buffer peak load by intelligently reissuing failed requests, it instead flooded our servers with repeated attempts, causing instability and producing nearly 10x the normal load. Only after identifying this issue were we able to isolate and address the true root cause of the incident.

To prevent further disruption, we temporarily pulled the 4.0 release to stop additional users from updating. We also contacted users who were generating significant load and asked them to roll back to the previous version. Unfortunately, some of them were already away for the Labor Day holiday, and communication didn’t reach them in time. In those cases, we had to enforce the rollback from our side - but after the weekend, we were able to find final resolutions with each of them.

I'm extremely proud of our team’s swift resolution, commitment under pressure, and dedication in working late into the night to support our users and ensure the issue does not recur. Our CTO and I remained on full standby throughout the entire weekend, ready to respond immediately if the issue resurfaced. Fortunately, we were able to fully resolve the problem within roughly 10 hours from the first occurrence, and the issue has not reappeared since.

Key Takeaways & Future Directions

We recognize the critical importance of implementing reasonable constraints to maintain our service quality. Effective immediately, we've introduced a high but sensible rate limit of 1200 requests per minute, primarily aimed at preventing abuse and misuse. This limit specifically includes the SearchFingerprints, CreateProfile, and StartProfile API calls. This comfortably supports up to 400 browser initiations per minute, far exceeding typical user requirements and maintaining our market-leading capacity.

To directly address the retry logic flooding issue that emerged after version 4.0, we've released version 4.0.2. This update includes a more robust implementation of retry handling within the Kameleo.CLI, ensuring stability even under high-load conditions. We strongly recommend all users to upgrade to 4.0.2 to benefit from these improvements.

Additionally, we realize we could have better educated our users on efficient scaling strategies. To rectify this, we'll enhance our developer documentation to recommend best practices - such as maintaining a browser pool rather than initiating new browsers per request, or reducing the number of SearchFingerprints calls - since each call returns 25 fingerprints that can be used to launch 25 different browsers - thereby lowering the total number of invoked requests. Naturally, users discover these efficiencies over time, but clearer guidance from us will help avoid future overload scenarios.

Finally, we recognize the need for greater transparency in monitoring. We are committed to making our robust and fast API's performance and uptime publicly available soon. This initiative will clearly demonstrate our consistent operational excellence over past years and our dedication to maintaining this standard moving forward.

We appreciate your continued trust and support as we strive for excellence.

Warm regards,

Tamas Deak, CEO at Kameleo

Share this post

Post-Mortem: What Caused Our Service Disruption on April 30, 2025

Incident Overview

Root Cause & Next Steps

Timeline of Events

Detailed Incident Review

Key Takeaways & Future Directions

Related posts

Meet Kameleo Free Tier: Zero-Cost Browser Automation

Discover More Value: Introducing Kameleo’s New Pricing

Experience the Next Level of Masking with Multikernel