News
min read

Post-Mortem: What Caused Our Service Disruption on April 30, 2025

Written by
Tamas Deak
CEO, co-founder
Updated on
May 1, 2025

Incident Summary

On April 30, 2025, users experienced elevated error rates (503 errors) due to unexpected performance degradation. The incident lasted nearly 2x2 hours and was swiftly resolved. It occurred because, at my request, we allowed our users to operate without request rate limits, despite clear warnings from our co-founder (CTO). This lack of limits, combined with significantly increased user load and other contributing factors, led to the incident.

Full Post-Mortem and Apology

Dear Kameleo Community,

Today, I want to personally address the unfortunate incident that occurred on April 30, 2025, when many of you experienced intermittent 503 errors. First and foremost, I take full responsibility for this incident, and sincerely apologize for any disruptions caused to your operations.

What Happened?

Over the past year, our primary goal has been maintaining four nines (99.99%) uptime - a commitment we've proudly upheld while handling extreme workloads such as individual customers initiating up to 700 browsers per minute, while simultaneously rolling out critical changes including a multi-kernel support, a major API update, a centralized cloud-based browser profile storage, and a new pricing model.

Over the past year our primary focus was to support users scaling their web scraping operations significantly. This has been incredibly successful - some users sent requests at rates up to 40 times higher than the per-minute limits set by competing anti-detect browsers. Motivated by our success, I personally made the decision 4 months ago to temporarily remove API rate limits to better understand the full potential of our platform and push our technological boundaries.

My co-founder and CTO warned me of the inherent risks involved, and he was right. I want to personally apologize to him and the engineering team for the ambitious challenges I set forth. Despite these risks, our engineering team displayed outstanding capability, continuously monitoring, optimizing, and scaling our infrastructure - both horizontally and vertically. They solved significant technical challenges, including shifting from a costly cloud-based metric tracking solution to an in-house Grafana Loki stack due to exponential growth in data volume.

The Incident Details

On April 30th, our team was wrapping up a successful sprint involving our recent 4.0 release and several infrastructure updates ahead of the Labor Day weekend. Just as we prepared for well-deserved rest, our monitoring system alerted us to rapidly increasing request failures. A few minutes later it was already reported through the technical support.

The issue was traced to our fingerprint generation component (aka. Spoofing Server), a core system typically delivering responses under 80 milliseconds. During the incident, response times spiked dramatically, Redis memory usage surged unexpectedly, and containers restarted repeatedly. Our engineering team was able to introduce rate limits to ensure smooth service. However the issue reoccurred a couple of hours later. Within 30 minutes the team identified the root cause - a memory leak introduced by a recent update of a third-party monitoring tool. Thanks to our rapid CI/CD processes, a fix was quickly deployed, restoring normal service.

I'm extremely proud of our team’s swift resolution, commitment under pressure, and dedication in working late into the night to support our users and ensure the issue does not recur. Our CTO and I remain personally available over the weekend to address any further issues promptly.

Lessons Learned & Moving Forward

We recognize the critical importance of implementing reasonable constraints to maintain our service quality. Effective immediately, we've introduced a high but sensible rate limit of 1200 requests per minute, primarily aimed at preventing abuse and misuse. This limit specifically includes the SearchFingerprints, CreateProfile, and StartProfile API calls. This comfortably supports up to 400 browser initiations per minute, far exceeding typical user requirements and maintaining our market-leading capacity.

Additionally, we realize we could have better educated our users on efficient scaling strategies. To rectify this, we'll enhance our developer documentation to recommend best practices - such as maintaining a browser pool rather than initiating new browsers per request. Naturally, users discover these efficiencies over time, but clearer guidance from us will help avoid future overload scenarios.

Finally, we recognize the need for greater transparency in monitoring. We are committed to making our robust and fast API's performance and uptime publicly available soon. This initiative will clearly demonstrate our consistent operational excellence over past years and our dedication to maintaining this standard moving forward.

Thank you for your understanding, support, and continued trust in Kameleo.

Warm regards,

Tamas Deak, CEO at Kameleo

Share this post