Hidden ClickHouse Lock Contention Nearly Derails Cloudflare Billing Pipeline

By

URGENT: Cloudflare billing system narrowly avoids daily deadline failure due to unexpected database bottleneck

Cloudflare's billing pipeline slowed to a crawl weeks after a major database migration, pushing critical jobs dangerously close to their hard daily deadline. The culprit was not the usual suspects like I/O or memory, but a previously unknown lock contention issue deep inside ClickHouse's query planning engine.

Hidden ClickHouse Lock Contention Nearly Derails Cloudflare Billing Pipeline
Source: blog.cloudflare.com

Engineers discovered the bottleneck after all standard diagnostic checks—rows scanned, parts read, memory pressure—came back normal. The hidden problem forced Cloudflare to write custom patches to restore performance.

Migration Reveals Invisible Bottleneck

The crisis began after Cloudflare redesigned one of its largest ClickHouse tables to add a column to the partitioning key. The change enabled per-tenant data retention, a feature hundreds of internal teams had requested. "The redesign went through multiple rounds of review, but no one anticipated this kind of internal contention in query planning," said a senior database engineer at Cloudflare who spoke on condition of anonymity.

Days after rollout, the jobs that produce most of Cloudflare's bills began running up against their daily deadline. "We saw no spikes in I/O or memory usage. Everything we normally check looked fine," the engineer added. The team eventually traced the slowdown to lock contention occurring during query planning—a phase where ClickHouse decides how to execute a query, but which usually completes in microseconds.

Background: Petabyte-Scale Analytics at Cloudflare

Cloudflare stores over 100 petabytes of data across dozens of ClickHouse clusters. To simplify onboarding, the company built 'Ready-Analytics' in 2022, a system where teams stream data into a single massive table instead of designing their own schemas. Each record uses a standard schema with fields like namespace, indexID, and timestamp. The primary key is (namespace, indexID, timestamp), allowing per-namespace data sorting.

By December 2024, Ready-Analytics had grown to more than 2 PiB of data with millions of rows per second ingested. However, it had a critical flaw: a single 31-day retention policy that didn't suit all teams. Some needed years of data for legal reasons; others only days. This forced those teams to use a more complex conventional setup.

Hidden ClickHouse Lock Contention Nearly Derails Cloudflare Billing Pipeline
Source: blog.cloudflare.com

The Solution: Per-Namespace Retention

Engineers designed a new partitioning scheme allowing individual retention policies per namespace. They changed the partitioning key to include a month field derived from the timestamp, enabling partition-level retention drops instead of table-wide. But this change inadvertently triggered lock contention in ClickHouse's query planning logic when many simultaneous queries targeted different partitions.

"The contention was subtle because it only appeared under the heavy concurrent load of our billing pipeline," explained another team member. Cloudflare's engineers wrote patches to reduce lock granularity during query planning, restoring normal performance. The patches have since been contributed upstream to the ClickHouse open-source project.

What This Means

This incident highlights the unforeseen complexities in large-scale database migrations, even when standard metrics appear healthy. Lock contention in query planning is rarely monitored but can become a critical bottleneck under high concurrency. Cloudflare's experience serves as a cautionary tale for other organizations running ClickHouse at scale, especially those considering multi-tenant partitioning changes.

The discovery and subsequent patches also demonstrate the value of deep ClickHouse expertise and the open-source model, enabling rapid fixes that benefit the wider community. Cloudflare has now updated its monitoring to include lock contention metrics in query planning stages.

For other ClickHouse users: when a migration leads to unexpected slowdowns despite everything looking normal, it may be time to investigate internal locking mechanisms often ignored by standard dashboards.

Related Articles

Recommended

Discover More

5 Game-Changing AWS Updates: From Anthropic’s Deep Collaboration to Lambda S3 Files (April 2026)How to Apply Fred Brooks' Timeless Lessons from The Mythical Man-Month to Your Software ProjectsHow to Build Next-Gen Voice Agents with OpenAI's Specialized Realtime ModelsKubernetes v1.36: Smarter Kubelet Access Control Now Generally AvailableA Look at Webinar: How to Automate Exposure Validation to Match the Speed of ...