Resolved
The load spikes are resolved and the system is operating normally.
Monitoring
We're still looking into some spikes in query patterns that have been impacting ClickHouse. We've adjusted some retry behavior in our applications, which has alleviated some of the impact. We will continue monitoring until we have a clearer picture.
Investigating
We're still seeing intermittent spikes and query failures. We are continuing to investigate the root cause. We're monitoring closely and engineers from multiple teams are working together to stabilize performance.
Investigating
Problem: A sharp spike in query volume caused a surge in failed queries and high load on database hosts, leading to errors when loading dashboards and running queries.
Impact: Some customers in the US region could not load dashboards or run queries. Event ingestion lag also built up during the incident.
Cause: Still investigating
Steps to resolve: We restarted the affected service, which restored query and dashboard functionality. System metrics show recovery in progress, but we're keeping an eye on it and continuing to investigate.
Investigating
We’re investigating reports of some queries and dashboards failing to load.
Event ingestion lag is also accumulating. Operators are currently investigating the root cause.
Investigating
We’re investigating reports of some queries and dashboards failing to load.