Resolved
All events have been processed and the cluster is operating normally.
Monitoring
We are still processing the ingestion queue, we should be fully caught up in about 2 hours.
Monitoring
We have identified the root cause of the ingestion lag and cluster overload, and have resolved the issue.
We have now resumed ingestion and are processing the event ingestion lag.
Identified
During routine maintenance a shard has entered a degraded state in terms of performance and is causing us to fall behind on ingesting data. We are working to remedy the issue and will report back as soon as we have a remedy in place.
Identified
EU event ingestion experienced delays due to elevated part counts on ClickHouse. The high part count caused some insert rejections, leading to Kafka consumer lag on event processing. Replication queues have been restarted and merge backlogs are draining. Part counts are returning to normal.
Investigating
We’ve identified processing delays in the event ingestion pipeline. Events may take longer than usual to appear in the product. Data is not lost but may not show in PostHog apps and queries until the processing delay is resolved.