Resolved
Our event ingestion processing has fully caught up to the accumulated lag and all systems are back to fully operational.
Monitoring
Ingestion throughput has recovered and the system is currently working through the backlog.
Current throughput estimates suggest we will be all caught up in the next 3-4 hours.
Operators will be monitoring the system as it fully recovers.
Investigating
The underlying compute for our queueing system is currently being upgraded to handle the increased event throughput the system is currently experiencing. We expect this process to take another two hours. We will continue experiencing degraded ingestion performance as this upgrade operation is completed.
No data has been lost, our systems are just behind 30-40 minutes.
Investigating
Latency producing and consuming from the ingestion event queue has increased resulting in degraded throughput of the ingestion pipeline. Lag is accumulating. Operators are investigating the cause. There is no data loss.