We have processed all events since the start of the incident and all systems are fully operational.
All services are now reading from both old and new persons tables, meaning that all person data is available. We are in the process of backfilling all previous person data into the new table, for housekeeping and should have no impact.
Expect a small maintenance window in the following week, probably around 5min, as we reconsolidate the system to only use our newly migrated tables.
Resolved
We have processed all events since the start of the incident and all systems are fully operational.
All services are now reading from both old and new persons tables, meaning that all person data is available. We are in the process of backfilling all previous person data into the new table, for housekeeping and should have no impact.
Expect a small maintenance window in the following week, probably around 5min, as we reconsolidate the system to only use our newly migrated tables.
Monitoring
Events and persons ingestion delays have been resolved. Efforts are ongoing to complete full remediation for the core database issue.
Monitoring
A large majority of our partitions are fully caught up. Event processing delays are now impacting only a small fraction of overall ingested events.
We still have a few hot partitions to churn through before this is fully resolved, so please bear with us.
The team will continue monitoring the system until the final few hot partitions are fully processed.
Monitoring
Ingestion is healthy and we are processing the tail end of the accumulated lag.
Because the final lag is not well distributed across the partitions, the system can't process it at peak consumption rates. This makes it hard to estimate when we'll be fully caught up across all partitions, for all our customers. Hopefully, we'll be fully caught up within the hour.
We will send another update if we haven't cleared the last of the hot partitions by then.
Monitoring
The team is monitoring the system as the accumulated backlog is processed. Our current estimates put us at being fully caught up in 2-3 hours.
Monitoring
The remediation is 70% complete, we are expecting to finish processing the event backlog within 3-6 hours. All warnings from the previous update regarding temporary data visibility gaps and degraded product behavior still stand while the remediation is in flight.
Monitoring
The remediation efforts are progressing as expected. All warnings from the previous update regarding temporary data visibility gaps and degraded product behavior still stand while the remediation is in flight. The remediation is 50% complete, and our current estimate is the event backlog should be processed and visible in the product in 8 hours.
Identified
As we continue to recover from the delay on event and persons ingestion the following services are affected in these ways:
Overall our event ingestion will take between 1-8 hours to catch up (there are varying levels of delay across different projects and persons). Overall this means that queries to our analytics products may show incorrect results for the most recent time ranges until this caught up.
Feature flags continues to serve traffic as before, but evaluations based on recent Person information is affected by the same delays and may continue to evaluate incorrectly. As the data catches up however these evaluations will be correct.
Error tracking and Session replay ingestion is up-to-date but filtering on Person or event information is affected by the same delays.
Generally, the web application may display incorrect data for individual Persons as we are still working on ensuring those queries are using the new database tables. This will be resolved shortly and will then only be affected by the general ingestion delay.
CDP and Workflows destinations are processing the ingested events well and so will continue to catch up at the same rate as our main ingestion pipeline.
We have some follow up operations to clean out some minor data discrepancies but at this time we are confident that no data has been lost in the process and once we have caught up, reporting tools will show accurate values for the period we were delayed.
Identified
We are continuing to process a significant amount of incoming events. We expect at this stage that we should fully process the backlog in around 12-15 hours.
We will continue posting status updates as we progress.
Identified
We have finished a significant amount of the data transfer which has enabled us to resume the processing of incoming events. We expect at this stage that we should fully process the backlog in around 12-18 hours.
We will continue posting status updates as we progress.
Identified
We continue making progress as expected on migrating data to the new database tables. It is a long process and we are exploring ways to speed things up but we expect ingestion delays to continue until at least the end of today
We will continue to provide updates throughout the day
Identified
We are making progress of moving data to the new data storage. Our validation for data consistency and performance is looking good so that the new storage is likely the permanent solution.
We still see degraded performance when accessing older data, so we will prioritize moving data to the new storage so that we can speed up ingestion as soon as possible to reach a stable state soon.
We will provide an update later in the day on the progress of this.
Identified
Remediation work is continuing on the Persons database. Ingestion processing delays will persist while this work is in flight. Some Feature Flags functionality will also be effected during this incident.
Identified
We have multiple engineers working on remediation as a top priority. Data is being written to our new data store and we are carefully validating the data integrity before we start to switch services over to it which will begin to solve the core underlying issues.
We expect delays to continue to event and persons ingestion as we make these changes over the next 12-24 hours as we are prioritise data integrity and stability. Importantly no data has been lost.
There will be a full public post-mortem once we are recovered.
Identified
We are still working on remediation for the database issue and lag is still increasing, some events will see delays of up to 13 hours. We have identified the root cause but due to the volumes of data involved repairing the database will take some time
Investigating
Our data processing infrastructure is lagging as we work to resolve an issue with our persons database. No data has been lost but the incident is ongoing and substantial lag should be expected between event submission and availability in the product during remediation.
Identified
We are experiencing issues with our database responsible for updating Persons from incoming events which is causing large delays in processing.You may see delays ranging between 10-60 minutes depending on the customer. We are actively working on remedying the underlying issue although we expect to have continued delays over the coming hours as we carefully fix the issues whilst maintaining data integrity.