The following is the incident report for the Periscope Data incident that occurred on March 8th, 2019. This issue resulted in many of our customers not being able to run new queries and delays in background refreshes in the Periscope Data application. We understand the impact this had on our customers and sincerely apologize. We have taken a number of steps to prevent this issue from occurring in the future.
ISSUE SUMMARY
From approximately 7:00am to 4:30pm PT on March 8th, 2019, the Periscope Data query service was degraded. Many new queries to Periscope did not complete. The root cause of the issue was due to an overloaded database in our backend query service, resulting in query performance to degrade significantly.
TIME LINE
7:00 am: On-call engineer was paged due to slow rate of queries completion and began investigation.
7:15 am: Engineers restarted backend query service. Engineers deleted non-critical rows from database.
7:43 am: Customers were notified of incident via Status Page: "Periscope Data App Queries Not Running”
8:30 am: Engineers paused background query running jobs to reduce database load and allow foreground queries to run.
9:00 am: Engineers identified the problem query to the database that may have resulted in the slow down.
9:30 am: Engineers deleted rows from the problem table. Database CPU started looking better. Database I/O metrics were not yet healthy.
9:36 am: Query processing latency was below 1 second again. Most new queries were able to run.
10:00 am: Engineers continued to delete non-critical rows and ran vacuum manually on the problem database to improve internal database metrics.
10:47 am: Status Page was moved to Monitoring.
12:00 pm: Engineers performed more aggressive clean up and vacuuming of the problem database.
3:30 pm: Database auto vacuuming began to catch up. Manual clean-up stopped.
4:30 pm: All queued query requests cleared. Database CPU and IOPS metrics were back to healthy level. All queries were completing as normal, background queries were also resumed.
REMEDIATION
Our investigation showed that the underlying database CPU had been climbing due to increased query load over time. It reached a dangerous level in the early hours of March 8th. Additionally, there were dead rows that were building up over time and hit a threshold beyond which normal vacuum cleanup process could keep up. Both of these factors resulted in severely degraded performance in our backend query service. Query requests quickly built up and the service was unable to keep up with new queries.
We are taking steps to ensure proper clean up of non essential database rows to prevent this from happening in the future. In addition, we are adding more monitoring and alerts on our databases’ CPU and IO utilization, and modifying our on-call runbook for proper cleanup and vacuuming to ensure quick mitigation for any potential issues in the future.