Summary

On July 18th at 13:41 UTC (09:41 ET) a code deployment caused custom app runs to start failing (the Nextmv routing service was unaffected). Two services were being deployed using traffic shifting during the failure. The service causing the failures alarmed at 13:49 UTC and immediately rolled back upon detection, restoring application runs. The second service continued deployment to completion, and custom app functionality was restored.

At 15:55 UTC (11:55 ET) a customer contacted us indicating that they were seeing some failures with a 400 status indicating that the run was not in a completed state when retrieving run results, and we began investigating the issue. At 18:56 UTC (14:56 ET) a second customer contacted us with the same issue. The issue was escalated to a major event at 19:15 UTC, and an incident opened. We could see a number of customers were having issues with a higher rate than normal for 400 errors, although other customers were succeeding and none of our system alarms or service canaries indicated an issue. Customers typically poll for a success status and then retrieve the results once a success is indicated. Our internal canaries that were succeeding delayed for one second before the first status poll. We were able to reproduce the error by removing the delay. An emergency patch was deployed at 20:35 UTC (16:35 ET) and the condition cleared. 18% of customer run result requests received the 400 error because they requested too early during the incident

Timeline

18 July 2023
- 13:27 UTC - released prod stage for core
- 13:27 UTC - released prod stage for wrangle
- 13:39 UTC - blue/green deploy starts on wrangle
- 13:41 UTC - blue/green deploy starts on core
- 13:41 UTC - wrangle first starts shifting traffic to new release
- 13:44 UTC - first failure of wrangle canary
- 13:46 UTC - first failure of core canary
- 13:49 UTC - 2nd failure of wrangle canary, alarm triggers rollback
- 13:50 UTC - Wrangle rollback completed, systems operational
- 14:01 UTC - Core deployment completed
- 14:05 UTC - all alarms cleared (two 5 minute success cycles required to clear)

Root Cause

A code release introduced a bug causing a transient condition at the start of a run to report incorrect status until the run actually started execution, at which point the correct status was restored. This interval was typically less than 1 second.

Resolution and Recovery

The system recovered once the emergency patch was deployed. Additional action items to better detect this condition are being investigated and will be updated here.

Posted Jul 20, 2023 - 19:57 UTC

Resolved

This incident has been resolved.

Posted Jul 18, 2023 - 20:40 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 18, 2023 - 20:35 UTC

Identified

The issue has been identified and a fix is being implemented. Runs are succeeding but results are reporting before the run is complete.

Posted Jul 18, 2023 - 19:28 UTC

Investigating

We are investigating reports of increased API errors.

Posted Jul 18, 2023 - 19:27 UTC

This incident affected: Nextmv API.