On July 18th at 13:41 UTC (09:41 ET) a code deployment caused custom app runs to start failing (the Nextmv routing service was unaffected). Two services were being deployed using traffic shifting during the failure. The service causing the failures alarmed at 13:49 UTC and immediately rolled back upon detection, restoring application runs. The second service continued deployment to completion, and custom app functionality was restored.
At 15:55 UTC (11:55 ET) a customer contacted us indicating that they were seeing some failures with a 400 status indicating that the run was not in a completed state when retrieving run results, and we began investigating the issue. At 18:56 UTC (14:56 ET) a second customer contacted us with the same issue. The issue was escalated to a major event at 19:15 UTC, and an incident opened. We could see a number of customers were having issues with a higher rate than normal for 400 errors, although other customers were succeeding and none of our system alarms or service canaries indicated an issue. Customers typically poll for a success status and then retrieve the results once a success is indicated. Our internal canaries that were succeeding delayed for one second before the first status poll. We were able to reproduce the error by removing the delay. An emergency patch was deployed at 20:35 UTC (16:35 ET) and the condition cleared. 18% of customer run result requests received the 400 error because they requested too early during the incident
18 July 2023
A code release introduced a bug causing a transient condition at the start of a run to report incorrect status until the run actually started execution, at which point the correct status was restored. This interval was typically less than 1 second.
The system recovered once the emergency patch was deployed. Additional action items to better detect this condition are being investigated and will be updated here.