At 2025-06-20T12:05:45.692Z, runs longer than 5 minutes on the lambda execution class began failing consistently due to an accidental change in the configuration of the lambda runtime limit. These runs did not result in any API errors, but resulted in runs that exceeded the new 5 minute limit failing.
The first realization of the issue was on 2025-06-20 at 12:07 UTC when an alert triggered the on-call developer to be paged due to execution timeouts on runs exceeding the accepted threshold.
At 12:14 UTC, a developer identified a change in a recent deployment that reduced the maximum runtime limit to 5 minutes. At 12:16 UTC, a fix was deployed to our pipeline to address the issue. Developers then proceeded to take manual action while the fix made its way through our deployment pipeline.
Beginning at 12:33 (UTC) developers made several manual deployments to both our staging and production environment to address the issue. Each deployment was followed by a round of monitoring, testing, and continued resolution steps.
The resolution deployment was made at 13:15:20 (UTC) at which time developers continued to test and monitor for errors. When none were found, the incident was resolved.
A code release introduced a bug in which the maximum runtime for our lambda execution class was set to 5 minutes.
Runs longer than 5 minutes were successful once the final deployment was made at 13:15 (UTC). As part of this retrospective, we identified an integration test that did address errors of this nature. Recently, we reorganized our execution logic and failed to move that integration test downstream into to the relevant pipeline. The long running execution test did fail in the upstream pipeline, but the downstream pipeline had already deployed at that time. That integration test will be moved to the correct pipeline to prevent further errors of this nature.