Nextmv API Incident

Incident Report for Nextmv

Postmortem

Summary

At 2025-06-20T12:05:45.692Z, runs longer than 5 minutes on the lambda execution class began failing consistently due to an accidental change in the configuration of the lambda runtime limit. These runs did not result in any API errors, but resulted in runs that exceeded the new 5 minute limit failing.

The first realization of the issue was on 2025-06-20 at 12:07 UTC when an alert triggered the on-call developer to be paged due to execution timeouts on runs exceeding the accepted threshold.

At 12:14 UTC, a developer identified a change in a recent deployment that reduced the maximum runtime limit to 5 minutes. At 12:16 UTC, a fix was deployed to our pipeline to address the issue. Developers then proceeded to take manual action while the fix made its way through our deployment pipeline.

Beginning at 12:33 (UTC) developers made several manual deployments to both our staging and production environment to address the issue. Each deployment was followed by a round of monitoring, testing, and continued resolution steps.

The resolution deployment was made at 13:15:20 (UTC) at which time developers continued to test and monitor for errors. When none were found, the incident was resolved.

Root Cause

A code release introduced a bug in which the maximum runtime for our lambda execution class was set to 5 minutes.

Resolution and Recovery

Runs longer than 5 minutes were successful once the final deployment was made at 13:15 (UTC). As part of this retrospective, we identified an integration test that did address errors of this nature. Recently, we reorganized our execution logic and failed to move that integration test downstream into to the relevant pipeline. The long running execution test did fail in the upstream pipeline, but the downstream pipeline had already deployed at that time. That integration test will be moved to the correct pipeline to prevent further errors of this nature.

Posted Jun 20, 2025 - 21:03 UTC

Resolved

This incident has been resolved.
Posted Jun 20, 2025 - 13:16 UTC

Update

A fix has been deployed and errors have resolved, but we are continuing to monitor.
Posted Jun 20, 2025 - 13:15 UTC

Monitoring

A fix has been deployed and we believe the issue is resolved but are continuing to monitor.
Posted Jun 20, 2025 - 13:02 UTC

Identified

Runtime limit on the lambda execution class is restricted. We are actively working on deploying a fix.
Posted Jun 20, 2025 - 12:33 UTC

Investigating

We are investigating reports of increased API errors.
Posted Jun 20, 2025 - 12:05 UTC
This incident affected: Nextmv API.