Nextmv API Incident

Incident Report for Nextmv

Postmortem

Summary

On June 12, 2023 from 18:26 to 18:40 UTC, we experienced degraded performance due to high call volume on our app run endpoints. The high load impacted app runs and also resulted in degraded performance of other endpoint calls. During this period customers experienced an elevated error rate for the following APIs:

Endpoint	Error Rate
GET /v0/run/{id}/status	4%
GET /v1/applications/{application_id}/runs/{run_id}	5.5%
POST /v1/applications/{application_id}/runs	3.1%
POST /v1/applications/{application_id}/runs/uploadurl	1%

Timeline

June 12, 2023 at 18:28 UTC - an engineer responded to an alarm indicating a high number of API failures
June 12, 2023 at 18:35 UTC - an engineer determined the root cause to be non-malicious high call volume
June 12, 2023 at 18:40 UTC - the call volume subsided and the errors cleared.
June 12, 2023 at 18:45 UTC - Services fully operational

Root Cause

During the high call event the API layer experienced intermittent failures because of handle exhaustion. This prevented some internal calls to fail because clients could not be created, and resulted in API failures. The failure also impacted the rate limiting layer, which exacerbated the overload condition. Once the load subsided the system recovered without any further actions. The handles issue was caused by failure to close service clients in some cases, which leaked handles. Once handles exhausted, it prevented any additional clients from being opened for that handler instance.

Resolution and Recovery

The system recovered once the customer causing the overload was notified, and reduced their call volume. The following additional steps were taken to prevent a reoccurrence:

Moved rate limiting to the edge of the service. Previously rate limiting was being handled in the routing layer. This prevents resources from being created when in an overload condition. (completed 6/16/2023)
Ensure all clients either closed or cached for lifetime of handler instance (completed 6/16/2023)
Added a metric to monitor open handles for the API layer (completed 6/16/2023)

Posted Jun 28, 2023 - 16:37 UTC

Resolved

We have resolved the issue.

Posted Jun 12, 2023 - 19:54 UTC

Investigating

We are investigating reports of increased API errors.

Posted Jun 12, 2023 - 18:30 UTC

This incident affected: Nextmv API.