On June 12, 2023 from 18:26 to 18:40 UTC, we experienced degraded performance due to high call volume on our app run endpoints. The high load impacted app runs and also resulted in degraded performance of other endpoint calls. During this period customers experienced an elevated error rate for the following APIs:
Endpoint | Error Rate |
---|---|
GET /v0/run/{id}/status | 4% |
GET /v1/applications/{application_id}/runs/{run_id} | 5.5% |
POST /v1/applications/{application_id}/runs | 3.1% |
POST /v1/applications/{application_id}/runs/uploadurl | 1% |
During the high call event the API layer experienced intermittent failures because of handle exhaustion. This prevented some internal calls to fail because clients could not be created, and resulted in API failures. The failure also impacted the rate limiting layer, which exacerbated the overload condition. Once the load subsided the system recovered without any further actions. The handles issue was caused by failure to close service clients in some cases, which leaked handles. Once handles exhausted, it prevented any additional clients from being opened for that handler instance.
The system recovered once the customer causing the overload was notified, and reduced their call volume. The following additional steps were taken to prevent a reoccurrence: