Nextmv API Incident
Incident Report for Nextmv
Postmortem

Summary

At 6:50 AM ET on May 16, we experienced degraded service due to slow responses on the v0/run endpoint. The issue was identified as a bug causing a panic in the Open Telemetry (OTEL) container, called a sidecar, which is what we use for metrics. The bug caused tasks to intermittently fail and be restarted, resulting in slow response times. Once this cause was determined, an engineer manually pinned the OTEL container version to the previous release (v0.28.0). After this update completed, the queued backlog cleared and run times returned to normal. The issue was resolved by 8:15 AM ET.

Timeline

  • May 15, 2023 at 17:34:00 v0.29.0 of the AWS OTEL collector was tagged in github while creating a new OTEL container image in a public repository
  • May 16, 2023 at 3:38AM ET: The first task is replaced in the worker cluster using the faulty OTEL container image
  • May 16, 2023 at 6:46AM ET: All tasks in the cluster have been updated to use the fault OTEL container image
  • May 16, 2023 at 6:50AM ET: Engineer was paged due to timeouts of canaries in prod
  • May 16, 2023 at 7:50 AM ET: The issue was identified as the OTEL collector causing a panic causing runs to retry slowing execution
  • May 16, 2023 at 8:15 AM ET: The OTEL image was pinned to the previous version, and the cluster tasks rotated. The issue cleared and the queued backlog started to clear, run times started to return to normal
  • May 16, 2023 at 8:30 AM ET: The backlog was cleared, and run times returned to normal. The incident was closed.

Root Cause

We use AWS OTEL Collector which is an open source project lead by AWS for sending telemetry data to AWS CloudWatch. We were using the latest version in our cluster task definition. When v0.29.0 released, it introduced a bug which panicked with certain string-valued DynamoDB attributes (see their bug fix here). Whenever our cluster replaced a running task, it picked up the latest version. The issue caused runs to take much longer, however runs didn’t fail, which caused a degradation in response times.

Resolution and Recovery

The version of the OTEL collector was pinned in the task definition to v0.28.0 manually, and the tasks in the cluster were rotated. This cleared the issue and the service began to recover. The cluster cleared in 15 minutes and operations returned to normal. Subsequently we pinned to a specific version in our automation and pushed through our pipeline to prevent future unmanaged updates of the container to production.

Posted Jun 05, 2023 - 13:45 UTC

Resolved
We have resolved the issue.
Posted May 16, 2023 - 12:30 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 16, 2023 - 12:19 UTC
Identified
The issue has been identified and a fix is being implemented. This issue is only impacting runs on the v0/run endpoint.
Posted May 16, 2023 - 12:11 UTC
Investigating
We are investigating reports of delayed response time
Posted May 16, 2023 - 11:55 UTC
This incident affected: Nextmv API.