Summary

At 6:50 AM ET on May 16, we experienced degraded service due to slow responses on the v0/run endpoint. The issue was identified as a bug causing a panic in the Open Telemetry (OTEL) container, called a sidecar, which is what we use for metrics. The bug caused tasks to intermittently fail and be restarted, resulting in slow response times. Once this cause was determined, an engineer manually pinned the OTEL container version to the previous release (v0.28.0). After this update completed, the queued backlog cleared and run times returned to normal. The issue was resolved by 8:15 AM ET.

Timeline

May 15, 2023 at 17:34:00 v0.29.0 of the AWS OTEL collector was tagged in github while creating a new OTEL container image in a public repository
May 16, 2023 at 3:38AM ET: The first task is replaced in the worker cluster using the faulty OTEL container image
May 16, 2023 at 6:46AM ET: All tasks in the cluster have been updated to use the fault OTEL container image
May 16, 2023 at 6:50AM ET: Engineer was paged due to timeouts of canaries in prod
May 16, 2023 at 7:50 AM ET: The issue was identified as the OTEL collector causing a panic causing runs to retry slowing execution
May 16, 2023 at 8:15 AM ET: The OTEL image was pinned to the previous version, and the cluster tasks rotated. The issue cleared and the queued backlog started to clear, run times started to return to normal
May 16, 2023 at 8:30 AM ET: The backlog was cleared, and run times returned to normal. The incident was closed.

Root Cause

We use AWS OTEL Collector which is an open source project lead by AWS for sending telemetry data to AWS CloudWatch. We were using the latest version in our cluster task definition. When v0.29.0 released, it introduced a bug which panicked with certain string-valued DynamoDB attributes (see their bug fix here). Whenever our cluster replaced a running task, it picked up the latest version. The issue caused runs to take much longer, however runs didn’t fail, which caused a degradation in response times.

Resolution and Recovery

The version of the OTEL collector was pinned in the task definition to v0.28.0 manually, and the tasks in the cluster were rotated. This cleared the issue and the service began to recover. The cluster cleared in 15 minutes and operations returned to normal. Subsequently we pinned to a specific version in our automation and pushed through our pipeline to prevent future unmanaged updates of the container to production.

Posted Jun 05, 2023 - 13:45 UTC

Resolved

We have resolved the issue.

Posted May 16, 2023 - 12:30 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 16, 2023 - 12:19 UTC

Identified

The issue has been identified and a fix is being implemented. This issue is only impacting runs on the v0/run endpoint.

Posted May 16, 2023 - 12:11 UTC

Investigating

We are investigating reports of delayed response time

Posted May 16, 2023 - 11:55 UTC

This incident affected: Nextmv API.