At 6:50 AM ET on May 16, we experienced degraded service due to slow responses on the v0/run endpoint. The issue was identified as a bug causing a panic in the Open Telemetry (OTEL) container, called a sidecar, which is what we use for metrics. The bug caused tasks to intermittently fail and be restarted, resulting in slow response times. Once this cause was determined, an engineer manually pinned the OTEL container version to the previous release (v0.28.0). After this update completed, the queued backlog cleared and run times returned to normal. The issue was resolved by 8:15 AM ET.
We use AWS OTEL Collector which is an open source project lead by AWS for sending telemetry data to AWS CloudWatch. We were using the latest version in our cluster task definition. When v0.29.0 released, it introduced a bug which panicked with certain string-valued DynamoDB attributes (see their bug fix here). Whenever our cluster replaced a running task, it picked up the latest version. The issue caused runs to take much longer, however runs didn’t fail, which caused a degradation in response times.
The version of the OTEL collector was pinned in the task definition to v0.28.0 manually, and the tasks in the cluster were rotated. This cleared the issue and the service began to recover. The cluster cleared in 15 minutes and operations returned to normal. Subsequently we pinned to a specific version in our automation and pushed through our pipeline to prevent future unmanaged updates of the container to production.