8-24-22 AWS Outage Causing Racking and Other Issues

Incident Report for SMRT Systems

Resolved

AWS has yet to announce that the issue is resolved. However, our queues are back to normal levels and everything should be functioning as normal.

This is a function of reduced system load due to the late hour and the 70% recovery of AWS ECS. If you experience any issues tomorrow morning please reach out to support. Our dev team will be standing by and monitoring the queues even though we are back to normal.

Thanks for your patience,
SMRT Systems

Posted Aug 24, 2022 - 20:16 PDT

Update

New update from AWS.

7:45 PM PDT We have identified the cause of the decreased capacity and understand why the Fargate task launch success rate is only 70% at this point. Our remediation actions are making slower progress than expected, so we are working on additional actions to further reduce load on Fargate. The work started in the previous update is still progressing but we do not yet have a projected ETA for when it will complete or when we will see recovery. Customers can switch to using the EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Our queue backlog is down over 5x from its peak a couple of hours ago. There's still a moderate backlog but most of the queue has been processed. We should be back to normal within the next hour if this pace holds. Thank you for your patience.

Posted Aug 24, 2022 - 19:57 PDT

Update

New update from AWS.

6:49 PM PDT We have identified the cause of the decreased capacity and understand why the Fargate task launch success rate is only 70% at this point. We are working on multiple parallel actions to address the underlying issues and have identified one area in particular that should help us make faster progress towards recovery. We have started work on this and have an indication on progress by 7:00 PM PDT. Once we have that progress data we will be able to provide an ETA for recovery. We are also making a change to the rate at which ECS launches tasks as part of ECS services to reduce load on Fargate and to speed up recovery. For customers with prepared and rehearsed plans for moving to a different region should exercise those if they are in a place to do so. Customers can also switch to using the EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

We have seen a significant decline in the job count in our queues. That said, they are still backlogged and we'll continue to provide updates until the issue is fully resolved.

Posted Aug 24, 2022 - 19:01 PDT

Update

There was a recent update from AWS.

Amazon Elastic Container Service - Increased Rates of Insufficient Capacity Errors
5:49 PM PDT We have identified the root cause for the increase in insufficient capacity error rates for launching new Fargate tasks and pods. Customers using ECS with Fargate and EKS with Fargate are impacted, starting at 1:15 PM PDT. We continue to work towards full resolution of the issue, however we are experiencing some delays with full recovery. We are working multiple, parallel paths to make additional capacity available. You will still see some task launches succeeding during this event. Running tasks and pods are not impacted. Customers using ECS with EC2 or EKS with EC2 are not impacted by this issue. We will provide an another update in the next 30 minutes.

Here's the link to their status page (see the second issue Operational issue - Amazon Elastic Container Service (N. Virginia))
https://health.aws.amazon.com/health/status

Posted Aug 24, 2022 - 17:55 PDT

Monitoring

There's currently an AWS ECS outage causing SMRT queue delays. Users may be experiencing delays with racking, Metal Progetti Assembly, and other areas of the system that rely on our queue system. You can continue to use the system but be aware that there will be a delay for all events that use our queue system.

AWS has identified the issue and is working to resolve it. Once they have resolved the issue our queues will process.

This means that racking events, MP unloads and events, ready order texts, etc. will process after the issue is resolved.

Posted Aug 24, 2022 - 17:48 PDT

This incident affected: SMRT POS (POS, Reporting / KPI System).