At 09:52AM UTC, we received the first customer report indicating HTTP 500 errors during Debian package uploads. Additional reports followed throughout the day, confirming that multiple package types and customers were affected. After an internal investigation, the issue was declared an incident at 18:19PM UTC.
The underlying cause was found to be insufficient disk space on a subset of ECS tasks responsible for handling file uploads. These tasks had reached their disk capacity, preventing them from completing upload operations. Because the tasks were running at maximum disk usage, they also could not be automatically cycled out by ECS, allowing the problem to persist.
During the investigation we observed an increase in device maximum capacity errors, and inspection of the ECS cluster revealed tasks of varying ages. This correlated with the accumulation of temporary files generated during package extraction and processing. Although temporary file cleanup mechanisms exist, they did not adequately prevent disk saturation.
As a short-term remediation, a deploy was triggered to refresh all ECS tasks, which replaced the existing tasks with new ones. Following this deployment, upload functionality stabilized and no further customer impacting errors were observed. The issue was considered mitigated at 20:19PM UTC.
Subsequent analysis confirmed that temporary file usage was a significant contributor to disk exhaustion. Additional improvements were implemented to ensure temporary files are consistently closed and cleaned up, regardless of whether errors occur during processing.
To reduce the likelihood of similar incidents in the future, we are implementing the following long-term improvements: