Intermittent errors on package uploads

Incident Report for packagecloud.io

Postmortem

Summary

At 09:52AM UTC, we received the first customer report indicating HTTP 500 errors during Debian package uploads. Additional reports followed throughout the day, confirming that multiple package types and customers were affected. After an internal investigation, the issue was declared an incident at 18:19PM UTC.

The underlying cause was found to be insufficient disk space on a subset of ECS tasks responsible for handling file uploads. These tasks had reached their disk capacity, preventing them from completing upload operations. Because the tasks were running at maximum disk usage, they also could not be automatically cycled out by ECS, allowing the problem to persist.

During the investigation we observed an increase in device maximum capacity errors, and inspection of the ECS cluster revealed tasks of varying ages. This correlated with the accumulation of temporary files generated during package extraction and processing. Although temporary file cleanup mechanisms exist, they did not adequately prevent disk saturation.

As a short-term remediation, a deploy was triggered to refresh all ECS tasks, which replaced the existing tasks with new ones. Following this deployment, upload functionality stabilized and no further customer impacting errors were observed. The issue was considered mitigated at 20:19PM UTC.

Subsequent analysis confirmed that temporary file usage was a significant contributor to disk exhaustion. Additional improvements were implemented to ensure temporary files are consistently closed and cleaned up, regardless of whether errors occur during processing.

Changes we’re making:

To reduce the likelihood of similar incidents in the future, we are implementing the following long-term improvements:

Strengthening disk usage observability: Enhanced monitoring now provides realtime insight into ECS task disk consumption, enabling earlier detection of abnormal usage.
Improving temporary file lifecycle management: Further audit and cleanup of temporary file handling across the codebase is underway to minimize unnecessary disk usage.
Refining ECS task lifecycle practices: We are reviewing our deployment and recycling strategy to ensure tasks do not remain active long enough to accumulate excessive temporary data.
Enhancing health checks: We are evaluating load balancer health check behavior to ensure that resource-degraded tasks (including those with insufficient disk) are identified and replaced more promptly.

Posted Dec 04, 2025 - 20:39 PST

Resolved

On November 26th, 2025, from approximately 03:00AM UTC until 20:19PM UTC, our package upload service experienced intermittent failures affecting file uploads. Customers attempting to upload Maven, RPM, and Debian packages encountered HTTP 500 errors. Retrying occasionally resulted in a successful upload, but overall reliability was significantly reduced during this period.

Posted Nov 26, 2025 - 05:00 PST