Proton’s servers experienced an outage on Monday that interrupted service for users for about one hour. In line with our policy of full transparency and disclosure, we are publishing the results of our internal investigation into the incident.
The outage began around 7:30 AM CET and impacted all Proton services for approximately one hour. No user data was lost, and this was not the result of any attack or breach. We have identified and corrected the problems that led to the downtime. Moreover, our work over these last few days will inform a number of new tests and safeguards that will make Proton more resistant to problems, even under abnormal conditions.
We understand that even a few minutes of downtime is disruptive to our community. We will be taking steps to ensure our systems are even more resilient in the future. This article details some of those steps, as well as a timeline of the issues.
At 7:22 AM CET on Monday, we switched our API services into offline mode to perform a scheduled maintenance to upgrade some of our database hardware, which was successful. The intervention should have taken less than five minutes.
However, four minutes later, at 7:26 AM, we noticed spikes in system CPU and memory utilization across our API server instances, which almost instantly brought all applications to a halt.
As most of the server fleet became unresponsive due to too-high CPU and memory usage, we were not able to recover quickly. First we attempted to stop traffic from reaching the impacted servers so they could “cool down.” However, that was ultimately ineffective; even with no web traffic reaching the servers, we could not connect to them. The servers could only be recovered via a hard reboot triggered from out-of-band-management. For security reasons, this can only be done manually by a small number of engineers, which means it is not a fast process.
When the servers came back up, some of them also had improper configurations because some of them had gone down before the config changes we were pushing out had all been applied. This required an additional server-by-server check before we could bring the server fleet back online. In total, we needed approximately one hour to restore all services.
Each server has a local cache of certain critical configuration data. This data expires in a time period that varies between three and four minutes, and each server then pulls an updated copy. While cache expiration times are normally staggered, setting all services to offline mode had the unfortunate side effect of approximately synchronizing them, as part of this operation resets this cache.
This configuration cache normally should not be needed at all during offline mode. However, there was a bug in our application code where authenticated requests (those of logged-in users, the vast majority of requests) erroneously required this data, even when offline.
What this meant was that offline mode turned into a ticking time bomb. After about three minutes, on every server, the configuration cache expired, triggering requests to a database which at this point was still unreachable. This meant that requests that usually took milliseconds to handle, now required up to 10 seconds as they waited for a timeout. The spike in response times caused API requests to pile up, triggering a massive memory and CPU usage spike which rendered all affected servers unresponsive in a matter of seconds.
We do have some protections in place against “thundering herd” problems such as this, but for reasons which are still under investigation, they were not sufficiently effective to prevent the problem.
To recap, here is the sequence of events and errors:
- Each server manages its own cache of configuration data which expires at semi-regular intervals. This data should not be required in offline mode.
- Because of a bug, authenticated requests to the servers (the vast majority of requests) triggered dependence on this cached data.
- The cached data expired three or four minutes after offline mode was entered, which triggered many requests to an offline database.
- This flood of requests resulted in memory and CPU usage spikes on the API servers, which rendered them unresponsive.
Our first priority was to fix the two code issues that caused the immediate problem. There is no longer database dependence in offline mode for authenticated requests, and the configuration cache is being reworked to continue to use the “stale” data in the event of a refresh failure rather than clearing it entirely. This will fully prevent this particular issue from recurring in this manner.
We also continue our efforts to fully simulate the failure so that we can build in additional safeguards at different levels of our technology stack. This in-depth defense is essential for ensuring that one bug cannot create a fatal cascade of errors.
This outage has also revealed the need to be better when it comes to thoroughly stress testing our infrastructure and code in abnormal conditions, such as when various dependent services are offline. Offline mode tests, for instance, must include various types of requests, including authenticated ones, and should have gone on for longer to trigger the cache expiration issue. It also highlights the need for more regular testing. While offline mode has been tested in the past, between the last test and the incident on Feb. 1, Proton’s traffic has grown so dramatically that more tests should have been conducted even though none of the software involved has been changed.
We are also changing our standard operating procedures for maintenance operations to ensure that more engineers are on-call in case something unexpected happens. One of the reasons for the longer recovery time during this incident was the time required to call additional engineers to accelerate the recovery.
This was a highly unusual downtime and stressful both for our team and our users. We apologize for this inconvenience, and we’re grateful for your trust as we work to provide the highest levels of privacy, security, and reliability to the Proton community.
The Proton Team