Update about the service interruption on Feb. 1, 2021

Illustration of service interruption

Proton’s servers experienced an outage on Monday that interrupted service for users for about one hour. In line with our policy of full transparency and disclosure, we are publishing the results of our internal investigation into the incident.

The outage began around 7:30 AM CET and impacted all Proton services for approximately one hour. No user data was lost, and this was not the result of any attack or breach. We have identified and corrected the problems that led to the downtime. Moreover, our work over these last few days will inform a number of new tests and safeguards that will make Proton more resistant to problems, even under abnormal conditions.

We understand that even a few minutes of downtime is disruptive to our community. We will be taking steps to ensure our systems are even more resilient in the future. This article details some of those steps, as well as a timeline of the issues.

Timeline

At 7:22 AM CET on Monday, we switched our API services into offline mode to perform a scheduled maintenance to upgrade some of our database hardware, which was successful. The intervention should have taken less than five minutes.

However, four minutes later, at 7:26 AM, we noticed spikes in system CPU and memory utilization across our API server instances, which almost instantly brought all applications to a halt.

Graph of CPU usage spiking on Monday

As most of the server fleet became unresponsive due to too-high CPU and memory usage, we were not able to recover quickly. First we attempted to stop traffic from reaching the impacted servers so they could “cool down.” However, that was ultimately ineffective; even with no web traffic reaching the servers, we could not connect to them. The servers could only be recovered via a hard reboot triggered from out-of-band-management. For security reasons, this can only be done manually by a small number of engineers, which means it is not a fast process. 

When the servers came back up, some of them also had improper configurations because some of them had gone down before the config changes we were pushing out had all been applied. This required an additional server-by-server check before we could bring the server fleet back online. In total, we needed approximately one hour to restore all services. 

Causes

Each server has a local cache of certain critical configuration data. This data expires in a time period that varies between three and four minutes, and each server then pulls an updated copy. While cache expiration times are normally staggered, setting all services to offline mode had the unfortunate side effect of approximately synchronizing them, as part of this operation resets this cache.

This configuration cache normally should not be needed at all during offline mode. However, there was a bug in our application code where authenticated requests (those of logged-in users, the vast majority of requests) erroneously required this data, even when offline.

What this meant was that offline mode turned into a ticking time bomb. After about three minutes, on every server, the configuration cache expired, triggering requests to a database which at this point was still unreachable. This meant that requests that usually took milliseconds to handle, now required up to 10 seconds as they waited for a timeout. The spike in response times caused API requests to pile up, triggering a massive memory and CPU usage spike which rendered all affected servers unresponsive in a matter of seconds.


We do have some protections in place against “thundering herd” problems such as this, but for reasons which are still under investigation, they were not sufficiently effective to prevent the problem.

To recap, here is the sequence of events and errors:

  1. Each server manages its own cache of configuration data which expires at semi-regular intervals. This data should not be required in offline mode.
  2. Because of a bug, authenticated requests to the servers (the vast majority of requests) triggered dependence on this cached data.
  3. The cached data expired three or four minutes after offline mode was entered, which triggered many requests to an offline database.
  4. This flood of requests resulted in memory and CPU usage spikes on the API servers, which rendered them unresponsive.

Fixes

Our first priority was to fix the two code issues that caused the immediate problem. There is no longer database dependence in offline mode for authenticated requests, and the configuration cache is being reworked to continue to use the “stale” data in the event of a refresh failure rather than clearing it entirely. This will fully prevent this particular issue from recurring in this manner.

We also continue our efforts to fully simulate the failure so that we can build in additional safeguards at different levels of our technology stack. This in-depth defense is essential for ensuring that one bug cannot create a fatal cascade of errors.

This outage has also revealed the need to be better when it comes to thoroughly stress testing our infrastructure and code in abnormal conditions, such as when various dependent services are offline. Offline mode tests, for instance, must include various types of requests, including authenticated ones, and should have gone on for longer to trigger the cache expiration issue. It also highlights the need for more regular testing. While offline mode has been tested in the past, between the last test and the incident on Feb. 1, Proton’s traffic has grown so dramatically that more tests should have been conducted even though none of the software involved has been changed. 

We are also changing our standard operating procedures for maintenance operations to ensure that more engineers are on-call in case something unexpected happens. One of the reasons for the longer recovery time during this incident was the time required to call additional engineers to accelerate the recovery.

This was a highly unusual downtime and stressful both for our team and our users. We apologize for this inconvenience, and we’re grateful for your trust as we work to provide the highest levels of privacy, security, and reliability to the Proton community.

Best regards,
The Proton Team

***

Feel free to share your feedback and questions with us via our official social media channels on Twitter and Reddit. Note that while blog comments also remain open, questions and feedback will not be responded to individually. Where relevant, we will incorporate the most frequently asked questions or comments into a blog update.

About the Author

Proton Team

Proton was founded by scientists who met at CERN and had the idea that an internet where privacy is the default is essential to preserving freedom. Our team of developers, engineers, and designers from all over the world is working to provide you with secure ways to be in control of your online data.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

22 comments on “Update about the service interruption on Feb. 1, 2021

  • Many thanks for your open response and especially for finding the issues and fixing them. This serves to make Proton products better and improve public trust in all that you do.

    Reply
  • This explanation is great. Thank you. I also had service loss from AT&T earlier this week. No information whatsoever from them about what was going on. Fairly new user to Proton, but I continue to me more and more impressed.

    Reply
  • Thank you for this detailed insight of the outage, this is what i call transparency at its finest. I’m glad i switched to your services and plan to stay for a long time.

    Reply
  • I’ll echo the appreciative comments already provided. Such coordinated restorative work and informative explanations are rare. I’m glad i made the switch to Proton.

    Reply
  • Thank you for the TRANSPARENCY that seems to be lost today. The most important and valuable parts of our lives in today’s crazy society has been kicked aside by those we think we can or should trust. Others want all our money and privacy and Proton is only 1 of 2 companies I know of that exceed 5/5 stars for all services provided. When I am working again I will be upgrading.
    Thank you

    Reply
  • Thanks for the explanations. I am considering an account. As I am unfamiliar with ProtoMail, I have some questions currently:

    Is ProtonMail a separate e:mail account or does it route other e:mails?
    If it routes other e:mails, how many may be routed through one PMail account?
    If separate, how many separate accounts may be established for one entity?
    Are PMail accounts for charitable organizations also free?

    Reply
  • Finally. A public relations RFO that’s not some vague “Sorry it won’t happen again.”

    You guys and your service are wonderful. Keep up the good work.

    Reply
  • I think the honesty and transparency here amazing. It is good you have been open, learned lessons from what happened and taken steps to minimise/stop this in the future.

    Reply
  • Thank you for the TRANSPARENCY that seems lost today. Be aware about the cache technology. Some other companys also struckling about this kind of configuration and integration. So keep on going with your great work.

    Reply
  • The fact that ProtonMail said that hey this was what was happening something weird happened as a result and you all said what it was and just being honest and open makes me all that much happier I switched in October of 2020 and things do happen. Just inevitable but at least I know that oh hey ProtonMail is having issues ehh they will say something in a couple of hours about it not too worried about it and left in the dark.

    Reply
  • A huge THANK YOU to ProtonMail for being such a trusted and transparent service!! I absolutely love the Proton stack of services, I’m always eagerly looking for new updates and features. Looking forward to buying more services and features when available!

    Reply