Incident report regarding the March 15th, 2018 technical issues

On the morning of March 15th, 2018, the ProtonMail API was accidentally taken offline and it took approximately one hour to restore services. We have thoroughly investigated the incident from both a policy and engineering perspective. Here is what happened, why it was not fixed sooner, and what we are doing to prevent this from ever happening again.

The incident began with a typo in a configuration relating to our anti-spam subsystem which was committed by the on-shift engineer as part of our routine anti-spam operations. Integration tests were performed but due to several circumstances did not detect the error. The broken patch was then deployed to our production servers on a rolling basis. Because the API was not returning the correct response, our load balancers successively dropped the production web servers one by one until all servers were dropped. As a result, ProtonMail users started receiving 503 errors.

This, fundamentally, is what brought down ProtonMail. From an engineering perspective, the obvious conclusion is that integration tests performed were clearly insufficient to detect the kind of error introduced, and we have adjusted our deployment procedure accordingly.

The second part of the story is why it took an hour to restore services. The reasons boil down to a combination of bad timing and monitoring issues. The bad timing was that it happened during the shift change (from night to day) which delayed the response. We have instituted new policies to prevent these kinds of deployments too close to the shift change.

The monitoring issues were also serious. No less than 3 monitoring systems should have immediately warned us that something was wrong, but none did. The first system monitors the availability of the site from an off-site location. Unfortunately, this system only monitored a subset of our domains, and those were not affected by the deployment. This system has now been updated to monitor all of our domains and subdomains.

The second monitoring system monitors changes in event rates to detect spikes and dips that should be brought to our attention. We can call this system the rate monitor. As much of the main API system was offline, the rate monitor should have notified us that event rates had flatlined. It did not, however, because it had been taken out of production earlier in the day for maintenance and had failed to come back up automatically. The monitoring for the rate monitor server itself also did not function properly because it had been disabled in advance of the maintenance work.

Finally, we have yet another system that monitors API error logs. However, the class of error that caused the downtime is logged to a specific error log which is normally empty, so we were not monitoring that at all. This has also now been corrected.

In summary, too many things went wrong simultaneously, and had any one of them detected a problem, this incident could have been avoided or quickly fixed. The fact that they did not detect the problem or did so too late is our mistake. We accept that responsibility and have taken the policy and engineering steps necessary to prevent this kind of incident from happening again. We know how important reliability is to our users, and while sometimes incidents are outside of our control, this one was not, and we have learned from these mistakes to make ProtonMail even more reliable.

Best Regards,
The ProtonMail Team

About the Author

Bart Butler

Bart is the CTO of Proton Technologies AG and expert in email encryption. Previously, Bart was a physicist at CERN working on the ATLAS experiment. He was also a postdoctoral researcher at Harvard and received his PhD in Physics from Stanford University.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

11 comments on “Incident report regarding the March 15th, 2018 technical issues

  • Thank you for the incident report regarding the March 15th, 2018 technical issues. I’m not very technical so I have no idea what the ProtonMail API does or does not. So please forgive me if this is a stupid question: was any email lost because of this incident?

    Reply
  • Great communication guys,

    A miss configuration can alway happen and monitoring is always good until you realise that you are not monitorng the thing that failed.

    The import part of this is you were able to identify the issues with the configuration and monitoring and get back up and running in an hour. Although this may be a long time for you guys, other companies far to oftern take much longer in the same situation.

    The fact that you now have been able to put better monitoring and process in place is what it’s all about, refine, refine, refine!

    Good Job Guys!

    Reply
  • THIS is what we want! I don’t care if something is not working 1hour, but DO care about transparency.
    This thing is to me yet another “white ball” for you, not a bad thing.

    All good things in life are NOT perfect. The most beautiful and shiny apple is what you should avoid.

    Reply
  • A resposta foi honesta . Como se fala no Brasil, confiança se adquire com o tempo, a honestidade e transparencia de assumir o erro e sinal de profissionalismo e cuidado com os usuários. Bom Trabalho.

    Reply