As some of our customers may have been aware, there was an outage yesterday with our website between 11:00am - 6:30pm. During this time our website, and subsequently NodePanel, was inaccessible. Customer servers continued to operate and control of servers was still something our staff could manage. Part of our new infrastructure allows us to build an interface at our website which communicates with our backbone API which means that when we have problems with our web server, we are not down with our logic operations, but just the interface our customers interact with.
Outage Details
We recently moved to a new provider that offers DDoS protection for our website. While our new provider provides a great protection from attacks, it isn't built with any redundancy to protect us from hardware issues or network problems with an upstream provider. The server that hosts our website had a hardware failure which knocked us offline for the aforementioned time.
Website Restoration Issues
We have had a contingency plan in place to handle an event such as this, however we overlooked two important things in our original plan
While we still can move the customer facing website, we are not handling our billing service outage. A series of live changes were not pushed to our internal git which prevented us from rebuilding the site on another server as a temporary solution. While we typically never make live changes, this week has had a lot of small and minor bug fixes which needed a solution to be implemented instantly.
Future Prevention Mechanisms
Whilst the outage didn't affect our customers directly, it prevented them from fully managing their services and ultimately affects trust by our customers for our brand. We are currently working to deploy a redundant infrastructure to better prevent an outage if one server or provider has any problems in future. Furthermore, a better update schedule and development routine may give us a better set of options should this issue arise again.
In Closing
We feel it's important for us to be open and transparent regarding these types of issues. We want our customers to celebrate the victory we share when things go well, but we also feel that we owe it to them to share our failures as well, as without this, there is no trust in that customer-provider relationship. We appreciate our customers patience and understanding in these matters. While this wasn't the ideal learning scenario, this is something you will find in the, hopefully minimal, future incidents similar to this.