As most of you are likely aware, CakeMail experienced some significant downtime yesterday during the late afternoon and early evening. We pride ourselves on the reliability and consistency of our service availability and yesterday our service did not live up to those expectations.
This downtime was the result of some changes made at our data center to their cooling system. A routine maintenance operation was performed on the cooling system that had a negative impact on the temperature of our server cage. As a result, the data storage switches overheated and failed. These switches are used to access our main data storage, and all of our redundant switches (and servers) in this location were affected. This resulted in the CakeMail service being unavailable to all users.
How We’re Fixing It (and making sure it doesn’t happen again)
The cooling system in this data center has been adjusted to ensure that the switches will continue to perform at optimal levels, and a new monitoring system has been added that will alert the data center and our operations team if temperatures rise above acceptable levels so we’re able to intervene before a problem occurs.
We’re also working with Dell (the manufacturer) to determine if the switches are working under acceptable conditions and are optimized for performance. They will be replaced if necessary. In addition, we’ll be adding additional offsite data center redundancy to eliminate the risks of this problem occurring again.