Datacenter Outage – Cole Thoron's Profile

I know I don’t get the opportunity to post very often on LinkedIn, but I wanted to take a moment to share something truly incredible that happened this week in the life of a Tier 2 Support Specialist at Miles IT.

Picture this: a company is migrating from Workspace to Microsoft 365, and it isn’t going as well as hoped. It’s Wednesday, aside from the migration, everything is going super well for said company. All systems cruising along. Until all of a sudden, we get a ping that there is a server down at the datacenter… and another… and another… until we get a ping that all 400 servers at the company’s datacenter are now down. Well, around 10:30am is when this exact scenario happened.

I immediately jumped into action and dispatched an Emergency Onsite to the datacenter, fearing that there was perhaps some sort of hardware/Internet/power failure causing the downtime. Our wonderful Helpdesk Services team quickly got a technician enroute for 11:30am.

About this time, I went to add the technician to the Datacenter security page. This is where we discovered that there was indeed a power-related event at the datacenter, which took down the company’s entire cage.

We arrived onsite at the datacenter at 11:30am and found that the power was still out. We waited, until finally the power was restored at 12:30pm. We had to wait another 30 minutes to get access to the cage. Finally, we’re in. Some servers started up after power restoration, but not all started successfully. We found that 2x Power Distribution Units (PDUs) had failed in our server rack. Our onsite technician was able to switch some things around and balance the load to the other 2x PDU’s in the cage. Finally, all of the Nutanix cluster had power and were currently powering up.

I am well regarded as “The Nutanix guy.” I started working on restoring cluster health. All nodes came back online (thank goodness!), but after performing a cluster health check, none of the nodes were showing as available in the Cassandra and Genesis services.

From there, we quickly raised a P1 case with Nutanix. Now, if you haven’t had the pleasure of interacting with Nutanix support, you’re truly missing out. Every time that I get the opportunity to deal with the wonderful team at Nutanix, I remember that great customer service is *not* dead. The technician joined the call within about 15 minutes of the ticket creation. From there, the tech was able to very quickly identify what the issue was, and we were able to bring all the nodes back to “Healthy” status.

Now came the real work: getting all the virtual machines back up and running. This is where our real workhorse Joshua Whitlow comes in. He is our “Linux God” in every meaning of the words. I truly couldn’t be more grateful for him as a friend, a teammate, and his willingness to help whenever incidents like these happen.

Josh and I tag teamed getting all critical services back online. He worked his magic on the Linux servers, while I worked to restore Windows Server/fileserver/Dizzion Frame (Think Citrix)/Cisco Identity Services Engine to the rest of the company’s infrastructure. I was able to get all the core Windows Server’s back up and verify stability. At this point, the two of us had mission-critical services back up after about 3 hours of downtime WITH power available. 6 hours from the time of the initial power outage.

Josh and I worked tirelessly until 9:00pm that night to restore the last few lingering services. Truly an incredible feat, whenever you think that it was just the two of us (plus the onsite technician) working on this since 12:30pm.

If someone had told me a few years ago that I would be able to administer a cluster of servers that over 1,500 users rely on, I would have called them crazy. I truly pat both Josh and I on the back for this incredible feat, and am so thankful that he was my copilot in this endeavor.