This pretends to be a mix up of my real life as a software engineer, applied to a fiction IT corporation: EvilCorp.
First things first. What is an SRE? SRE stands for "Site Reliability Engineer". SRE is a job position. Usually, an SRE is responsible for reliability and availability of all IT company's services. An SRE is basically a software engineer that spends 50% of time developing amazing software products, the other 50% of the time is used in several tasks such as: developing automations, attending IT operations, being on call (typically one week per month) , attending tickets, creating alarms & metrics, designing system solutions, etc.
As part of the SRE team at EvilCorp, I put pressure on myself everyday to reach the best possible performance in our systems.
Now, I bet you're wondering "What EvilCorp is?" EvilCorp is the most f ***** g awesome (and evil) corporation in the whole world. Our CEO, Steve Moss, is an evil genious. Our target: every person in the world has to use our software products (apps, social networks, ...) and / or our devices (smartphones, laptops, ...).
"Why is EvilCorp so evil?" you could ask me. Well, every time you share any kind of data inside our systems (posts, pictures, likes, geolocations, ...), our evil bots are tracking your activity, analyzing your behavior, and building evil models that describe you better than yourself. EvilCorp uses all this data mesh in the most evil way possible ... of course, What did you think? All in this corporation is evil xD
This is one of those usual days on this amazing (and so deeply evil) corporation ......
Monday.
I got up early this morning. After I drink coffee in an obscene amount, I started the day revising my emails. What the heck! I had a lot of emails in my inbox, so many more than usual. Something crashed last night. Many emails were from our alarms system, other ones were from the IT Support Team. It seems the incident scaled up rapidly.
Bob, my co-worker who was "on call" last week, must have all the relevant information about this. Definitively, this is not the way to start a week. I had an online meeting with Bob, I wanted to catch up with him as soon as possible. Our main analytical service crashed about midnight, something in the network didn't work well. This is a bad thing. We need near real-time analytics all the time, if that service fails, many other services won't be able to do their job.
Bob shared his findings with me, it seems like the issue's origin was something related with the main, on-premise, DNS servers. They were in a failed status for 2 hours. After my conversation with Bob, I understood the problem.
I had a chat with Mr. Brown, the responsible of both IaaS and Networking departaments. I needed to have a quick words with him. Mr. Brown is a man who always seems to be in an emotional state between sad and angry. He told me they need to scale their DNS architecture as soon as possible. By me side, I needed to ensure this problem with the real-time analytics won't happen again if those centralized DNS servers fail.
After a little synchronization meeting with my team, we decided to put some DNS cache in our servers. This solution is cool, if DNS servers goes to hell again, we shouldn't be affected, at least for one or two hours, depending on the cache's TTL (time to live).
After testing the solution, it was automated and properly deployed at the end of the day in all our environments (developing, user acceptance and production). Of course, in order to measure the solution's effectiveness, we set metrics and alarms to study the behavior.
EvilCorp will have its data one day more... Another day, I will tell you what we do with those data :P
Wish me luck, this is my on call week xD
That was very entertaining! Thank you for sharing
Thank you very much for the correction Aster :D. This is very useful for me. Regards :)