When was the last time you needed to Google something and Google wasnâ€™t there?
Odds are, you donâ€™t remember that ever happening. Sure, there are times when you canâ€™t reach Google because your internet connection is down. But Googleâ€™s primary online services, from its search engine to Gmail to Google Docs and more, are nearly always accessible. The companyâ€™s Google Apps suite, including Gmail and Docs, was available about 99.97 percent of the time in 2015, according to the companyâ€™s own numbers. The world pretty much takes this for granted, but itâ€™s a remarkable reality. The billions who use Google hardly stop to consider how Google made something so impressive seem so mundane.
Google explains the feat in three words: Site Reliability Engineering. OK, they arenâ€™t the best three words. But thatâ€™s the rather unsexy name Google gave to this seminal philosophy more than a decade ago. Itâ€™s a rather nuanced and expansive philosophy, but it really boils down to one central idea: Donâ€™t get IT people who specialize in running Internet services to run your Internet services. Have software coders run them instead. If you do this, the thinking goes, the software coders will build tools that can help run the operation without the active involvement of real live people.
‘We long for the day when nobody runs anything.’ Todd Underwood, Google
â€œThe result of our approach,â€ writes Googler Ben Treynor Sloss in a new essay, â€œis that we end up with a team of people who will quickly become bored by performing tasks by hand and have the skill set necessary to write software to replace their previously manual work.â€
For many in Silicon Valley, that may seem like a common idea. This kind of thing is now practiced across the tech world, from Amazon to Box.com. People call it DevOpsâ€”â€œdevelopmentâ€ plus â€œoperationsâ€â€”an effort to combine the ways of the software coder with the aims of the systems administrator. But the DevOps movement, embodied by tools like Chef and Puppet, evolved separately from and largely after the SRE philosophies that arose inside Google (and similar ideas that took hold at Amazon). Itâ€™s just that Google has kept largely quiet about this over the last decade, as it often did when the topic was the inner workings of its enormously efficient online operation.
But the company has entered a new period, one in which itâ€™s more willing to discuss such things (mainly because it wants to promote the cloud services that allow outside business to run their own software atop its vast network of data centers and machines). Google has even gone so far as to write a book about Site Reliability Engineering.
The book is called, well, Site Reliability Engineering. It was just published by Oâ€™Reilly, and the essay from Sloss serves as the first chapter. If youâ€™re into DevOps, itâ€™s a must-read. And even if youâ€™re not, the opening of the bookâ€”the preface, the introduction, and the first chapterâ€“is a fascinating look at the attitudes that drive the worldâ€™s largest online empire.
For many in techâ€”and almost everyone outside of techâ€”system administration (or operations or whatever you want to call it) is an afterthought, one of the more boring aspects of computer technology. But Sloss, officially known as Googleâ€™s Vice President for 24/7 Operations, turns this notion upside down, arguing that site reliability is â€œthe most fundamental feature of any product.â€ After all: â€œA system isnâ€™t very useful if nobody can use it.â€
Sloss is ground zero for the SRE movement. It began when Google hired him to run its operations, and it was he who coined the term. â€œSRE is what happens when you ask a software engineer to design an operations team,â€ he says. â€œI designed and managed the group the way I would want it to work if I worked as an SRE myself.â€
For Todd Underwood, now an SRE director at Google, itâ€™s only natural that the company would hire a coder like Sloss for the job. â€œWhen Google was in its infancy, there were so many software engineers who had a better sense of how things broke and a better sense of how engineering could be done well,â€ he tells WIRED. â€œBut not one them wanted to do any of that by hand.â€
Thatâ€™s a very Googly thing to say. But Adam Jacob, chief technology officer at Chef, pretty much agrees, explaining that this is the expected transition for an online operation thatâ€™s growing to such a large size. â€œItâ€™s natural to have a conversation to combine software development and the practical pieces of operationâ€”and to have no real divide between the two,â€ he says. â€œWhen you look at the problem holistically, you get better results.â€
The shift is particularly interesting when you consider that dev and ops were traditionally opposing forces. The devs wanted to build new software and change it and get the changes out to the public as a fast as possible. But the ops folks wanted to ensure that nothing went wrong, and the best way to do that was to keep changes to a minimum. â€œThese are incommensurate goals,â€ Underwood says. The trick is that, if you combine dev and ops, you can start to eliminate their competing aims.
Underwood calls it a â€œHegelian thesis-antithesis synthesis.â€ He then acknowledges that when he says this, no one really buys it. â€œPeople just donâ€™t read Hegel anymore,â€ he quips. But the description is spot on. And once this synthesis was in place, Google accelerated the process by adding all sorts of other Googly ideas to the mix.
The Error Budget
One big idea is that, in an effort to reduce the conflict between dev and ops, the company doesnâ€™t strive for 100 percent uptime. The reality, Sloss writes, is that you donâ€™t need an internet service to be 100 percent available. Users canâ€™t really tell the difference between 100 percent and, say, 99.999 percent (their laptop or WiFi or electricity or ISP are down far more than 0.001 percent of the time). If you set a reasonable uptime goal below 100 percentâ€”an â€œerror budgetâ€â€”you have more room to make changes and roll out experiments.
â€œThe use of an error budget resolves the structural conflict of incentives between development and SRE,â€ Slosser says. â€œAn outage is no longer a â€˜badâ€™ thing. It is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.â€
At the same time, the company put rules in place to ensure that SREs didnâ€™t end up morphing into good old fashioned sysadmins. Basically, it decreed that no SRE could spent more than 50 percent of his or her time on traditional operations as opposed to coding. If ops starts to take precedence over dev on a particular SRE team, Google shifts some of the ops load onto the team that is typically just build the softwareâ€”the regular Google software engineers. â€œConsciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering,â€ Sloss writes, â€œwhile still retaining the wisdom gleaned from the operations side of running a service.â€
Chefâ€™s Jacob says that the ratio hereâ€”50 percentâ€”isnâ€™t that important. But he likes the attitude. â€œThis is just economics,â€ he says. â€œThereâ€™s always demand for people to do operational bullshit. There is an almost infinite amount of bullshit that people will ask an operational person to do. So the idea that you would put a cap on that it legit.â€
Google even created strict guidelines for hiring its SREs. It hires about 50 to 60 percect through exactly the same process that applies to all other Google engineers, and the rest have about â€œ85 to 99 percentâ€ of the same skillsâ€”plus a â€œset of technical skills that is useful to SRE but is rare for most software engineers,â€ such as an intimate knowledge of the inside of the UNIX operating system or hardware networking protocols. This too aims to ensure that dev and ops maintain the proper balance.
The Moonshot That Keeps Google Online
In many ways, this was a new philosophy. But in their book, as they seek to describe the philosophy, the Google team uses a much older example. The spiritual forebear of the Google SREs is Margaret Hamilton, the MIT programmer who spent the â€™60s building software for Apollo spacecraft that would one day land on the moon. As explained by Hamilton herselfâ€”who was interviewed for the bookâ€”part of the culture on the Apollo program â€œwas to learn from everyone and everything, including from that which one would least expect.â€
Hamilton was a coder. But she played an important role in operations. To show this, the book recounts the day Hamiltonâ€™s young daughter, Lauren, who she often brought to the computer lab, happened to hit a button and feed an Apollo pre-launch program into a computer that was running a post-launch scenario.
This crashed the scenario, and Hamilton tried to add a new error checking code to the system that automatically would prevent this during a real flight. Her superiors rejected the idea, arguing that astronauts would never do such a thing, but on Apollo 8, the astronauts did such a thing. Luckily, Hamilton had added a workaround to the system documentation. And for subsequent missions, she added the error checking code.
â€œIf you come along and say â€˜Thatâ€™s going to break,â€™ itâ€™s really not that useful. But if say: â€˜Thatâ€™s going to break, and let me tell you how,â€™ youâ€™ve done something amazing,â€ Underwood explains. â€œHereâ€™s a person who saw that something was going to break and saw how it was going to break and devised a way to prevent it from breaking.â€
Thatâ€™s DevOpsâ€”or, in Google parlance, Site Reliability Engineering. As three words, it doesnâ€™t sound like much. But itâ€™s an enormously powerful idea. It has already produced Google. But particularly philosophical SREs like Underwood have even bigger ambitions. They envision a world where operations shift even further towards code. â€œWe long for the day,â€ Underwood says, â€œwhen nobody runs anything.â€
Go Back to Top. Skip To: Start of Article.