Regarding AEM and SRE (Site Reliability Engineering) CultureMay 12, 2021
A few years back, someone (badly) explained Google’s operations team structure to me, telling me that Google called them “SRE’s” and it meant that developers were on-call, that nobody actually did operations work and their whole world is just developers on developers, and then developers all the way down.
As a multi-decade, traditional “ops” guy, I dismissed it as fanciful at the time, though did start to notice companies all over the place start to call their “ops” folk “SRE’s” regardless of whether that changed their job description. I gradually righted my mis-perception of what site reliability engineering is, but only just recently made time to read the O’Reilly Site Reliability Engineering book, thanks in part to the Audible version that I could listen to while on long bike rides. I wanted to share my thoughts on this as it relates to infrastructure for portly and natively cloud-unfriendly CMS implementations like Adobe Experience Manager.
What is Site Reliability Engineering?
SRE is a term (and an associated job role) for the engineers whose job it is to enable the business they’re working for by making a desired service run smoothly and reliably. This also may be what operations folks like myself thought they’ve been doing all along (mom, can I be called an SRE too?) but there are a few key differences that define the SRE approach (mostly, in this case, shamelessly lifted from The Site Reliability Engineering Workbook:
- Operations is a Software Problem: The basic tenet of SRE is that doing operations well is treating it like a software problem. SRE should use software engineering approaches to solve problems in operations.
- Manage by Service Level Objectives (SLOs): SRE does not attempt to give everything 100% availability. Instead, the product team and the SRE team select an appropriate availability target for the service and its user base, and the service is managed to that SLO.
- Work to Minimize Toil: Operations work is classically rich in repetitive, required work. In SRE, the principle is that if a machine can perform a desired operation, then a machine often should. The distinction between SRE and other non-SRE orgs is that in traditional ops the toil IS the job, and that’s what you’re paying a person to do, whereas in SRE any time you’re spent doing repetitive toil is time you’re taking away from your primary job of engineering services to be more reliable.
- Scale Services Without Scaling Employees: A side-effect of eliminating toil and increasing automation is that one ideally should be able to scale up the size of a service without always proportionally scaling up your staffing. A site should be able to handle vastly more traffic without needing simultaneously vastly more staff.
- Automate This Year’s Job Away: The real work in this area is determining what to automate, under what conditions, and how to automate it. SRE (as practiced in Google) has a hard limit of how much time a team member can spend on toil, as opposed to engineering that produces lasting value at 50%. Most “classically trained” operations folks in my line of work have been doing the same repetitive tasks for 10 years in some cases, by contrast.
- Share Ownership with Developers: Rigid boundaries between “application development” and “production” (sometimes called programmers and operators) are counterproductive. This is especially true if the segregation of responsibilities leads to power imbalances or discrepancies in esteem or pay. In SRE implementation
Challenges in Implementing SRE on an AEM Environment
Running an on-premise or self-hosted Adobe Experience Manager environment has some intrinsic factors that I’ve seen can lead one away from an SRE-style culture, and more one of traditional, old-school “Ops”. This means leaning more toward reactive operations, and less toward continually automating yourself out of a job, and working to minimize “toil”.
AEM installations I’m working on at this very minute tend to be very manual, and run by more traditional operations methods, with the application, web server, load balancers all installed and configured manually, little-to-no configuration-as-code, etc. In this day and age this might seem shocking, but the reality is that there are several reasons why companies use such a dated and problematic operating model with such an important and expensive system which almost invariably costs millions to deploy and maintain.
It’s unfortunate, but the AEM licensing model is a big root cause, in my experience, for why companies haven’t gone to a more devops-ish approach to their AEM installations. A large majority of installations I’ve run have been 1-Author / 2-Publisher systems, mostly because that was the most that the company was willing to shell out for. If you’ve only got two Publishers (the application server in an AEM environment) and never will have more, there’s diminished benefit to automating things like instance provisioning, auto-scaling logic, and the like – as you’ll always have the same two persistent Publishers all the time. If you always have the same two Publishers, you’ll always have the same two Apache servers fronting them, so Apache installation doesn’t “need” to be automated either. This, then, moves one in the direction of classic “Ops” as you then just have static set of production servers that you care, feed and know the names of – as opposed to a programmatic infrastructure that you describe as code, and expand, contract, deploy and fix with software.
The Servers are Managed By Corporate
Another big reason I come across why SRE/DevOps culture “can’t work” on AEM is the teams that are managing the AEM systems don’t even have root access on their own servers, as they’re managed by corporate IT. Obviously, it’s easy to understand why corporate may not want to give individual application teams root access to their servers, if corporate has to be the one that handles the monthly patching jobs, configures the antivirus, runs the hypervisors and the firewalls, etc.
But when you have an application team that cannot actually manage their whole stack, it severely curtails the amount of DevOps you can do. When you look at the classes of problem that you’d want to attack as an SRE, even if you know you’re limited to only two AEM publishers, you’d need to be able to have at minimum:
- Root access on any Linux system you run
- Sufficient access to any hypervisors (if you’re running in a corp datacenter) or cloud accounts to be able to provision/destroy VMs
- Sufficient API access to load balancers to be able to add and remove members from a pool, create and destroy pools and manage healthchecks
- Leeway to implement a configuration management system
- Leeway to implement monitoring & log aggregation tooling
Without this, you are forced to respond to every problem reactively. But with this, and even with the restrictive nature of AEM licensing, all manner of fabulous automation can be designed around AEM to solve routine problems, eliminate the toil of long, manual deployment processes, manual maintenance, or other neat tasks like upsizing VMs, reverting backups, or doing prod-to-lower-environment content migrations.
But without access, you’re essentially saying that you never want your teams to be able adopt any SRE principles.
Monolithic Nature of AEM
Now, I wanted to mention the monolithic nature of AEM in this too, as I’ve seen it get thrown around a lot as a reason why AEM can’t be all DevOps’ish like other more container-friendly applications. It is indeed difficult to do things like auto-scaling with AEM, as one has to clone the entire state of the repository (content and code) to another machine, and that repository size is usually many hundreds of gigs, sometimes many terabytes. It is about as tractable as a wild hog when it comes to moving it around like that.
However, whilst that does present a barrier to containerization, it certainly isn’t a barrier to operating using SRE principles, and any company running AEM on-premise can work around the limitations of having to deal with its massive repository to automate all of the other things you can to solve problems of toil and repetitive work. The main goal here is to be able to have services that can scale up without having to always proportionally scale up your operations workforce as well.
To sort of sum up these challenges in implementing SRE in an AEM environment, and why having the leeway to automate is so key, I wanted to pull a quote from the Site Reliability Engineering book – where an engineer is talking about the constant battle folks have to justify whether getting a certain little something automated is worth it.
It’s because it comes down to more than just raw time savings. It comes down to the fact that once one has encapsulated a manual operation into a piece of automation that can be run by anyone (or even by itself), you no longer are a slave to your system and its good days and bad days. This quote sums it up perfectly:
“If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.”Joseph Bironas – Google SRE [ref]
I’ve got a lot more thoughts on this, but this is already a wall of text, so I’ll have to put it in another post. Or maybe a podcast – who wants to sit & talk AEM, DevOps and SRE with me for an hour or so?