Adobe Experience Manager is a complex, high-end content and asset management platform that many of the largest brands in the world trust to run their massive marketing websites. As such, stake-holders in the enterprise are rightly concerned over AEM’s availability, and what the options are for architecting AEM so that it will weather high-stress / high-traffic period, but also to be designed to handle a serious physical or software disaster.

As the topics of “DR” and “clustering” come up in literally every single discussion I’ve had around architecting an AEM environment, I wanted to discuss these topics at some length in hopes that it might help you come to a decision about how you want to approach this with your company.

AEM Disaster Recovery – what Disasters are you Planning for?

To make a disaster recovery plan, you need to decide what classes of “disaster” you are trying to plan for. In times past, “disaster” might have only meant “my server caught fire” or “my building permanently lost power” or something of this nature.

A rare image of an exploding AEM author server in the wild

A rare image of an exploding AEM author server launching itself from a datacenter near you.

But with a complex asset management system like AEM, there are so many more things which can constitute a “disaster” and make your whole marketing engine grind to a standstill. Before laying out what your options are for making an AEM DR solution, it’s first important to understand the gory ways in which AEM can die, so that you can plan for this and be three steps ahead of it when it does happen. (Because eventually, in one of these ways, it likely will.)

The Gory Ways Your AEM Will Die

Your Tools for Mitigating Disaster on AEM

Quite similar to engineers designing a combat system like an aircraft carrier or a main battle tank, no single solution handles every disaster scenario. If you design a tank with armor that’s too heavy, you lose mobility and suddenly can’t cross bridges or move fast enough to evade the enemy. Armor designed to stop some types of armor-piercing rounds may be ineffective against others. Similarly, a complex “enterprise clustering solution” might sound good on paper and protect against some types of extremely rare disasters, but will be entirely ineffective at helping you through more common scenarios.

Publishers are HA, Authors Generally Not

Your first stable datum in making an AEM system is that you can scale out your publish tier horizontally basically as far as your pocketbook allows. If you see my earlier post on AEM Architecture diagrams, the most usual architecture for a production AEM environment consists of two publishers (each behind a single dispatcher server) and a single author server.

The AEM Author server, however, is constantly in the paradox of being your most valuable single server in your marketing machine (running your whole asset library, all your content, all your touchpoints with your marketing staff, etc) but is a single point of failure. It is not possible to horizontally scale the AEM Author unless you go to a MongoDB-backed clustered author – a setup which brings with it a massive set of performance limitations, and huge complexity and expense – including requiring that your team has a high-quality MongoDB DevOps engineering team on-board.

AEM Cold Standby Author Clustering – Mostly a “Security Blankie”

In CQ5 (aka AEM 5.x) there was a hot clustering option which, frankly speaking, only served to decrease the reliability of the author, and didn’t really offer increased security. With AEM 6.x, Adobe moved to a “Cold Standby” model for author clustering, where authors log into and use the primary AEM Author, which then actively syncs changes and data to a Cold Standby instance which is up and running in a “standby” mode, only alive enough to receive sync changes, but not able to be logged-into by authors.

AEM 6.4 with Cold Standby Author

AEM 6.4 with Cold Standby Author

Scenarios this will assist with:

Scenarios Cold Standby doesn’t help at all for:

"That security blanket soaks up all my fears and frustrations."

“That security blanket soaks up all my fears and frustrations.”

Because of the fact that all of the most-likely scenarios for outages are best-handled by restoring from a backup or filesystem snapshot, 100% of the AEM Cold Standby authors that I have deployed over the last 3 years have gone unused. However, for the AEM environments where I’ve set up Cold Standby Authors, each and every one of them has, at some point, had to be restored from a backup snapshot – so our backup & restore capabilities were certainly used, but 0 of them have ever had a scenario where it was indicated that we’d transition authors over to the Cold Standby.

As such I’ve found that the Cold Standby setup is more of a “security blanket” for those who envision the rare scenario that it is actually designed for, and like the sound of the word “clustering”. Otherwise, most failure scenarios are better-handled by other means.

Using your DevOps Toolset as your DR

If you don’t use the Cold Standby as your DR solution, then what? In today’s world of magical cloudiness, your best approach to mitigating disaster is to be able to use your DevOps toolbelt to rapidly and flexibly re-create a working installation around whatever disaster has just transpired.

Tools to think with:

Summary

AEM is a complex beast, and there are a ton of ways things can go wrong. But the plus is that if you’ve got a team who have seen all the ways it can break, there are a number of ways that the environment can be designed so that you can get through your career without having an I’m-going-to-be-fired-tomorrow-grade disaster that vaporizes your company’s multi-million-dollar marketing machine.

Please do reach out if you’ve got any questions or would like some help architecting your next AEM solution.