Decorative double Helix

Using Splunk for Log Search & Monitoring on AEM as a Cloud Service

Attempting to develop on and run a modern website without a log aggregator and without metrics and graphs is a clinic on infuriation and frustration. When did the such-and-such problem start? Was it before or after the deployment? Has it always been this way? Are all pages slow or just this one page? It’s throwing errors now – was it always throwing errors? Does this error correlate to anything?

These are the classes of questions that you get faced with when you only have access to a downloaded log, but are unable to search on it, graph it, and measure it across time and across multiple servers. When AEM as a Cloud Service first launched, the only mechanism to get at logs was downloading whole logs via Cloud Manager. Adobe later offered the ability to tail the logs of individual pods via a somewhat involved process using Adobe IO. But now, there’s a third and vastly better way to instrument your AEM Cloud Service infrastructure, and that’s with Splunk. And guess what, it can even be done for free, if you’re tight on cash or just trying to demo it out.

Here’s how to set it up:

What You’ll Need

In order to measure your AEM as a Cloud Service installation with Splunk, you’ll need:

A paid, licensed AEM as a Cloud Service environment: This will work on any Dev, Stage or Prod AEM as a Cloud Service environment, but it will not work on your Adobe Partner Sandbox environment. Because of the way that the AEM Cloud Service sandboxes are set up, they share a lot of different configurations, one of them, being the Splunk configuration. So, this only works with fully-fledged paid AEMaaCS environments.
A Splunk Free, Splunk Enterprise or Splunk Cloud environment that you have Admin access to.

How it Works

The log data from AEM as a Cloud Service is first ingested into an Adobe-owned Splunk instance, which Adobe support has access to in order to monitor and troubleshoot your environments. When properly set up, Adobe then sets up their Splunk instance to forward log information to your Splunk instance using the Splunk HTTP Event Collector or “HEC”.

Diagram of Splunk with AEM as a Cloud Service

Setting it Up

I’m going to give you the list of things to do to set it up from scratch. If you’re a big company, you likely already have a Splunk environment you’ll be using so will skip a bunch of steps. But even if you’re not doing a full enterprise setup, these instructions will work with the free version of Splunk and will give you 500MB/day worth of logging which will actually work with many small AEM installations, dev installs and even some production Assets installations I’ve worked on.

Here’s what to do:

Install Splunk

Assuming you don’t already have a Splunk setup, you’ll first need to install Splunk. To get started, even a very small environment will be sufficient to get you off the ground. This demo was created with Splunk Free running on a single-CPU T-Series AWS VM with 8GB RAM, though any production environment will likely need significantly more resources.

Configure an Index for your AEMaaCS Data

In Splunk, go to Settings -> Indexes and create an index for your AEM Cloud Service data.

Create a Splunk index for your AEM as a Cloud Service data.

Procure an Externally-Valid SSL Cert for Splunk

Adobe requires an externally-valid SSL certificate for the Splunk HEC endpoint. A self-signed certificate will not work. Sometimes with server-to-server connections one can run SSL in a “relaxed” validation mode to allow for validation errors & self-signed certs to work, but not in this case.

For this demo, I created an SSL cert for free with Letsencrypt using Certbot on CentOS.

Configure the HTTP Event Connector

The next step is to configure the HTTP Event Collector on Splunk. Go to Settings -> Data Inputs and click ” + Add New” for HTTP Event Collector.

When configuring the endpoint:

Make sure to select the Splunk index that you created earlier so that the HEC endpoint can feed into that index.
Leave “Enable indexer acknowledgement” unchecked. If you enable indexer acknowledgement, it will end up throwing a “Data channel is missing” error from the source Splunk instance when you attempt to forward data. So – leave it un-checked.

Configure SSL on the HEC Endpoint

A next point to do is to configure SSL on the Splunk endpoint. Note here that Splunk web (the Splunk UI you use to search & configure Splunk) and Splunk HEC have entirely separate HTTP and SSL configurations. So, if you turn on SSL in Splunk web, this does not make your Splunk HEC SSL as well. Adobe requires that HEC traffic be encrypted and prefers that it be on port 443. I didn’t push back on this too hard, so unknown if they can set up up on an alternate port if you ask nicely enough.

You can first turn SSL on by going to “GLOBAL SETTINGS” in the HTTP Event Collector settings, and clicking “enable SSL” and entering in the port number.

However, you’ll then need to go into your Splunk configuration on disk in order to complete the SSL configuration.

Edit {splunk_install_dir}/etc/apps/splunk_httpinput/local/inputs.conf and ensure it has the desired port and SSL certificate configuration in there:

[http]
disabled = 0
port = 443
serverCert = /etc/letsencrypt/live/splunk.opsinventor.com/fullchain.pem
privKeyPath = /etc/letsencrypt/live/splunk.opsinventor.com/privkey.pem

Then, re-start Splunk. You should then be able to do a test of the Splunk HEC endpoint with the following curl command:

curl -k https://your-splunk-host.com:443/services/collector -H 'Authorization: Splunk 1e238ab6-1f9d-47d4-9b0g-81c2a47e389c' -d '
 {
    "sourcetype": "aemerror",
    "index": "aemaacs",
    "event": {
      "host": "172.27.60.76",
      "file_path": "/var/log/aem/error.log",
      "orig_time": "07.12.2020 10:07:09.895",
      "level": "INFO",
      "msg": "[FelixLogListener] Test SPLUNK",
      "pod_name": "cm-random-pod",
      "aem_program_id": "12345",
      "aem_tier": "author",
      "aem_env_type": "dev",
      "aem_env_id": "32132",
      "splunk_customer": "true"
    }
  }
'

In the curl command, replace the “‘Authorization: Splunk 1e238ab6-1f9d-47d4-9b0g-81c2a47e389c'” bit with the “Token Value” you see in the HTTP Event Collector inputs config in Splunk (/en-US/manager/search/http-eventcollector).

As a response, you should see:

{"text": "Success", "code": 0}

This means your event collector is configured successfully.

Open a Ticket with Adobe to have them Set Up HEC on Their Side

Once the above is done, you can open up a ticket with Adobe Support to have them begin forwarding logs to your Splunk instance.

Make sure to specifically include:

Splunk HEC endpoint address: (i.e. https://splunk.myorganization.com)
Splunk index: (the name of the index you created)
Splunk port: 443
Splunk HEC token: (a value like “1e238ab6-1f9d-47d4-9b0g-81c2a47e389c” that you would have gotten from /en-US/manager/search/http-eventcollector in your Splunk instance
What environments you want ingested into Splunk

What Data You Should See (How you know it’s working)

In Splunk, you should then be able to do a simple search like:

index=aemaacs

And it should display results like this:

There should be (at least) 7 sourcetypes that come in to your results:

aemerror: AEM error logs from all instances
aemrequest: AEM request logs from authors & publishers (includes timing & response info)
aemaccess: AEM access logs from authors & publishers
httpdaccess: Apache Dispatcher access_log
aemqueryrecorder: AEM query debug logs
aemdispatcher: AEM Dispatcher logs (containing cache-hit data)
httpderror: Apache error_log

A Sample Splunk Dashboard for AEM as a Cloud Service

To round out this demo, I created a sample dashboard for this AEM as a Cloud Service dev environment.

The top panels here show access info over time, average response time and cache-hit ratio. Since there is currently no way to view average response time over time for the cloud service (that is – until Adobe give us access to New Relic or backend APM data) this is the only way I’ve found to get granular and find out, by page or resource, what is taking how long to process.

And above, you can get errors over time, plus a table of top 500/400-level errors, as well as a list of resources with the highest average response time.

I’ll be making a separate post shortly with the Splunk searches used to generate this dashboard – as the sourcetypes, fields and such should all be the same for any AEM as a Cloud Service implementation, meaning you should only have to replace out your index name to get it to work.

To Repeat: This is Free, You Should Do It

All of this was created with no additional license charge with Adobe, and with free versions of Splunk. The only cost involved is for the cloud VM needed to host Splunk.

For additional info, the free version of Splunk has a few key limitations. The first is that Splunk Free does away with the login capability. So, anyone who can hit the front end of your Splunk instance can see all your data. This means that if you implement this, you’ll want to either (a) buy the real version of Splunk or (b) just lock down Splunk by IP or put Nginx/Apache in front of it with HTTP basic auth.

The second limitation is that Splunk Free is limited to 500MB/day of logging. For reference, the AEMaaCS Assets Dev environment that I am using for this demo is using, on average, about 22% of that license capacity:

So, if you were to want to get this up and running with a Free license (while you sort out how to get a purchase order through for full real-deal Splunk) you may just want to make a separate Splunk env for Dev & Stage, and another for Prod.

I hope this helps!