Baron Schwartz giving an Epic Talk on Benchmarking, a photo by tadnkat on Flickr.
Baron Schwartz from Percona gave an amazing talk on Benchmarking. As someone who’s always loved reading about benchmarks, but having been pretty terrible at producing them myself, I found this talk fascinating — especially after my recent experience with attempting to run a bunch of inconclusive benchmarks on JBoss 4.2 vs JBoss 5.1 performance.
Put simply, Baron Schwartz is a benchmarking GOD. Listen to what he says. Read his blog. This guy is benchmarking sanity personified.
Bullets from his talk:
- It’s important to establish goals for a benchmark, reasons why, legend, distribution, response time, etc – not just throughput
- One needs a lot of info to think clearly about a benchmark
- Ideal benchmark report:
- Clear benchmark goals:
- Validating hardware config (disk / cpu / etc) – see if it matches expectations
- Compare two systems
- Checking for regressions
- Capacity planning (how will it perform at higher load than you have?)
- Reproduce bad behaviour to solve it
- Most systems you don’t want to push it as far as it’s max throughput, as at that point you’re beyond its threshhold of “good behaviour”.
- Stress test to find bottlenecks
- Get specs:
- Get specs for CPU, disk, memory, network, including makes/models/etc.
- SSDs are EXTREMELY tricky to benchmark
- Versions of all software
- RAID controller / filesystem
- Disk queue scheduler –
- a lot of Linux defaults have tons of desktop software shoved in there. CFQ is standard disk scheduler (desktop – perf sucks) instead of noop or others
- Generate some plots to summarize
- Better Aggregate Measurement:
- Average / Percentiles
- Observation duration
- 95th percentile = you can throw away the worst 1/20 of your day. Means you can throw away more than an hour of data per day. I.e. your system can be rock bottom performing for an hour a day. Not so good for establishing an SLA or SLO (objective).
- Scatter graphs can be much more telling than a single point – as you can see if your performance is all over the map or if it returns a stable figure. i.e. SSDs have performance all over the map, and have very different performance characteristics when empty / full or at start/end of the benchmark.
- Two metrics: Thoughput and Response time (tasks per time or time per task)
- They are not reciprocals
- Resource consumption is NOT a good measure of performance – i.e. CPU% / Load Avg / etc. These are indicators. They are not the goal.
- Be very careful with tools that report utilization. At 100% utilization many systems are not actually saturated.
- try ptdiskstats from perconia
- What is a system’s actual capacity?
- Max throughput at max achievable concurrency while being given acceptable performance (response time).
- Most benchmarks reveal little
- if 1/20 is serialized, you’ll never get more than a 20x speedup from going parallel.
- Isolating bottlenecks or iteratively optimizing them is one way – but don’t optimize things that don’t matter. Don’t try to optimize little things.
- Little’s law: concurrency = throughput * response time
- This holds regardless of queuing, arrival rate distribution, response time distribution, etc.
- Utilization law:
- Utilization = service time * throughput
- Clear benchmark goals: