Measuring and Optimizing CPU Performance

Jul 10, 2020 | Author: Kit Merker

Avg. reading time: 6 minutes

Software engineers are under constant pressure to make their desktop, laptop, and client applications run as efficiently as possible.

That’s easier said than done. Just ask Aaron Tyler, Principal Software Engineer at DocuSign. I spoke with Aaron recently about the relentless drive to improve infrastructure efficiency, and he shared some of his best practices for measuring and optimizing CPU consumption. In short, you can’t improve what you don’t measure. Below is an edited transcript of our conversation.

Kit Merker: Thanks for chatting today. What are the major concerns that developers need to know when measuring CPU? I’m sure there are important tradeoffs to make, but is it as simple as battery vs performance?

Aaron Tyler: Happy to be here. Let me start by saying, measuring CPU is hard. To give an illustration of this, think about how low CPU, which you might think is universally good, can actually turn out to mean something is wrong. For example, if your process is waiting for activities that the operating system is performing on your behalf (like disk IO, network IO, paging, etc), your app could be waiting when it could be working. This is not optimal.

High CPU issues, on the other hand, tend to be easier to fix: find what’s using the most CPU and make it use less. It might be easier said than done, but it’s straightforward. Optimizing your wait time is trickier and requires rearranging dependencies to be used in parallel and shows up as low CPU usage.

My advice is to measure and collect data before attempting to fix anything.

You have to consider what you’re optimizing for: wall clock time or overall efficiency? Generally speaking, you want to measure both resource consumption and wall clock time. But the biggest tradeoff is engineering time vs. whatever you’re trying to optimize for. We put a man on the moon with a 2Mhz CPU with 2K RAM, so with enough engineering time, you can hit just about any kind of performance target.

The challenge is figuring out the correct tradeoff for the kind of product you’re building. If you want to utilize app development frameworks that reduce engineering time, you have to accept the tradeoffs they represent. This is how Microsoft Teams ends up using 22.7GB of RAM. They use Electron to create multiple processes under the covers, each loading a web browser to interpret JavaScript and render UI locally. This isn’t efficient in terms of path, but it is quick to develop.

Anything with a battery needs to consider battery life on top of wall clock time. Typically, this means that client applications can’t do things like spinwaits or polling that you might be able to get away with on a server application. Heavily multithreaded applications need to consider data dependencies and CPU cache usage if you need highly efficient code.

For more, see https://medium.com/@ricomariani/how-to-performance-lab-108829c1b9d4.

KM: What about different manufacturers? I imagine it’s similar to testing across browsers. How do different platforms like Apple vs. Microsoft or AMD vs. Intel play into it?

AT: Each ecosystem has their own tools and quirks, depending on what you’re looking for. Windows has a great structured logging format called Event Tracing for Windows (ETW) that most of the profiling tools are built on top of. The operating system and its components are heavily instrumented, and you can record nearly everything you want to see into a trace file to be analyzed off box, including things like every context switch or individual DPC/ISRs. POSIX compatible systems (*nix OSes) have largely moved over to eBPF, which lets you insert small instruction streams to be run on “interesting” events occurring.

On the processor vendor side, assuming you’re using an x64 instruction set, you’re mostly looking at processor microbenchmarks or variations between firmware/microcode. It’s rare that this level of inspection is worth the tradeoff, but when you need maximum efficiency you need to start looking at icache/dcache and ops/cycle and figuring out how to reduce branching and data dependencies. Intel produces vTune for this, and AMD has μProf. Typically, you’re optimizing for Instructions per Cycle (IPC) here. Adding in multithreading, and you need to start worrying about NUMA nodes and ideal cores. Not to mention the rabbit-hole of instruction path optimization.

If you want to get started with this, I would suggest ETW and Windows Performance Analyzer (WPA) on Windows, Android Profiler for Android, and Chrome Profiler for web sites. Bruce Dawson has a good introduction to ETW on his blog. The MSDN blog has a solid post on event tracing that might be useful, too. Also, C# ETW events typically use TraceEvent.

KM: Software is built by people, so any complex project is going to have human/organizational problems. How do these affect CPU? Are there any common pitfalls you see developers running into when first trying to fix an efficiency issue?

AT: Computers would be easy if it weren’t for the people! The first thing is that you have to figure out what to prioritize, and how your organization’s boundaries show up in your codebase. I found this short Medium post from Rico Mariani is helpful.

My advice is to measure and collect data before attempting to fix anything. I’m always surprised at what the critical path is on an application, and very often it’s not what you expect it to be. You might spend months getting the O(NlogN) processing algorithm working just to find out that your JSON parsing time dwarfs everything else because you’re doing it on every request, and the disk IO cost kills you. Automate your measurements, change something, and re-measure. Rico Mariani has a bunch of good blog posts about this, and he’s been influential at both Microsoft and Facebook in developing their performance teams.

Computers would be easy if it weren’t for the people!

KM: What kind of mechanisms do you recommend putting in place to ensure CPU usage doesn’t get worse over time?

AT: As part of your CI/CD pipeline, automate tests for common operations and measure both wall clock time and resource consumption during those tasks. You should fail these tests if this moves “meaningfully,” and start an investigation. Think of this as a performance gate for check-in.

You’ll also want monitoring of your production application. This gets tricky because you don’t necessarily have tight control over the environment or hardware (especially for client apps) so the data is much noisier. You’ll want to record wall clock times from your applications, as well as enough metadata to pivot your data based on interesting factors like CPU, RAM, etc.

Beyond this, you may want to build an “escalation” pipeline so that when something unexpected starts occurring on a particular device, you can record a trace or take a process dump to investigate further. Solving performance problems “in the wild” is one of the hardest activities due to privacy and data security concerns, so it’s preferable to find things in your CI/CD or performance lab prior to release. Still, there are some things that only show up on customer hardware, so you’ll want a way to get data back on these even if it’s painful.

KM: How does server-side design affect the client when it comes to CPU? For example, can you design your API to be CPU-friendly?

AT: Absolutely. On the client-side, the less work the client needs to do to process a request, the better. Especially for battery life, you want to minimize the number of requests that you’re making at the network layer so that the CPU can “get idle, stay idle.” P state / C state transitions can take up to a few hundred milliseconds, so the operating system needs to expect a long period of idle before it will trigger a C state transition.

If you’re doing a while(!(response()){sleep(1);}, you’re never going to leave C0, and even though you’re not doing work the CPU can’t save any power. Even beyond that, keeping the radio powered on mobile devices because a net connection is actively being used costs even more power, and this can vary depending on how strong the signal is. Thus, if you can do a single network request, use wait handles or other tech to yield until the response is handed back to you by the operating system, and then quickly process the response. Then, you’ll get better performance, you’ll typically get a thread priority boost on wakeup, and it’ll be more battery efficient, too.

KM: Service Level Objectives (SLOs) are used to measure performance for cloud and server-side components. Can you define SLOs for CPU? Why or why not? What about error budgets?

AT: When you want to hit performance and resource targets, it really helps to set budgets or allocations for each team that will consume some of those resources. Usually, it helps to start with a high-level goal for the overall time or resources spent, and then ask each team lead what goal they can hit. In my experience, it helps to have the teams “own” these goals because it makes it easier to hold them accountable later. Remember, it’s not just the machines that need to be managed, but also the people who are trying to build something using the machines.

Once you have goals in place, you can figure out the dependencies between teams. Can the tasks run in parallel? Or do they have to run serially? How does the behavior change as we expand the test cases to other hardware or environments? What if we use a resource-intensive app on the same hardware? If everyone agrees on the resource utilization guidelines up front, it’s easier to set clear consequences (i.e., fix your part of the process) and to make better engineering decisions to deliver on the overall goals.

Image Credit: Tianyi Ma on Unsplash