Approach to Linux performance troubleshooting

This will be a mere list of commands and pointer to resources as I lack the expertise and diligence to write a comprehensive guide. At the same time, I am afraid I will forget some of the things I have learned in this topic if I don’t write them down. Performance analysis is a much needed skill in the real world and also a favorite interview question. Mastering the tools and techniques is no easy task. But every admin must know the basics.

The first step is to understand the problem clearly. Question such as-

  1. What exactly is slow?
  2. What much do you get and expect?
  3. Has it always been slow?

will aide in understanding the problem. If you get the problem wrong, you could be heading to the wrong direction.

I start by looking at the load average using uptime, w or top. In Linux, load average is a combination of cpu and disk usage. A load average of 1 (after dividing by the no. of cpus) is a bad sign. If the cpu is not saturated, it is usually disk IO. Lookout for IO wait in top (wa) and other tools.

Next I will fire up top and look at cpu usage- idle time (id), user time (us) and system time (sy). You can sort the processes by CPU usage (^p) or memory usage (^m). If its refreshing too quickly, you can pause with ^S, ^Q.

The output of free is often misunderstood. I have a cheat sheet here. Swap usage alone is not bad but constant swapping is which you can monitor using vmstat. sar can show performance stats of most resources in the system.

Here’s a list of useful commands from Brendan Gregg’s Linux Performance Analysis in 60,000 Milliseconds.

dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1

Here is a video version. When it comes to system performance, Brendan Gregg is what you should think of. I highly recommend you watch his tutorial from Oreilly Velocity Conference 2015, especially part 1 as part 2 is quite advanced.

Part 1

Part 2

I first intended to write a summary of the video but decided to do this short post as a pointer. I want to point out one thing though. After looking at cpu, mem, disk, network, you don’t notice anything out of the line. The next thing you can do is to look at what the process is doing, assuming you know which application process is slow. You can run strace -p “pid of application” (example: strace -p pgrep myapp) which will show you all system calls made by the process. In the example he used in the video, the process was merely waiting on something to happen. Therefore, it was a problem in the code and not with the system resources. Another too you can use is lsof, to look at open files and sockets.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s