Monitor system and GPU performance with Performance Co-Pilot

Table of Contents

I’ve used so many performance monitoring tools and systems over the years. When you need to know information right now, tools like btop and glances are great for quick overviews. Historical data is fairly easy to pick through with sysstat.

However, when you want a comprehensive view of system performance over time, especially with GPU metrics for machine learning workloads, Performance Co-Pilot (PCP) is an excellent choice. It has some handy integrations with Cockpit for web-based monitoring, but I prefer using the command line tools directly.

This post explains how to set up PCP on Fedora and enable some very basic GPU monitoring for both NVIDIA and AMD GPUs.

Installing Performance Co-Pilot #

Install the core packages and command line tools:

sudo dnf install pcp pcp-system-tools

Enable and start the PCP services:

sudo systemctl enable --now pmcd pmlogger
sudo systemctl status pmcd

These two services work together like a team:

pmcd (Performance Metrics Collection Daemon) gathers real-time metrics from various sources on your system when you request them.
pmlogger records these metrics to log files for historical analysis.

You can verify that the services are working as expected:

# Check available metrics
pminfo | head -20

# View current CPU utilization
pmval kernel.all.cpu.user

# Show memory statistics
pmstat -s 5

Adding GPU metrics collection #

I do a lot of LLM work locally and I’d like to keep track of my GPU usage over time. Fortunately, PCP supports popular GPUs through something called a PMDA (Performance Metrics Domain Agent). These are packaged in Fedora, but they have an interesting installation process.

NVIDIA GPUs #

Unverified instructions: I only have an AMD GPU, but I pulled this NVIDIA information from various places on the internet. Please let me know if you find any issues and I’ll update the post!

For NVIDIA GPUs, ensure you have the NVIDIA drivers and nvidia-ml library:

# Check if nvidia-smi works
nvidia-smi

# Install the NVIDIA management library if needed
sudo dnf install nvidia-driver-cuda-libs

Now install the NVIDIA PMDA:

cd /var/lib/pcp/pmdas/nvidia
sudo ./Install

The installer will prompt you for configuration options. Accept the defaults unless you have specific requirements.

Thanks to Will Cohen for helping me get these NVIDIA steps corrected! 👏

After installation, verify GPU metrics are available:

# List all NVIDIA metrics
pminfo nvidia

# Check GPU utilization
pmval nvidia.gpuactive

# Monitor GPU memory usage
pmval nvidia.memused

AMD GPUs #

For AMD GPUs, PCP provides the amdgpu PMDA that works with the ROCm stack:

# Ensure rocm-smi is installed and working
rocm-smi

# Install the AMD GPU PMDA package
sudo dnf install pcp-pmda-amdgpu

# Install the PMDA
cd /var/lib/pcp/pmdas/amdgpu
sudo ./Install

After installation, verify AMD GPU metrics:

# List all AMD GPU metrics
pminfo amdgpu

# Check GPU utilization
pmval amdgpu.gpu.load

# Monitor GPU memory usage
pmval amdgpu.memory.used

Querying performance data #

There are lots of handy tools for querying PCP data depending on whether you need information about something happening now or want to analyze historical trends.

Real-time monitoring with pmrep #

The pmrep tool provides formatted output perfect for dashboards or scripts. It’s great for situations where you need to see what’s happening right now. It’s much like iostat or vmstat from the sysstat package, but you get a lot more flexibility.

# System overview with 1-second updates
pmrep --space-scale=MB -t 1 kernel.all.load kernel.all.cpu.user mem.util.used

# GPU metrics for LLM monitoring (NVIDIA)
pmrep --space-scale=MB -t1 nvidia.gpuactive nvidia.memused nvidia.temperature

# GPU metrics for LLM monitoring (AMD)
pmrep --space-scale=MB -t 1 amdgpu.gpu.load amdgpu.memory.used amdgpu.gpu.temperature

Historical analysis with pmlogsummary #

If you’re used to to running sar commands from the sysstat package, you’ll find pmlogsummary very familiar. Again, you can do a lot more with pmlogsummary than with sar, but the basic concepts are similar.

# Summarize yesterday's GPU utilization (NVIDIA)
pmlogsummary -S @yesterday -T @today /var/log/pcp/pmlogger/$(hostname)/$(date -d yesterday +%Y%m%d) nvidia.gpuactive

# Summarize yesterday's GPU utilization (AMD)
pmlogsummary -S @yesterday -T @today /var/log/pcp/pmlogger/$(hostname)/$(date -d yesterday +%Y%m%d) amdgpu.gpu.load

# Find peak memory usage over the last hour
pmlogsummary -S -1hour /var/log/pcp/pmlogger/$(hostname)/$(date +%Y%m%d) mem.util.used

Troubleshooting tips #

If GPU metrics aren’t showing up:

# Check if the PMDA is properly installed
pminfo -f pmcd.agent | grep -E "amdgpu|nvidia"

# Restart PMCD to reload PMDAs
sudo systemctl restart pmcd

# Check PMDA logs for errors
sudo journalctl -u pmcd -n 50

# Verify GPU drivers are working
rocm-smi  # for AMD
nvidia-smi  # for NVIDIA