Using perf_event_open in C for Fine-Grained Linux Hardware Counter Monitoring

Modern Linux systems offer powerful capabilities for performance analysis, with hardware performance counters (PMCs) being a cornerstone for deep system understanding. While tools like perf provide a user-friendly interface to these counters, direct programmatic access from C offers unparalleled flexibility for custom tooling, in-application monitoring, and advanced diagnostics. The perf_event_open system call is the gateway to this low-level access (see the perf_event_open(2) man page).

This article provides a practical guide for C developers to leverage perf_event_open for monitoring hardware performance counters. We’ll explore its core concepts, demonstrate its usage with code examples, and discuss essential considerations like permissions and error handling.

Core Concepts of `perf_event_open`

The perf_event_open syscall creates a file descriptor that represents a performance monitoring event. This event can be a hardware event (like CPU cycles or cache misses), a software event (like context switches or page faults), or other types of tracepoints, as detailed in the perf_event_open(2) man page.

The `perf_event_attr` Structure

Configuration of a performance event is done via the struct perf_event_attr. This structure is crucial and contains numerous fields to define the event’s behavior, as documented in the kernel header <linux/perf_event.h> and the perf_event_open(2) man page.

Key fields include:

type: Specifies the category of the event (e.g., PERF_TYPE_HARDWARE, PERF_TYPE_SOFTWARE, PERF_TYPE_RAW). These types are defined in <linux/perf_event.h>.
size: Must be set to sizeof(struct perf_event_attr) for forward/backward compatibility.
config: A type-specific value that further defines the event. For hardware events, this could be one of the PERF_COUNT_HW_* constants (e.g., PERF_COUNT_HW_INSTRUCTIONS). For raw events, it’s a CPU-specific code.
sample_period / sample_freq: Used for event sampling. If non-zero, the kernel will generate an overflow notification after sample_period events or at sample_freq Hz.
sample_type: A bitmask specifying what data to include in a sample (e.g., IP, TID, time).
read_format: A bitmask defining the format of data read from the file descriptor, especially for grouped events.
disabled: If set to 1, the event is created in a disabled state and must be enabled via ioctl. This is common practice.
inherit: If set to 1, the event is inherited by child threads/processes.
pinned: If set to 1, the event must always be on the PMU if scheduled. If it cannot be, it won’t be scheduled.
exclude_kernel / exclude_hv: Flags to exclude events occurring in kernel or hypervisor space respectively.

Counting vs. Sampling

perf_event_open supports two primary modes of operation:

Counting Mode: The kernel aggregates event occurrences. The total count can be read directly from the event’s file descriptor. This mode is generally lower overhead and suitable for collecting cumulative metrics.
Sampling Mode (Profiling): When an event counter overflows (after sample_period events or at sample_freq frequency), the kernel generates a sample. This sample can include information like the instruction pointer, process ID, timestamp, etc., as specified by sample_type. Samples are typically read from a ring buffer shared between the kernel and userspace (set up via mmap). This mode is used for profiling to identify performance hotspots.

This article primarily focuses on the simpler counting mode for brevity, but the concepts of struct perf_event_attr apply to both.

Using `perf_event_open` in C

Since glibc often does not provide a wrapper for perf_event_open, you typically invoke it using the syscall() function, as detailed in the perf_event_open(2) man page.

1. Invoking the System Call

First, ensure you have the necessary includes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h> // For __NR_perf_event_open
#include <errno.h>      // For errno

// Helper function for the syscall
static long
perf_event_open_syscall(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    // __NR_perf_event_open is typically defined in <asm/unistd.h>
    return syscall(__NR_perf_event_open, hw_event, pid, cpu,
                   group_fd, flags);
}

This helper function perf_event_open_syscall makes calling the system call cleaner.

2. Monitoring a Single Hardware Event (e.g., Instructions)

Let’s create an event to count retired instructions for the current process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Function to monitor a single hardware event
void monitor_single_event() {
    struct perf_event_attr pe;
    long long count;
    int fd;

    // Initialize the perf_event_attr structure
    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    // Count retired instructions
    pe.config = PERF_COUNT_HW_INSTRUCTIONS; 
    pe.disabled = 1; // Start disabled
    pe.exclude_kernel = 1; // Exclude kernel-space events
    pe.exclude_hv = 1;     // Exclude hypervisor events

    // Create the event file descriptor.
    // pid = 0: current thread; cpu = -1: any CPU current thread runs on.
    // group_fd = -1: no group; flags = 0: no flags.
    fd = perf_event_open_syscall(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening event: %s (errno %d)\n",
                strerror(errno), errno);
        exit(EXIT_FAILURE);
    }

    // Enable the counter
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    // --- Critical code section to measure ---
    printf("Counting instructions for this printf and some work...\n");
    for (volatile int i = 0; i < 1000000; i++); // Example work
    // --- End of critical code section ---

    // Disable the counter
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);

    // Read the counter value
    if (read(fd, &count, sizeof(long long)) == -1) {
        perror("Error reading counter");
        close(fd);
        exit(EXIT_FAILURE);
    }

    printf("Instructions retired: %lld\n", count);

    close(fd);
}

In this example:

We initialize perf_event_attr for PERF_COUNT_HW_INSTRUCTIONS.
disabled = 1 ensures the counter doesn’t start until PERF_EVENT_IOC_ENABLE.
pid = 0 monitors the current thread. cpu = -1 means the counter is associated with the thread and follows it across CPUs.
ioctl calls are used to enable and disable the counter.
read retrieves the 64-bit count.
Always close(fd) when done to release resources.

3. Monitoring an Event Group

Performance events can be grouped. A group is led by a “group leader” event. All events in a group are scheduled on the PMU as a single unit. This is vital when you need to correlate counts (e.g., calculating Instructions Per Cycle - IPC).

To read all counters in a group simultaneously, you must set PERF_FORMAT_GROUP in the read_format field of the leader.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
// Structure to hold multiple counter values for a group read.
// Based on PERF_FORMAT_GROUP, PERF_FORMAT_ID,
// PERF_FORMAT_TOTAL_TIME_ENABLED, and PERF_FORMAT_TOTAL_TIME_RUNNING.
struct read_format_group {
    long long nr;          // Number of events in the group
    long long time_enabled;
    long long time_running;
    struct {
        long long value;   // Counter value
        long long id;      // Event ID (if PERF_FORMAT_ID)
    } values[];            // Flexible array member for event data
};

// Function to monitor a group of hardware events
void monitor_event_group() {
    struct perf_event_attr pe_leader, pe_member;
    int fd_leader, fd_member;
    long long instr_id = 0, cycle_id = 0; // To store event IDs

    // Buffer for reading group data. Must be large enough for the actual data.
    // For 2 events, values are: nr (LL), time_enabled (LL), 
    // time_running (LL), and 2 * (value (LL), id (LL)).
    // This totals: (3 * 8 bytes) + (2 * (8 bytes + 8 bytes)) = 56 bytes.
    // The buffer should be large enough to accommodate this structure.
    char read_buf[256]; // A buffer for a few events.
    struct read_format_group *rf = (struct read_format_group *)read_buf;
    ssize_t bytes_read;

    // --- Configure leader event (e.g., CPU Cycles) ---
    memset(&pe_leader, 0, sizeof(struct perf_event_attr));
    pe_leader.type = PERF_TYPE_HARDWARE;
    pe_leader.size = sizeof(struct perf_event_attr);
    pe_leader.config = PERF_COUNT_HW_CPU_CYCLES;
    pe_leader.disabled = 1;
    pe_leader.exclude_kernel = 1;
    pe_leader.exclude_hv = 1;
    // Request event ID and group read format
    pe_leader.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID |
                            PERF_FORMAT_TOTAL_TIME_ENABLED |
                            PERF_FORMAT_TOTAL_TIME_RUNNING;

    // Create the leader event (group_fd = -1)
    fd_leader = perf_event_open_syscall(&pe_leader, 0, -1, -1, 0);
    if (fd_leader == -1) {
        fprintf(stderr, "Error opening leader event: %s (errno %d)\n",
                strerror(errno), errno);
        exit(EXIT_FAILURE);
    }
    ioctl(fd_leader, PERF_EVENT_IOC_ID, &cycle_id);


    // --- Configure member event (e.g., Instructions) ---
    memset(&pe_member, 0, sizeof(struct perf_event_attr));
    pe_member.type = PERF_TYPE_HARDWARE;
    pe_member.size = sizeof(struct perf_event_attr);
    pe_member.config = PERF_COUNT_HW_INSTRUCTIONS;
    pe_member.disabled = 1; // Will be enabled with the group
    pe_member.exclude_kernel = 1;
    pe_member.exclude_hv = 1;
    pe_member.read_format = PERF_FORMAT_ID; // Member needs ID to match

    // Create the member event, belonging to the group fd_leader
    fd_member = perf_event_open_syscall(&pe_member, 0, -1, fd_leader, 0);
    if (fd_member == -1) {
        fprintf(stderr, "Error opening member event: %s (errno %d)\n",
                strerror(errno), errno);
        close(fd_leader);
        exit(EXIT_FAILURE);
    }
    ioctl(fd_member, PERF_EVENT_IOC_ID, &instr_id);

    // Reset and enable all events in the group via the leader
    ioctl(fd_leader, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
    ioctl(fd_leader, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);

    // --- Critical code section to measure ---
    printf("Counting cycles and instructions for this section...\n");
    for (volatile long long i = 0; i < 2000000; i++); // Example work
    // --- End of critical code section ---

    // Disable all events in the group via the leader
    ioctl(fd_leader, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);

    // Read all counters from the group leader
    bytes_read = read(fd_leader, read_buf, sizeof(read_buf));
    if (bytes_read == -1) {
        perror("Error reading group counters");
    } else if (bytes_read < (ssize_t)offsetof(struct read_format_group, values) ||
               (rf->nr > 0 && bytes_read < 
                (ssize_t)(offsetof(struct read_format_group, values) + 
                          rf->nr * sizeof(rf->values[0])))) {
        fprintf(stderr, "Short read from perf event group: %zd bytes\n", 
                bytes_read);
    } else {
        printf("Group counters (%lld events read):\n", rf->nr);
        printf("  Time enabled: %llu ns\n", rf->time_enabled);
        printf("  Time running: %llu ns\n", rf->time_running);
        long long cycles_val = -1, instr_val = -1;

        for (int i = 0; i < rf->nr; i++) {
            if (rf->values[i].id == cycle_id) {
                cycles_val = rf->values[i].value;
                printf("  CPU Cycles: %llu (ID: %llu)\n", 
                       cycles_val, rf->values[i].id);
            } else if (rf->values[i].id == instr_id) {
                instr_val = rf->values[i].value;
                printf("  Instructions: %llu (ID: %llu)\n", 
                       instr_val, rf->values[i].id);
            }
        }
        if (cycles_val != -1 && instr_val != -1 && cycles_val > 0) {
             double ipc = (double)instr_val / cycles_val;
             printf("  Calculated IPC: %.2f\n", ipc);
        }
    }

    close(fd_member);
    close(fd_leader);
}

Key aspects of group monitoring:

The leader event is created with group_fd = -1. Member events are created with group_fd set to the leader’s file descriptor.
pe_leader.read_format must include PERF_FORMAT_GROUP. PERF_FORMAT_ID is used to get unique IDs for each event, helping to match values in the output. PERF_FORMAT_TOTAL_TIME_ENABLED and PERF_FORMAT_TOTAL_TIME_RUNNING provide information about counter multiplexing.
ioctl operations like PERF_EVENT_IOC_ENABLE, PERF_EVENT_IOC_DISABLE, and PERF_EVENT_IOC_RESET can be applied to the entire group using the leader’s file descriptor and the PERF_IOC_FLAG_GROUP flag.
A single read() on the leader’s file descriptor retrieves data for all events in the group. The format is defined by struct read_format_group (or similar, depending on read_format flags). The number of bytes read should be checked.

Permissions and `perf_event_paranoid`

Access to performance counters is controlled for security reasons.

/proc/sys/kernel/perf_event_paranoid: This sysctl setting determines what events unprivileged users can access. The perf_event_open(2) man page details its levels (e.g., -1 for most permissive, 2 or 3 for most restrictive, often default).
Capabilities: Linux capabilities provide fine-grained privilege control.
- CAP_PERFMON (since Linux 5.8): The preferred capability for performance monitoring operations. It allows access even with restrictive perf_event_paranoid settings.
- CAP_SYS_ADMIN: A more general, powerful capability that also grants access. However, CAP_PERFMON is more targeted. (See capabilities(7) man page for more on capabilities).

If your program fails with EACCES or EPERM when opening an event, check /proc/sys/kernel/perf_event_paranoid and ensure your process has the necessary capabilities (e.g., run as root for testing, or use setcap to grant cap_perfmon to your executable: sudo setcap cap_perfmon+ep ./your_program).

Other Key `ioctl` Operations

Besides enable, disable, and reset, other ioctl requests detailed in the perf_event_open(2) man page are useful:

PERF_EVENT_IOC_ID: Retrieves the unique ID assigned by the kernel to an event. This is useful when reading grouped events with PERF_FORMAT_ID to identify which count belongs to which event.

Common Challenges and Pitfalls

Permissions: As discussed, perf_event_paranoid and missing capabilities are common hurdles.
Incorrect perf_event_attr: Setting the size field incorrectly or using invalid type/config combinations will lead to errors (EINVAL).
Event Scheduling Limits: A CPU’s Performance Monitoring Unit (PMU) has a limited number of physical counters. If you try to enable too many events in a group simultaneously, perf_event_open might succeed for each, but enabling the group via ioctl might fail, or the kernel might resort to multiplexing.
Counter Multiplexing: If more events are monitored than available hardware counters, the kernel time-shares the counters. The time_enabled and time_running fields (if requested via read_format) can help account for this scaling, but it can reduce precision.
Forgetting to close(fd): Event file descriptors are resources; always close them. Leaking FDs can lead to errors like EMFILE (too many open files).
Raw Event Codes (PERF_TYPE_RAW): These are CPU-specific and require consulting vendor manuals (Intel, AMD, ARM). Libraries like libpfm4 (see libpfm official site) can help translate symbolic event names to raw codes.

Debugging Tips

Check errno: After perf_event_open or other syscalls fail, errno provides valuable information (e.g., EINVAL, EACCES, EMFILE).
strace ./your_program: Shows the system calls being made and their return values, extremely useful for debugging parameters.
Start Simple: Begin with a single, well-known hardware event (e.g., PERF_COUNT_HW_INSTRUCTIONS) before attempting complex groups or sampling.
Compare with perf tool: If you can achieve the desired monitoring with perf stat or perf record (see perf wiki or man perf), it implies the events are supportable, and the issue likely lies in your C code’s configuration.

Advantages of Direct `perf_event_open` Usage

While tools like perf are convenient, direct C access offers:

Fine-Grained Control: Precise configuration of events and their attributes.
Custom Tool Development: Build specialized monitoring or profiling tools tailored to specific needs.
Application Integration: Embed performance monitoring directly within applications for adaptive behavior or detailed logging.
Reduced Overhead: For very specific measurements, direct access can sometimes be more lightweight than invoking a separate tool.

Alternatives to Direct Syscall Usage

perf command-line tool: Excellent for general-purpose profiling and tracing. Easier to use for common tasks.
PAPI (Performance Application Programming Interface): A portable library that abstracts hardware counters across different architectures.
Vendor-Specific Libraries/Tools: E.g., Intel VTune Profiler, AMD uProf. Often provide deep insights specific to their hardware.
libpfm4: A helper library to translate event names to the machine-specific encodings needed for perf_event_attr.config, especially for PERF_TYPE_RAW events. (Official site: http://perfmon2.sourceforge.net/)

Conclusion

The perf_event_open syscall is a potent interface for sophisticated performance analysis on Linux. By understanding its mechanics, the struct perf_event_attr structure, and the nuances of permissions and event grouping, C developers can unlock fine-grained access to hardware performance counters. This capability enables the creation of highly customized monitoring solutions, deep-dives into application behavior, and a more profound understanding of system performance characteristics. While complex, the power and flexibility offered by perf_event_open make it an invaluable tool in the arsenal of any serious systems developer or performance engineer.

Core Concepts of perf_event_open

The perf_event_attr Structure