BPF (eBPF)
Main Goals
Portability:
- ability to write a BPF program that works with different kernel version, where kernel data structures are different.
Solutions:
a. Embed your BPF C program as string, and use BCC that builds the BPF on the fly using the kernel headers CONS:
- LLVM is a big binary,
- resource-heavy,
- slow,
- needs kernel headers installed,
- compilation errors are only found at runtime.
b. CO-RE. Components:
- BTF type information
- Clang able to use write BTF relocations
- libbpf (user space BPF loader library): adjust BPF compiled code to the specific kernel version using BTF info
- kernel
BTF
Format to describe the type information of C programs. Compact size.
Exposed via /sys/kernel/btf/vmlinux
; and can be read via bpftool, getting all
the kernel types, as if we had the kernel headers.
$ bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
Your BPF program can just include a single vmlinux.h
file.
Missing #define
macros though :-(, but they can be found in bpf_helpers.h
Clang
Clang support generating BPF programs with BTF relocation information, so that the BPF loader can adjust the program at runtime for the particular kernel it is running on. -> Field offset relocation.
BPF loader (libbpf)
BPF program loader. It takes a BPF ELF object file, post-processes it, setups various kernel objects and triggers BPF program loading and verification.
Resolved and matches all types and fields.
Kernel
No specific kernel support required, if BTF is exposed, the BPF program can be loaded as a traditional BPF one.
Type of BPF Programs & Probes
See /sys/kernel/debug/tracing/events/...
and /sys/kernel/debug/tracing/available_events
.
- tracepoints / raw tracepoints (includes any syscall enter / exit).
kprobe
/kretprobe
: any kernel funcfentry
/fexit
: any kernel func (need trampoline support for the architecture!)- LSM: LSM hooks in "security.h"
- perf_events (used by BPF program type perf event)
fmod_ret
: override UM function and syscalls return code. E.g. return a syscall error or a fake int result.- can only be hooked to a "security_" prefixed function.
bpf_iter
: iterator over various stuff, e.g. list of all current task structures.syscall
: a program that can call sycalls.
From
libbpf-bootstrap
:The big distinction between fexit and kretprobe programs is that fexit one has access to both input arguments and returned result, while kretprobe can only access the result.
If you hook on exit path using kretprobe
, you may not be able to retrieve the
function arguments. The registers holding the arguments at the function entry
could be clobbered by the function execution. However this should not happen
with fexit
, according to BPF_PROG
documentation in bpf_tracing.h
Useful helper functions
bpf_get_current_pid_tgid
: returns PID / TGID of a thread.bpf_d_path
: returns full path for given struct path object. NOTE: it is only allowed in a set of functions, seebtf_allowlist_d_path
in the kernel code.bpf_send_signal
: raises a signal on the current thread.
Rootkit - oriented:
-
bpf_probe_write_user
: Writes the memory of a user space thread (subject to TOC-TOU limitations). It can corrupt user memory! -
bpf_override_return
: -
Only for kernel functions defined with
ALLOW_ERROR_INJECTION(...)
- Needs a kernel with
CONFIG_BPF_KPROBE_OVERRIDE
- used at entry: syscall completely skipped!
- used at exit: alters a syscall return value, but the syscall is executed.
bpf_probe_read_user() / bpf_probe_read_user_str() / bpf_probe_write_user()
:
read / write an argument from a syscall, or read an VM addres of a user space
program. The address can be also passed via skel->bss->ptr
.
bpf_copy_from_user()
: sleepable version of bpf_probe_read_user()
. Can only
be used in sleepable BPF progs.
List BPF stuff
List of BPF programs:
$ bpftool prog
List of BPF maps:
$ bpftool map
List of all BTF info, including functions: bpftool btf dump file
/sys/kernel/btf/vmlinux format raw
Can you hook to FUNCTION_NAME
?
bpftool btf dump file /sys/kernel/btf/vmlinux | grep FUNCTION_NAME
Pinning BPF programs
Normally, a BPF program requires the program that loaded it to be running, in
order for the program to be kept loaded in the kernel. Alternatively, a BPF
program can be pinned to a file under the special filesystem bpffs
.
Interpreting verifier errors
See the kernel file kernel/bpf/verifier.c : reg_type_str[]
:
[NOT_INIT] = "?",
[SCALAR_VALUE] = "inv",
[PTR_TO_CTX] = "ctx",
[CONST_PTR_TO_MAP] = "map_ptr",
[PTR_TO_MAP_VALUE] = "map_value",
[PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null",
[PTR_TO_STACK] = "fp",
[PTR_TO_PACKET] = "pkt",
[PTR_TO_PACKET_META] = "pkt_meta",
[PTR_TO_PACKET_END] = "pkt_end",
[PTR_TO_FLOW_KEYS] = "flow_keys",
[PTR_TO_SOCKET] = "sock",
[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
[PTR_TO_SOCK_COMMON] = "sock_common",
[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
[PTR_TO_TCP_SOCK] = "tcp_sock",
[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
[PTR_TO_TP_BUFFER] = "tp_buffer",
[PTR_TO_XDP_SOCK] = "xdp_sock",
[PTR_TO_BTF_ID] = "ptr_",
[PTR_TO_BTF_ID_OR_NULL] = "ptr_or_null_",
[PTR_TO_PERCPU_BTF_ID] = "percpu_ptr_",
[PTR_TO_MEM] = "mem",
[PTR_TO_MEM_OR_NULL] = "mem_or_null",
[PTR_TO_RDONLY_BUF] = "rdonly_buf",
[PTR_TO_RDONLY_BUF_OR_NULL] = "rdonly_buf_or_null",
[PTR_TO_RDWR_BUF] = "rdwr_buf",
[PTR_TO_RDWR_BUF_OR_NULL] = "rdwr_buf_or_null",
If loading bpf code results in this error, it means you are attempting to read a (relocatable) structure field that your kernel does not have (or better said, cannot relocate):
180: (85) call unknown#195896080
invalid func unknown#195896080
BPF iterator
- Declare the program with SEC("iter/task"). See kernel code for all possible
"iter/hook"
type of iterators. - seq write
- iter attach
- Need to read seq fd to make the iterator run.
Alternatively, can use pin the iterator to a bpffs file path:
- Read from the path to run the iterator.
- Delete the file path to unload the iterator program.
BPF tail call
- Tail called programs must be of the same type of the caller.
You have the option to declare the tail called BPF program to be of the same type,
or leave a BPF program with a generic type, and use the helper functions
bpf_program__set_xxx()
to set a type before loading the program. - They also need to match in terms of JIT compilation, thus either JIT compiled or only interpreted programs can be invoked, but not mixed together.
- BPF tail call functions are paired with an prog array map used as jump table
(BPF_MAP_TYPE_PROG_ARRAY).
Each entry of the map is a program fd. UM needs to configure it before loading the
program. When BPF jumps to a tail call functions, it jumps to a fd #:
bpf_tail_call(ctx, &array_map_of_fds, __fd_number__);
- The context
ctx
passed tobpf_tail_call
, must be really a pointer to a context, it cannot be any pointer to a argument the user wants to pass around.
JIT
JIT compilers speed up execution of the BPF program significantly since they reduce the per instruction cost compared to the interpreter. Often instructions can be mapped 1:1 with native instructions of the underlying architecture.
- Dynamic enable / disable is controlled with
/proc/sys/net/core/bpf_jit_enable
. - Permanently enable:
CONFIG_BPF_JIT_ALWAYS_ON
.
Maps
Memory mapped
You can memory maps a BPF map of type array. Declare the map with BPF_F_MAPPABLE
.
A limitation is that the map cannot be read only from UM, i.e. declared
with BPF_F_RDONLY
.
Map of maps
Limitations: you can only create a new inner map from UM.
BPF Links
A bpf_link
is an abstraction used to:
- represent an attachment of a BPF program to a BPF hook point;
- encapsulate ownership of an attached BPF program to a process, or to a file path (via BPFFS). This allows to survive a user process exit. See: https://lore.kernel.org/bpf/[email protected]/
BPF spinlocks
- Used to mutex access/updates to a single map element.
- Cannot be used in tracing programs because of preemption.
- Need one spinlock per map element.
CORE
Thanks to CORE you can write bpf code that supports different structure formats
for the different kernel version. But, make sure to declare any alternative
version of a default struct foobar (defined in vmlinux.h ), as struct
foobar___v1
, struct foobar___v2
with exactly 3 underscores!
Quoting some libbpf code in the kernel:
* 1. For given local type, find corresponding candidate target types.
* Candidate type is a type with the same "essential" name, ignoring
* everything after last triple underscore (___). E.g., `sample`,
* `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
* for each other. Names with triple underscore are referred to as
* "flavors" and are useful, among other things, to allow to
* specify/support incompatible variations of the same kernel struct, which
* might differ between different kernel versions and/or build
* configurations.
Use bpf_core_field_exists()
to select what of the struct to use to read data.
Nice helpers
bpf_get_stackid
: save the user space or kernel space functions call stack into a map.
Direct memory read
Some program types (tracing and trampolines) allows direct memory reads of
kernel pointers, without having to use bpf_probe_*
helpers.
However, you still need to have the compiler emit CO-RE relocations, either
using __builtin_preserve_access_index
, or by tagging a kernel struct
definition with __attribute__((preserve_access_index))
- these twos are
equivalent.
Programs statistics
Write 1
to /proc/sys/kernel/bpf_stats_enabled
.
That will make bpftool show additional info:
- run_cnt: the number of times the program was executed.
- run_time_ns: the cumulative program execution time in nanoseconds including the off-cpu time when the program was sleeping.
sudo bpftool prog show
--- CUT ---
3001: raw_tracepoint name my_tracepoint tag abcdef0123456789 gpl run_time_ns 41752 run_cnt 24
--- CUT ---
Stats in newer kernel:
-
verified_insns: the number of verifier processed instructions - what's currently dumped in the libbpf log on program load, when the log level is verbose enough.
-
recursion_misses: how many times recursion was prevented (TBD).
BPF programs, preemption, CPU migration
See Are BPF programs preemptible?
- We are only guaranteed that the BPF program will run from start to finish on the same CPU (disable_migration()).
- There is no guarantee that at BPF program is not preempted.
- That said, on non-RT systems, a BPF program can only be interrupted by IRQ, NMI, an other higher priority stuff.
- On a RT system, a BPF program can be interrupted (scheduled out) by another program of the same priority / same kind. NOTE: it cannot be interrupted by a different execution of the same program. Not clear why.
- The implications is that you cannot share a per-CPU array map among different programs, since one program can clobber data from another program.
- You need one map per program, or you need a dedicate entry in the map for each program type (again, not clear why the same program cannot interrupt itself ... ).
BPF programs and RCU
BPF programs run with RCU read lock held.