diff options
author | Will Chandler <wchandler@gitlab.com> | 2023-01-05 22:34:19 +0300 |
---|---|---|
committer | Will Chandler <wchandler@gitlab.com> | 2023-02-02 18:07:02 +0300 |
commit | ee856a0c74eaf327cf9f33ad29f22f6ef2f25aea (patch) | |
tree | 677a54efb1c5044fe8a1a64eb06e7947dd7f083d /_support | |
parent | 4fb6bb8a078bc32e4a1f62d1045d0264445909ee (diff) |
benchmarking: Add profiling script
Understanding where Gitaly and Git are spending their time, as well as
general system health are critical to useful benchmarking. Add a script
to the Gitaly node to run `perf` and a number of `libbpf-tools`
utilities while the node is under load.
Running this introduces a performance overhead of ~10%, mostly from
`perf`, which is run twice simultaneously. Once to profile only Gitaly
using `--call-graph=fp`, which works well with Golang, and again for the
system as whole using `--call-graph=dwarf`, which is more accurate for
Git and other C programs. The DWARF output is ~10x larger than function
pointer, causing flamegraphs built from it to take proportionately
longer, typically longer than the duration profiled.
The `libbpf-tools` utilities used are a bit of a grab bag, but quite
lightweight to run. This are BPF CO-RE utilities that run much more
lightly than `bcc`, which can be a resource hog. These focus primarily
on determing the amount of delay block I/O imposes, which may be useful
in determining how much of a penalty slower storage imposes on Gitaly.
Currently the only RPC being tested is `FindCommit`, which being
read-only hits the kernel page cache 100% of the time after the first
request.
- biolatency: Histogram of the latency of block I/O operations for each
attached disk.
https://github.com/iovisor/bcc/blob/master/tools/biolatency_example.txt
- biotop: List of processes performing the most block I/O.
https://github.com/iovisor/bcc/blob/master/tools/biotop_example.txt
- execsnoop: List of all processes forked by Gitaly and their
arguments.
https://github.com/iovisor/bcc/blob/master/tools/execsnoop_example.txt
- cpudist: Histogram of durations that programs executed by the kernel,
or with the `--offcpu` flag, how long they were slept.
https://github.com/iovisor/bcc/blob/master/tools/cpudist_example.txt
- cachestat: Statistics regarding kernel page cache hit rate.
https://github.com/iovisor/bcc/blob/master/tools/cachestat_example.txt
Note that the links above are to the `bcc` documentation for each tool
used. The arguments the `bcc` version takes may vary a bit from what
`libbpf-tools` allows, but they perform the same task.
Further work is needed for this be fully useable, most notably tracking
CPU and memory utilization. This is difficult with polling tools like
Prometheus's `node-exporter`, as most of the system load is typically
from short-lived Git processes that may spawn and exit between polling
intervals.
Diffstat (limited to '_support')
-rwxr-xr-x | _support/benchmarking/roles/gitaly/files/profile-gitaly.sh | 117 | ||||
-rw-r--r-- | _support/benchmarking/roles/gitaly/tasks/setup_profiling.yml | 8 |
2 files changed, 125 insertions, 0 deletions
diff --git a/_support/benchmarking/roles/gitaly/files/profile-gitaly.sh b/_support/benchmarking/roles/gitaly/files/profile-gitaly.sh new file mode 100755 index 000000000..68a625d1d --- /dev/null +++ b/_support/benchmarking/roles/gitaly/files/profile-gitaly.sh @@ -0,0 +1,117 @@ +#!/bin/sh +# +# profile-gitaly: Profile host with perf and libbpf-tools. +# Must be run as root. +# +# Mandatory arguments: +# -d <DURATION_SECS> : Number of seconds to profile for +# -g <GIT_REPO> : Name of Git repository being used +# -o <OUTPUT_DIR> : Directory to write output to +# -r <RPC> : Name of RPC being executed + +set -e + +usage() { + echo "Usage: $0 -d <DURATION_SECS> -o <OUTPUT_DIR> -r <RPC> \ +-g <GIT_REPOSITORY>" + exit 1 +} + +profile() { + # Profile Gitaly only + # --no-inherit - Don't profile child Git processes. + # --call-graph=fp - Use framepointers for call stack, works well with Golang + # and ~10x smaller output than DWARF. + perf record --freq=99 --call-graph=fp --pid="$(pidof -s gitaly)" \ + --no-inherit --output="${gitaly_perf_data}" -- sleep "${seconds}" & + + # Profile whole system + # --call-graph=dwarf - Use DWARF debug info for call stack, works well with + # C programs but is much larger than fp, causing + # flamegraph generation to be proportionately slower. + perf record --freq=99 --call-graph=dwarf --all-cpus \ + --output="${all_perf_data}" -- sleep "${seconds}" & + + # Capture arguments of all processes forked by Gitaly. + timeout "${seconds}" execsnoop \ + --uid=1999 --quote > "${out_dir}/gitaly-execs.txt" & + + # Histogram of duration programs were scheduled by the kernel. + cpudist "${seconds}" 1 > "${out_dir}/cpu-dist-on.txt" & + + # Histogram of duration programs were slept by the kernel. + cpudist --offcpu "${seconds}" 1 > "${out_dir}/cpu-dist-off.txt" & + + # Histogram of latency to block I/O, separated by disk. + # `git-repositories` will be mounted as `/dev/sdb`. + biolatency --disk "${seconds}" 1 > "${out_dir}/biolatency.txt" & + + # Details of processes performing the most block I/O. + biotop --noclear --rows 100 "${seconds}" 1 > "${out_dir}/biotop.txt" & + + # Capture kernel page cache hit rate. + cachestat "${seconds}" 1 > "${out_dir}/page-cachestat.txt" & + + wait +} + +generate_flamegraphs() { + gitaly_perf_svg="${out_dir}/gitaly-perf.svg" + perf script --header --input="${gitaly_perf_data}" \ + | stackcollapse \ + | flamegraph > "${gitaly_perf_svg}" & + + all_perf_svg="${out_dir}/all-perf.svg" + perf script --header --input="${all_perf_data}" \ + | stackcollapse \ + | flamegraph > "${all_perf_svg}" & + + wait +} + +main() { + if [ "$(id -u)" -ne 0 ]; then + echo "$0 must be run as root" >&2 + exit 1 + fi + + while getopts "hd:g:o:r:" arg; do + case "${arg}" in + d) seconds=${OPTARG} ;; + g) repo=${OPTARG} ;; + o) out_dir=${OPTARG} ;; + r) rpc=${OPTARG} ;; + h|*) usage ;; + esac + done + + if [ "${seconds}" -le 0 ] \ + || [ -z "${out_dir}" ] \ + || [ -z "${rpc}" ] \ + || [ -z "${repo}" ]; then + usage + fi + + if ! pidof gitaly > /dev/null; then + echo "Gitaly is not running, aborting" >&2 + exit 1 + fi + + # Ansible's minimal shell will may not include /usr/local/bin in $PATH + if ! printenv PATH | grep "/usr/local/bin" > /dev/null; then + export PATH="${PATH}:/usr/local/bin" + fi + + perf_tmp_dir=$(mktemp -d "/tmp/gitaly-perf-${repo}-${rpc}.XXXXXX") + gitaly_perf_data="${perf_tmp_dir}/gitaly-perf.out" + all_perf_data="${perf_tmp_dir}/all-perf.out" + + profile + + generate_flamegraphs + + chown -R git:git "${out_dir}" + rm -rf "${perf_tmp_dir}" +} + +main "$@" diff --git a/_support/benchmarking/roles/gitaly/tasks/setup_profiling.yml b/_support/benchmarking/roles/gitaly/tasks/setup_profiling.yml index e218ba60c..b445c8eb1 100644 --- a/_support/benchmarking/roles/gitaly/tasks/setup_profiling.yml +++ b/_support/benchmarking/roles/gitaly/tasks/setup_profiling.yml @@ -79,3 +79,11 @@ dest: /usr/local/bin/flamegraph mode: 0755 remote_src: true + +- name: Install profile-gitaly.sh as profile-gitaly + copy: + src: profile-gitaly.sh + dest: /usr/local/bin/profile-gitaly + owner: root + group: root + mode: '0755' |