Introduction
Occasionally in OS land we’ll come across a bug that snuck its way
into a build and has been in the wild for a while before anyone
notices it. One example is a
recent bug
(originally discovered
by the community of CoreOS Container Linux) where the jumbo
packet MTU
size of 9001
was no longer getting set properly on EC2
instances.
So we have this bug, and we know things used to work. I fired up the
first and last releases of F28 Atomic Host. Both had the problem. I
then went all the way back to the
first release of F27 Atomic Host
and fired up an AMI from that release. On that release the MTU
looks
good at 9001
:
[root@ip-10-0-246-17 ~]# ip addr show eth0 | grep mtu
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc fq_codel state UP group default qlen 1000
I then went all the way to the latest F27 commit of 27.153
(ddaa675
)
by doing an rpm-ostree upgrade
. After upgrading I see the non-jumbo
MTU
value of 1500
:
[root@ip-10-0-246-17 ~]# ip addr show eth0 | grep mtu
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
Where was this regression introduced?
Bisecting
I could go through each commit in the entire history to find where the issue was introduced, but with a little logic we can arrive at the answer in fewer steps.
The idea here is that the regression was introduced somewhere and all commits before that point will not have the regression and all commits after that will have the regression. With a simple test we can easily check to see if a commit has the regression or not. The pseudo-code for the bisect logic looks something like this:
# grab info on every commit in history
# A -> B -> C -> D -> E -> F -> G
#
# user provided good/bad commits
# - good (default to first in history: A)
# - bad (default to current commit: G)
#
# run test script
# returns 0 for pass and 1 for failure
#
# known good is A, known bad is G
#
# start bisect:
# deploy D, test --> bad
# mark D, E, F bad
#
# deploy B, test --> good
# mark B good
#
# deploy C, test --> bad
#
# Failure introduced in B -> C
Luckily, since OSTree
is like “Git for your OS”, and we have a stream
of discrete changes, we can actually automate the bisect testing.
Automated Bisecting
For this particular MTU
regression we are lucky. We have a
simple pass/fail test that tells us if the problem exists or
not. If we script it out it looks something like:
[fedora@ip-10-0-246-17 ~]$ cat /usr/local/bin/test.sh
#!/bin/bash
sleep 20 # Wait some time for the settings to get applied
if ip addr show eth0 | grep "mtu 1500"; then
echo "Found mtu 1500. Test Fails"
exit 1
else
echo "Found mtu != 1500. Test Passes"
exit 0
fi
If we run the test we can see it fails:
[root@ip-10-0-246-17 ~]# chmod +x /usr/local/bin/test.sh
[root@ip-10-0-246-17 ~]# /usr/local/bin/test.sh
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
Found mtu 1500. Test Fails
Now we can grab rpm-ostree-bisect , a python program I wrote to help us do the bisect testing, and set it up.
[root@ip-10-0-246-17 ~]# curl -L https://raw.githubusercontent.com/ostreedev/ostree-releng-scripts/master/rpm-ostree-bisect > /usr/local/bin/rpm-ostree-bisect
[root@ip-10-0-246-17 ~]# chmod +x /usr/local/bin/rpm-ostree-bisect
[root@ip-10-0-246-17 ~]# rpm-ostree-bisect --testscript /usr/local/bin/test.sh
Using data file at: /var/lib/rpm-ostree-bisect.json
Created symlink /etc/systemd/system/multi-user.target.wants/rpm-ostree-bisect.service → /etc/systemd/system/rpm-ostree-bisect.service.
[root@ip-10-0-246-17 ~]#
[root@ip-10-0-246-17 ~]# reboot
We passed it the path to the executable test. The program
assumes that the current commit is a bad commit and it will
assume that the oldest commit is good. If you’d like you can
explicitly set the known good and known bad commits with
--good
and --bad
arguments.
A few more things happened here as well. Since we are going
to be deploying different versions of the OS and rebooting
we’ll need to store some state and leverage a startup script
to resume the bisection. rpm-ostree-bisect
created a systemd
service to aid in the bisection:
[fedora@ip-10-0-246-17 ~]$ systemctl cat rpm-ostree-bisect.service | tee
# /etc/systemd/system/rpm-ostree-bisect.service
[Unit]
Description=RPM-OSTree Bisect Testing
After=network.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/rpm-ostree-bisect --resume
StandardOutput=journal+console
StandardError=journal+console
[Install]
WantedBy=multi-user.target
It also created a json data file with information about all the commits in the history of the OSTree:
[fedora@ip-10-0-246-17 ~]$ head -n 12 /var/lib/rpm-ostree-bisect.json
{
"commits_info": {
"ddaa675df9ea91b0233ef996691f7f7fdf4eb84fccdf21a49f46a84fa0b39355": {
"version": "27.153",
"heuristic": "TESTED",
"status": "BAD"
},
"892983cd174980ea35f54e571e828aa4365addf2a0f5acdcda770f7b4b87a3a5": {
"version": "27.152",
"heuristic": "ASSUMED",
"status": "UNKNOWN"
},
The data file keeps information about that status
of each commit
(BAD
, GOOD
, or UNKNOWN
) and the heuristic
(GIVEN
, TESTED
,
ASSUMED
) that was used to determine the status. TESTED
means that
commit was explicitly tested, while ASSUMED
means that the status
was assumed to be good or bad based on if an earlier or later commit
was good or bad.
Now we can reboot our system and the automated bisect testing will
happen on the system for us. We can monitor it from remote by using
watch
or an ssh
for loop. One example is:
$ watch -n 10 ssh fedora@52.91.195.75 journalctl -b0 -n 30 -u rpm-ostree-bisect
After waiting some time the bisect was finished. The last run of
the rpm-ostree-bisect
service output the following:
Starting RPM-OSTree Bisect Testing...
Using data file at: /var/lib/rpm-ostree-bisect.json
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
Found mtu 1500. Test Fails
Last known good commit:
21e115d93d04f206a874973c91185184c70ba2e7500d5b2619d8c849b6cad3c1 : 27.125
First known bad commit:
e59f9373145bae36f90571feff24719537827273891c822692d037b05d1ff470 : 27.126
libostree pull from 'fedora-atomic' for fedora/27/x86_64/atomic-host complete
security: GPG: commit http: TLS
non-delta: meta: 306 content: 0
transfer: secs: 28 size: 1.7 MB
ostree diff commit old: 21e115d93d04f206a874973c91185184c70ba2e7500d5b2619d8c849b6cad3c1
ostree diff commit new: e59f9373145bae36f90571feff24719537827273891c822692d037b05d1ff470
Upgraded:
cockpit-bridge 165-1.fc27 -> 166-1.fc27
cockpit-docker 165-1.fc27 -> 166-1.fc27
cockpit-networkmanager 165-1.fc27 -> 166-1.fc27
cockpit-ostree 165-1.fc27 -> 166-1.fc27
cockpit-system 165-1.fc27 -> 166-1.fc27
coreutils 8.27-20.fc27 -> 8.27-21.fc27
coreutils-common 8.27-20.fc27 -> 8.27-21.fc27
findutils 1:4.6.0-16.fc27 -> 1:4.6.0-19.fc27
kernel 4.15.17-300.fc27 -> 4.16.3-200.fc27
kernel-core 4.15.17-300.fc27 -> 4.16.3-200.fc27
kernel-modules 4.15.17-300.fc27 -> 4.16.3-200.fc27
libcgroup 0.41-13.fc27 -> 0.41-17.fc27
selinux-policy 3.13.1-283.30.fc27 -> 3.13.1-283.32.fc27
selinux-policy-targeted 3.13.1-283.30.fc27 -> 3.13.1-283.32.fc27
Started RPM-OSTree Bisect Testing.
So we can see clearly the regression was introduced in 27.126
and the diff
output shows that in that particular commit several things changed. The most
obvious change that could have caused this regression was the kernel
going
from 4.15.17
to 4.16.3
. This narrows down the window of changes to look at
when trying to track down the specific bug.
Note: Here you can access the final rpm-ostree-bisect.json and the systemd unit journal log from the run.
Fin
This automated bisecting can really help us find the causes for regressions by narrowing down the scope of changes we need to look at when performing an investigation.