RPM-OSTree Bisecting Helps Track Down Boot Timeout Issue

Introduction

Last time I talked about using rpm-ostree-bisect, a tool that I wrote to automatically bisect the history of an OSTree remote in order to find the exact commit when a problem was introduced. I recently put the tool to the test again.

The Problem

Recently a user reported an issue where their system was seeing timeouts on boot. They determined that if they removed the resume=/dev/mapper/fedora-swap argument from the kernel command line then the system would boot without timing out on the swap device (i.e. an extra 90 seconds added to boot time).

This was interesting because the behavior was actually introduced during the Fedora 28 cycle (i.e. it was not present in the first release of Fedora 28 Atomic Host, but showed up sometime over the life of Fedora 28 Atomic Host). I decided to dig in and do some investigation. Along the way I realized, admittedly later than I than I should have, that I could use rpm-ostree-bisect to help determine exactly where this problem was introduced.

Automated Bisecting

OK. First off I needed an easy scriptable test to determine if the problem exists in a particular commit. In this case it was pretty easy; I just needed to search for Timed out waiting for device in the journal for the current boot. My test.sh script ended up looking like:

[root@localhost ~]# cat /usr/local/bin/test.sh 
#!/bin/bash
if journalctl -b0 | grep 'Timed out waiting for device'; then
    echo "Found device timeout. Test Fails"
    exit 1
else
    echo "Did not find device timeout. Test Passes"
    exit 0
fi

I then grabbed the rpm-ostree-bisect program from GitHub and started the bisect. I happened to know at the time one commit that was good and one that was bad so I added that information to the rpm-ostree-bisect invocation:

[root@localhost ~]# curl -L https://raw.githubusercontent.com/ostreedev/ostree-releng-scripts/master/rpm-ostree-bisect > /usr/local/bin/rpm-ostree-bisect
[root@localhost ~]# chmod +x /usr/local/bin/rpm-ostree-bisect 
[root@localhost ~]# rpm-ostree-bisect --good 94a9d06eef34aa6774c056356d3d2e024e57a0013b6f8048dbae392a84a137ca --bad 8df48fa2e70ad1952153ae00edbba08ed18b53c3d4095a22985d1085f5203ac6 --testscript /usr/local/bin/test.sh
Using data file at: /var/lib/rpm-ostree-bisect.json
Created symlink /etc/systemd/system/multi-user.target.wants/rpm-ostree-bisect.service → /etc/systemd/system/rpm-ostree-bisect.service.
[root@localhost ~]# reboot 

I let it run and when it finished the log from the last run gave me all of the information I needed:

[root@localhost ~]# journalctl -b0 -o cat -u rpm-ostree-bisect.service 
Starting RPM-OSTree Bisect Testing...
Using data file at: /var/lib/rpm-ostree-bisect.json
Did not find device timeout. Test Passes
Removed /etc/systemd/system/multi-user.target.wants/rpm-ostree-bisect.service.
BISECT TEST RESULTS:
Last known good commit:
  5736e83 : 28.20180708.0 : 2018-07-08T20:03:31Z
First known bad commit:
  bc3aa17 : 28.20180711.0 : 2018-07-11T18:26:22Z
libostree pull from 'fedora-atomic' for fedora/28/x86_64/atomic-host complete
security: GPG: commit http: TLS
non-delta: meta: 270 content: 0
transfer: secs: 43 size: 2.7 MB
ostree diff commit old: 5736e832b1fd59208465458265136fbe2aa4ba89517d8bdcc91bc84724f40a8e
ostree diff commit new: bc3aa17a5ad6c04103563bd93c4c668996ef786ec04f989a3209a9887c8e982c
Upgraded:
  acl 2.2.52-20.fc28 -> 2.2.53-1.fc28
  attr 2.4.47-23.fc28 -> 2.4.48-1.fc28
  dracut 047-8.git20180305.fc28 -> 048-1.fc28
  dracut-config-generic 047-8.git20180305.fc28 -> 048-1.fc28
  dracut-network 047-8.git20180305.fc28 -> 048-1.fc28
  kernel 4.17.3-200.fc28 -> 4.17.4-200.fc28
  kernel-core 4.17.3-200.fc28 -> 4.17.4-200.fc28
  kernel-modules 4.17.3-200.fc28 -> 4.17.4-200.fc28
  libacl 2.2.52-20.fc28 -> 2.2.53-1.fc28
  libattr 2.4.47-23.fc28 -> 2.4.48-1.fc28
  openldap 2.4.46-1.fc28 -> 2.4.46-2.fc28
  podman 0.6.5-1.git9d97bd6.fc28 -> 0.7.1-1.git802d4f2.fc28
  python3-pytoml 0.1.16-1.fc28 -> 0.1.17-1.fc28
Removed:
  grubby-8.40-11.fc28.x86_64
Added:
  libkcapi-1.1.1-1.fc28.x86_64
  libkcapi-hmaccalc-1.1.1-1.fc28.x86_64
Started RPM-OSTree Bisect Testing.

So now we now the exact versions of software that we need to look at that introduced the problem. I had already suspected dracut and this confirmed my suspicions even more. Going from dracut-047 to dracut-048 introduced the problem.

NOTE: For anyone interested here is the final rpm-ostree-bisect.json

Fin

Having this information ultimately led me to discover that the behavior gets introduced specifically in the iscsi module of the dracut-network rpm. I opened an upstream bug to track the issue further with the dracut maintainers.