While investigating the multi-monitor issue yesterday, Muse died with some bad BTRFS message.
Multi-monitor support in Spice
Apparently, the problem I saw with multi-monitor reproduces with Fedora 25 guests, but not Fedora 24, according to Snir.
Christophe remarked that I was using Cinnamon and not Gnome, so I tried switching to Gnome. Things started going south for Muse when I did that.
Muse BTRFS crash
Selecting GNOME, I could not login, I was immediately sent back to login prompt. So I tried Gnome on Xorg, same issue. I tried to return to Cinnamon, but even that failed.
Time to reboot, I thought… And there, I landed in the emergency shell.
journalctl showed a number of rather ominous messages (copied manually, there is no network, so pardon the typos):
BTRFS error (dedvice sdb3): unable to find ref byte nr 100493688832 parent 0 root 258 owner 185347 offset 9630244864
Then a kernel stack trace, and
BTRFS: error (device sdb3) in __btrfs_free_extent:6958: errno=-2 No such entry BTRFS: error (device sdb3) in btrfs_run_delayed_refs:2967: errno=-2 No such entry BTRFS: error (device sdb3) in btrfs_replay_log:2506: errno=-2 No such entry (Failed to recover log tree) BTRFS error (device sdb3): pending csums is 23719936 BTRFS error (device sdb3): cleaner transaction attach returned -30 mount: mount(2) failed: /sysroot: No such file or directory sysroot.mount: Mount process exited, code=exited status=32 Failed to mount /sysroot. BTRFS error (device sdb3): open_ctree failed
systemd output telling me the system was going to emergency mode. Time to try
btrfsck /dev/sdb3. It looks happy.
After that, tried
reboot, but it complains that there is no
exit from the shell, but it prints
logout and then stays there. What is the correct procedure for exiting from an emergency shell? (Update: I discovered later that under
systemd, you can use
systemctl reboot. Duly noted)
Forced a reboot manually. Same problem. Look at
btrfck -h, and it looks like it’s not repairing by default. The help page says it’s considered dangerous. Well, my machine won’t boot, so I’ll go with dangerous.
In repair mode,
btrfsck warns that it has to clear the log tree. To the extent that the log tree is the root cause of my problem, I guess it’s a good thing. Chances are I might lose a VM or two in the process. We’ll see.
After a long time,
btrfsck tells me it can’t repair the disk. Trying again, same issue. Finally, tried
btrfs rescue zero-log, and that seems to allow the system to boot further. At least, it seems to mount the disk. But it stays stuck with the Fedora F logo, does not really go to full multi-user mode.
Booted again without the
rhgb options. I see rather nasty stack traces when mounting the BTRFS volume, and then it says something about mounting in read-only mode. Good luck starting a system if you can’t write to any file, I guess 😉
Interestingly, the only two things that fail to start in that mode are Network Manager and Hostname Service. But then
systemd is ready to wait forever for these to start. So I guess I’m stuck.
Rebooted under Ubuntu and installed BTRFS tools. But my Ubuntu is 14.04 LTS, and the BTRFS tools seem a bit confused. I think I’m going to do more harm than good.
Booting into single user mode (Go to grub, type
e to edit, remove
single, type Control-X to resume booting). I see some interesting messages in the boot log, notably:
Device: /dev/sdb [SAT], 1 currently unreadable (pending) sectors Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors
This page seems to indicate that I should run
smartctl -a /dev/sdb. The output indicates that the disk passes its self-test assessment. This is a Western Digital Caviar Green disk.
My guess is that this indicates that my disk is starting to wear out. Another one?
Starting to really dislike Western Digital desktop disks
All my recent disk failures have been with this WD crap. As much as I love their laptop drive, which seem to be highly reliable under rather more stressful conditions.
I have 5 WD Passport of different sizes, including a 3TB one. I have yet to see one fail, despite them being used for daily backups and stuff like that.
That being said, I should not be too hard on WD. I have three dead Seagate drives on my desk to prove it 😉 I guess the difference between desktop and laptop disks is that laptops don’t kill the power to the disk abrubtly if the machine loses power. Which happened a bit too frequently at home, thanks to a combination of cleaning lady incidents and having bought an electric car that pushed my circuit breaker over its limit.
Copying stuff out
Tried a few more
btrfs rescure and
btrfsck tests. But the next ones,
btrfs rescure chunk-recover and
btrfs-rescure-super-recover, make me a tad bit nervous. So before getting there, I want to backup my data.
ifup en0, but that did not work. The most difficult part was finding the name of the network device on Muse. Usually, I get that from
ifconfig, but here, it’s too early. Finally, did a
systemctl | grep -i net, that gave me the name of the device, and
ifup enp4s0. Some day, I’ll have to figure out where this silly name comes from and what it means. I don’t have 4 network ports in this machine, do I?
Copying the source repositories went well. Copying the VM stuff was less successful. I had a number of BTRFS “
csum failed” messages and (possibly resulting)
Input/Output error on
fedora25-qxl. This was the VM I was using for the multi-screen bug analysis. It’s not a major loss, since the team has reproduced the issue on other machines, and that VM contained nothing else of value. The VM that I cared about,
fedora25-clone, was copied successfully.