Losing my Muse…

While investigating the multi-monitor issue yesterday, Muse died with some bad BTRFS message.

Multi-monitor support in Spice

Apparently, the problem I saw with multi-monitor reproduces with Fedora 25 guests, but not Fedora 24, according to Snir.

Christophe remarked that I was using Cinnamon and not Gnome, so I tried switching to Gnome. Things started going south for Muse when I did that.

Muse BTRFS crash

Selecting GNOME, I could not login, I was immediately sent back to login prompt. So I tried Gnome on Xorg, same issue. I tried to return to Cinnamon, but even that failed.

Time to reboot, I thought… And there, I landed in the emergency shell. journalctl showed a number of rather ominous messages (copied manually, there is no network, so pardon the typos):

BTRFS error (dedvice sdb3): unable to find ref byte nr 100493688832 parent 0 root 258 owner 185347 offset 9630244864

Then a kernel stack trace, and

BTRFS: error (device sdb3) in __btrfs_free_extent:6958: errno=-2 No such entry
BTRFS: error (device sdb3) in btrfs_run_delayed_refs:2967: errno=-2 No such entry
BTRFS: error (device sdb3) in btrfs_replay_log:2506: errno=-2 No such entry (Failed to recover log tree)
BTRFS error (device sdb3): pending csums is 23719936
BTRFS error (device sdb3): cleaner transaction attach returned -30
mount: mount(2) failed: /sysroot: No such file or directory
sysroot.mount: Mount process exited, code=exited status=32
Failed to mount /sysroot.
BTRFS error (device sdb3): open_ctree failed

Then some systemd output telling me the system was going to emergency mode. Time to try btrfsck /dev/sdb3. It looks happy.

After that, tried reboot, but it complains that there is no init. Tried exit from the shell, but it prints logout and then stays there. What is the correct procedure for exiting from an emergency shell? (Update: I discovered later that under systemd, you can use systemctl reboot. Duly noted)

Forced a reboot manually. Same problem. Look at btrfck -h, and it looks like it’s not repairing by default. The help page says it’s considered dangerous. Well, my machine won’t boot, so I’ll go with dangerous.

In repair mode, btrfsck warns that it has to clear the log tree. To the extent that the log tree is the root cause of my problem, I guess it’s a good thing. Chances are I might lose a VM or two in the process. We’ll see.

After a long time, btrfsck tells me it can’t repair the disk. Trying again, same issue. Finally, tried btrfs rescue zero-log, and that seems to allow the system to boot further. At least, it seems to mount the disk. But it stays stuck with the Fedora F logo, does not really go to full multi-user mode.

Booted again without the quiet and rhgb options. I see rather nasty stack traces when mounting the BTRFS volume, and then it says something about mounting in read-only mode. Good luck starting a system if you can’t write to any file, I guess 😉

Interestingly, the only two things that fail to start in that mode are Network Manager and Hostname Service. But then systemd is ready to wait forever for these to start. So I guess I’m stuck.

Rebooted under Ubuntu and installed BTRFS tools. But my Ubuntu is 14.04 LTS, and the BTRFS tools seem a bit confused. I think I’m going to do more harm than good.

Booting into single user mode (Go to grub, type e to edit, remove rhgb and quiet, add single, type Control-X to resume booting). I see some interesting messages in the boot log, notably:

Device: /dev/sdb [SAT], 1 currently unreadable (pending) sectors
Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors

This page seems to indicate that I should run smartctl -a /dev/sdb. The output indicates that the disk passes its self-test assessment. This is a Western Digital Caviar Green disk.

My guess is that this indicates that my disk is starting to wear out. Another one?

Starting to really dislike Western Digital desktop disks

All my recent disk failures have been with this WD crap. As much as I love their laptop drive, which seem to be highly reliable under rather more stressful conditions.

I have 5 WD Passport of different sizes, including a 3TB one. I have yet to see one fail, despite them being used for daily backups and stuff like that.

That being said, I should not be too hard on WD. I have three dead Seagate drives on my desk to prove it 😉 I guess the difference between desktop and laptop disks is that laptops don’t kill the power to the disk abrubtly if the machine loses power. Which happened a bit too frequently at home, thanks to a combination of cleaning lady incidents and having bought an electric car that pushed my circuit breaker over its limit.

Copying stuff out

Tried a few more btrfs rescure and btrfsck tests. But the next ones, btrfs rescure chunk-recover and btrfs-rescure-super-recover, make me a tad bit nervous. So before getting there, I want to backup my data.

I tried ifup en0, but that did not work. The most difficult part was finding the name of the network device on Muse. Usually, I get that from ifconfig, but here, it’s too early. Finally, did a systemctl | grep -i net, that gave me the name of the device, and ifup enp4s0. Some day, I’ll have to figure out where this silly name comes from and what it means. I don’t have 4 network ports in this machine, do I?

Copying the source repositories went well. Copying the VM stuff was less successful. I had a number of BTRFS “csum failed” messages and (possibly resulting) Input/Output error on fedora25-qxl. This was the VM I was using for the multi-screen bug analysis. It’s not a major loss, since the team has reproduced the issue on other machines, and that VM contained nothing else of value. The VM that I cared about, fedora25-clone, was copied successfully.

 

Advertisements

Author: Christophe de Dinechin

I try to change the world, but that's work in progress. If you want to know me, google "Christophe de Dinechin". Keywords: concept programming, virtualization, OS design, programming languages, video games, 3D, modern physics. Some stuff I did that I'm proud of: the first "true" 3D game for the PC, HP's big iron virtualization, real-time test systems for car electronics, some of the best games for the HP48 calculator, a theory of physics that makes sense (at least to me).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s