Returning to last week’s investigations regarding VMware virtual machines and Linux recent kernels.
Back from Munich
Back from old Munich. This is a beautiful city that I discovered for the first time (I only knew the airport before). Nice people too. And I come back with a Red Hat 🙂 and some other nice photos.
Back to trying to figure out the various issues with VMware and recent Linux kernels. I’ll probably need to check with my management if it’s OK for me to spend so much time on this. But for now, I would really appreciate if things did work. I made some progress this week-end, and got my internal partition to boot.
Reminder of the current state
As of late last week, I had the following issues:
- Booting a regular VM image file with a recent kernel hangs. I filed a Bugzilla. I bisected the problem to a specific commit, but later realized that the same version could boot or not boot based on “something else” which I have not identified yet. Specifically, a version as old a 4.9 can sometimes fail to boot, and more recent versions that I had marked as “good” also later failed to boot. Something else is at play there.
- Booting a physical partition in VMware proved a bit complicated. I added a physical disk with the relevant partitions, but that would not work.
Progress on the two fronts has been a bit slow, but steady.
Physical partition finally boots
To boot the physical partition, after a lot of trial and errors, I finally managed to get it to boot under the following conditions:
- Set the firmware to EFI, but adding the following line to the
firmware = "efi"
- Use scsi and not
satafor the disk interface – so copying the macOS VM was not such a good idea after all:
scsi0:0.present = "TRUE" scsi0:0.fileName = "PhysicalLinuxPartitions.vmdk" sata0:0.present = "FALSE"
- Wait until network boot fails. Then, and only then, will it attempt to boot from hard disk. I’ve tried to force boot ordering with the following, but it does not seem to help:
bios.bootOrder = "CDROM,hdd,ethernet1" bios.hddOrder = "scsi0:0"
- Boot the recovery mode image. The other images (including some I built myself) fail to find the hard disk to boot from, and end up in the dracut emergency shell. I’m a bit puzzled by that, and I want to figure it out. But at least, that gives me a workable physical partition.
I am not sure yet of what prevents graphical mode from booting under VMware. One thing I noticed by
diff-ing a case that boots OK against a case that does not boot is the following line that only shows when it works:
[ 0.000000] Hypervisor detected: VMware
There are more oddities, e.g. what looks like a bad number of CPUs in the case that does not work:
[ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:2065 __generic_processor_info+0x28c/0x370 [ 0.000000] Only 63 processors supported.Processor 64/0x80 and the rest are ignored. [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.0-rc4+ #24 [ 0.000000] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [ 0.000000] ffffffff81e03cb0 ffffffff8132bb47 ffffffff81e03d00 0000000000000000 [ 0.000000] ffffffff81e03cf0 ffffffff81059c26 0000081100001000 0000000000000040 [ 0.000000] 0000000000000015 0000000000000000 0000000000000000 0000000000000080 [ 0.000000] Call Trace: [ 0.000000]  dump_stack+0x4d/0x66 [ 0.000000]  __warn+0xc6/0xe0 [ 0.000000]  warn_slowpath_fmt+0x4a/0x50 [ 0.000000]  __generic_processor_info+0x28c/0x370 [ 0.000000]  acpi_register_lapic+0x32/0x80 [ 0.000000]  acpi_parse_lapic+0x46/0x4e [ 0.000000]  acpi_parse_entries_array+0xf2/0x14d [ 0.000000]  acpi_table_parse_entries_array+0xae/0xd0 [ 0.000000]  acpi_boot_init+0xdf/0x4a7 [ 0.000000]  ? acpi_parse_x2apic_nmi+0x46/0x46 [ 0.000000]  ? dmi_ignore_irq0_timer_override+0x2e/0x2e [ 0.000000]  setup_arch+0xafa/0xc00 [ 0.000000]  ? printk+0x43/0x4b [ 0.000000]  start_kernel+0x59/0x3c7 [ 0.000000]  x86_64_start_reservations+0x2a/0x2c [ 0.000000]  x86_64_start_kernel+0x178/0x18b [ 0.000000] ---[ end trace 01d8505d0b85a6ae ]---
Indeed, the default kernel configuration limits the number of CPUs to 64, which is good enough for most people, and should not be a problem under VMware. What this seems to tell me is that the ACPI description coming from VMware reports more than 64 CPUs, which is odd (I understand that you reserve a few spares for CPU hotplug, but not zillions). Maybe this is related to Linux not detecting it’s running within a hypervisor.
With 4.9, where I had both “good” and “bad” boots, I noticed that a build after
make distclean and the default config booted again. So I’m attempting a new bisect building like this at every step, and we’ll see where it leads me, if anywhere.
Something is badly affecting my VMware performance. It may be when I have more than one VM running, and the two VMs are busy. Or something I did not identify yet. But it really makes bisecting a bit painful.
I’m going to try to setup the Shuttle host for comfortable remote and local use. It’s a nice machine, but if I want it to be connected over 1Gb ethernet to the file server, then it has to be in a spot that does not make it very convenient to use as a primary machine, by lack of screen real estate.
I opened the machine this morning to check about a possible RAM transfer from the tower PC. Unfortunately, at 16G, the shuttle is at its max, so I’ll have to leave some VMs on the tower, using Windows as a host (24G).
System-wide update while I’m reading mail and bisecting my kernel…
Built-in remote access with VNC
The built in remote access I get using the default Gnome settings is not compatible with the macOS Screen Sharing application. According to this page, it’s because the encryption used by Screen Sharing is not supported.
Tried to set it to
dconf-editor as explained on the page, but neither Screen Sharing nor Chicken of the VNC are happy with it. The first one says that this version of the software is not supported by Screen Sharing. The second one complains about
unknown authType 18. This is about as user-friendly as it can get 😉
Another option I then changed was “require authentication”. With that off, Screen Sharing no longer complains, but it spins forever, and I never actually see the screen, although the server states that the desktop is controlled by another machine.
OK, finally found the combination that works with Screen Sharing: switch “require encryption” to off, use use [‘vnc’] as allowed method of authentication, and provide the VNC password to Screen Sharing. Weird, but it works. Time to stuff this machine choke full with VMs.
First Shuttle hang…
Ouch, the shuttle PC did a hard hang while I was attempting to drag a window around. That’s a bit annoying. That PC has always been a bit fussy, but I did not expect it to start acting up so quickly. This is an example of my personal experience with PCs, which often are quite fast and inexpensive, but supremely unstable.
Linux is not at fault, it’s generally things like weak connectors, PCI cards that move inside, and I’ve seen the same stability issues with Windows as well. It’s still pretty annoying. I should not have opened that box this morning 🙂
Trying to install NFS-backed VMs (again)
Since the Shuttle PC is on the same 1Gb loop as the disk server, I’m tempted to try hosting the VMs on remote storage again. Testing with Virtual Machine Manager.
Attempt 1: I use a manual mount to the VM disk file. It fails with an access permission error from Qemu, stating that it cannot modify the disk as user 107 (qemu). I added the correct permission, but still no go. This may be again related to the SE Linux warnings I got when trying to mount a disk over NFS with Boxes. But I ran the SE Linux command to enable it.
If I instead select the same location as being an NFS mount point using the storage manager. This works a bit better. However, this time, when starting the VM, I get:
Unable to complete install: ‘Failed to connect socket to ‘/var/run/libvirt/virtlogd-sock’: No such file or directory’
Two nested single quotes :-O
systemctl start virtlogd, which works, and allows me to go one step further, to a rather scary looking error message:
Switching the virtual disk back from SCSI to the original IDE gives me a message complaining about “Permission denied” on the ISO file. So clearly, Virtual Machines Manager / Qemu and Boxes do not work the same way here. Boxes is subjected to SE Linux permissions, VMM/QEmu has some other set of rules. Another little recursive
chmod on teh server, and I finally get a VM that looks like it boots.
Temporary conclusion: Can’t add a SCSI disk over NFS for now.
But then, it’s really fast! Until it dies, that is. Gnome shell crash in the middle of the installation. Can’t get through. Tried to report the bug, but Gnome shell crashes again before I even get a chance to fill enough information for it to report it 😦
So far, I’ve not been very lucky. I’m still trying to find some VM configuration that actually works… VMware won’t boot and is quite slow, Boxes dies on me for no reason and does not seem very happy with NFS, Qemu + KVM dies during installation… OK, every time, I’m trying configurations that are just a little bit off-road, but still…