I woke up to a nasty little surprise this morning. A virtual machine that I have running amidst others, which performs plenty of data transfers as part of its purpose, had died. Truth be told, I was performing a file transfer and executing a batch of network requests to that machine at the time, but when it stopped responding over the network and refused connections over SSH, I laughed at how I managed to tank a CentOS server with 16 GB of RAM on an NVMe disk. That takes effort…
Or does it? I found two online discussions which allude to the fact that the Discard option in the VM’s disk settings could be the culprit when using Proxmox. Either way, here’s what I see when booting my VM (and when trying to unmount the failing
1 2 3 4 5 [sda] tag#192 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [sda] tag#192 Sense Key : Aborted Command [current] [sda] tag#192 Add. Sense: I/O process terminated [sda] tag#192 CBD: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 blk_update_request: I/O error, dev sda, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
I first found this post, which had the following juicy snippet of information:
I can confirm this issue with latest version pve-qemu-kvm: 3.0.1-4 - when used with Ceph KRBD VMs suffer from IO errors and data loss.
They then go on to say:
Yeah, scsi with discard […] It is rather easily trigged, just generate lots of IO activity inside VM
And then they read my mind:
I am surprised that such a serious bug (freezing VMs, data loss!) hasn’t caused more complaints here. Especially considering this bug is persisting for months already.
The bug report filed in this thread is my second source, and they say this:
I can confirm that with ‘detect_zeroes=0’ I do not have this issue, if I don’t set detect_zeroes I’ll see ext4 errors within a day with discard=on.
That’s great to know, but the issue is that to test the two theories (one being disabling discard, and the other disabling detect_zeroes), I need to risk data loss. If my understanding is correct, it’s not reading the disk that fails with these “incorrect” options, but rather writes that cause some serious corruption. Admittedly, I’m not incredibly comfortable using Proxmox at this point in time, but I’ll first try to recover this data and then evaluate other options for a type 1 hypervisor.
I’ll add to this post as I work through potential solutions and the journey to get there.
I currently have this VM set to VirtIO SCSI Single as the only hard drive controller type, as this Reddit post and the Proxmox docs themselves class this controller as giving the best performance possible. However (and I knew this would come back to bite me), there’s usually a good reason why the “best” option isn’t the default (which is VirtIO SCSI), and I feel as though I’ve now paid the price for that.
Amazingly, changing the controller back to the default of VirtIO SCSI fixed the initial boot issue! I thought it would have irreversibly rendered the disk content unusable, but it works without any issues. That said, I have disabled Discard on that disk, so that could well be the issue. However, since it was cited as being the issue to begin with, I don’t think I’m going to toy around with it too much. I will, however, monitor what happens to the apparent disk usage now that I have Discard disabled. Theoretically, some underlying space (somewhere) won’t be reclaimed, but it should be okay for my purposes.
Unfortunately, even though the machine started working fine for a little while, it ran into some disk issues that looked along the same lines as the initial ones. They flashed past too fast for me to see, so I’ll try to reboot and see what happens.
Interestingly, resetting the system now shows the same error. I’m not sure what I did to fix the problem in the first place, now. Running
xfs_repair gave the following error, even though I did unmount the filesystem at
1 2 3 4 5 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.
Fair enough. Let’s try mounting it.
1 2 3 4 :/# mount /dev/c1/root /sysroot [304.562797] XFS (dm-0): Mounting V5 Filesystem [304.579537] XFS (dm-0): Starting recovery (logdev: internal) [304.583047] XFS (dm-0): Ending recovery (logdev: internal)
…wait, that’s it? Did it work?
1 2 3 4 :/# ls /sysroot -l total 16 lrwxrwxrwx 1 root root 7 May 11 2019 bin -> usr/bin [...]
It does. Curious. I feel like letting the machine write to disk caused further failures, even with Discard off. I’ll try
xfs_repair and then set
0 to see if that fixes anything.
1 :/# umount /sysroot
Unfortunately, this shows the same disk errors as above, and on top of that further errors like this one:
1 2 XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1272 of file fs/xfs/xfs_log.c [...] XFS (dm-0): Unable to update superblock counters. Freespace may not be correct on next mount.
Yuck. This is getting quite messy. The more I play around here, the less I’m liking Proxmox. It’s a shame, because I have a fair amount of infrastructure set up here and didn’t mind the setup I have. Now I mind, just a little.
ls operations around the mounted
sysroot keep giving random unprompted errors on the console. Me no like. I think I’m going to have to destroy the log and attempt a repair.
1 :/# xfs_repair /dev/cl/root -L
It gave the
FAILED Result (and other such) errors as before, but it seemed to finish. Mounting it then showed my files (again), but isn’t helping. I’ll try to add the
detect_zeros flag to my VM’s disk.
1 scsi0: local-lvm:vm-101-disk-0,cache=writeback,size=750G,detect-zeros=0
Annoyingly, this option must be invalid, because the disk disappears from the VM’s hardware pane. You’re not helping me love you, Proxmox. It turns out I read an incorrect command from somewhere. What I actually needed was
detect_zeroes=0. Now it shows up, but will it blend?
Nope. Same errors as before. I’ll try an
xfs_repair and see if continuing to boot works.
Turns out that just sends the same errors my way. As a last-ditch effort, I’ll set the hard drive’s Cache option to Default (No cache) and see if that works after rebooting.
Okay, now it boots… was that because of caching or the
xfs_repair? That’ll teach me to not do too much at once. Now that the machine has booted up, I’ll let the normal processes fire up and start their disk and network transfers and see if it fails at any point. For good measure, I’ll back things up so that I can migrate the machine if need be…
Enough is Enough
And just as I say that, the machine dies. This is clearly a more integral problem. Not knowing what causes this issue or how it affects my other containers and VMs, I can now safely draw a line under Proxmox. Sorry - it’s been fun, but I’m out.
Hopefully this article helped someone somehow, but I’ll be moving away from Proxmox. I’ll start an article after this one to detail my journey to another hypervisor.