In my previous post, I detail an issue I had with data loss from the way Proxmox handles disks with specific configurations. The result of that post was that I no longer trust Proxmox to handle my VMs and containers, and so this is my journey to find the next big bad hypervisor. Let’s go through what I’m looking for.
- Relatively lightweight. Proxmox is very good at this, but seeing as I have plenty (32 GB) of RAM on my host system, I don’t need anything crazy.
- Linux-based (and preferably RHEL-based). I’ve tried the Hyper-V on bare metal route, and it makes me feel, well dirty. I could go into technical reasons, but that would be a long post.
- Accessible without special configuration. What I found with Hyper-V was that the best way to do things was with a domain-connected Windows machine, which spoils the fun when you’re trying to manage VMs from your iPad or iPhone (which I often do with my home lab).
- Configurable via a GUI. This joins in with the above point somewhat, but deserves a special shoutout. Managing VMs via CLI is not a problem, but host hypervisors are complex systems that can be made more easily managed by GUIs. Good ones, at least.
- Popular and supported by the community. The way I fix and learn about 98% of everything software-related is via posts on Reddit and other such Google-fu.
- Backed by good and reputable developers. Think RHEL behind CentOS, and so on.
- Flexible and user-oriented configuration. Things like resizing disks and changing RAM allocation should be easy.
With that, we set off down the yellow brick road.
This is the big gun. Everyone in the virtualisation space should know about this, and I’ve previously used it for a virtualised setup, but I’m not quite sure how this will work with my current requirements and hardware, since things have changed a fair bit from my side since last running it.
- Not super lightweight: Even though this TechTarget article cites ESXi as being lightweight, it doesn’t put any figures out there that concretely prove that. As a result, I have to assume that it can’t tackle the likes of Proxmox. That’s not a problem, though.
- It is Linux-based: And thank goodness for that. You don’t see much of it, but it’s what’s inside that matters.
- Accessible without special configuration: You can access the GUI via an IP address, which is more than enough for my means. No domain = no pain.
- We have a GUI: I’m in heaven.
- Community support is good: Better yet, it sounds like there’s a lot more out there on ESXi than on Proxmox, which is going to make life so much easier.
- Backed like all hell: VMWare is no slouch, so that’s a big tick there.
- User-oriented configuration: As an example, this article demonstrates that it’s a piece of cake to change the RAM allocated to a VM, which again is enough for me.
What About the Others?
There are plenty of other good virtualisation host options out there, but I’m looking for something to get up and running with quickly, and that’s stable enough for me to trust that it’s not going to arbitrarily stop working when it reaches a certain load. ESXi gives me that, so I’m not going to turn this into a battle of the ages.
Beyond the above requirements, there are some other things that need to be taken into account here. Some are to do specifically with ESXi, and others with my existing setup and migration path.
VMWare licenses ESXi as a paid product with a free tier, which comes with some limitations:
- No Official VMware Support
- Max 8 vCPU per Each VM
- Cannot Be Managed with vCenter
- vStorage API Is Not Available
None of these are an issue for me, but would obviously be blocking factors for larger organisations with greater requirements.
Actually Downloading ESXi
Esssentially every link I follow to download the latest version (7) of ESXi (which they call vSphere) gives me the following error…
Content Not Available Dear user, the web content you have requested is not available.
This Reddit post and the ensuing torrent of rage-filled comments demonstrates that VMWare is abhorrently shocking at maintaining their website and supporting downloads, but I have it on good authority that this will not diminish their product quality. It does, however, mean that I have to get the OS somehow.
Migrating Away From Proxmox
Since I don’t anticipate that any part of moving from Proxmox to VMWare will be a lift-and-shift operation, I need to start backing up my data and informing of outages to my server’s other operation as soon as possible. There will likely be a lot of learning and hiccups along the way, but that’s par for the course.
One of my Linux containers needs to have direct access to the CPU’s graphics to provide its services. The advantage of using containers is that they’re built directly atop the host kernel, meaning less overhead when dealing with multimedia and suchlike. As far as I know, ESXi/vSphere isn’t going to be able to provide an equivalent service, meaning that service is going to have to bite the bullet and deal with a little I/O inefficiency in exchange for data integrity and my peace of mind.
With That, We Begin
First to back up my existing configuration. My Proxmox host and other VMs and containers shouldn’t be an issue to back up, since I know what configuration I want to keep and where it lies. From there, it’s simply a matter of using our good friend
tar and copying it over SSH.
The problematic machine from my previous post is another story. It works for about 60 seconds before dying due to disk access errors. However, I think these errors are caused or at least provoked by the processes that start with the machine and cause network and disk transfers. If I stop them soon enough and copy configuration as quickly as possible, it shouldn’t be an issue to get the needed files off in good time.
Another point of consideration is that I run a local DNS in a container, which will obviously not work once the host is reinstalled. However, if memory serves me correctly, most of the machines that relied on that host had
126.96.36.199 as a backup, so it shouldn’t be a problem in and of itself.
Hooking It Up
The machine currently runs in a headless setup (with no monitor or keyboard), so I need to set those up first. I also need to flash the vSphere ISO to a USB drive, which I’ll do at the same time.
It looks like the disk access issues span beyond the VM that showed issues, and now the host is playing up. I receive the following error when traversing one of the directories to be backed up in another VM:
1 ls: reading directory [...] : Bad message
Unfortunately, it seems like I’m going to have a hard time deleting that directory, and I don’t particularly fancy doing so anyways. Since I need to take the read error in my stride, I will instruct
tar to ignore errors with
--ignore-failed-read and hope that works.
I have Proxmox containers which previously needed no SSH access (which is a plus from a security standpoint), but now that I need to copy the backups from those containers, I will now need to install and start SSH with:
1 2 dnf install -y openssh-server systemctl start sshd
Et voilà! I’m able to connect to the root user with the public key I had previously established and never used, and can now copy the config off the container (and the remaining containers).
Backing Up Docker
One of my virtual machines is dedicated as a Docker host, since it’s frowned upon to run Docker on containers in Proxmox. Assuming that doesn’t throw a million disk errors my way, I need to back up those containers (and their volumes) to the same SSH destination as the previous content. Thankfully, Docker provide a guide on how to do something along those lines. Then this happened:
1 2 3 XFS (dm-0): Log I/O error -5 XFS (dm-0): Log I/O Error Detected. Shutting down filesystem. XFS (dm-0): Please unmount the filesystem and rectify the problem.
Not again. I hope this is all down to Proxmox and not an actual disk failure, or this will be very sad. The host disk (of which there is only one) is a 1TB NVMe Samsung 970 EVO which is about three weeks old, so I would be incredibly saddened to hear of its demise.
I suppose I should try repairing the XFS filesystem as in my previous post. As before, it complains about metadata changes which need to be replayed. I’m going to use
-L to ignore those and destroy the log. Unlike the previous VM issues, this doesn’t complain during the repair process, so there may be hope yet. Now to continue booting…
Odd. Firing off a
^D sent the
exit command, but just hangs there and doesn’t do anything.
ssh won’t connect, either. I’ll try resetting it again.
That gave the same disk error as before. I’ll try repairing the disk and then restarting from that same prompt without quitting.
Interestingly, repairing the disk that time didn’t give any issues to fix and finished successfully, but boots back into an error state. I think I’m going to have to consider my data toast at this stage.
I don’t expect this part to be fun at all. Perhaps it won’t be so bad since the VM wasn’t running when all of the odd issues started happening, but if Windows can be a pain at the best of times, then I don’t expect it to let up here.
I run one Windows Server VM for monitoring my surveillance cameras, and need to deactivate it to release the license key before decommissioning the host. It’s definitely slow to load, but looks like it’s doing something beyond wanting to scan the disk or throw a blue screen. CPU usage is a good sign (he says while eagerly awaiting some form of life). After finally getting past Applying Computer Settings for a good 10 minutes, it finally allowed me to log in. Again, very slowly.
While that loads, I’m going to perform a quick disk check on the Proxmox host itself. The NVMe passed SMART and other checks when I last looked, but I’m now suspicious of whether this problem will follow me into my next OS.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # smartctl -a /dev/nvme0n1 [...] === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 48 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 14,023,706 [7.18 TB] Data Units Written: 16,680,998 [8.54 TB] Host Read Commands: 58,828,073 Host Write Commands: 772,236,505 Controller Busy Time: 210 Power Cycles: 11 Power On Hours: 158 Unsafe Shutdowns: 2 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 48 Celsius Temperature Sensor 2: 47 Celsius Error Information (NVMe Log 0x01, max 64 entries) No Errors Logged
Looks fine… but I’ll stay suspicious. For now, Windows is taking too long to boot, so I’m going to try one thing and give up if that doesn’t work promptly. I’ll disable the caching and discard options on the two disks that VM has, and see if that speeds things up. Interestingly, starting the VM gave this error initially, but then worked after trying again:
1 TASK ERROR: timeout waiting on systemd
And is now reporting running (io-error) as its status. I don’t know what’s going on, but it’s time to ditch this host OS.
Running ESXi on USB
The one thing I hadn’t taken into account until now is where the OS runs. While I was running Proxmox on the NVMe itself, ESXi can run on a USB flash drive as per this article (which makes so much more sense). I now need to shut the host down and hook up some peripherals, as well as my target USB drive.
Flashing the ISO to USB
Unlike an ordinary distribution which simply allows you to use a tool like Etcher or
dd to create your installer USB, it seems as though VMWare have done something a little unorthodox (which seems typical of them from what I hear). This page led me to using the script from this GitHub repo to create a bootable USB from my iMac.
I also read somewhere (that I now can’t find) that because the installer is loaded into RAM, you can install to the same device that you booted from. This looks to be true so far (even though the installation is taking longer than I expected), and makes sense given that they otherwise would have hidden the installer device from the list of available targets.
ESXi installation progress
I don’t know if other people share this notion, but by contrast to Proxmox (which retrospectively feels like a high risk/low reward operation), I find some level of excitement and motivation in installing an operating system that has “backing” and community interest. I painfully admit that even Hyper-V Server has that same hype for me, but there is no way in high heaven that I’d consider it for this scenario.
This reminds me that I haven’t mentioned anything about the hardware that I’m running on. I’ll be sure to do that in a future post, since it’s a fairly simple but incredibly neat setup.
This is fun. Having taken somewhere in the order of an hour to get to 90%, the installer failed due to I/O errors. I don’t know what’s going on here, but it’s giving me a bad feeling. Just to make matters worse, the temporary monitor I’ve hooked up to is so old and antiquated that it’s starting to press buttons by itself now.
Fortunately, I think the issue was around the disk I was using to install with (and to).
Previously: Installing to 16 GB mediocre USB to same disk. Now: Installing from 8 GB reasonable USB to fast 32 GB USB3.
No disk errors, and it installed in a matter of five minutes. We live, we learn.
Setting Up VMWare
It takes about a minute to load vSphere, which is longer than Proxmox but not a huge issue given that one tends not to restart one’s hypervisor too often. And now we set ourselves a nice static IPv4 address so that we can reference it easily, which will be less of an issue once we get DNS up and running.
VMWare ESXi dashboard
A Lot Happened at This Point
It was around here that everything started to hit the fan, and I started to fall under time pressure. Trying to create the VM that would serve the most-needed service previously offered by the Proxmox host’s children was my first priority, so I frantically went about that. This is the VM that needed access to multimedia functions (i.e. Intel QuickSync/graphics). I struggled forever trying to pass the Intel graphics PCI device through to the VM, which was quite easy to do through ESXi’s management GUI, but when the VM booted it would only reach an arbitrary point in console output before freezing.
The trick was that it was actually booting, but using the PCI output in place of the rendered screen shown through the management GUI. Setting
FALSE in the VM’s advanced options meant that the VMWare graphics adapter was removed, and multimedia applications have an easier time picking the Intel graphics device over the VMWare one. The only caveat that I’ve yet to figure out here is how to get the screen showing on VMWare’s console while still only having the Intel display device, but that’s definitely a challenge for another day.
SVGA present option
Performance-wise, I’m quite impressed. I was afraid of moving from CTs in Proxmox to VMs for fear of overhead noticeably impacting performance, but it’s negligible across the applications I’m running so far.
VMware ESXi performance stats
More another time. This has been enough for one day.