Nutanix CE: Cannot Start VM

I’m lucky enough to have a four node Nutanix Community Edition cluster in my lab. I stumbled across a four node Supermicro chassis that had been mislabeled on eBay and got it for a heck of a deal. I had been interested in getting hands on with Nutanix at the time and it worked out very well.

The cluster architecture is as follows:

  • Four dual-processor nodes
  • Two SSDs per node (oplog/hot tier)
  • Four HDDs per node (cold tier)
  • One USB key per node (boot)
  • One 10Gb and two 1Gb network interfaces

This was working very well for a couple months, when all of a sudden I ran into an issue where I was not able to power on a newly created VM. It errored out with a gigantic stack trace, near the bottom of which I could see references to ovs/openvswitch and the following message:

libvirtError: Cannot create directory '/var/lib/libvirt/qemu/domain-1cb62c34-f37f-3a98-d3aa-7d45acc39b88': No space left on device

Now at this time I checked the rest of my lab – none of the VMs were complaining, health checking on all of the nodes was green everywhere, yet trying to start the VM still errored out. To see if it was related to a specific node, I went to the affinity settings and selected a specific node that wasn’t the Prism leader and found I was able to start the VM just fine. I then deleted the virtual NIC and was able to start it on the original host, but if I re-added the virtual NIC it would again refuse to start.

Armed with this information, I had a reasonable suspicion that something was wrong with the actual AHV host. I connected to the AHV host that was the Prism leader and did a “df -h” to see if anything jumped out, and something surely did (line 6):

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         36G     0   36G   0% /dev
tmpfs            36G  272K   36G   1% /dev/shm
tmpfs            36G  3.5G   32G  10% /run
tmpfs            36G     0   36G   0% /sys/fs/cgroup
/dev/sdc1       6.8G  6.8G     0 100% /
tmpfs           7.1G     0  7.1G   0% /run/user/0

Nothing good ever came from a completely filled root partition. In checking the rest of the nodes I saw they were all rapidly approaching full root partitions as well. But why were they only 6.8GB? Even if it was on the USB key, that has a capacity of 32GB of space. Surely the installer didn’t just create an 8GB partition? Let’s investigate using “fdisk -l /dev/sdc” on another host in the cluster.

Disk /dev/sdc: 32.1 GB, 32080200192 bytes, 62656641 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xc2ed4a55

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *        2048    14540799     7269376   83  Linux
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Welp. It sure did. Now what do we do? Well, luckily someone had a similar issue and was gracious enough to post their procedure on the Nutanix CE forum. At a high level, the procedure is as follows:

  • Place host into maintenance mode
  • Shutdown CVM on affected node
  • Disable autostart of CVM on affected node
  • Delete old partition on USB boot disk
  • Create new (larger) partition on USB boot disk
  • Reboot affected node
  • Check partition size again, verify partition needs to be expanded
  • Expand partition
  • Verify expanded partition size
  • Re-enable autostart for CVM
  • Start CVM on affected node
  • Verify CVM boots up and is healthy
  • Remove host from maintenance mode

Now this is scary at first since we’re deleting a partition that has data we definitely don’t want to lose. The key here to keep in mind is that we’re only altering the partition table – the underlying filesystem isn’t being changed at all. It’s a little weird to wrap your head around but I promise it actually makes sense.

Before We Begin

This procedure involves taking a node offline. Due to how Nutanix clusters operate, if this process takes longer than 30 minutes the affected node will be automatically detached from the metadata store because its data will have become stale. This should not have any affect on VMs running on other hosts in the cluster assuming everything else is healthy. That being said, it’s critical to make sure the cluster is otherwise healthy before we start maintenance. If you go through this process and find that the node has been automatically detached from the metadata store, it’s a very simple process to navigate to the “Hosts” page in Prism, click on the affected host and click “Enable metadata store” once maintenance has been completed.

It is also important to make sure that the rest of your cluster has enough spare resources to handle the resource requirements of VMs that are currently running on the affected node you’re about to work on. I prefer to migrate VMs manually through Prism before beginning maintenance. It also gives you the opportunity to shut down anything non-essential that may be running instead of shuffling it around the cluster.

Last but not least: you’re doing this all at your own risk. This post pertains to the Community Edition version of Nutanix which means that it’s not meant for production use and has no official support, just the community forum. While I would expect this procedure to work just fine on a production Nutanix cluster, if you’re experiencing this issue on a production/retail Nutanix cluster then you should absolutely stop reading here and call support instead.

What went wrong?

It turns out, during installation of the cluster I used the USB image as directed by the Nutanix CE installation guide. The problem is, this image does not do any repartitioning of the USB key like the ISO installer does – it simply drops an 8GB partition on there and calls it good. If you used the USB installer method, this is almost certainly the case for your installation as well. As the cluster is in service for a longer period of time it would be expected for you to run into this issue as well.

Let’s fix it.

Logon to an affected node via SSH. If you’ve got this problem across all of your nodes then I recommend leaving the current Prism leader for last to avoid migrating the roles multiple times.

Run “acli” and then do a “host.list” to get the list of hosts in the cluster.

<acropolis> host.list
Hypervisor address  Host UUID                             Schedulable  Hypervisor Type  Hypervisor Name
192.168.1.10         11111111-1111-1111-1111-111111111111  True         kKvm             AHV
192.168.1.11         22222222-2222-2222-2222-222222222222  True         kKvm             AHV
192.168.1.12         33333333-3333-3333-3333-333333333333  True         kKvm             AHV
192.168.1.13         44444444-4444-4444-4444-444444444444  True         kKvm             AHV
<acropolis>

Now run “host.enter_maintenance_mode x.x.x.x” where x.x.x.x is the IP of the AHV host we are working on. This will take some time if there are VMs currently running on this host. You can speed this up by migrating VMs in Prism manually (or shutting them down) before entering maintenance mode.

Note: If your cluster doesn’t have enough spare resources to migrate the running VMs to different hosts, the operation will fail and you’ll receive an error like this:

Error entering maintenance mode: too much memory in use: No host has enough available memory. Maximum allowable VM size is approximately 6901 MB

If that happens, go back to Prism and shut some VMs down so you’ve got room to breathe, then come back and enable maintenance mode again.

Once “pending” changes to “complete,” run “host.list” again to verify the node has the “Schedulable” property changed from “True” to “False.” If this is the case then exit out of acli to get back to shell.

<acropolis> host.enter_maintenance_mode 192.168.1.12
EnterMaintenanceMode: pending
EnterMaintenanceMode: complete
<acropolis>
<acropolis> host.list
Hypervisor address  Host UUID                             Schedulable  Hypervisor Type  Hypervisor Name
192.168.1.10         11111111-1111-1111-1111-111111111111  True         kKvm             AHV
192.168.1.11         22222222-2222-2222-2222-222222222222  True         kKvm             AHV
192.168.1.12         33333333-3333-3333-3333-333333333333  False        kKvm             AHV
192.168.1.13         44444444-4444-4444-4444-444444444444  True         kKvm             AHV
<acropolis>
<acropolis> exit
[root@NTNX-333333333-A ~]#

Now run “virsh list” to ensure that we have nothing but the CVM running on this node.

[root@NTNX-333333333-A ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 1     NTNX-333333333-A-CVM            running

[root@NTNX-333333333-A ~]#

Now we have to shutdown that CVM – either ssh into it and use “sudo shutdown -h now” or shut it down from Prism. I prefer using SSH when interacting with the platform directly. Once the CVM is shutdown, verify that no VMs are running.

[root@NTNX-333333333-A ~]# virsh list
 Id    Name                           State
----------------------------------------------------

[root@NTNX-333333333-A ~]#

Because we’ll be rebooting during this process, we don’t want the CVM to spin back up and start communicating with the cluster before we’re ready. To ensure this doesn’t happen, let’s disable autostart on the CVM.

[root@NTNX-333333333-A ~]# virsh autostart NTNX-333333333-A-CVM --disable
Domain NTNX-333333333-A-CVM unmarked as autostarted

[root@NTNX-333333333-A ~]#

Perfect. Now we’re ready to get down to brass tacks. We’re going to be doing some seemingly redundant checking during this process, but that’s because I’m a firm believer in “measure twice, cut once.” To that end, let’s check the partitions again to make sure we’ve got the right target in mind.

[root@NTNX-333333333-A ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G  272K   32G   1% /dev/shm
tmpfs            32G  3.2G   29G  11% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sdc1       6.8G  5.7G  750M  89% /
tmpfs           6.3G     0  6.3G   0% /run/user/0
[root@NTNX-333333333-A ~]#

OK – /dev/sdc is our culprit, and the partition in question is /dev/sdc1. Time to rewrite the partition table. Launch fdisk against the disk (not the partition!) and use “p” to view the existing partition table.

[root@NTNX-333333333-A ~]# fdisk /dev/sdc
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): p

Disk /dev/sdc: 32.1 GB, 32080200192 bytes, 62656641 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xc2ed4a55

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *        2048    14540799     7269376   83  Linux

Command (m for help):

We have one partition which is currently marked as bootable. We can also see that the partition is 7269376 blocks long, and fdisk considers a block to be 1024 bytes. That gives us a current partition size of 7099 MB. We will now delete the existing partition, create a new one that will fill the disk, then mark it as bootable (active).

Command (m for help): d
Selected partition 1
Partition 1 is deleted

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p):
Using default response p
Partition number (1-4, default 1):
First sector (2048-62656640, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-62656640, default 62656640):
Using default value 62656640
Partition 1 of type Linux and of size 29.9 GiB is set

Command (m for help):

Command (m for help): a
Selected partition 1

Command (m for help):
Command (m for help): p

Disk /dev/sdc: 32.1 GB, 32080200192 bytes, 62656641 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xc2ed4a55

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *        2048    62656640    31327296+  83  Linux

Command (m for help):

We’ve now replaced the old partition with one that fills the disk. We haven’t written any of these changes to disk yet – to do that we need to issue command “w”.

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.
[root@NTNX-333333333-A ~]#

The partition table has now been written to the disk. That warning is telling you that the disk is currently in use, so the new partition table won’t be used until the machine is rebooted. Let’s do that now.

[root@NTNX-333333333-A ~]# sudo reboot
PolicyKit daemon disconnected from the bus.
We are no longer a registered authentication agent.

When the host starts back up, we will check the filesystem usage again. We are mainly ensuring that the device identifier didn’t change before we extend the partition.

[root@NTNX-333333333-A ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G  268K   32G   1% /dev/shm
tmpfs            32G  9.0M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sdc1       6.8G  5.7G  757M  89% /
tmpfs           6.3G     0  6.3G   0% /run/user/0
[root@NTNX-333333333-A ~]#

Still has the same identifier, still shows the old size. Remember, we resized the partition but not the filesystem. Let’s extend the filesystem now.

[root@NTNX-333333333-A ~]# resize2fs /dev/sdc1
resize2fs 1.42.9 (28-Dec-2013)
Filesystem at /dev/sdc1 is mounted on /; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 4
The filesystem on /dev/sdc1 is now 7831824 blocks long.

[root@NTNX-333333333-A ~]#

That was easy. Let’s check usage now.

[root@NTNX-333333333-A ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G  268K   32G   1% /dev/shm
tmpfs            32G  9.0M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sdc1        30G  5.7G   23G  21% /
tmpfs           6.3G     0  6.3G   0% /run/user/0
[root@NTNX-333333333-A ~]#

Much better! Technically speaking you should be good to go right now, but out of an abundance of caution I like to reboot again here just to make sure there’s no weirdness and the new partition still boots just fine.

[root@NTNX-333333333-A ~]# sudo reboot
PolicyKit daemon disconnected from the bus.
We are no longer a registered authentication agent.

When it comes back up, check usage one last time.

[root@NTNX-333333333-A ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G  268K   32G   1% /dev/shm
tmpfs            32G  9.0M   32G   1% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sdc1        30G  5.7G   23G  21% /
tmpfs           6.3G     0  6.3G   0% /run/user/0
[root@NTNX-333333333-A ~]#

Fantastic. Time to start bringing services back online. First we re-enable autostart for the CVM.

[root@NTNX-333333333-A ~]# virsh autostart NTNX-333333333-A-CVM
Domain NTNX-333333333-A-CVM marked as autostarted

[root@NTNX-333333333-A ~]#

Next we manually start the CVM.

[root@NTNX-333333333-A ~]# virsh start NTNX-333333333-A-CVM
Domain NTNX-333333333-A-CVM started

[root@NTNX-333333333-A ~]#

Wait a bit then use “virsh list” to verify the CVM is running.

[root@NTNX-333333333-A ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 1     NTNX-333333333-A-CVM            running

[root@NTNX-333333333-A ~]#

The CVM will take a few minutes to come fully online. Usually it’s easiest to watch the Hardware page in Prism – when the CVM is offline you’ll just see the AHV IP address in the Host Name field and no other details, but when it’s back up you’ll see the actual hostname and rest of the information populated. You can also use cluster status to verify a CVM is online. SSH into another CVM and run the following, looking for “CVM: x.x.x.x Up”.

nutanix@NTNX-333333333-A-CVM:192.168.1.22:~$ cluster status | grep 192.168.1.22
2019-05-19 13:01:00 INFO zookeeper_session.py:110 cluster is attempting to connect to Zookeeper
2019-05-19 13:01:01 INFO cluster:2484 Executing action status on SVMs 192.168.1.21,192.168.1.22,192.168.1.23,192.168.1.20
        CVM: 192.168.1.22 Up
2019-05-19 13:01:05 INFO cluster:2597 Success!
nutanix@NTNX-333333333-A-CVM:192.168.1.22:~$

If your host has been offline longer than 30 minutes and been automatically evicted from the metadata ring, go into the Hardware tab in Prism, select the affected host and click “Enable Metadata Store.”

Once the CVM is online, let’s get it back out of maintenance mode.

nutanix@NTNX-444444444-A-CVM:192.168.1.23:~$ acli
<acropolis> host.list
Hypervisor address  Host UUID                             Schedulable  Hypervisor Type  Hypervisor Name
192.168.1.10         11111111-1111-1111-1111-111111111111  True         kKvm             AHV
192.168.1.11         22222222-2222-2222-2222-222222222222  True         kKvm             AHV
192.168.1.12         33333333-3333-3333-3333-333333333333  False        kKvm             AHV
192.168.1.13         44444444-4444-4444-4444-444444444444  True         kKvm             AHV
<acropolis>
<acropolis> host.exit_maintenance_mode 192.168.1.12
<acropolis> host.list
Hypervisor address  Host UUID                             Schedulable  Hypervisor Type  Hypervisor Name
192.168.1.10         11111111-1111-1111-1111-111111111111  True         kKvm             AHV
192.168.1.11         22222222-2222-2222-2222-222222222222  True         kKvm             AHV
192.168.1.12         33333333-3333-3333-3333-333333333333  True         kKvm             AHV
192.168.1.13         44444444-4444-4444-4444-444444444444  True         kKvm             AHV
<acropolis>

This node is now healthy. Monitor any tasks in Prism to make sure that the cluster health is green and data resiliency is available, then you can move on to the rest of the nodes in your cluster one at a time. Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *