Containers on a ZFS filesystem

madpenguin · 16 October 2023 16:35

What and Why

There are many ways to host data for containers. Earlier I covered containers using LVM, so each container would sit on a separate thinly provisioned Logical Volume managed by LVM. In this instance I’m going to try to do the same, but with each container sitting on a volume on a ZFS pool.

Operationally this should be relatively transparent, but under the hood it does expose some more interesting options, specifically compression and encryption. Typically you would do this on a real machine, however for ease of documentation I’m running inside a KVM based Virtual Machine.

To Start

I have a machine installed with Ubuntu Server 23.04, it has 200G of storage, partitioned with a 25G root partition and two empty partitions of 25G for 150G. The larger partition will be a storage pool for container instances, and the smaller will be a compressed and encrypted pool for private data. So, the following will all be done as the root user;

$ fdisk -l /dev/vda
Device         Start       End   Sectors  Size Type
/dev/vda1       2048      4095      2048    1M BIOS boot
/dev/vda2       4096  52432895  52428800   25G Linux filesystem
/dev/vda3   52432896 104861695  52428800   25G Linux filesystem
/dev/vda4  104861696 419428351 314566656  150G Linux filesystem

The first thing we need to do is install ZFS which includes the kernel modules, tools, and associated libraries.

$ apt install zfs-dkms

This should install ZFS and associated dependencies of which there will be a quite a few. It needs to compile and generate kernel modules so it may take a few minutes to complete.

Creating the ZFS Pools

Next we need to create the desired pools. The container pool is relatively straightforward, all we need is;

$ zpool create default /dev/vda4 -m legacy
$ zpool list
NAME      SIZE  ALLOC   FREE  FRAG   CAP  DEDUP    HEALTH
default   149G   100K   149G  0%     0%   1.00x    ONLINE
$ zfs list
NAME      USED  AVAIL     REFER  MOUNTPOINT
default   118K   144G       24K  legacy

It gets a little more interesting as we add a pool with compression and encryption;

$ zpool create -O encryption=on -O keyformat=passphrase \
               -O keylocation=prompt -o compatibility=off \
               -o feature@encryption=enabled -m legacy \
               private /dev/vda3
#
# It should then prompt for your passphrase, this should be
# secure and at least 14 characters.
#
Enter new passphrase: ....
Re-enter new passphrase: ....

$ zfs list
NAME      USED  AVAIL     REFER  MOUNTPOINT
default   118K   144G       24K  legacy
private   198K  23.7G       98K  legacy
#
# Now turn on compression for the pool called "private"
#
$ zfs set compression=gzip private

And we should be done. The state of private with regards to encryption is persistent, so once the pool has been unlocked, so it will remain so until either explicitly locked or until the machine is shut down.

Testing ZFS compression

Just to make sure we’re actually getting some compression on the private pool, I’m going to try a simple test, bearing in mind all pools with get de-duplication by default. (so just copying a file full of zero’s won’t actually show compression because de-duplication will have already taken all the blanks out)

$ tar cf archive.tar /usr
$ ls -lh
total 2.6G
-rw-r--r-- 1 root root 2.6G Oct 16 12:16 archive.tar
#
# If we create a temporary volume in "private"
#
$ zfs create private/tmp -o mountpoint=/mnt/tmp
#
# Then move our test archive onto it ..
#
$ zfs list
NAME          USED  AVAIL     REFER  MOUNTPOINT
default       118K   144G       24K  legacy
private       975M  22.8G       98K  legacy
private/tmp   975M  22.8G    =>975M  /mnt/tmp

So the archive when stored on the normal ext4 root filesystem is 2.6G in size, however when moved to the encrypted / compressed filesystem on the private zfs pool, it’s only actually consuming 975M, which seems pretty reasonable. Just to see how much of this is compression and how much is de-duplication, I’ll move it over to the uncompressed default pool;

$ zfs create default/tmp -o mountpoint=/mnt/tmp2
$ mv /mnt/tmp/archive.tar /mnt/tmp2
$ zfs list
NAME          USED  AVAIL     REFER  MOUNTPOINT
default      2.49G   142G       24K  legacy
default/tmp  2.49G   142G   =>2.49G  /mnt/tmp2
private       442K  23.7G       98K  legacy
private/tmp    98K  23.7G       98K  /mnt/tmp

So although we’re getting a little bit of de-duplication saving (2.46G vs 2.6) the majority is coming from the compression we applied to the private volume. We could apply compression to default, however this would slow our containers down somewhat and in this instance I’m not too worries about storage space.

Note compression is typically applied to either an entire pool, or a volume, whereas encryption is typically enabled for an entire pool.

Accessing our encrypted data

If you now reboot your machine, when it comes back up you should see that it has auto-mounted /mnt/tmp2, which is a filesystem we created on the default pool, but not the /mnt/tmp we created on the private pool. (this is because the pool is locked by default)

To get access to the private pool we can do;

$zfs load-key -a
Enter passphrase for 'private': ...
1 / 1 key(s) successfully loaded
$ zfs mount -a

Now if you take a look at df you should see it has unlocked the private pool and automatically mounted the /mnt/tmp volume.

Adding containers into the mix

Adding containers was previously covered here;

But to go over it again in a little less detail;

$ snap install lxd
$ lxd init
Would you like to use LXD clustering? (yes/no) no 
Do you want to configure a new storage pool? yes 
Name of the new storage pool: default
Name of the storage backend to use: zfs
Create a new ZFS pool? (yes/no): no
Name of the existing ZFS pool or dataset: default
Would you like to connect to a MAAS server? no
Would you like to create a new local network bridge? yes
What should the new bridge be called? lxdbr0 
What IPv4 address should be used? auto
What IPv6 address should be used? auto
Would you like the LXD server available over the network? yes
Address to bind LXD to (not including port): all
Port to bind LXD to: 8443
Would you like cached images to be updated automatically? yes
Would you like a YAML "lxd init" preseed to be printed? no

Now enable the LXD user interface;

$ snap set lxd ui.enable=true
$ systemctl reload snap.lxd.daemon

If you point your browser at the machine’s port 8443 (for from the machine itself; https://localhost:8443) and follow the instructions, you should be able to install the appropriate client certificates to get the GUI working.

Note I have found the process of installing client certificates in the browser for LXD to, on occasion, be problematic. If you end up with a strange text (JSON) response instead of a web page, you might like to try the following fix which has worked for me;

$ mkdir lxd-api-access-cert-key-files
$ cd lxd-api-access-cert-key-files
$ openssl genrsa -out lxd-webui.key 4096
$ openssl req -new -key lxd-webui.key -out lxd-webui.csropenssl x509 -req -days 3650 -in lxd-webui.csr -signkey lxd-webui.key -out lxd-webui.crt
$ openssl pkcs12 -keypbe PBE-SHA1-3DES -certpbe PBE-SHA1-3DES -export -in $ lxd-webui.crt -inkey lxd-webui.key -out lxd-webui.pfx -name "LXD WebUI"
$ lxc config trust add lxd-webui.crt
# Now download the lxd-webui.pfx file. Locally.
# Import the file to the browser.

IF the issue still persists (note, this will destroy any containers you’ve created), try;

$ snap remove --purge lxd
$ snap install lxd
# At this point you will need to remove all the ZFS volumes
# from **default** because "init" will try to recreate them
$ lxd init

Creating a ZFS based container

So assuming we now have a working UI on https://localhost:8443, we should be seeing something like this;

So, if we click on create instance …

And follow the yellow brick road, we end up with a running container.

Then if you click on storage you it should show you your (ZFS) storage pool with details of space used and space remaining;

So our first container has consumed a total of 659Mb of storage, which is probably what you might expect for a basic server installation. On further inspection however, what it’s actually done is to create an immutable base image for the version of Linux you’ve selected, and then a second (copy-on-write) volume containing differences to the base image.

$ zfs list -r default/images
NAME                           USED  AVAIL     REFER  MOUNT
default/images                 617M   144G       24K  legacy
default/images/5c0f660608...   617M   144G      617M  legacy

$ zfs list -r default/containers
NAME                      USED  AVAIL     REFER  MOUNTPOINT
default/containers       21.3M   144G       24K  legacy
default/containers/zfs1  10.6M   144G      622M  legacy

So the base image is consuming 617M, but the container itself is only using 10.6M. The useful thing to note is that if we create a second container using the same version of Linux, it can use the same base instance. So whereas the first container consumes 617M + 10.6M of space, the second (and subsequent) will only consume 10.6M of space. (which makes them incredibly space efficient, even before you start to look at de-duplication or compression) Just to prove the point, if I create a second instance;

Then go back an look at storage consumption in the default pool;

$ zfs list -r default/images
NAME                           USED  AVAIL     REFER  MOUNT
default/images                 617M   144G       24K  legacy
default/images/5c0f660608...   617M   144G      617M  legacy

$ zfs list -r default/containers
NAME                      USED  AVAIL     REFER  MOUNTPOINT
default/containers       21.3M   144G       24K  legacy
default/containers/zfs1  10.6M   144G      622M  legacy
default/containers/zfs2  10.6M   144G      622M  legacy

Summary

An alternative pool based storage system for LXD based containers.

One or more host managed storage pools
Access to ZFS options such as compression, encryption, RAID etc
Out of the box de-duplication
Lazy space allocation / re-allocation
Easy access to snapshots for backing up individual containers
Fully integrated into LXD’s infrastructure and UI