What's going on here?

One of the bottlenecks of today's computers is storage. While CPUs, buses and other components of computers have really nice values of throughput going up to several GiBs/s, disks are really slow compared to them. HDDs give few hundreds of MiBs/s at most when performing sequential read/write and much less when doing random I/O operations. While SSDs are much faster than HDDs especially in doing random I/O operations they are much more expensive and thus not so great for big amounts of data. As usual in today's world, the key word for a win-win solution is the word "hybrid". In this case a combination of HDD and SSD (or just their technologies in a single piece of hardware) using a lot of HDD-based space together with small SSD-based space as a cache providing fast access to (typically) most frequently used data. There are many hardware solutions that provide such hybrid disks, but they have the same drawbacks as hardware RAIDs –they are not at all flexible and really good just for a particular use case. And as with the hardware RAIDs the solution for better flexibility and broader range of use cases is to use a software RAID, with hybrid disks software comes into to this game (to win it, maybe?) with multiple approaches. Two most widely used and probably also most advanced are bcache and LVM-cache (or dm-cache as explained below). So what these two are and how they differ? Let's focus on each separately and then compare them a bit.

bcache

What it is?

bcache or Block (level) cache is a software cache technology being developed and maintained as part of the Linux kernel codebase which as it's name suggests provides cache functionality on top of arbitrary (pair of) block devices. As with any other cache technology bcache needs some backing space (holding data that should be cached), typically on a slow device, and some cache space, typically on a fast device. Combined with the fact that bcache is a block level cache we get the fact that both backing space and cache space could be arbitrary block devices – i.e. disks, partitions, iSCSI LUNs, MD RAID devices, etc.

Deployment options

The simplest solution is to use an HDD (let's say /dev/sda) together with an SSD (let's say /dev/sdb), create a bcache on top of them (as described below) and then partition the bcache device for the system. A bit more complicated solution is to create partitions on the HDD and SSD and create one or more bcache devices on top of partitions that are desired to be cached. Why one should even think about this more complicated solution? It provides much better flexibility. While by creating bcache on top of the whole HDD and SSD devices gives us basically the same as hybrid disks except that we need two SATA ports and we can choose from more HDD and SSD sizes creating bcache(s) on top of partitions allows us e.g. to have some data (e.g. system data) directly on SSD and some other data in a bcache (HDD+SSD) or even have multiple bcache devices with different backing space and cache space sizes or even caching policies (see below for details).

Setting up

So, let's say we have an HDD (/dev/sda) and an SSD (/dev/sdb) and we have some partitions created on them – let's say /dev/sda1 on the whole HDD (to be used for /mnt/data) and /dev/sdb1 used for system (/) plus /dev/sdb2 (dedicated for cache) on SSD.

First of all we need to install the tools that will allow us to create, configure and monitor the bcache. These are typically a part of a package called bcache-tools or similar. So on my Fedora 21 system, I need to run the following command to get it (# means it should be run as root):

# dnf install bcache-tools

Another tool we will need is the wipefs tool which is part of the util-linux package that should already be installed in the system.

With all the necessary tools available, we can now proceed to the bcache creation. But before we start creating something new we need to first wipe all old weird things from the block devices (in our case partitions) we want to use (WARNING: this removes all file system and other signatures from /dev/sda1 and /dev/sdb2 partitions):

# wipefs -a /dev/sda1
# wipefs -a /dev/sdb1

Cleaned up. Now, as is usual with basically all storage technologies, we need to write some metadata to the devices we want to use for our bcache so that the code providing the cache technology can identify such devices as bcache devices and so that it can store some configuration, status, etc. data there. Let's do it then:

# make-bcache -B /dev/sda1

This command writes bcache metadata for the backing device (space) to the partition /dev/sda1 (which is on the HDD). Believe it or not, but this is all we needed to create a bcache device. If udev is running and appropriate udev rules are effective (if not, we have to do it manually 1), we should now be able to see the /dev/bcache0 device node and the /dev/bcache/ directory (try listing it to see what's inside) in our file system hierarchy which we could start using. Really? Is that everything that needs to be done? Well, it's not that easy. Remember that every cache technology needs backing space and cache space and with the command above we have only defined the backing device (space). So we now of course have to define the cache device (space) again by writing some metadata into it:

# make-bcache -C /dev/sdb1

The result is that we now have the metadata written to both the backing device (space) and the cache device (space). However, these devices don't know about each other and the caching code (i.e. the kernel in case of bcache) has no idea about our intention of using /dev/sdb1 as a cache device for /dev/sda1. Remember that the first make-bcache run created the /dev/bcache0 device that was from the first moment usable? Well, it was usable as a bcache device, but without any caching device which is not really useful. The last step missing is to attach the cache device to our bcache device bcache0 by writing the Set UUID from the make-bcache -C run to the appropriate file:

# echo C_Set_UUID_VALUE > /sys/block/bcache0/bcache/attach

From now on we can enjoy the speed, noise and other improvements provided by the use of our cache. The /dev/bcache0 device is just a common block device and the easiest thing to do with it is to run e.g. mkfs.xfs on it, mount the file system to e.g. /mnt/data and copy some data to it. If we later want to detach the cache device from the bcache device, we just use the detach file instead of the attach file in the same directory under /sys.

As I've mentioned in the beginning of this post, SW-based cache solutions provide more flexibility as HW solutions. One area of such flexibility is configuration because it is quite easy to make a SW solution configurable and extensible compared to a HW solutions. The configuration of our bcache can be controlled by reading and writing files under the /sys file system. The most useful and easiest example is changing the mode of cache – the default is writethrough which is the safest one, but which on the other hand doesn't save the backing device (HDD) from many random write operations. Another typical mode is writeback which keeps the data in the cache (SSD) and once in a while writes them back to the backing device. To change the mode we simply run the following command:

# echo writeback > /sys/block/bcache0/bcache/cache_mode

However, this change is only temporary and we have to do the same after every boot of the system if we want to always use the writeback mode (of course we can do this in a udev rule, systemd service, init script or whatever we prefer instead of doing it manually after each boot).

Monitoring and maintenance

Even though it is usually possible to see (and even hear 2) the difference once bcache is created and used instead of just using the HDD people are curious and always want to know something more. A typical question is: "How well is the new solution performing?" In case of cache, the most clear performance metric is the ratio of read/write hits and misses. Of course, the more hits compared to misses the better. To find out more about the current state, status and stats of a bcache another tool from the bcache-tools package can be used:

# bcache-status -s

In the output we should see quite a lot of interesting information and we can for example also check that the desired cache mode is being used. There are other configuration options and other stats that might be important for many users, but these are left to the kind reader for further exploration.

LVM cache (dm-cache)

Why?

We have seen in the previous part of this post that bcache is quite a powerful and flexible solution for using HDD and SSD in a combination giving us great performance (of the SSD) and big capacity (of the HDD). So one may ask why we even bother with a description of some other solution. What could possibly be better with LVM cache (dm-cache) compared to bcache?

A little bit about terminology

First of all, let's start with clarification of why I up until now always referred to this technology as "LVM cache (dm-cache)". Some people know, some may not, that LVM (which stands for Logical Volume Management) is a technology of user space abstract volume management using the Device Mapper functionality (in both user space and kernel). As a result of that, everything that can be done with LVM can be done by directly using the Device Mapper (even though it is typically incomparably more complex) and anything that LVM does needs to have the underlying (or low-level if you prefer) support in the Device Mapper. The same applies to the caching technology which is provided by the cache Device Mapper target and made "consumable" by the LVM cache abstraction.

Okay, okay, but why?

Now, let's get back to the big question from the first paragraph of this section. The answer is clear and simple to people who like LVM – the LVM cache for bcache is what LVM is for plain partitions. For people who don't like, use or totally don't get LVM an example of quite a big difference could be the best argument. The first step we did in order to set our bcache up was wiping all signatures from block devices we wanted to use for both backing space and cache space. That means that any file systems that could potentially existed on those block devices would be removed leaving the data unreadable and practically lost. With LVM cache it is possible to take and existing LV (Logical Volume) with an existing (even mounted) file system and convert it to a cached LV without any need of moving the data to some temporary place and even without any downtime 3. And the same applies if we for example later decide that we want to stripe the cache pool to two SSDs (RAID 0) to get more cache space and really nice performance or on the other hand mirror the backing device to get better reliability (or both of course). So we may easily start with some basic setup and improve it later as we have more HW available or different requirements. The LVM cache also provides better control and even more flexibility by allowing user to manually define the data and metadata parts of the cache space with various different parameters (e.g. mirrored metadata part on more reliable devices with striped data part for more space and better performance).

Setting up

Let's assume we have the same HW as in case of bcache – a HDD and a SSD –but this time let's also assume that we already have LVM set up on the HDD (or even multiple HDDs, that makes no difference for the commands we are going to use) and that the SSD provides 20 GiB of space . Setting up LVM on top of HDD(s) would be a nice topic for another blog post, so let me know if you are interested in such topic in the comments. Now we want to demonstrate one of the benefits of the LVM cache over bcache so let's assume all the basic LVM setup work is done and we have an LV (Logical Volume) with some file system and data on it using the HDD for its physical extents 4 the name of which is DataLV and which is part of the data VG (Volume Group) (the backing space is called Origin in LVM's terminology). We will basically follow the steps described in the lvmcache (7) man page (another benefit over bcache from my point of view).

As the first step, we need to add the SSD (/dev/sdb) into the same volume group as where our LV holding the data (DataLV) is. To do that, we need to tell LVM that the /dev/sdb block device should become an LVM member device (we could use a partition on /dev/sdb if we wanted to combine partitions and LVM on our disks):

# pvcreate /dev/sdb

If that fails because of some old metadata (disk label, file system signature…) being left on the disk we could either use the wipefs tool (as in case of the bcache) or add the --force option to the pvcreate command.

Once LVM marks the /dev/sdb device as an LVM member device 5 we can now add it to the data VG:

# vgextend /dev/sdb

The data VG now sees the SSD as a free space for allocation if we create more LVs in it or grow some existing ones. But we want to use it as a cache space, right? Well, LVM only knows PVs (Physical Volumes), VGs and LVs. However, LVs can be of various types (linear, striped, mirror, RAID, thin, thin pool,…) which can be changed online. So let's start with creation of a good old LV with the size we want for our cache space and with it's PEs (Physical extents) being allocated on the SSD:

# lvcreate -n DataLVcache -L19.9G data /dev/sdb

I believe a concentrated reader now asks why only 19.9 GiB when we have 20 GiB of space on the SSD. The reason is that we are going the "hard" (more controlled) way and we need some space for a separate metadata volume which we can now create:

# lvcreate -n DataLVcacheMeta -L20M data /dev/sdb

with the size of 20 MiB because the LVM documentation (the man page) says it should be 1000 times smaller than the cache data LV, with a minimum size of 8MiB. If we wanted to have the DataLVcache and/or DataLVcacheMeta more special (like mirrored), we could have created them as such right away now. Or we could convert them later if we want to. But for now, let's just follow our simple (and probably most common) case. The next step we need to do is to "engage" the data cache LV and metadata cache LV in a single LV called cache pool. A cache pool is an LV that provides the cache space for the backing space with metadata being written and kept in it. And as such, it is created from the data cache LV, more precisely converted:

# lvconvert --type cache-pool --cachemode writethrough --poolmetadata data/DataLVcacheMeta data/DataLVcache

As you may see, we specify the cache mode on cache pool creation. The bad thing about it is that it cannot be changed later, but the good thing about it is that it is persistent. And honestly, other then playing with various technologies, how often one needs to change the cache mode? If it's really needed, the cache pool can be simply created again with a different cache mode.

It's been a long way here, I know, but we are almost at the end now, I promise. The only missing step is to finally make our DataLV cached. And as usual with LVM, it is a conversion:

# lvconvert --type cache --cachepool data/DataLVcache data/DataLV

And with that, we are done. We can now continue using the DataLV logical volume, but from now on as a cached volume using the cache space on the SSD.

Unfortunately, there seems to be no nice tool shipped with the LVM that would give us all the cool stats just like bcache-status does for bcache. The only such tool I'm aware of is the lvcache tool written by Lars Kellogg-Stedman available from this git repository: https://github.com/larsks/lvcache. Hopefully this will change when the LVM cache starts to be more widely deployed and used.

Summary

I know it probably seemed really complicated and much harder to set up LVM cache than setting up bcache, but if we wanted to, we could have dropped the separate data and metadata cache LVs creation and do it in a single step creating the cache pool right away. 6 I just wanted to demonstrate extra control and possibilities the LVM cache provides. Without that, the LVM cache setup would really be very similar to the bcache setup, but still we a big advantage of doing everything online without any need to move data somewhere else and back.

I don't think that any of the two SW cache technologies presented in this blog post is better than the other one. Just like I mentioned in the very beginning of the LVM cache description, LVM cache for bcache is what LVM is for partitions. So if somebody has some advanced knowledge and likes having things configured the exact complex way that they think is best for their use case or if somebody needs to deploy cache online without any downtime then LVM cache is probably the better choice. On the other hand, if somebody just wants to make use of their SSD by setting up SW cache on a fresh pair of SSD and HDD and they don't want to bother with all the LVM stuff and commands, the bcache is probably the better choice.

And as usual, having to independent and separate solutions for a single problem leads into many new and great ideas that are in the end shared because what gets implemented in one of them usually sooner or later makes it to the other too, typically even improved somehow. Let's just hope that this will also apply to bcache and LVM cache and that both technologies are deployed widely enough to be massively supported, maintained and further developed.

  1. by running # echo /dev/sda1 > /sys/fs/bcache/register

  2. if the writeback mode is used many writes to the backing device are spared and the rest is serialized as most as possible which makes the HDD quite a lot less noisy due to R/W header not moving randomly

  3. a typical approach to convert a block device into a "bcached" block device is to freeze the data on it, move/copy it somewhere else, set the bcache up and move the data back

  4. LVM's units of physical space allocation

  5. try running wipefs (without the -a option!) on it

  6. with lvcreate --type cache-pool -L20G -n DataLVcache data /dev/sdb