How device size affects disk performance in Linux

PUBLISHED ON FEB 9, 2011 — BLOG

While running some tests in a client’s environment, we’ve noticed reading from a partition of a multipath device was considerably slower than reading from its parent node:

`[root@none]# dd if=mpath4 of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 8.92711 seconds, 120 MB/s

[root@none]# dd if=mpath4p1 of=/dev/null bs=1M count=1024 skip=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 17.5965 seconds, 61.0 MB/s`

We asked to client support of a well-known GNU+Linux vendor, and they indicated that this behavior was “expected”, since this kind of partitions were created by stacking a_dm-linear_ device over the original multipath node. I wasn’t satisfied by this answer, since AFAIK dm-linear only did a simple transposition of the original request over an specified offset (the beginning of the partition), so I decided to investigate a bit further on my own.

The first thing I’ve noticed, was that changing size of the dm-linear device affected the performance of the tests:

`[root@none]# echo “0 1870000 linear 8:96 63” | dmsetup create test [root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=600 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.906487 seconds, 116 MB/s

[root@none]# dmsetup remove test [root@none]# echo “0 1870001 linear 8:96 63” | dmsetup create test [root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=700 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 1.47716 seconds, 71.0 MB/s`

This was something, but then I needed to find the reason of how a simple change in the device size could impact the performance this way. Playing around with _kgdb_ (what a nice tool!), I’ve reached to this piece of code from Linux (drivers/md/dm.c):

`static int __split_bio(struct mapped_device md, struct bio bio) { struct clone_info ci; int error = 0;

ci.map = dm_get_table(md); if (unlikely(!ci.map)) return -EIO;

ci.md = md; ci.bio = bio; ci.io = alloc_io(md); ci.io->error = 0; atomic_set(&ci.io->io_count, 1); ci.io->bio = bio; ci.io->md = md; ci.sector = bio->bi_sector; ci.sector_count = bio_sectors(bio); ci.idx = bio->bi_idx;

start_io_acct(ci.io); while (ci.sector_count && !error) error = __clone_and_map(&ci);

dec_pending(ci.io, error); dm_table_put(ci.map);

return 0; }`

In the debugging session, I’ve noticed that _ci.sector_count_ takes the value ‘1’ for the device with worst performance, while other devices with different sizes and better read speeds could take values in a range from ‘2’ to ‘8’ (being the latter the case with best performance). So, indeed, the size of a device affects how is accessed, and this implies a noticeably difference in performance. But, still, it wasn’t clear for me where is the root of this behavior, so I decided to dig a bit deeper. That took me to this function (fs/block_dev.c):

`void bd_set_size(struct block_device bdev, loff_t size) { unsigned bsize = bdev_logical_block_size(bdev);

bdev->bd_inode->i_size = size; while (bsize < PAGE_CACHE_SIZE) { if (size & bsize) break; bsize <<= 1; } bdev->bd_block_size = bsize; bdev->bd_inode->i_blkbits = blksize_bits(bsize); }`

This function searches for the greatest power of 2 which is divisor of the device size in the range of 512 (sector size) to 4096 (the value for PAGE_CACHE_SIZE in x86), and sets it as the internal block size. Further direct requests to the device will be internally divided in chunks of this size, so devices with sizes that are multiple of 4096 will perform better than the ones which are multiple of 2048, 1024 or 512 (the worst case, which every device conforms as its the size of each sector). This is specially important in scenarios in which devices are directly accessed by the application, such as in Oracle’s ASM configurations.

TL;DR: Linux chooses the internal block size that will be used to fulfill page requests by searching the greatest power of 2 which is divisor of the device size in a range from 512 to 4096 (in x86), so creating your partitions with a size which is multiple of 4096 will help to obtain better performance in I/O disk operations.