How device size affects disk performance in Linux (II)

I’ve just noticed that in a virtualized environment, where roundtrips are more expensive, read speed differences between a page-aligned partition and a non page-aligned one are even more noticeable:

Non-aligned partition (size in sectors: (41929649-63)+1=41929587):

none:~# fdisk -l -u /dev/sdb

Disk /dev/sdb: 21.4 GB, 21474836480 bytes
255 heads, 63 sectors/track, 2610 cylinders, total 41943040 sectors
Units = sectors of 1 x 512 = 512 bytes
Disk identifier: 0x93dbf2be

Device Boot Start End Blocks Id System
/dev/sdb1 63 41929649 20964793+ 83 Linux

none:~# dd if=/dev/sdb1 of=/dev/null bs=1M count=100 skip=0
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 3.80593 s, 27.6 MB/s

none:~# dd if=/dev/sdb1 of=/dev/null bs=1M count=100 skip=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 3.82304 s, 27.4 MB/s

none:~# dd if=/dev/sdb1 of=/dev/null bs=1M count=100 skip=200
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 3.83713 s, 27.3 MB/s

Page-aligned partition (size in sectors: (41929654-63)+1=41929592) :

none:~# fdisk -l -u /dev/sdb

Disk /dev/sdb: 21.4 GB, 21474836480 bytes
255 heads, 63 sectors/track, 2610 cylinders, total 41943040 sectors
Units = sectors of 1 x 512 = 512 bytes
Disk identifier: 0x93dbf2be

Device Boot Start End Blocks Id System
/dev/sdb1 63 41929654 20964796 83 Linux

none:~# dd if=/dev/sdb1 of=/dev/null bs=1M count=100 skip=0
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.618494 s, 170 MB/s

none:~# dd if=/dev/sdb1 of=/dev/null bs=1M count=100 skip=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.582423 s, 180 MB/s

none:~# dd if=/dev/sdb1 of=/dev/null bs=1M count=100 skip=200
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.589081 s, 178 MB/s

NOTE: In both tests, the data is cached in the host (so there is no real I/O to disks but traffic through virtualization pipes), but clean in the guest.

Read More

How device size affects disk performance in Linux

While running some tests in a client’s environment, we’ve noticed reading from a partition of a multipath device was considerably slower than reading from its parent node:

[root@none]# dd if=mpath4 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.92711 seconds, 120 MB/s

[root@none]# dd if=mpath4p1 of=/dev/null bs=1M count=1024 skip=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 17.5965 seconds, 61.0 MB/s

We asked to client support of a well-known GNU+Linux vendor, and they indicated that this behavior was “expected”, since this kind of partitions were created by stacking adm-linear device over the original multipath node. I wasn’t satisfied by this answer, since AFAIK dm-linear only did a simple transposition of the original request over an specified offset (the beginning of the partition), so I decided to investigate a bit further on my own.

The first thing I’ve noticed, was that changing size of the dm-linear device affected the performance of the tests:

[root@none]# echo "0 1870000 linear 8:96 63" | dmsetup create test
[root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=600
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.906487 seconds, 116 MB/s

[root@none]# dmsetup remove test
[root@none]# echo "0 1870001 linear 8:96 63" | dmsetup create test
[root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=700
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.47716 seconds, 71.0 MB/s

This was something, but then I needed to find the reason of how a simple change in the device size could impact the performance this way. Playing around with kgdb (what a nice tool!), I’ve reached to this piece of code from Linux (drivers/md/dm.c):

static int __split_bio(struct mapped_device md, struct bio bio)
{
struct clone_info ci;
int error = 0;

ci.map = dm_get_table(md);
if (unlikely(!ci.map))
return -EIO;

ci.md = md;
ci.bio = bio;
ci.io = alloc_io(md);
ci.io->error = 0;
atomic_set(&ci.io->io_count, 1);
ci.io->bio = bio;
ci.io->md = md;
ci.sector = bio->bi_sector;
ci.sector_count = bio_sectors(bio);
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
while (ci.sector_count && !error)
error = __clone_and_map(&ci);

dec_pending(ci.io, error);
dm_table_put(ci.map);

return 0;
}

In the debugging session, I’ve noticed that ci.sector_count takes the value ‘1’ for the device with worst performance, while other devices with different sizes and better read speeds could take values in a range from ‘2’ to ‘8’ (being the latter the case with best performance). So, indeed, the size of a device affects how is accessed, and this implies a noticeably difference in performance. But, still, it wasn’t clear for me where is the root of this behavior, so I decided to dig a bit deeper. That took me to this function (fs/block_dev.c):

void bd_set_size(struct block_device bdev, loff_t size)
{
unsigned bsize = bdev_logical_block_size(bdev);

bdev->bd_inode->i_size = size;
while (bsize < PAGE_CACHE_SIZE) {
if (size & bsize)
break;
bsize <<= 1;
}
bdev->bd_block_size = bsize;
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
}

This function searches for the greatest power of 2 which is divisor of the device size in the range of 512 (sector size) to 4096 (the value for PAGE_CACHE_SIZE in x86), and sets it as the internal block size. Further direct requests to the device will be internally divided in chunks of this size, so devices with sizes that are multiple of 4096 will perform better than the ones which are multiple of 2048, 1024 or 512 (the worst case, which every device conforms as its the size of each sector). This is specially important in scenarios in which devices are directly accessed by the application, such as in Oracle’s ASM configurations.

TL;DR: Linux chooses the internal block size that will be used to fulfill page requests by searching the greatest power of 2 which is divisor of the device size in a range from 512 to 4096 (in x86), so creating your partitions with a size which is multiple of 4096 will help to obtain better performance in I/O disk operations.

Read More