My server's /home/
directory is a btrfs RAID 5, spanning three drives (I did
a blog post about it [here](/?q=linux/Filesystems and Storage/Btrfs:RAID_Setup).
Everything worked fine, until I used rsync to sync my files from my laptop to
my server. At that point, the sync would go well for a little while and then
slow to a crawl. I couldn't cancel the sync with a ctrl+c. If I could get on my
server over ssh, I'd find that one of my cpus was pegged at 100%. Sometimes
though it got so bogged down I couldn't even get to the server at all. If I
were already on the server and I did a kill -9 on rsync, it'd go defunct.
I checked my logs after trying to umount /home/ and found...
Nov 03 12:01:18 zion kernel: device label home devid 1 transid 1173 /dev/sdb
Nov 03 12:01:19 zion kernel: btrfs: disk space caching is enabled
Nov 03 12:11:53 zion kernel: INFO: task umount:1668 blocked for more than 120 seconds.
Nov 03 12:11:53 zion kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 03 12:11:53 zion kernel: umount D ffff880037afbc60 0 1668 1653 0x00000000
Nov 03 12:11:53 zion kernel: ffff880037afbbd0 0000000000000086 0000000000014500 ffff880037afbfd8
Nov 03 12:11:53 zion kernel: ffff880037afbfd8 0000000000014500 ffff8800aa0caa30 0000000000000010
Nov 03 12:11:53 zion kernel: 000000000d6fffff ffff880037afbb98 ffffffff8113a911 ffff8800afedb728
Nov 03 12:11:53 zion kernel: Call Trace:
Nov 03 12:11:53 zion kernel: [<ffffffff8113a911>] ? free_pcppages_bulk+0x3b1/0x3f0
Nov 03 12:11:53 zion kernel: [<ffffffff81132700>] ? filemap_fdatawait+0x30/0x30
Nov 03 12:11:53 zion kernel: [<ffffffff814e1029>] schedule+0x29/0x70
Nov 03 12:11:53 zion kernel: [<ffffffff814e12cf>] io_schedule+0x8f/0xe0
Nov 03 12:11:53 zion kernel: [<ffffffff8113270e>] sleep_on_page+0xe/0x20
Nov 03 12:11:53 zion kernel: [<ffffffff814ddb5b>] __wait_on_bit_lock+0x5b/0xc0
Nov 03 12:11:53 zion kernel: [<ffffffff8113284a>] __lock_page+0x6a/0x70
Nov 03 12:11:53 zion kernel: [<ffffffff81084800>] ? wake_atomic_t_function+0x40/0x40
Nov 03 12:11:53 zion kernel: [<ffffffff81141fa3>] truncate_inode_pages_range+0x613/0x660
Nov 03 12:11:53 zion kernel: [<ffffffff81142005>] truncate_inode_pages+0x15/0x20
Nov 03 12:11:53 zion kernel: [<ffffffffa07df172>] btrfs_evict_inode+0x42/0x380 [btrfs]
Nov 03 12:11:53 zion kernel: [<ffffffff811b97b0>] evict+0xb0/0x1b0
Nov 03 12:11:53 zion kernel: [<ffffffff811b98e9>] dispose_list+0x39/0x50
Nov 03 12:11:53 zion kernel: [<ffffffff811ba56c>] evict_inodes+0x11c/0x130
Nov 03 12:11:53 zion kernel: [<ffffffff811a1cc8>] generic_shutdown_super+0x48/0xe0
Nov 03 12:11:53 zion kernel: [<ffffffff811a1f22>] kill_anon_super+0x12/0x20
Nov 03 12:11:53 zion kernel: [<ffffffffa07a8ee6>] btrfs_kill_super+0x16/0x90 [btrfs]
Nov 03 12:11:53 zion kernel: [<ffffffff811a22fd>] deactivate_locked_super+0x3d/0x60
Nov 03 12:11:53 zion kernel: [<ffffffff811a28e6>] deactivate_super+0x46/0x60
Nov 03 12:11:53 zion kernel: [<ffffffff811bdeaf>] mntput_no_expire+0xef/0x150
Nov 03 12:11:53 zion kernel: [<ffffffff811bf0b1>] SyS_umount+0x91/0x3b0
Nov 03 12:11:53 zion kernel: [<ffffffff814ea5dd>] system_call_fastpath+0x1a/0x1f
The only way to solve the problem was to perform a restart. After that, the problem would come back as soon as I started rsync again.
I hunted around for a while until I finally just searched for the name of the
pegged process, btrfs-endio-wri
, and cpu time. It turns out, the btrfs
folks have a page detailing a
list of current "gotchas" btrfs has. This issue was one of them. They describe
it as
Files with a lot of random writes can become heavily fragmented (10000+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. ... Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs). You can use filefrag to locate heavily fragmented files.
One of the best parts of rsync is that is syncs deltas instead of resyncing the entire file. What does that result in? Lots of little random writes. Sounds like a match to me.
To fix this, I defragged all of /home/ (with compression=lzo
of course
:)), and remounted using the autodefrag
option.
Now I can run rsync with no problems.
One last thing to note. Their gotchas page says that once they've worked out a few potential kinks with the autodefrag mount option, they'll make it the default, which should prevent this from being an issue in future versions.
Written on: 2015-06-28 14:45:35 -0600
Last edited: 2025-03-01 01:46:33 UTC