Btrfs:RAID 5 Rsync Freeze

My server’s /home/ directory is a btrfs RAID 5, spanning three drives (I did a blog post about it Btrfs:RAID_Setup[here]). Everything worked fine, until I used rsync to sync my files from my laptop to my server. At that point, the sync would go well for a little while and then slow to a crawl. I couldn’t cancel the sync with a ctrl+c. If I could get on my server over ssh, I’d find that one of my cpus was pegged at 100%. Sometimes though it got so bogged down I couldn’t even get to the server at all. If I were already on the server and I did a kill -9 on rsync, it’d go defunct.

I checked my logs after trying to umount /home/ and found…​

Nov 03 12:01:18 zion kernel: device label home devid 1 transid 1173 /dev/sdb
Nov 03 12:01:19 zion kernel: btrfs: disk space caching is enabled
Nov 03 12:11:53 zion kernel: INFO: task umount:1668 blocked for more than 120 seconds.
Nov 03 12:11:53 zion kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 03 12:11:53 zion kernel: umount D ffff880037afbc60 0 1668 1653 0x00000000
Nov 03 12:11:53 zion kernel: ffff880037afbbd0 0000000000000086 0000000000014500 ffff880037afbfd8
Nov 03 12:11:53 zion kernel: ffff880037afbfd8 0000000000014500 ffff8800aa0caa30 0000000000000010
Nov 03 12:11:53 zion kernel: 000000000d6fffff ffff880037afbb98 ffffffff8113a911 ffff8800afedb728
Nov 03 12:11:53 zion kernel: Call Trace:
Nov 03 12:11:53 zion kernel: [<ffffffff8113a911>] ?  free_pcppages_bulk+0x3b1/0x3f0
Nov 03 12:11:53 zion kernel: [<ffffffff81132700>] ? filemap_fdatawait+0x30/0x30
Nov 03 12:11:53 zion kernel: [<ffffffff814e1029>] schedule+0x29/0x70
Nov 03 12:11:53 zion kernel: [<ffffffff814e12cf>] io_schedule+0x8f/0xe0
Nov 03 12:11:53 zion kernel: [<ffffffff8113270e>] sleep_on_page+0xe/0x20
Nov 03 12:11:53 zion kernel: [<ffffffff814ddb5b>] __wait_on_bit_lock+0x5b/0xc0
Nov 03 12:11:53 zion kernel: [<ffffffff8113284a>] __lock_page+0x6a/0x70
Nov 03 12:11:53 zion kernel: [<ffffffff81084800>] ?  wake_atomic_t_function+0x40/0x40
Nov 03 12:11:53 zion kernel: [<ffffffff81141fa3>] truncate_inode_pages_range+0x613/0x660
Nov 03 12:11:53 zion kernel: [<ffffffff81142005>] truncate_inode_pages+0x15/0x20
Nov 03 12:11:53 zion kernel: [<ffffffffa07df172>] btrfs_evict_inode+0x42/0x380 [btrfs]
Nov 03 12:11:53 zion kernel: [<ffffffff811b97b0>] evict+0xb0/0x1b0
Nov 03 12:11:53 zion kernel: [<ffffffff811b98e9>] dispose_list+0x39/0x50
Nov 03 12:11:53 zion kernel: [<ffffffff811ba56c>] evict_inodes+0x11c/0x130
Nov 03 12:11:53 zion kernel: [<ffffffff811a1cc8>] generic_shutdown_super+0x48/0xe0
Nov 03 12:11:53 zion kernel: [<ffffffff811a1f22>] kill_anon_super+0x12/0x20
Nov 03 12:11:53 zion kernel: [<ffffffffa07a8ee6>] btrfs_kill_super+0x16/0x90 [btrfs]
Nov 03 12:11:53 zion kernel: [<ffffffff811a22fd>] deactivate_locked_super+0x3d/0x60
Nov 03 12:11:53 zion kernel: [<ffffffff811a28e6>] deactivate_super+0x46/0x60
Nov 03 12:11:53 zion kernel: [<ffffffff811bdeaf>] mntput_no_expire+0xef/0x150
Nov 03 12:11:53 zion kernel: [<ffffffff811bf0b1>] SyS_umount+0x91/0x3b0
Nov 03 12:11:53 zion kernel: [<ffffffff814ea5dd>] system_call_fastpath+0x1a/0x1f

The only way to solve the problem was to perform a restart. After that, the problem would come back as soon as I started rsync again.

The Solution

I hunted around for a while until I finally just searched for the name of the pegged process, btrfs-endio-wri, and cpu time. It turns out, the btrfs folks have a page detailing a list of current "gotchas" btrfs has. This issue was one of them. They describe it as <pre> Files with a lot of random writes can become heavily fragmented (10000+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. …​ Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs). You can use filefrag to locate heavily fragmented files. </pre>

One of the best parts of rsync is that is syncs deltas instead of resyncing the entire file. What does that result in? Lots of little random writes. Sounds like a match to me.

To fix this, I defragged all of /home/ (with compression=lzo of course :) ), and remounted using the autodefrag option.

Now I can run rsync with no problems.

One last thing to note. Their gotchas page says that once they’ve worked out a few potential kinks with the autodefrag mount option, they’ll make it the default, which should prevent this from being an issue in future versions.

Category:Linux Category:Btrfs Category:Storage Category:RAID