Operations Tuning Linux
Background
The Linux kernel is not really tuned for large-scale IO or high-bandwidth long-haul network traffic by default. This page aims to document various tuning applicable to the various services run with NDGF.
dCache disk pools
TCP/network tuning
NDGF dCache pools do single-stream TCP transfers for pool-to-pool (p2p) copies, so we cater for that performance and hope that it will also be good enough for end user transfers.
The bandwidth-delay product limits the transfer speed that can be achieved. The RTT over the LHC OPN between HPC2N and IJS is approx 75 ms. In order to sustain 800 MB/s transfers a 60 MB TCP window is needed, which is larger than most Linux distribution defaults allow for TCP auto tuning.
Check your current tuning with: sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
You need to modify the net.ipv4.tcp_rmem
and net.ipv4.tcp_wmem
sysctl:s to allow Linux TCP autotuning to reach at minimum the needed TCP window size. Change only the rightmost value. Note that the value set for the read buffer typically needs to be 50% larger than the wanted TCP window size due to overhead.
Below is an example /etc/sysctl.d/60-tcptuning.conf
# Typical defaults, listed by sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem: # net.ipv4.tcp_rmem = 4096 87380 6291456 # net.ipv4.tcp_wmem = 4096 16384 4194304 # # Tuning for 64MiB tcp windows, with similar tcp_rmem overhead as default: net.ipv4.tcp_rmem = 4096 87380 100663296 net.ipv4.tcp_wmem = 4096 16384 67108864
Enable BBR congestion control
Switching to the BBR congestion control algorithm is preferred in order to better manage links with spurious periods of high package loss. This condition causes the default Linux congestion control to overreact and recover by doing a slow ramp up from a full stop. BBR is able to detect spurious loss and reacts in a more appropriate manner.
BBR is available in Linux distributions with a sufficiently recent kernel, for example Ubuntu 18.04, RHEL 8, Debian 10, and more.
Below is an example /etc/sysctl.d/60-net-tcp-bbr.conf
# Enable the Bottleneck Bandwidth and RTT (BBR) congestion control algorithm. # # Use fq qdisc for best performance. # From https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md: # Any qdisc will work, though "fq" performs better for highly-loaded servers. # (Note that TCP-level pacing was added in v4.13-rc1 but did not work well for # BBR until a fix was added in 4.20.) net.core.default_qdisc=fq # # Enable BBR, despite the name also applies to IPv6 net.ipv4.tcp_congestion_control=bbr
vm.swappiness
It generally makes no sense to preemptively swap out (parts of) running applications (ie java/dcache) in order to gain a few bytes more disk cache.
Add the following as /etc/sysctl.d/60-vm-swappiness.conf
# Tell the kernel to reduce the willingness to swap out applications when doing # IO in order to increase the amount of file system cache. # Default: 60 # Note that it needs to be set to minimum 1 with newer kernels to have the same # behaviour as 0 has with older kernels. vm.swappiness = 1
vm.min_free_kbytes
When the Linux kernel doesn't have enough free contiguous memory buffers for DMA etc, it will do small pauses of memory management that sometimes lead to full hangs. The clearest symptom of this is "failed 0-order allocation" (sometimes with other numbers than 0), the solution to this is that vm.min_free_kbytes needs to be large enough (0.5-1.0 seconds of network traffic might be a reasonable starting point).
Older distributions/kernels has a default tuned for GigE class networking.
Suggested starting value for 10GigE class network is 524288 kbytes (512 MiB), or higher for faster network or more complex network/storage etc.
Check your current value with sysctl vm.min_free_kbytes
. If it's smaller than the suggested value you need to increase it.
To increase it, add the following as /etc/sysctl.d/60-vm-minfree.conf
vm.min_free_kbytes = 524288
vm.dirty
Defaults are quite large on modern machines with much RAM and basically waits waaay to long before starting writeout, which causes write storms and huge impacts on read. The total lack of IO pacing makes this behaviour have bigger impact than it should have.
The workaround is to reduce the vm.dirty settings, but care also needs to be taken that this doesn't cause files written in multiple fragments as that reduces the efficency of read-ahead. AIM for setting it as low as possible without causing a lot of fragmentation.
Proceed in the following way to find suitable tuning for this:
- Start out with
vm.dirty_background_bytes
approximating 0.5s of IO, ie same value asvm.min_free_kbytes
but note bytes vs kbytes! - Set
vm.dirty_background_bytes
to 4*vm.dirty_background_bytes - Verify that you're not getting overfragmented files:
- Copy/write a big file (multiple GB)
- sync
- Check file fragmentation with
filefrag filename
- If more than one extents is listed, increase vm.dirty* and try again
- Copy/write two big files (multiple GB) simultaneously
- sync
- Check file fragmentation with
filefrag filename1 filename2
- If more than a couple (single-digit) extents are listed for each file, increase vm.dirty* and try again.
- Note that all sorts of factors can affect this, and that the ratio between
dirty_background
anddirty
is picked from experience and might need to be different.
Add the tuning to /etc/sysctl.d/60-vm-dirty.conf:
# Start writeout of pending writes earlier. # Limit size of pending writeouts. # # When to start writeout (default: 10% of system RAM) # 512 MiB vm.dirty_background_bytes = 536870912 # # Limit of pending writes (default: 20% of system RAM) # 2GiB vm.dirty_bytes = 2147483648
disk scheduler
We recommend using the deadline disk scheduler. Either hardcode it by passing elevator=deadline to the kernel on boot or create a nifty udev rule file, see the tape pool tuning for an example
NOTE: Recent multiqueue kernels already uses the mq-deadline
scheduler as default. Check what's available/used before trying to change it.
dCache tape pools
The goal: To be able to stream to/from a tape drive with decent speed despite other IO happening. Linux/XFS is really bad at this, in the long run ZFS on Linux is probably a better idea as it has some sort of IO pacing.
NOTE: These tunings are in addition to the disk pool tunings.
If you find that writes are starving reads, consider lowering vm.dirty*_bytes more. vm.dirty_bytes should be significantly smaller than the write cache in your raid controller.
Note however that you'll want vm.dirty* to be at least a few multiples larger than the raid stripe size in order to be able to do full-stripe writes.
disk scheduler/readahead
We recommend using an udev rule to set the tuning attributes. This ensures that the tunings get reapplied if the device is recreated for some reason (device resets, changes, etc).
Add the following as /etc/udev/rules.d/99-iotuning-largedev.rules (verified on Ubuntu):
# Note the ugly-trick to only match large devices: glob on the size attribute ... # dCache tape pool optimized tunings for large block devices: # - Set scheduler to deadline # - 64MB read-ahead gives somewhat decent streaming reads while writing. # - Lowering nr_requests from default 128 improves read performance with a slight write penalty. # - Tell the IO scheduler it's OK to starve writes more times in favor of reads SUBSYSTEM=="block", ACTION=="add|change", ATTR{size}=="[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]*", ATTR{bdi/read_ahead_kb}="65536", ATTR{queue/scheduler}="deadline", ATTR{queue/nr_requests}="64", ATTR{queue/iosched/writes_starved}="10"
Write pool caveats
If you're still having problems keeping the tapedrive streaming, investigate what the OS thinks the IO latencies are. The r_await output from iostat -dxm device-name 4 is a good starting point. Consider worst-case-r_await*tape-drive-speed to be the absolute minimum requirement for bdi/read_ahead_kb.
If you find that writes still starves reads, consider changing queue/nr_requests even more. Note that changing nr_requests also affects read behaviour/performance. Increasing bdi/read_ahead_kb further might also help.
See also hardware specific issues (Smart Array controller queue depth for example).
Hardware specific issues
HP(E) Smart Array RAID controllers
Queue depth
Change HP Smart Array controller queue depth from auto to 8 for more balanced read/write performance, ie:
hpssacli ctrl slot=X modify queuedepth=8
This is suggested primarily for tape pools.
Write performance degradation due to overly large max_sectors_kb setting
Mainline Linux 3.18.22 merged https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=20d74bf29cfae86649bf1ec75038c79a9bc5010f which triggers a write performance regression on HP(E) Smart Array Controllers.
This change is included in Ubuntu Vivid 3.19.0 (and newer) kernels, and CentOS 7 3.10.0-327.36.3 (and newer) kernels.
While the controllers can handle big writes, the performance suffers in some conditions.
A workaround is to cap max_sectors_kb to the old value used before this change.
The following udev rule can be used for this:
# # HP/HPE Smart Array controllers can't handle IO:s much larger than 1 MiB with # good performance although they can handle 4 MiB without error. The driver # advertise the stripe size as maximum IO size which can be substantially # larger for bulk IO setups. # # Install this as # /etc/udev/rules.d/90-smartarray-limitiosize.rules # to limit max_sectors_kb to 512, the default before Linux 3.18.22. # # hpsa driver, Px1x and newer SUBSYSTEM=="block", ACTION=="add|change", DRIVERS=="hpsa", ENV{DEVTYPE}=="disk", ATTR{queue/max_sectors_kb}="512" # # cciss driver, Px0x and older SUBSYSTEM=="block", ACTION=="add|change", DRIVERS=="cciss", ENV{DEVTYPE}=="disk", ATTR{queue/max_sectors_kb}="512"
Complete Ubuntu bug report is available at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668557
Experiences
UIO
We tried myriads (*) of combinations of hardware and os configurations to prevent read requests from being stalled by a continuous stream of writes. To date (20141128), the only really helpful trick was to use cgroups to cap the bandwidth of writes at the block level. (*) It felt like myriads at least ;)
Here is what you need to do:
1. Check if cgroups and io throttling are enabled in the kernel
$ grep CONFIG_BLK_CGROUP /boot/config-$(uname -r) CONFIG_BLK_CGROUP=y $ grep CONFIG_BLK_DEV_THROTTLING /boot/config-$(uname -r) CONFIG_BLK_DEV_THROTTLING=y
2. Create a mount point for the cgroups interface and mount it
$ mkdir -p /cgroup/blkio $ mount -t cgroup -o blkio none /cgroup/blkio
3. Enable non-root uids to write to /cgroup/blkio/blkio.throttle.write_bps_device (needed for integration with endit/tsmarchiver.pl)
$ chmod 666 /cgroup/blkio/blkio.throttle.write_bps_device
4. Determine major:minor of the device for which you want to throttle write bandwidth
$ stat -c "0x%t 0x%T" $(readlink -f /dev/mapper/cachevg-tapecachelv) | gawk --non-decimal-data '{ printf "%d:%d\n", $1, $2}' 253:6
5. Cap write bandwidth at, for example, 300 MiB/s
$ echo "253:6 314572800" > /cgroup/blkio/blkio.throttle.write_bps_device