Operations Tuning Linux

From neicext
Jump to navigation Jump to search

Background

The Linux kernel is not really tuned for large-scale IO or high-bandwidth long-haul network traffic by default. This page aims to document various tuning applicable to the various services run with NDGF.

dCache disk pools

TCP/network tuning

NDGF dCache pools do single-stream TCP transfers for pool-to-pool (p2p) copies, so we cater for that performance and hope that it will also be good enough for end user transfers.

The bandwidth-delay product limits the transfer speed that can be achieved. The RTT over the LHC OPN between HPC2N and IJS is approx 75 ms. In order to sustain 800 MB/s transfers a 60 MB TCP window is needed, which is larger than most Linux distribution defaults allow for TCP auto tuning.

Check your current tuning with: sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem

You need to modify the net.ipv4.tcp_rmem and net.ipv4.tcp_wmem sysctl:s to allow Linux TCP autotuning to reach at minimum the needed TCP window size. Change only the rightmost value. Note that the value set for the read buffer typically needs to be 50% larger than the wanted TCP window size due to overhead.

Below is an example /etc/sysctl.d/60-tcptuning.conf

# Typical defaults, listed by sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem:
#   net.ipv4.tcp_rmem = 4096        87380   6291456
#   net.ipv4.tcp_wmem = 4096        16384   4194304
#
# Tuning for 64MiB tcp windows, with similar tcp_rmem overhead as default:
net.ipv4.tcp_rmem = 4096        87380   100663296
net.ipv4.tcp_wmem = 4096        16384   67108864

Enable BBR congestion control

Switching to the BBR congestion control algorithm is preferred in order to better manage links with spurious periods of high package loss. This condition causes the default Linux congestion control to overreact and recover by doing a slow ramp up from a full stop. BBR is able to detect spurious loss and reacts in a more appropriate manner.

BBR is available in Linux distributions with a sufficiently recent kernel, for example Ubuntu 18.04, RHEL 8, Debian 10, and more.

Below is an example /etc/sysctl.d/60-net-tcp-bbr.conf

# Enable the Bottleneck Bandwidth and RTT (BBR) congestion control algorithm.
#
# Use fq qdisc for best performance.
# From https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md:
#  Any qdisc will work, though "fq" performs better for highly-loaded servers.
#  (Note that TCP-level pacing was added in v4.13-rc1 but did not work well for
#  BBR until a fix was added in 4.20.)
net.core.default_qdisc=fq
#
# Enable BBR, despite the name also applies to IPv6
net.ipv4.tcp_congestion_control=bbr

vm.swappiness

It generally makes no sense to preemptively swap out (parts of) running applications (ie java/dcache) in order to gain a few bytes more disk cache.

Add the following as /etc/sysctl.d/60-vm-swappiness.conf

 # Tell the kernel to reduce the willingness to swap out applications when doing
 # IO in order to increase the amount of file system cache.
 # Default: 60
 # Note that it needs to be set to minimum 1 with newer kernels to have the same
 # behaviour as 0 has with older kernels.
 vm.swappiness = 1

vm.min_free_kbytes

When the Linux kernel doesn't have enough free contiguous memory buffers for DMA etc, it will do small pauses of memory management that sometimes lead to full hangs. The clearest symptom of this is "failed 0-order allocation" (sometimes with other numbers than 0), the solution to this is that vm.min_free_kbytes needs to be large enough (0.5-1.0 seconds of network traffic might be a reasonable starting point).

Older distributions/kernels has a default tuned for GigE class networking.

Suggested starting value for 10GigE class network is 524288 kbytes (512 MiB), or higher for faster network or more complex network/storage etc.

Check your current value with sysctl vm.min_free_kbytes. If it's smaller than the suggested value you need to increase it.

To increase it, add the following as /etc/sysctl.d/60-vm-minfree.conf

vm.min_free_kbytes = 524288

vm.dirty

Defaults are quite large on modern machines with much RAM and basically waits waaay to long before starting writeout, which causes write storms and huge impacts on read. The total lack of IO pacing makes this behaviour have bigger impact than it should have.

The workaround is to reduce the vm.dirty settings, but care also needs to be taken that this doesn't cause files written in multiple fragments as that reduces the efficency of read-ahead. AIM for setting it as low as possible without causing a lot of fragmentation.

Proceed in the following way to find suitable tuning for this:

  • Start out with vm.dirty_background_bytes approximating 0.5s of IO, ie same value as vm.min_free_kbytes but note bytes vs kbytes!
  • Set vm.dirty_background_bytes to 4*vm.dirty_background_bytes
  • Verify that you're not getting overfragmented files:
  • Copy/write a big file (multiple GB)
  • sync
  • Check file fragmentation with filefrag filename
    • If more than one extents is listed, increase vm.dirty* and try again
  • Copy/write two big files (multiple GB) simultaneously
  • sync
  • Check file fragmentation with filefrag filename1 filename2
    • If more than a couple (single-digit) extents are listed for each file, increase vm.dirty* and try again.
  • Note that all sorts of factors can affect this, and that the ratio between dirty_background and dirty is picked from experience and might need to be different.

Add the tuning to /etc/sysctl.d/60-vm-dirty.conf:

 # Start writeout of pending writes earlier.
 # Limit size of pending writeouts.
 #
 # When to start writeout (default: 10% of system RAM)
 # 512 MiB
 vm.dirty_background_bytes = 536870912
 #
 # Limit of pending writes (default: 20% of system RAM)
 # 2GiB
 vm.dirty_bytes = 2147483648

disk scheduler

We recommend using the deadline disk scheduler. Either hardcode it by passing elevator=deadline to the kernel on boot or create a nifty udev rule file, see the tape pool tuning for an example

NOTE: Recent multiqueue kernels already uses the mq-deadline scheduler as default. Check what's available/used before trying to change it.

dCache tape pools

The goal: To be able to stream to/from a tape drive with decent speed despite other IO happening. Linux/XFS is really bad at this, in the long run ZFS on Linux is probably a better idea as it has some sort of IO pacing.

NOTE: These tunings are in addition to the disk pool tunings.

If you find that writes are starving reads, consider lowering vm.dirty*_bytes more. vm.dirty_bytes should be significantly smaller than the write cache in your raid controller.

Note however that you'll want vm.dirty* to be at least a few multiples larger than the raid stripe size in order to be able to do full-stripe writes.

disk scheduler/readahead

We recommend using an udev rule to set the tuning attributes. This ensures that the tunings get reapplied if the device is recreated for some reason (device resets, changes, etc).

Add the following as /etc/udev/rules.d/99-iotuning-largedev.rules (verified on Ubuntu):

 # Note the ugly-trick to only match large devices: glob on the size attribute ...
 # dCache tape pool optimized tunings for large block devices:
 # - Set scheduler to deadline
 # - 64MB read-ahead gives somewhat decent streaming reads while writing.
 # - Lowering nr_requests from default 128 improves read performance with a slight write penalty.
 # - Tell the IO scheduler it's OK to starve writes more times in favor of reads
 SUBSYSTEM=="block", ACTION=="add|change", ATTR{size}=="[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]*", ATTR{bdi/read_ahead_kb}="65536", ATTR{queue/scheduler}="deadline", ATTR{queue/nr_requests}="64", ATTR{queue/iosched/writes_starved}="10"

Write pool caveats

If you're still having problems keeping the tapedrive streaming, investigate what the OS thinks the IO latencies are. The r_await output from iostat -dxm device-name 4 is a good starting point. Consider worst-case-r_await*tape-drive-speed to be the absolute minimum requirement for bdi/read_ahead_kb.

If you find that writes still starves reads, consider changing queue/nr_requests even more. Note that changing nr_requests also affects read behaviour/performance. Increasing bdi/read_ahead_kb further might also help.

See also hardware specific issues (Smart Array controller queue depth for example).

Hardware specific issues

HP(E) Smart Array RAID controllers

Queue depth

Change HP Smart Array controller queue depth from auto to 8 for more balanced read/write performance, ie:

 hpssacli ctrl slot=X modify queuedepth=8

This is suggested primarily for tape pools.

Write performance degradation due to overly large max_sectors_kb setting

Mainline Linux 3.18.22 merged https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=20d74bf29cfae86649bf1ec75038c79a9bc5010f which triggers a write performance regression on HP(E) Smart Array Controllers.

This change is included in Ubuntu Vivid 3.19.0 (and newer) kernels, and CentOS 7 3.10.0-327.36.3 (and newer) kernels.

While the controllers can handle big writes, the performance suffers in some conditions.

A workaround is to cap max_sectors_kb to the old value used before this change.

The following udev rule can be used for this:

 #
 # HP/HPE Smart Array controllers can't handle IO:s much larger than 1 MiB with
 # good performance although they can handle 4 MiB without error. The driver
 # advertise the stripe size as maximum IO size which can be substantially
 # larger for bulk IO setups.
 #
 # Install this as
 # /etc/udev/rules.d/90-smartarray-limitiosize.rules
 # to limit max_sectors_kb to 512, the default before Linux 3.18.22.
 #
 # hpsa driver, Px1x and newer
 SUBSYSTEM=="block", ACTION=="add|change", DRIVERS=="hpsa", ENV{DEVTYPE}=="disk", ATTR{queue/max_sectors_kb}="512"
 #
 # cciss driver, Px0x and older
 SUBSYSTEM=="block", ACTION=="add|change", DRIVERS=="cciss", ENV{DEVTYPE}=="disk", ATTR{queue/max_sectors_kb}="512"


Complete Ubuntu bug report is available at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668557

Experiences

UIO

We tried myriads (*) of combinations of hardware and os configurations to prevent read requests from being stalled by a continuous stream of writes. To date (20141128), the only really helpful trick was to use cgroups to cap the bandwidth of writes at the block level. (*) It felt like myriads at least ;)

Here is what you need to do:

1. Check if cgroups and io throttling are enabled in the kernel

$ grep CONFIG_BLK_CGROUP /boot/config-$(uname -r)
CONFIG_BLK_CGROUP=y
$ grep CONFIG_BLK_DEV_THROTTLING /boot/config-$(uname -r)
CONFIG_BLK_DEV_THROTTLING=y

2. Create a mount point for the cgroups interface and mount it

$ mkdir -p /cgroup/blkio
$ mount -t cgroup -o blkio none /cgroup/blkio

3. Enable non-root uids to write to /cgroup/blkio/blkio.throttle.write_bps_device (needed for integration with endit/tsmarchiver.pl)

$ chmod 666 /cgroup/blkio/blkio.throttle.write_bps_device

4. Determine major:minor of the device for which you want to throttle write bandwidth

$ stat -c "0x%t 0x%T" $(readlink -f /dev/mapper/cachevg-tapecachelv) | gawk --non-decimal-data '{ printf "%d:%d\n", $1, $2}'
253:6

5. Cap write bandwidth at, for example, 300 MiB/s

$ echo "253:6 314572800" > /cgroup/blkio/blkio.throttle.write_bps_device