DCache Pool Hardware
A dCache pool needs a filesystem for data storage, networking to access the data and some CPU and RAM for the dCache pool software.
For tuning see: Operations Tuning Linux
Introduction
This page is a help for doing procurements of dCache pool hardware. Other useful resources are the chat and the sysadmins mailinglist, where useful feedback can be found before buying hardware.
Disk pools
- Server and storage pool size:
- These guidelines are valid for pools with 50 TB or more usable space.
- Provide one filesystem with all the storage on the node for dCache.
- More than 1000TB per server is not recommended today due to the risk of too large a share of data being unavailable if it breaks.
- Warranty/service/equivalent:
- Coverage should include physical hardware as well as related BIOS/Firmware/equivalent.
- Servers should be covered during the production period.
- Suggested lifetime for pools is 4-5 years, with the recommendation to always buy with at least 5 year warranty/service.
- Replacement pools should be provided before this expires.
- Disk/LUN/raidset layout:
- A separate disk device for the OS
- Examples:
- Boot grade flash.
- Carving out a ~50G device from a pool raidset as the OS disk accesses have low performance impact.
- Storage redundancy with double-parity RAID (RAID6 or equivalent), preferably hardware based with a fairly large non-volatile write cache (for procurements, insert something that will give you the beefiest "standard" raid controller).
- We don't want parity raids wider than 16 data drives.
- High rebuild times and hurts IOPS for some workloads.
- This means 16*16=256TB max pool size with 16TB drives, using a single 18 drive double parity (RAID6) raidset.
- If you have more disks than this split into multiple parity raidsets that gets striped
- A server with 24 disks should end up as two 12-disk raid6s in a raid60.
- 3.5inch 7kRPM "Midline/Nearline" HDDs usually good enough, ie HDDs with a workload rating of 500TB/year or more.
- SAS disks are strongly recommended
- SATA signaling using a SAS expander have a nasty possibility of disabling an entire SAS lane due to a broken SATA HDD.
- Optimal SATA performance via a SAS expander requires load balancing to be correctly implemented, this isn't always the case...
- If you're not using a hardware RAID controller with write cache (i.e. software RAID, CEPH, etc), consider adding a small redundant flash based storage for metadata and fs journal. Discuss with NDGF operators.
- Storage read and write bandwidth:
- Minimum of 1250MB/s (10GigE-speed) per pool (to avoid a single incoming data stream overloading the IO subsystem).
- Minimum 125MB/s per 10 TB usable storage. An example 200TB server should be able to handle at minimum 2500MB/s read or write dCache file IO.
- Specify this as something sales can't get wrong, naive big-block sequential performance for example.
- If there are shared components in the storage solution (for example SAN or CEPH based solutions) these bandwidth numbers should be seen as dedicated/guaranteed bandwidth (ie. if all hosts sharing components go all-out they should meet these bandwidth requirements, this can happen during checksumming of an entire site and when moving data between pools within a site).
- Network:
- 10G network minimum
- Aim for at least 1Gbit/s per 10TB of usable storage, require at least 1Gbit/s per 20TB of usable storage.
- This together with the storage bandwidth is essential to evacuate a broken pool node within reasonable time.
- If you have multiple network switches, don't insert bottlenecks by having just one uplink from each switch.
- Ensure enough connectivity from computer to storage: Having large storage devices behind a low-bandwidth link/controller will limit the transfer rate.
- CPU:
- Minimum nominal clock 2GHz.
- The sum of nominal clock (core * clock) should be least 8 coreGHz.
- Estimate at least 0.7 coreGHz per 10 TB usable space.
- Higher turbo frequency is better given equally priced options.
- RAM:
- ECC (or better) memory required.
- Minimum 16GB RAM.
- Estimate at least 2GB per 10TB usable space.
- Memory always configured to use all available memory channels for optimal performance.
- More memory for cache doesn't hurt.
A common hardware platform that usually fits these requirements are 2U machines with 12-20+ x 3.5in HDD slots.
Tape pools
At the moment (run 3) the target speed for both reading and writing should be at least 1 GByte/s in each direction, with the expectation that read speeds will drop some from that with sparse reading. Preferably this should be achievable at the same time, but if you cannot afford sufficient number of tape drives for this a total of 1.5 GBytes/s read+write is probably fine, but will likely build up backlogs (hopefully reading and writing at the same time won't be during for very long). It is expected that the requirements will increase by at least a factor of two when HL-LHC comes online. See also DCache TSM interface for more information regarding connecting dCache to TSM.
Background information
The main purpose of these recommendations is to make it easier for sites to figure out which hardware to procure for dCache tape pools, providing the required capability while still being cost effective.
One of the harder parts are to envision what will happen during the tape pool lifetime, and this is tightly coupled to the evolution of tape technology.
Short-term we can usually get hints on what the next-generation products will be capable of. For example, the latest product roadmap update states that there will be a speed increase of a factor 2 (ie. from 400-ish MB/s to at least 800-ish MB/s) when next-generation tape heads with twice the number of tracks are introduced sometime in the 2021 timeframe. This has since been revised to be factor 1.5, ie 650-ish MB/s, we can only know for sure when the products are announced.
Long-term tape technology transfer rates can be estimated by looking at the INSIC tape technology roadmap, available at http://www.insic.org/insic-application-systems-and-technology-roadmap/ with the latest as of this writing being the 2019-2029 roadmaps . For the purpose of these recommendations the interesting figures are the Maximum total streaming drive data rate and Minimum streaming drive data rate figures.
For optimum performance, the tape pool should be able to handle the Maximum total streaming drive data rate with the number of allocated tape drives and the technology that is envisioned to be installed during the tape pool lifetime.
Under no circumstance shall the tape pool performance fall below the Minimum streaming drive data rate with the number of allocated tape drives and installed technology as this would require drives to constantly backhitch causing high tape wear and very bad overall performance.
Sizing guidelines
Focused on sites using IBM Spectrum Protect (TSM), but most points are valid everywhere. In addition, there might be several aspects of the disk pool hardware discussion that applies to tape pools, please also read and take those into account.
These sizing guidelines are based on discussions on the 2019 autumn All Hands Meeting in Bern, with the following assumptions:
- A tape pool needs to be able to handle three tape drives (this might vary between sites, depends on the allocation/pledge).
- Per-tapedrive bandwidth will be 800 MB/s or more within the tape pool lifetime.
- The tape carousel initiative will result in more data being read from pools.
- Solving this problem with HDD technology is more expensive than using SSD:s.
With these assumptions in mind, this boils down to the following recommendations. Note that budget constraints might require a compromise, discuss with NDGF operators in what areas to cut corners.
- Dedicated read and write pool machines:
- Data availability requirements means we can't wait for more than a couple of days for a read pool to be fixed. Reconfiguring a write pool into a read pool is fairly easy compared to setting up a new read pool from scratch, hooking it up to the tape library etc.
- Note that you probably want this to be separate physical machines, housing everything in a single blade cabinet is not of much help if that cabinet breaks down due to firmware/hardware/foo.
- We really recommend this to be a dedicated physical self-contained machine. Tape needs streaming IO, and a virtualized setup has multiple issues:
- A tendency to add intermittent latency and other hard-to-debug performance issues.
- Shared infrastructure makes it hard to guarantee performance/bandwidth, it's hard enough on dedicated hardware!
- When licensing TSM using the PVU (per-core) model, it's the physical hardware that needs to be licensed. For virtualized hosts this means the entire virtualization cluster unless you are doing sub capacity reporting which requires a special agreement.
- You Have Been Warned!
- Storage pool size:
- Min 10 TB, more is better.
- More is nice for caching use, but not strictly needed for tape staging.
- Note however that a larger size for the write pool means that you can manage a longer tape library outage.
- The ideal is sizing this to handle a weekends worth of data at least.
- Write pool should have good redundancy, RAID5 is deemed sufficient when using SSD:s.
- Using HBA:s and software RAID might be a feasible option here. RAID FastPath features usually requires the RAID read/write cache to be disabled, so the only real difference is where the RAID stuff happens.
- Bandwidth:
- Minimum 6250 MB/s storage bandwidth.
- This is based on 25 GigE network connectivity, 3125 MB/s dual direction (to/from tape at the same time as to/from end user).
- Beware of:
- Sales-people reducing the number of storage devices without checking bandwidth requirements (for example using half as many devices with double the size, thus halving the bandwidth)
- NVMe device bandwidth vs slot bandwidth (common example is devices being x4, slots being x1 and thus limiting bandwidth)
- Congestion points in network and when considering shared storage solutions!
- Note that the TSM server needs to be able to handle this:
- The full bandwidth doesn't have to be in place from day 1, but needs to be catered for when tape technology is upgraded since we want to use the performance benefits.
- Confer with your TSM admin if you can do it using standard TCP/IP networking via the TSM server or if you need to do LanFree and direct connectivity (usually using FibreChannel) with tape drives.
- Minimum 6250 MB/s storage bandwidth.
- Network:
- 25 Gigabit networking.
- Needed to be able to support multiple tape drives at 800 MB/s each.
- The full bandwidth doesn't have to be in place from day 1, using a 25G interface in 10G is OK as long as it's not a bottleneck.
- 25 Gigabit networking.
- Storage:
- SSDs is the most cost-effective way of meeting the bandwidth requirements.
- Mixed-use endurance rating, ie. 3 Drive Writes Per Day (DWPD) or better.
- The DWPD requirement might vary depending on size and number of storage units, the interesting metric from an NDGF point of view is the total TeraBytes Written (TBW) for the storage area.
- CPU:
- At least one coreGHZ per Gbps of expected network connectivity, ie. at least 25 coreGHz for 25 Gbps networking.
- It's usually better to have one socket rather than multiple sockets, the link between the sockets is weaker than doing memory access in-socket.
- TSM is licensed per core when using PVU licensing, so it's good to keep the core-count down wrt license costs.
- Memory:
- Min 32 GB with the option to expand to 64 GB if the need arises.
- Optimize for memory bandwidth (highest speed supported by CPU, evenly distributed among all available memory channels, etc).
2U machines with enough storage slots is a common solution, you typically need 10-16 slots for the most cost-effective configuration that fullfills the requirements.
Recent procurements
See Tier1_Hardware_Procurements.
Proposed changes
Proposed changes go into the Discussion page. Please do so for every problematic bit you encounter when using the page. When you have a set of changes that you would like discussed for folding into the proper page, please send a note to maswan@ndgf.org and we'll schedule a chat meeting with all involved stakeholders.