E3DDS Deliverable1 Requirements meeting
E3DDS Deliverable1 Requirements meeting
Time: Friday 2018-06-01 08:30 CET
Place: Kiruna Space Campus
Room: Mimasalen (upstairs from the reception, at the far end of the corridor to the left)
Useful links
- Project public wiki: https://wiki.neic.no/wiki/EISCAT_3D_Data_Solutions
- Project public homepage: https://neic.no/e3dds/
- Project plan: https://wiki.neic.no/wiki/EISCAT_3D_Data_Solutions#Documents
Projected Attendance
- Mattias Wadenstein
- John White
- János Nagy
- Dan Johan Jonsson
- Jussi Markkanen
- Ingemar Häggström
- Assar Westman
- Harri Hellgren
- Rikard Slapak
- Anders Tjulin
- Carl-Fredrik Enell
- Ari Lukkarinen (by zoom)
Remote participation
- Ari Lukkarinen
Agenda
- Overview of current state of proposed system architecture (block diagrams) (Harri)
- Overview of the output of the first level beamformer (Harri)
- Including special cases
- Expected output of the second level beamformer (Assar?)
- Including special cases
- Bandwidths (John)
- what bandwidth is required from start and how does this affect hardware requirements?
- 30MHz bursts during the first generation of hardware?
- Filter lengths (Assar)
- Ring buffer size (Ingemar)
- Ring buffer operation (Mattias?)
- event: data protection (lock?)
- event: raw data dump to disk
- Verification of sketches from previous meetings (Mattias)
- Make sure it is coherent with Harri's drawings
- What code exists? (Assar?)
- What shape is it in?
- Can we trust the benchmarks that have been run to accurately reflect needs of production code?
- What needs to be run on the site cluster, and how are they connected? (Harri)
- ringbuffer
- int->fp conversion
- firs
- summation
- radar controller
- lag profiling
- realtime analysis
- realtime visualisation
- anything else?
- What can be virtualised and what must run on bare metal (Mattias)
- Command and control (Harri)
- How does the second level beamformer switch modes
- How often, with what latency does a switchover need to happen
- Filewriter nodes and site-local storage (Harri)
- WAN networking requirements (Carl-Fredrik)
- What needs to be shipped off-site? Where?
- Scheduler - site event list latency (Harri)
- Capabilities for storage that supports analysis (Mattias)
- Range of strict parallel posix to object store
- Validity of The Matlab Script (demo by Ingemar?)
- Have any assumptions been changed?
Minutes
Actual Attendance
- Mattias Wadenstein
- John White
- János Nagy
- Dan Johan Jonsson
- Jussi Markkanen
- Ingemar Häggström
- Assar Westman
- Harri Hellgren
- Rikard Slapak
- Anders Tjulin
- Carl-Fredrik Enell
- Ari Lukkarinen (by zoom)
Overview of current state of proposed system architecture (block diagrams) (Harri)
- Gives the buffer and second beamformer architecture:
- RAM in each beamforming node forms the ring buffer;
- each subarray beam has one RAM segment
- event mark (lock)
- IDs -> operation is asynchronous with radar controller
- For the beam former... floats vs double?
- Float is acceptable
- Output of 1st BF.... still open.
- What is the expected input to the second beamformer? 16 by 16 bit integer... plus the index and timing.
- See the photo for a diagram.
- 100 beams is the limit from the 2 BF.
Bandwidths (John)
- What bandwidth is required from start and how does this affect hardware requirements?
- 30MHz bursts during the first generation of hardware?
At the start need to support 5MHz operations... Need to support 30 MHz data rate for one hour maximum. At Skibotn... 30 MHz is OK if 1 or 2 beams per FSRU. i.e. 30 MHz would be used only in Skibotn with its fewer beams. Base capacity calc on 5 MHz continuous operation, 30 MHz in bursts (ie fill buffer, stop Tx and process). This will impact the ring buffer size.
The 30 MHz is mainly for astronomy and plasma line search (incl NEIALs including plasma lines). The system can switch to high rate when NEIALs detected. Then run the FSRUS at 30 MHz for about an hour, overwriting ring buffer -> store 10 s intervals -> offline interferometry. It takes roughly 10x longer to write data out of the ring buffer to storage (disk) than to write in. This means that possibly data taking may need to be paused?
What are the calculations?
5 MHz... 1.5 x 2 x 109 = 0.35 Tbit/s
Skibotn 1 beam/FSRU
30 MHz ... 1.5 x 6 x 2 x 11/10 = 0.23 tbit/s
30 MHz 2 beams remote per FSRU
1.5 x 2 x 6 x 109/5 = 0.39 Tbit/s
(16+16 bits) x 30 MSPS x 2 polarities x 10 beams = ring buffer size...
100 -600 s ring buffer at start up. Work with 100 s for now.
5 to 24 TB
- Base capacity calc on 5 MHz continuos operation, 30 MHz in bursts (i.e. fill buffer, stop Tx and process) - 30 MHz would be used only in Skibotn with its fewer beams
Filter lengths (Assar)
Still open question. 20 to 60 or more taps... real value filters. 20 taps give the same timing resolution a 1st BF.
Jussi and Assar will look at this still.
The number of FSRU beams will be reconfigured as needed. A change in number will require a ringbuffer flush.
Still need to test Assar's AVX256 software. Assar will apply for small SNIC project time at HPC2N, MW helps out with this.
The code exists and can be adapted to skylake AVX 512 and scale up.
Once this is benchmarked we can give HW recommendations based on the incremental steps in the number of CPUs needed. Factor of 8?
Of course, this will scale with the filter length required in each process.
Ring buffer size (Ingemar)
At Skibotn there are 119 subarrays (10 outliers)
At other receive sites there are 109 subarrays.
- Skibotn
At 5MHz and 10 beams per FSRU:
(16+16) bits x 5MHz x 10 beams x 2 polarities x 119 subarrays = 0.38 Tbit/s
At 30 MHz and 2 beams per FSRU
(16+16) bits x 30 MHz x 2 beams x 2 polarities x 119 subarrays = 0.42 Tbit/s
At 30 MHz and 1 beam per FSRU
(16+16) bits x 30 MHz x 1 beam x 2 polarities x 119 subarrays = 0.23 Tbit/s
- At other receive sites... need 10 beams always from the subarrays? Is this assumption correct as it will be very important.
At 5MHz and 10 beams per FSRU:
(16+16) bits x 5MHz x 10 beams x 2 polarities x 109 subarrays = 0.35 Tbit/s
At 30 MHz and 10 beams per FSRU
(16+16) bits x 30 MHz x 10 beams x 2 polarities x 109 subarrays = 2.10 Tbit/s
The ring buffer should be sized to 100s as a starting point. Also ring buffer of 600 and 1200s to be included in the calculations in the document... Quick calculations show:
5 MHz operation:
Skibotn: 4.75TB (100s) 28.5TB (600s) 57TB (1200s)
Receive sites: 4.38TB (100s) 26.3TB (600s) 52.5TB (1200s)
Output from each second beam former node at 5 MHz: (32+32) x 5 MSPS x 2 pol x 100 = 64 Gbit/s
Calculating a ring buffer for 30 MHz operation does not make sense...
We will calculate the implications of the various ring buffer sizes on the 30 MHz burst operations. Must consider the time to write out from ring-buffer to disk as well.
Ring buffer operation
The ring buffer should hold only samples, headers and timestamps go to table preferably in cache.
- event: data protection (lock?)
The ring buffer server memory populated evenly and locks from both the writer and second beam former possible. Write from top to bottom continuously.
- event: raw data dump to disk
Writing data to disk takes 4-8 times longer than incoming data rate. A two stage hybrid storage (fast SSD + slow disk) can speed up writing. This needs to be tested. The export to disk to be done by the file writer? The data is exported in "slices" across all nodes (ring buffer)
What code exists
- Second beam former code. Written in C. Will be put to the CodeRefinery.
The FIR performance needs to be tested on current hardware. Also need application support needed for the FIR filter testing.
The depolarization of the beams is done after the summation. Forms 100 beams.
- Lag Profiling code.
Lag profiling is calculated from the files output from the second beam former.
Exists some code for lag profiling as well. RAM is not important for lag profiling.
Question to be answered still:
What's the acceptable integration time for the lag profiles??? Seconds? How many seconds?
Processes to run on site cluster
- Operation: ring buffer, second beam former, file writers, radar controller.
- Realtime analysis: results out from site
Probably needs to run on depolarised and lag profiled data. Requires sorting, complex multiplication and summation, can use FFT libraries. Develop lag profiling: apply for Application expert support at SNIC,CSC or Sigma (develop based on pseudocode).
Computing requirements
If there is ONLY 1 beam per subarray per output beam, at 10 outputs gives 2 x 10 x 109 x 10 FIRs, this can be run in 4 nodes!
RAM: 2 TB per node >= 3 nodes at above sites net throughput: >= 5 nodes (count 6 nodes and spares)
Diskwriter rate >= 8 nodes (12 to be safe. Equipped with many CPUs can also do lagprofiles.)
Suggested eg 20 nodes: 8 RAM + FIR (1U), 12 processing and file (2U). No virtualisation / containers necessary
Sum ~ 0.5 MEUR per site
Network fabric
- input: one backplane switch vs split over "N".
Backplane switch: 128 port 8U 60 kUSD (+ fiber ports?)
Ethernet input if split 4 x 60 kSEK for switches + ~ 120 x 8 kSEK for fiber ports.
- cluster: infiniband and ethernet equal in cost, infiniband preferrable since summation step will be more reliable. An infiniband switch has order of 30 to 40 ports. Infiniband fabric can use copper cables. 36 port switch
Maybe can overlay 20 Gbit/s in the national providers rather than 100 Gbit/s. Need to find out from NorduNet about the current situation.
Procurement: -for competitiveness buy all three site clusters + some central archive in one bid
-New feature in dCache: event trigger.
Runs a chain of Apache Workflow events e.g. every time a new file is stored
Command and control
How to switch modes (i.e. second beam former coefficients)... this should be data driven. Use a database. What's the latency... 10 ms minimum acceptable delay between data streams. Not a problem as the Data Stream ID (DSID) is waiting already.
Have a distributed common DB for all sites. Should this be read-only or not (at the sites).
(Mattias) Synchronous SQL is possible (postgresql replication). Used in dCache for NT1 (large DBs with high rates of updates).
Ring buffer and beam former control... regular restart of processes can be preferable.
WAN requirements
- level 2 data is the normal archive product.
We must update data rate calculations for the data export off-site. i.e. What will be the data rate to the storage? 2~PB/y? Or not?
- Storage requirements
- Cold storage?
- Hot storage?
- Derived data should this be recalculated rather than store?
- Where should the first data be calculated? Checksum? On-site.
File storage
Average 2 PB/y to be written to each Data Centre "decided". i.e. This is our nominal rate.
Comment: AL: Object storage is still an option at those rates.
There are also several file system options, up to parallel POSIX (Lustre etc). Some file systems do not allow reopening closed files (eg dCache, Hadoop fs).
- How to run analysis on top of the storage?
- What storages do DIRAC support?
- Warm storage, cold storage and cold storage readback?
- Possibly separate the two DC sites: one for realtime access and one for backup
- Different hardware and software too?
- Check the DC options once more- what is available where?
- Update operational profiling questions.
Next steps
- Deliverable 1 document: site cluster architecture etc
- John to start the document with as much information from this and previous meetings.
- Harri will deliver the diagrams.
- All will caption these diagrams.
- Test runs at HPC2N:
- a filewriter node available for Kebnekaise
- SNIC allocations needed
- Advanced user support from one of the countries
Whiteboard photos
Dinner
Restaurant Spis, 19:30 CET