|
|
# System
|
|
|
|
|
|
* **Processor:** Fujitsu A64FX FX1000
|
|
|
* **Base frequency:** ??
|
|
|
* **Number of sockets:** 1
|
|
|
* **Number of memory domains per socket:** 4
|
|
|
* **Number of cores per socket:** 48
|
|
|
* **Number of HWThreads per core:** 1
|
|
|
* **[MachineState](https://github.com/RRZE-HPC/MachineState) output:** NA
|
|
|
|
|
|
# Tool chain
|
|
|
|
|
|
```
|
|
|
+----------+-----------------------------+
|
|
|
| Compiler | fcc (FCC) |
|
|
|
|----------|-----------------------------|
|
|
|
| Version | fcc (FCC) 4.4.0a 20210127 |
|
|
|
+----------+-----------------------------+
|
|
|
```
|
|
|
|
|
|
Optimizing flags: ```-Kfast -Kocl -Koptmsg=2 -Nlst=t -Kzfill -Kprefetch_line=6 -Kopenmp```
|
|
|
|
|
|
# Results
|
|
|
|
|
|
All results are in ```GB/s```.
|
|
|
|
|
|
Summary results:
|
|
|
```
|
|
|
+-----------------------------------------------+
|
|
|
| Single core | 92.72 (SDaxpy) |
|
|
|
| Memory domain | 230.03 (Sum with 8 cores) |
|
|
|
| Socket | 870.95 (Sum with 8 cores) |
|
|
|
| Node | 870.95 (Sum with 8 cores) |
|
|
|
+-----------------------------------------------+
|
|
|
```
|
|
|
|
|
|
Results for scaling within a memory domain:
|
|
|
```
|
|
|
#nt Init Sum Copy Update Triad Daxpy STriad SDaxpy
|
|
|
1 13.24 57.29 26.15 80.14 38.22 89.81 49.41 92.72
|
|
|
2 26.51 117.00 52.39 132.17 76.15 159.29 99.31 180.45
|
|
|
3 39.79 168.38 78.55 171.95 114.22 203.63 148.67 214.14
|
|
|
4 53.11 203.77 104.60 198.39 152.39 213.95 196.94 215.06
|
|
|
5 66.48 218.87 130.04 208.79 190.06 213.73 208.89 215.36
|
|
|
6 80.00 226.30 155.77 213.38 207.44 213.69 212.09 214.96
|
|
|
7 85.95 222.49 164.64 208.16 207.38 212.42 211.67 214.28
|
|
|
8 107.11 230.03 202.29 214.65 211.58 212.13 212.99 214.17
|
|
|
9 110.59 228.35 203.24 210.91 205.07 211.34 209.35 213.26
|
|
|
10 133.32 228.69 205.94 212.66 210.84 211.85 212.49 213.68
|
|
|
11 131.18 227.63 208.70 212.26 210.38 211.58 212.24 213.37
|
|
|
12 133.24 227.19 208.84 209.98 210.47 211.67 211.93 213.24
|
|
|
|
|
|
```
|
|
|
|
|
|
Results for scaling across memory domains. Shown are the results for the number of memory domains used (nm) with columns number of cores used per memory domain.
|
|
|
|
|
|
Init:
|
|
|
```
|
|
|
#nm 1 2 3 4
|
|
|
1 13.24 26.49 39.73 52.96
|
|
|
2 26.51 53.01 79.49 105.94
|
|
|
3 39.79 79.53 110.35 158.88
|
|
|
4 53.11 106.18 159.09 211.97
|
|
|
5 66.48 132.81 199.04 265.06
|
|
|
6 80.00 159.83 219.82 319.11
|
|
|
7 85.95 171.35 256.64 341.38
|
|
|
8 107.11 213.79 320.11 425.84
|
|
|
9 110.59 220.76 330.20 437.92
|
|
|
10 133.32 265.29 396.50 525.45
|
|
|
11 131.18 259.50 383.31 519.40
|
|
|
12 133.24 264.95 393.73 527.38
|
|
|
|
|
|
```
|
|
|
|
|
|
Sum:
|
|
|
```
|
|
|
#nm 1 2 3 4
|
|
|
1 57.29 114.50 170.48 226.70
|
|
|
2 117.00 231.74 344.43 453.07
|
|
|
3 168.38 333.02 418.70 646.18
|
|
|
4 203.77 400.74 592.29 779.26
|
|
|
5 218.87 430.32 633.73 828.78
|
|
|
6 226.30 443.33 627.61 865.72
|
|
|
7 222.49 435.47 645.35 847.42
|
|
|
8 230.03 451.53 665.44 870.95
|
|
|
9 228.35 449.76 665.32 864.43
|
|
|
10 228.69 448.83 661.70 864.71
|
|
|
11 227.63 446.43 659.02 859.17
|
|
|
12 227.19 445.79 655.94 864.56
|
|
|
|
|
|
```
|
|
|
|
|
|
Copy
|
|
|
```
|
|
|
#nm 1 2 3 4
|
|
|
1 26.15 52.40 78.60 104.78
|
|
|
2 52.39 104.74 157.08 209.39
|
|
|
3 78.55 157.00 215.78 313.31
|
|
|
4 104.60 209.10 312.30 416.26
|
|
|
5 130.04 259.82 389.30 518.41
|
|
|
6 155.77 311.05 423.25 619.02
|
|
|
7 164.64 329.27 492.46 655.47
|
|
|
8 202.29 402.55 599.54 797.99
|
|
|
9 203.24 403.31 601.76 796.74
|
|
|
10 205.94 411.32 616.34 817.48
|
|
|
11 208.70 415.97 622.32 827.97
|
|
|
12 208.84 416.56 615.35 817.13
|
|
|
|
|
|
```
|
|
|
|
|
|
Update
|
|
|
```
|
|
|
#nm 1 2 3 4
|
|
|
1 80.14 160.16 240.20 319.72
|
|
|
2 132.17 263.95 395.49 526.63
|
|
|
3 171.95 343.19 432.03 683.28
|
|
|
4 198.39 396.26 592.45 790.80
|
|
|
5 208.79 416.82 623.73 829.48
|
|
|
6 213.38 425.51 598.87 850.88
|
|
|
7 208.16 414.01 623.04 827.46
|
|
|
8 214.65 427.37 639.56 849.91
|
|
|
9 210.91 422.03 633.41 837.44
|
|
|
10 212.66 423.60 635.51 845.11
|
|
|
11 212.26 422.42 632.77 840.07
|
|
|
12 209.98 419.61 630.04 839.04
|
|
|
|
|
|
```
|
|
|
|
|
|
Triad
|
|
|
```
|
|
|
#nm 1 2 3 4
|
|
|
1 38.22 76.68 114.99 153.31
|
|
|
2 76.15 152.17 228.30 304.08
|
|
|
3 114.22 228.18 304.35 455.71
|
|
|
4 152.39 304.20 455.11 606.28
|
|
|
5 190.06 378.81 567.17 753.18
|
|
|
6 207.44 412.90 592.12 820.72
|
|
|
7 207.38 412.86 616.03 812.76
|
|
|
8 211.58 420.05 613.30 813.03
|
|
|
9 205.07 408.38 613.02 810.70
|
|
|
10 210.84 418.52 626.45 831.34
|
|
|
11 210.38 418.01 624.82 825.94
|
|
|
12 210.47 417.65 621.49 824.72
|
|
|
|
|
|
```
|
|
|
|
|
|
# Scaling
|
|
|
|
|
|
Memory bandwidth scaling within one memory domain:
|
|
|

|
|
|
|
|
|
The following plots illustrate the the performance scaling over multiple memory domains using different number of cores per memory domain.
|
|
|
|
|
|
Memory bandwidth scaling across memory domains for init:
|
|
|

|
|
|
|
|
|
Memory bandwidth scaling across memory domains for sum
|
|
|

|
|
|
|
|
|
Memory bandwidth scaling across memory domains for copy
|
|
|

|
|
|
|
|
|
Memory bandwidth scaling across memory domains for Triad
|
|
|
 |