... | @@ -143,12 +143,12 @@ These events don't provide any further insight. The counts raise for some benchm |
... | @@ -143,12 +143,12 @@ These events don't provide any further insight. The counts raise for some benchm |
|
## Changes with Cascadelake SP/AP
|
|
## Changes with Cascadelake SP/AP
|
|
When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Description: Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Description: Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
|
|
When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Description: Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Description: Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
|
|
> CLX3. IDI_MISC Performance Monitoring Events May be Inaccurate<br>
|
|
> CLX3. IDI_MISC Performance Monitoring Events May be Inaccurate<br>
|
|
> Problem: The IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE performance monitoring events (Event FEH; UMask 02H and 04H) counts cache lines evicted fromthe L2 cache. Due to this erratum, the per logical processor count may be incorrect when both logical processors on the same physical core are active. The aggregate count of both logical processors is not affected by this erratum.<br>
|
|
> Problem: The IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE performance monitoring events (Event FEH; UMask 02H and 04H) counts cache lines evicted from the L2 cache. Due to this erratum, the per logical processor count may be incorrect when both logical processors on the same physical core are active. The aggregate count of both logical processors is not affected by this erratum.<br>
|
|
> Implication: IDI_MISC performance monitoring events may be inaccurate.<br>
|
|
> Implication: IDI_MISC performance monitoring events may be inaccurate.<br>
|
|
> Workaround: None identified.<br>
|
|
> Workaround: None identified.<br>
|
|
> Status:No fix.<br>
|
|
> Status:No fix.<br>
|
|
|
|
|
|
The Intel Cascadelake SP micro architecture is the follow-up to Intel Skylake SP. As an experiment I added the two events to the Intel Skylake SP event file and ran some benchmarks using a single core to get accurate results. Same configuration as in the above comparison between Intel Broadwell EP and Intel Skylake SP.
|
|
In fact, Intel published these events for Intel Skylake SP already a long time ago but I havn't seen them in the wild. They are not listed in the Intel forum post and other hardware performance monitoring software is also not using it. I ran some benchmarks using a single core to get accurate results. Same configuration as in the above comparison between Intel Broadwell EP and Intel Skylake SP.
|
|
|
|
|
|
### `load` benchmark
|
|
### `load` benchmark
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
... | @@ -165,7 +165,7 @@ The Intel Cascadelake SP micro architecture is the follow-up to Intel Skylake SP |
... | @@ -165,7 +165,7 @@ The Intel Cascadelake SP micro architecture is the follow-up to Intel Skylake SP |
|
### `triad` benchmark
|
|
### `triad` benchmark
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
The doubled data volume compared to Intel Broadwell EP is still there but that's expected because the active copy to L3 is required to benefit from the L3 cache. The event `IDI_MISC_WB_UPGRADE` raises similar to the `L2_TRANS_L2_WB` event when sizes reach full L2 size. At about 75% of the L3 size, the evictions to L3 due to reuse hint (`IDI_MISC_WB_UPGRADE`) decrease and cache lines drops raise (`IDI_MISC_WB_DOWNGRADE`). Similar to the drops, the reads from memory increase as, after dropping, the L2 has to re-read the cache lines from memory.
|
|
The doubled data volume compared to Intel Broadwell EP is still there but that's expected because the active copy to L3 is required to benefit from the L3 cache. The event `IDI_MISC_WB_UPGRADE` raises similar to the `L2_TRANS_L2_WB` event when sizes reach full L2 size. At about 75% of the L3 size, the evictions with reuse hint to L3 (`IDI_MISC_WB_UPGRADE`) decrease and cache lines drops raise (`IDI_MISC_WB_DOWNGRADE`). Similar to the drops, the reads from memory increase as, after dropping, the L2 has to re-read the cache lines from memory.
|
|
|
|
|
|
So for accurate measurements of the writeback path, you need to measure `IDI_MISC_WB_UPGRADE` and `IDI_MISC_WB_DOWNGRADE` instead of `L2_TRANS_L2_WB`.
|
|
So for accurate measurements of the writeback path, you need to measure `IDI_MISC_WB_UPGRADE` and `IDI_MISC_WB_DOWNGRADE` instead of `L2_TRANS_L2_WB`.
|
|
|
|
|
... | @@ -177,72 +177,6 @@ The results show: |
... | @@ -177,72 +177,6 @@ The results show: |
|
|
|
|
|
So, if we leave out the `L2_TRANS_L2_WB` from the `L3` performance group, we can include both `IDI_MISC_WB*` events and still have one counter register left. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for [Intel Skylake SP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) but for [Intel Skylake Desktop (SKL128)](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf).
|
|
So, if we leave out the `L2_TRANS_L2_WB` from the `L3` performance group, we can include both `IDI_MISC_WB*` events and still have one counter register left. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for [Intel Skylake SP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) but for [Intel Skylake Desktop (SKL128)](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf).
|
|
|
|
|
|
So, the new L3 performance group will be
|
|
|
|
```
|
|
|
|
SHORT L3 cache bandwidth in MBytes/s
|
|
|
|
|
|
|
|
EVENTSET
|
|
|
|
FIXC0 INSTR_RETIRED_ANY
|
|
|
|
FIXC1 CPU_CLK_UNHALTED_CORE
|
|
|
|
FIXC2 CPU_CLK_UNHALTED_REF
|
|
|
|
PMC0 L2_LINES_IN_ALL
|
|
|
|
PMC1 IDI_MISC_WB_UPGRADE
|
|
|
|
PMC2 IDI_MISC_WB_DOWNGRADE
|
|
|
|
PMC3 MEM_LOAD_RETIRED_L3_HIT
|
|
|
|
MBOX0C0 CAS_COUNT_RD
|
|
|
|
MBOX0C1 CAS_COUNT_WR
|
|
|
|
MBOX1C0 CAS_COUNT_RD
|
|
|
|
MBOX1C1 CAS_COUNT_WR
|
|
|
|
MBOX2C0 CAS_COUNT_RD
|
|
|
|
MBOX2C1 CAS_COUNT_WR
|
|
|
|
MBOX3C0 CAS_COUNT_RD
|
|
|
|
MBOX3C1 CAS_COUNT_WR
|
|
|
|
MBOX4C0 CAS_COUNT_RD
|
|
|
|
MBOX4C1 CAS_COUNT_WR
|
|
|
|
MBOX5C0 CAS_COUNT_RD
|
|
|
|
MBOX5C1 CAS_COUNT_WR
|
|
|
|
|
|
|
|
METRICS
|
|
|
|
Runtime (RDTSC) [s] time
|
|
|
|
Runtime unhalted [s] FIXC1*inverseClock
|
|
|
|
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
|
|
|
CPI FIXC1/FIXC0
|
|
|
|
MEM->L2 load bandwidth [MBytes/s] 1.0E-06*(PMC0-PMC3)*64.0/time
|
|
|
|
MEM->L2 load data volume [GBytes] 1.0E-09*(PMC0-PMC3)*64.0
|
|
|
|
L3->L2 load bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
|
|
|
|
L3->L2 load data volume [GBytes] 1.0E-09*(PMC3)*64.0
|
|
|
|
L2->L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
|
|
|
L2->L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
|
|
|
L2 dropped CLs data volume [GBytes] 1.0E-09*PMC2*64.0
|
|
|
|
L2<->L3|MEM bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
|
|
|
|
L2<->L3|MEM data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
|
|
|
|
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0)*64.0/time
|
|
|
|
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0)*64.0
|
|
|
|
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0/time
|
|
|
|
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0
|
|
|
|
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0/time
|
|
|
|
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0
|
|
|
|
|
|
|
|
LONG
|
|
|
|
Formulas:
|
|
|
|
MEM->L2 load bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL-MEM_LOAD_RETIRED_L3_HIT)*64.0/time
|
|
|
|
MEM->L2 load data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL-MEM_LOAD_RETIRED_L3_HIT)*64.0
|
|
|
|
L3->L2 load bandwidth [MBytes/s] = 1.0E-06*MEM_LOAD_RETIRED_L3_HIT*64.0/time
|
|
|
|
L3->L2 load data volume [GBytes] = 1.0E-09*MEM_LOAD_RETIRED_L3_HIT*64.0
|
|
|
|
L2->L3 evict bandwidth [MBytes/s] = 1.0E-06*IDI_MISC_WB_UPGRADE*64.0/time
|
|
|
|
L2->L3 evict data volume [GBytes] = 1.0E-09*IDI_MISC_WB_UPGRADE*64.0
|
|
|
|
L2 dropped CLs data volume [GBytes] = 1.0E-09*IDI_MISC_WB_DOWNGRADE*64.0
|
|
|
|
L2<->L3|MEM bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL+IDI_MISC_WB_UPGRADE)*64/time
|
|
|
|
L2<->L3|MEM data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL+IDI_MISC_WB_UPGRADE)*64
|
|
|
|
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/runtime
|
|
|
|
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0
|
|
|
|
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/runtime
|
|
|
|
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0
|
|
|
|
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/runtime
|
|
|
|
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0
|
|
|
|
-
|
|
|
|
Profiling group to measure L2 - L3 - memory traffic. This group differs from previous CPU generations due to the
|
|
|
|
L3 victim cache. Since data can be loaded from L3 or memory, the memory controllers need to be measured as well.
|
|
|
|
```
|
|
|
|
|
|
|
|
(##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision. |
|
(##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision. |
|
|
|
\ No newline at end of file |