... | @@ -70,12 +70,15 @@ For SKX, the behavior is different: |
... | @@ -70,12 +70,15 @@ For SKX, the behavior is different: |
|
<img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
<img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
</p>
|
|
</p>
|
|
|
|
|
|
|
|
## Implications on the use of the L3 performance group for Intel Skylake
|
|
|
|
The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory).
|
|
|
|
|
|
## What is the current state?
|
|
## What is the current state?
|
|
I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
|
|
I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
|
|
|
|
|
|
The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)
|
|
The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)
|
|
|
|
|
|
It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption.
|
|
It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption.
|
|
|
|
|
|
In a meeting with Intel, we got a list of events:
|
|
In a meeting with Intel, we got a list of events:
|
|
* MEM_INST_RETIRED.ALL_LOADS
|
|
* MEM_INST_RETIRED.ALL_LOADS
|
... | @@ -91,7 +94,9 @@ In a meeting with Intel, we got a list of events: |
... | @@ -91,7 +94,9 @@ In a meeting with Intel, we got a list of events: |
|
* MEM_LOAD_MISC_RETIRED.UC
|
|
* MEM_LOAD_MISC_RETIRED.UC
|
|
* MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (**)
|
|
* MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (**)
|
|
|
|
|
|
All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the MEM_INST_RETIRED.ALL_* events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache).
|
|
All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the `MEM_INST_RETIRED.ALL_*` events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache).
|
|
|
|
|
|
|
|
I did the same measurements as above on the Skylake SP system. I left out the `MEM_INST_RETIRED.ALL_*` events and combined all `MEM_LOAD_L3_HIT_RETIRED.XSNP_*` to a single event `MEM_LOAD_L3_HIT_RETIRED.XSNP_ALL`.
|
|
|
|
|
|
### `load` benchmark
|
|
### `load` benchmark
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
... | @@ -112,8 +117,10 @@ All events marked with (**) are not published and consequently not usable by LIK |
... | @@ -112,8 +117,10 @@ All events marked with (**) are not published and consequently not usable by LIK |
|
### `triad` benchmark
|
|
### `triad` benchmark
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
## Implications on the use of the L3 performance group for Intel Skylake
|
|
### Conclusion
|
|
The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory).
|
|
These events seem not provide any insight.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Changes with Cascadelake SP/AP
|
|
## Changes with Cascadelake SP/AP
|
|
When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
|
|
When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
|
... | | ... | |