... | ... | @@ -71,10 +71,6 @@ So let's check full measurements to show the difference between BDX and SKX. |
|
|
<img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| BDX | SKX |
|
|
|
|-----------|-----------|
|
|
|
| As soon as the data sizes fill the L2 cache almost completely, the bytes loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, the data is loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX. |
|
... | ... | @@ -96,14 +92,14 @@ For completeness, here are the plots for the other benchmarks: |
|
|
## Implications on the use of the L3 performance group for Intel Skylake
|
|
|
The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory).
|
|
|
|
|
|
## What is the current state?
|
|
|
## What was done to fix the problem?
|
|
|
I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. Others have also tried different events (also in the LLC units, CBOX). At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
|
|
|
|
|
|
The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)
|
|
|
|
|
|
It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption.
|
|
|
It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. Furthermore, there are quite a few LLC units (CBOXes) and each unit has to be programmed and read.
|
|
|
|
|
|
In a meeting with Intel, we got a list of events:
|
|
|
After a meeting with Intel, we got a list of events:
|
|
|
* MEM_INST_RETIRED.ALL_LOADS
|
|
|
* MEM_INST_RETIRED.ALL_STORES
|
|
|
* MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT
|
... | ... | @@ -117,7 +113,7 @@ In a meeting with Intel, we got a list of events: |
|
|
* MEM_LOAD_MISC_RETIRED.UC
|
|
|
* MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (**)
|
|
|
|
|
|
All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the `MEM_INST_RETIRED.ALL_*` events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache).
|
|
|
All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the `MEM_INST_RETIRED.ALL_*` events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache) which are not covered by these two events.
|
|
|
|
|
|
I did the same measurements as above on the Skylake SP system. I left out the `MEM_INST_RETIRED.ALL_*` events and combined all `MEM_LOAD_L3_HIT_RETIRED.XSNP_*` to a single event `MEM_LOAD_L3_HIT_RETIRED.XSNP_ALL`.
|
|
|
|
... | ... | @@ -140,13 +136,12 @@ I did the same measurements as above on the Skylake SP system. I left out the `M |
|
|
### `triad` benchmark
|
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
### Conclusion
|
|
|
These events seem not provide any insight.
|
|
|
These events don't provide any further insight. The counts raise for some benchmarks when sizes fit in L3 or memory but it's hard to find a relation between these events and the application model (data volume per iteration).
|
|
|
|
|
|
|
|
|
|
|
|
## Changes with Cascadelake SP/AP
|
|
|
When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
|
|
|
When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Description: Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Description: Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
|
|
|
> CLX3. IDI_MISC Performance Monitoring Events May be Inaccurate<br>
|
|
|
> Problem: The IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE performance monitoring events (Event FEH; UMask 02H and 04H) counts cache lines evicted fromthe L2 cache. Due to this erratum, the per logical processor count may be incorrect when both logical processors on the same physical core are active. The aggregate count of both logical processors is not affected by this erratum.<br>
|
|
|
> Implication: IDI_MISC performance monitoring events may be inaccurate.<br>
|
... | ... | |