Changes

Thomas Gruber · 68a873a0
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
@@ -2,6 +2,9 @@

 Intel introduced with Intel Skylake SP (and the successor Cascadelake SP) a change in the cache hierarchy. The sizes of each layer are changed (L2 larger, L3 smaller) and the L3 is now a victim cache (non-inclusive cache). This results also in a different approach to measure the traffic between L2, L3 and memory.

+## Summary
+On this page, we show how to improve the measurments of L2, L3 and memory traffic for Intel systems with L3 victim caches. We test different hardware performance events in order to refine the current way to measure L2 <-> L3 data traffic. We identify two events that count the dropped cache lines and the cache lines moved from the L2 to the L3 cache. For LIKWID 5.0.0 the performance groups (eventset + metrics + description) are extended to measure the found events, too, and therefore provide deeper knowledge in the cache behavior.
+
 ## What is a [victim cache](https://en.wikipedia.org/wiki/Victim_cache)
 On all architectures before Intel Skylake SP (SKX), like Intel Broadwell EP (BDX), the caches are (mostly?) inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.

@@ -21,12 +24,10 @@ The exact heuristics are not published by Intel.

 ## What is the difference for measurements?

-For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (`L2_LINES_IN_ALL`, rf107, (--)) and evicted (`L2_TRANS_L2_WB`, r40f0, (++)) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event `L2_LINES_IN_ALL` is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same behavior is true for the `L2_TRANS_L2_WB` event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines, the event simply counts all cache lines written back by L2 which touch the L2. 
+For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (`L2_LINES_IN_ALL`, rf107, (--)) and evicted (`L2_TRANS_L2_WB`, r40f0, (++)) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event `L2_LINES_IN_ALL` is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same behavior is true for the `L2_TRANS_L2_WB` event. There is no differentiation between evicts to L3 and evicts to memory. The event simply counts all cache lines leaving the L2 cache. Instead of the `L2_TRANS_L2_WB` event, the Intel Skylake/Cascadelake SP architecuture provides two other usable events: `L2_LINES_OUT_SILENT` (r01F2) and `L2_LINES_OUT_NON_SILENT` (r02F2)

 (--, ++) For the both architectures: Intel Broadwell EP and Intel Skylake SP.

-(++) There are other usable events, like `L2_LINES_OUT_SILENT` (r01F2) and `L2_LINES_OUT_NON_SILENT` (r02F2) for Intel Skylake SP.
-

 | Event | BDX | SKX |
 |-------|-----|-----|
@@ -61,7 +62,7 @@ In order to improve visibility, the remaining benchmarks are not directly shown
 * [stream](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_stream.png)
 * [triad](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_triad.png)

-These plots clearly show no real difference in the measurements of `L2_LINES_OUT_NON_SILENT` and `L2_TRANS_L2_WB` under these workloads. The `L2_LINES_OUT_SILENT` event stays at zero. There might be workloads that produce different pictures. The measurements are performed in two application runs, first the `L2_TRANS_L2_WB` and second all `L2_LINES_OUT_*` events.
+These plots clearly show no real difference in the measurements of `L2_LINES_OUT_NON_SILENT` and `L2_TRANS_L2_WB` under these workloads. The `L2_LINES_OUT_SILENT` event stays at zero for all kernels. There might be workloads that produce different pictures. The measurements are performed in two application runs, first the `L2_TRANS_L2_WB` and second all `L2_LINES_OUT_*` events.

 So let's check full measurements to show the difference between BDX and SKX.

@@ -73,7 +74,7 @@ So let's check full measurements to show the difference between BDX and SKX.

 | BDX       |   SKX     |
 |-----------|-----------|
-| As soon as the data sizes fill the L2 cache almost completely, the bytes loaded from L3 (`L2_LINES_IN_ALLr`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, the data is loaded from L3 (`L2_LINES_IN_ALL`) until we read 64 Byte per iteration. Same as for BDX. |
+| As soon as the data sizes fill the L2 cache almost completely, the bytes loaded from L3 (`L2_LINES_IN_ALL`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, the data is loaded from L3 (`L2_LINES_IN_ALL`) until we read 64 Byte per iteration. Same as for BDX. |
 |The benchmark does not evict any data, hence the `L2_TRANS_L2_WB` event stays zero | `L2_TRANS_L2_WB` raises similar to (`L2_LINES_IN_ALL`). This is because the cache lines are evicted from L2 but they can be either evicted to L3, evicted to memory or being dropped.|
 | When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)` increases until 64 Byte. |
 | `SUM(CAS_COUNT_WR)` stays zero | `SUM(CAS_COUNT_WR)` stays zero |
@@ -89,15 +90,18 @@ For completeness, here are the plots for the other benchmarks:
 | stream | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_stream.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_stream.png) |
 | triad | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_triad.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png) |

-## Implications on the use of the L3 performance group for Intel Skylake
-The L3 performance group for Intel Skylake (version 4.3.4) uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory) as well as dropped cache lines. 
+# Implications on the use of the L3 performance group for Intel Skylake
+The L3 performance group for Intel Skylake (version 4.3.4) uses the two events mentioned above. So, keep in mind that `L2_LINES_IN_ALL` contains loads from L3 and memory and `L2_TRANS_L2_WB` contains writebacks to L3 (and memory) as well as dropped cache lines. There are quite a few uncertainties here:
+* Where is the data loaded from?
+* Where is it evicted to?
+* Have any cache lines been dropped?

 ## What was done to fix the problem?
-I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. Others have also tried different events (also in the LLC units, CBOX). At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
+I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. Others have also tried different events (also in the LLC units, LIKWID calls them CBOXes). At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source or destination.

 The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)

-It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. Furthermore, there are quite a few LLC units (CBOXes, BDX has 24 and SKX 28) and each unit with probably multiple counters per unit have to be programmed and read.
+It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. Furthermore, there are quite a few LLC units (BDX has 24 and SKX has 28) and each unit - with probably multiple counters per unit - have to be programmed and read.

 After a meeting with Intel, we got a list of events:
 * MEM_INST_RETIRED.ALL_LOADS (r81d0)
@@ -113,28 +117,17 @@ After a meeting with Intel, we got a list of events:
 * MEM_LOAD_MISC_RETIRED.UC (r04d4)
 * MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (**)

-All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the `MEM_INST_RETIRED.ALL_*` events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache) which are not covered by these two events.
+All events marked with (**) are not published and consequently not usable by LIKWID and other tools. We tried the other events but for some it was clear that it wouldn't work. E.g. the `MEM_INST_RETIRED.ALL_*` events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache) which are not covered by these two events.

 I did the same measurements as above on the Skylake SP system. I left out the `MEM_INST_RETIRED.ALL_*` events and combined all `MEM_LOAD_L3_HIT_RETIRED.XSNP_*` to a single event `MEM_LOAD_L3_HIT_RETIRED.XSNP_ALL`.

 ### `load` benchmark
 <img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">

-
-### `store` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-
-
-### `copy` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-
-
-### `stream` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-
-
-### `triad` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+* [store](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_store.png)
+* [copy](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_copy.png)
+* [stream](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_stream.png)
+* [triad](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3TRI_triad.png)

 These events don't provide any further insight. The counts raise for some benchmarks when sizes fit in L3 or memory but it's hard to find a relation between these events and the application model (data volume per iteration).

@@ -153,19 +146,13 @@ In fact, Intel published these events for Intel Skylake SP already a long time a
 ### `load` benchmark
 <img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">

-### `store` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-
-### `copy` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-
-### `stream` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+* [store](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_store.png)
+* [copy](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_copy.png)
+* [stream](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_stream.png)
+* [triad](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_triad.png)

-### `triad` benchmark
-<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">

-The doubled data volume compared to Intel Broadwell EP is still there but that's expected because the active copy to L3 is required to benefit from the L3 cache. The event `IDI_MISC_WB_UPGRADE` raises similar to the `L2_TRANS_L2_WB` event when sizes reach full L2 size. At about 75% of the L3 size, the evictions with reuse hint to L3 (`IDI_MISC_WB_UPGRADE`) decrease and cache lines drops raise (`IDI_MISC_WB_DOWNGRADE`). Similar to the drops, the reads from memory increase as, after dropping, the L2 has to re-read the cache lines from memory.
+The event `IDI_MISC_WB_UPGRADE` raises similar to the `L2_TRANS_L2_WB` event when sizes reach full L2 size. At about 75% of the L3 size, the evictions with reuse hint to L3 (`IDI_MISC_WB_UPGRADE`) decrease and cache lines drops raise (`IDI_MISC_WB_DOWNGRADE`). Similar to the drops, the reads from memory increase as, after dropping, the L2 has to re-read the cache lines from memory.

 So for accurate measurements of the writeback path, you need to measure `IDI_MISC_WB_UPGRADE` and `IDI_MISC_WB_DOWNGRADE` besides `L2_TRANS_L2_WB`.