Changes

Thomas Gruber · 6067670f
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
@@ -5,11 +5,11 @@ Intel introduced with Intel Skylake SP (and the successor Cascadelake SP) a chan
 ## What is a victim cache
 On all architectures before Intel Skylake SP, the caches are inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.
-If a core requests data from memory, it is directly loaded into L2 (and then in L1) bypassing the L3 cache. If a cache lines need to be evicted from L2, the current line state is checked and, based on its state, the cache line is either dropped (makes sense for a clean cache lines), evicted to L3 (makes sense for modified and shared cache lines) or even evicted directly to memory. The exact heuristics are not published by Intel.
+If a core requests data from memory, it is directly loaded into L2 (and then in L1) bypassing the L3 cache. If a cache lines need to be evicted from L2, the current line state is checked and, based on some heuristics, the cache line is either dropped (makes sense for clean cache lines), evicted to L3 (makes sense for modified and shared cache lines) or even evicted directly to memory. The exact heuristics are not published by Intel.
 ## What is the difference for measurements?
-For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (L2_LINES_IN_ALL, rf107) and evicted (L2_TRANS_L2_WB, r40f0) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation cached and the event L2_LINES_IN_ALL is the sum of loads from L3 and memory (simply all cache lines coming into L2 independent of the source). The same is true for the L2_TRANS_L2_WB event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines. Furthermore, first tests revealed that not the L2 cache drops cache lines but the L3, thus included in the data traffic between L2 and L3.
+For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (L2_LINES_IN_ALL, rf107) and evicted (L2_TRANS_L2_WB, r40f0) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event L2_LINES_IN_ALL is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same is true for the L2_TRANS_L2_WB event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines. 
 <p align="center">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_bdx.png" alt="Cache layers of Intel Broadwell EP processors">
@@ -17,11 +17,32 @@ For the CPU architectures before Intel Skylake SP, LIKWID uses two events for lo
 </p>
 ## What is the current state?
-I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination, e.g. the L2_LINES_IN_ALL event counts all cache lines coming into L2.
+I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
 The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)
 It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. 
+In a meeting with Intel, we got a list of events:
+* MEM_INST_RETIRED.ALL_LOADS
+* MEM_INST_RETIRED.ALL_STORES
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE
+* MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM
+* MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM
+* MEM_LOAD_L3_MISS_RETIRED.REMOTE_L4 (*)
+* MEM_LOAD_MISC_RETIRED.NON_DRAM (*)
+* MEM_LOAD_MISC_RETIRED.UC
+* MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (*)
+All events marked with (*) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the MEM_INST_RETIRED.ALL_* events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache)
 ## Implications on the use of the L3 performance group for Intel Skylake
 The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory). 
+## Changes with Cascadelake SP/AP
+When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json) 
+As an experiment I added the two events to the Intel Skylake SP event file and ran some benchmarks.