... | @@ -50,6 +50,7 @@ All benchmarks work on double precision floating-point arrays using scalar opera |
... | @@ -50,6 +50,7 @@ All benchmarks work on double precision floating-point arrays using scalar opera |
|
| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte. |
|
|
| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte. |
|
|
| `SUM(CAS_COUNT_WR)*64` stays zero | `SUM(CAS_COUNT_WR)*64` stays zero |
|
|
| `SUM(CAS_COUNT_WR)*64` stays zero | `SUM(CAS_COUNT_WR)*64` stays zero |
|
|
|
|
|
|
|
|
The problem gets visible here already. Since the L2 cache lines are not commonly contained in the L3, the cache lines need to be moved from L2 to L3 which doubles (for the load benchmark) the measured writeback data volume. Moreover, due to the behavior of the event `L2_TRANS_L2_WB` counting anything that is being written back passing L2, it is unclear what happens with the cache lines: evict to L3, evict to memory or dropping.
|
|
|
|
|
|
### Comparison for store benchmark
|
|
### Comparison for store benchmark
|
|
<p align="center">
|
|
<p align="center">
|
... | @@ -60,7 +61,7 @@ All benchmarks work on double precision floating-point arrays using scalar opera |
... | @@ -60,7 +61,7 @@ All benchmarks work on double precision floating-point arrays using scalar opera |
|
| BDX | SKX |
|
|
| BDX | SKX |
|
|
|-----------|-----------|
|
|
|-----------|-----------|
|
|
| Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores | Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores |
|
|
| Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores | Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores |
|
|
| 32B are evicted from L2 to the L3 when the vector size is larger than L2 | 32B are evicted from L2 to L3 when the vector size is larger than L2. There are not writebacks to memory using about half of the L3 cache. Above half of L3, the cache lines are either evicted from L2 to L3 (which forwards some cache lines to the memory) or the memory directly. |
|
|
| 32B are evicted from L2 to the L3 when the vector size is larger than L2 | 32B are evicted from L2 when the vector size is larger than L2 but until half of the cache it is unclear whether to L3 or memory or dropped. Above half of L3, the cache lines are either evicted from L2 to L3 (which forwards some cache lines to the memory) or the memory directly. This leads to the total "traffic" is |
|
|
| The memory loads of 32B are starting shortly after reaching L3 size, so data is streamed from memory through the L3 into L2 | Although the L3 size is not full reached, some loads (RFO traffic) are already handled by the memory. On the other side, it takes much larger sizes than L3 (100-200MB) until the full 32B are read from memory |
|
|
| The memory loads of 32B are starting shortly after reaching L3 size, so data is streamed from memory through the L3 into L2 | Although the L3 size is not full reached, some loads (RFO traffic) are already handled by the memory. On the other side, it takes much larger sizes than L3 (100-200MB) until the full 32B are read from memory |
|
|
| The memory stores of 32B behave similar to the memory loads, which is clear by a 1:1 ratio of reads and writes | The load and store memory traffic behaves similarly. It takes also larger sizes than L3 until the full 32B are written back to memory.
|
|
| The memory stores of 32B behave similar to the memory loads, which is clear by a 1:1 ratio of reads and writes | The load and store memory traffic behaves similarly. It takes also larger sizes than L3 until the full 32B are written back to memory.
|
|
|
|
|
... | @@ -91,7 +92,7 @@ All benchmarks work on double precision floating-point arrays using scalar opera |
... | @@ -91,7 +92,7 @@ All benchmarks work on double precision floating-point arrays using scalar opera |
|
The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory).
|
|
The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory).
|
|
|
|
|
|
## What is the current state?
|
|
## What is the current state?
|
|
I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
|
|
I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. Others have also tried different events (also in the LLC units, CBOX). At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
|
|
|
|
|
|
The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)
|
|
The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)
|
|
|
|
|
... | | ... | |