... | @@ -54,7 +54,7 @@ I mentioned that there are different events to measure the evict/writeback traff |
... | @@ -54,7 +54,7 @@ I mentioned that there are different events to measure the evict/writeback traff |
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_load.png" alt="Comparison of data volume per loop iteration using the events `L2_LINES_OUT_SILENT`, `L2_LINES_OUT_NON_SILENT` and `L2_TRANS_L2_WB` running the `load` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz"><br>
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_load.png" alt="Comparison of data volume per loop iteration using the events `L2_LINES_OUT_SILENT`, `L2_LINES_OUT_NON_SILENT` and `L2_TRANS_L2_WB` running the `load` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz"><br>
|
|
</p>
|
|
</p>
|
|
|
|
|
|
For visibility, the remaining benchmarks are not directly shown but here are the links:
|
|
In order to improve visibility, the remaining benchmarks are not directly shown but here are the links:
|
|
|
|
|
|
* [store](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_store.png)
|
|
* [store](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_store.png)
|
|
* [copy](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_copy.png)
|
|
* [copy](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_copy.png)
|
... | @@ -149,3 +149,30 @@ When releasing the Intel Cascadelake SP/AP chips, Intel published two new events |
... | @@ -149,3 +149,30 @@ When releasing the Intel Cascadelake SP/AP chips, Intel published two new events |
|
> Status:No fix.<br>
|
|
> Status:No fix.<br>
|
|
|
|
|
|
The Intel Cascadelake SP micro architecture is the follow-up to Intel Skylake SP. As an experiment I added the two events to the Intel Skylake SP event file and ran some benchmarks using a single core to get accurate results. Same configuration as in the above comparison between Intel Broadwell EP and Intel Skylake SP.
|
|
The Intel Cascadelake SP micro architecture is the follow-up to Intel Skylake SP. As an experiment I added the two events to the Intel Skylake SP event file and ran some benchmarks using a single core to get accurate results. Same configuration as in the above comparison between Intel Broadwell EP and Intel Skylake SP.
|
|
|
|
|
|
|
|
### `load` benchmark
|
|
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
|
|
### `store` benchmark
|
|
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
|
|
### `copy` benchmark
|
|
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
|
|
### `stream` benchmark
|
|
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
|
|
### `triad` benchmark
|
|
|
|
<img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3NEW_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
|
|
|
|
|
|
|
|
The doubled data volume compared to Intel Broadwell EP is still there but that's expected because the active copy to L3 is required to benefit from the L3 cache. The event `IDI_MISC_WB_UPGRADE` raises similar to the `L2_TRANS_L2_WB` event when sizes reach full L2 size. At about 75% of the L3 size, the evictions to L3 due to reuse hint (`IDI_MISC_WB_UPGRADE`) decrease and cache lines drops raise (`IDI_MISC_WB_DOWNGRADE`). Similar to the drops, the reads from memory increase as, after dropping, the L2 has to re-read the cache lines from memory.
|
|
|
|
|
|
|
|
So for accurate measurements of the writeback path, you need to measure `IDI_MISC_WB_UPGRADE` and `IDI_MISC_WB_DOWNGRADE` instead of `L2_TRANS_L2_WB`.
|
|
|
|
|
|
|
|
The results show:
|
|
|
|
|
|
|
|
1. The events work for Intel Skylake SP although released only for Intel Cascadelake SP
|
|
|
|
2. The L2 writeback path can be characterized. No information about L3 writebacks.
|
|
|
|
3. No information about the load path (##)
|
|
|
|
|
|
|
|
(##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision. |
|
|
|
\ No newline at end of file |