Changes

Thomas Gruber · 6935c3cd
--- a/LikwidMarkerAPIPitfalls.md
+++ b/LikwidMarkerAPIPitfalls.md
+# Motivation 
+
+Most users of LIKWID that want to measure just a region of their code, like to use LIKWID's MarkerAPI. The MarkerAPI is a set of macros/functions that can be embedded in the code and turned on and off at compile time. In contrast to other tools that support measuring regions of code, the MarkerAPI just specifies where to measure but not what to measure. The configuration is done from the outside by using LIKWID's `likwid-perfctr` or by setting the appropriate environment variables.
+
+Although the MarkerAPI has only a few calls, it is crucial where to put them and maybe how to restructure the code to make it work. This page gives some hints about the operations done by the calls and some tricky examples with explainations.
+
+# Most common way of using the MarkerAPI in parallel applications
+
+If you stick to the 4 basic calls and the region is traversed only once, it is quite simple to add the MarkerAPI. The example uses OpenMP:
+```
+                                                            LIKWID_MARKER_INIT;
+#pragma omp parallel                                        #pragma omp parallel
+{                                                           {
+                                                                LIKWID_MARKER_START("do_op"); 
+    #pragma omp for                           ->                #pragma omp for
+    for (i = 0; i < N; i++)                                     for (i = 0; i < N; i++)
+       do_op(i);                                                     do_op(i);
+                                                                LIKWID_MARKER_STOP("do_op"); 
+}                                                           }
+                                                            LIKWID_MARKER_CLOSE;
+```
+
+If you use only the 4 basic calls, those are the rules:
+* Call `LIKWID_MARKER_INIT` only once in your application. Recommendation is the beginning of the `main` routine of your code. There is a guard inside that should prevent problems when it is called multiple times, but don't do it.
+* Call `LIKWID_MARKER_CLOSE` only once in your application. Recommendation is the end of the `main` routine of your code. There is **NO** guard if you call it multiple times and will overwrite the output file, skewing up results if you measure between two calls to `LIKWID_MARKER_CLOSE`, so don't do it.
+* `LIKWID_MARKER_START(str)` and `LIKWID_MARKER_START(str)` should be called once per application thread. There is some logic inside if it is called not by all application threads, but there might be problems.
+
+# My code region is quite short
+
+It is simple to put instrumentation calls inside your application, but always remember, they have overhead which (in most cases) does not come from LIKWID directly but the system calls to access the hardware counters. Independent of the `ACCESSMODE` you selected at `config.mk`, system calls are executed. Of course, if you use more events, the overhead is getting larger.
+
+Let's look at some code (excerpt from McCalpin's STREAM benchmark):
+```
+for (k=0; k<NTIMES; k++)
+{
+    // copy
+    #pragma omp parallel for
+    for (j=0; j<STREAM_ARRAY_SIZE; j++)
+        c[j] = a[j];
+    // scale
+    #pragma omp parallel for
+    for (j=0; j<STREAM_ARRAY_SIZE; j++)
+        b[j] = scalar*c[j];
+}
+```
+Even if the `STREAM_ARRAY_SIZE` is large, each execution of the `copy` or `scale` loop is not taking long. Additionally, the OpenMP parallel region is not opened once but each loop is a separate parallel region. If we put MarkerAPI calls there, it would look like this:
+```
+LIKWID_MARKER_INIT;
+for (k=0; k<NTIMES; k++)
+{
+    // copy
+    #pragma omp parallel
+    {
+        LIKWID_MARKER_START("copy");
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];
+        LIKWID_MARKER_STOP("copy");
+    }
+    // scale
+    #pragma omp parallel
+    {
+        LIKWID_MARKER_START("scale");
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            b[j] = scalar*c[j];
+        LIKWID_MARKER_STOP("scale");
+    }
+}
+LIKWID_MARKER_CLOSE;
+```
+So, the MarkerAPI regions are traversed `NTIMES` and we get the results. But execute the original code and see how fast you get the results, especially if you leave the defaults of `NTIMES=10` and `STREAM_ARRAY_SIZE=10000000` (76MB). That's only slightly larger than todays server-class CPU caches.
+
+So, if the `STREAM_ARRAY_SIZE` is reasonable large, you will get proper results. Of course, the MarkerAPI calls increase the total runtime but the loop of interest is not affected much.
+If `STREAM_ARRAY_SIZE` is small, the MarkerAPI calls are using a higher fraction of the loop's runtime. If you have timing routines (like STREAM) and calculate some time-based metric (like bandwidths), the results might be wrong because the time contains also the MarkerAPI calls, if you place the MarkerAPI calls inside the timed region. The way to disturb the run in minimal way requires some code restructuring and failed validation (in case of STREAM). Here only the copy part:
+```
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_START("copy");
+    for (k=0; k<NTIMES; k++) 
+    {
+        // copy
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];     
+    }
+    LIKWID_MARKER_STOP("copy");
+}
+LIKWID_MARKER_CLOSE;
+```
+This way, the benchmark can execute the array copy without influences from the MarkerAPI. The result of this loop is different to the one before but it performs the same operation. If you want reliable results, make sure the whole region is executed a reasonable amount of time (like above one second).
\ No newline at end of file