Changes

Thomas Roehl · 4c715e99
--- a/TutorialMPI.md
+++ b/TutorialMPI.md
@@ -21,27 +21,27 @@ A better way is to integrate LIKWID into the 2 started processes like:
 ```
 mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY <mpi-program>
 ```
-With this calling order, LIKWID is executed on the hosts that are given in the MPI hostfile and measures the energy consumption on 40 CPUs. If you use distinct hosts, this works but what about the results. Each process prints its results when it is done. So the output could be interleaved and you fist have to separate the outputs to get the measurement results. This could be avoided by setting an output file for each process:
+With this calling order, LIKWID is executed on the hosts that are given in the MPI hostfile and measures the energy consumption on 40 CPUs. If you use distinct hosts, this works but what about the results. Each process prints its results when it is done. So the output could be interleaved and you first have to separate the outputs to get the measurement results. This could be avoided by setting an output file for each process:
 ```
-mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -o output_%r.txt <mpi-program>
+mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -o output_%h_%r.txt <mpi-program>
 ```
-We see, that we added `-o output_%r.txt` to the commandline. The `%r` is a variable that is filled by LIKWID with the MPI rank, an unique number within the MPI execution. After execution you will have the two files `output_0.txt` and `output_1.txt`. But there is still some problem. If both processes should run on the same host, both measure the same range of CPUs. If you want distinct CPUs per MPI processes, you have to separate the calls:
+We added `-o output_%h_%r.txt` to the commandline. The `%h` is a variable that is filled by LIKWID with the hostname it is running on. `%r` is substituted by the MPI rank, an process identifier within the MPI execution. After execution you will have the two files `output_<host1>_0.txt` and `output_<host2>_1.txt`. But there is still some problem. If both processes should run on the same host, both measure the same range of CPUs. If you want distinct CPUs per MPI processes, you have to separate the calls:
 ```
-mpiexec -np 1 likwid-perfctr -C S0:20 -g ENERGY -o output_%r.txt <mpi-program> : -np 1 likwid-perfctr -C S1:20 -g ENERGY -o output_%r.txt <mpi-program>
+mpiexec -np 1 likwid-perfctr -C S0:20 -g ENERGY -o output_%h_%r.txt <mpi-program> : -np 1 likwid-perfctr -C S1:20 -g ENERGY -o output_%h_%r.txt <mpi-program>
 ```
 With this commandline, you start two processes, one is measuring 20 CPUs on CPU socket 0 (`S0`) and the other one 20 CPUs on CPU socket 1 (`S1`). This works fine but it is not very handy. [[likwid-mpirun|likwid-mpirun]] builds commandlines like this but provides a much simpler interface and prints out the results combined for all processes. An example call would be:
 ```
 likwid-mpirun -pin S0:20_S1:20 -g ENERGY <mpi-program>
 ```
-Due to the `_` in the pin statement, likwid-mpirun knows that we need 2 MPI processes, each may run on 20 CPUs (caused by threading).
+Due to the `_` in the pin statement, likwid-mpirun knows that we need 2 MPI processes, each may run on 20 CPUs (caused by threading). The page [[likwid-mpirun|likwid-mpirun]] lists some more options.

 # Using the MarkerAPI in MPI applications
-A little more complex thing is to use the MarkerAPI cause the measuring is not performed by likwid-perfctr anymore but by the MarkerAPI calls in the application. likwid-perfctr is only used to export some configuration like CPU list, event set and intermediate result file path.
+A little more complex thing is to use the MarkerAPI because the measuring is not performed by likwid-perfctr anymore but by the MarkerAPI calls in the application. likwid-perfctr is only used to export some configuration like CPU list, event set and intermediate result file path.
 So, let's look again at the naive call:
 ```
 likwid-perfctr -C E:N:40 -g ENERGY -m mpiexec -np 2 <mpi-program>
 ```
-This may work in random cases but it will often fail. The explaination is simple: Both MPI applications write their measurement results to the same intermediate result file.
+This may work in some cases but it will often fail. The explaination is simple: Both MPI applications write their measurement results to the same intermediate result file. 
 Moving the likwid-perfctr call inside the mpiexec call avoids that because likwid-perfctr is called for each process defining one intermediate output file for each MPI program.
 ```
 mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -m <mpi-program>
@@ -50,7 +50,7 @@ Moreover we add a distinct output file for each MPI process. When spreading the
 ```
 mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -m -o output_%h_%r.txt <mpi-program>
 ```
-The `%h` is substituted by the hostname executing likwid-perctr, so we can differentiate the output files afterwards. Bbut there still may be problems when running on the same machine, because each MPI process is pinned to the same list of CPUs. In order to pin the MPI programs to distinct parts of a host, we have to separate the calls:
+The `%h` is substituted by the hostname executing likwid-perctr, so we can differentiate the output files afterwards. But there still may be problems when running on the same machine, because each MPI process is pinned to the same list of CPUs. In order to pin the MPI programs to distinct parts of a host, we have to separate the calls:
 ```
 mpiexec -np 1 likwid-perfctr -C S0:20 -g ENERGY -m -o output_%h_%r.txt <mpi-program> : -np 1 likwid-perfctr -C S1:20 -g ENERGY -m -o output_%h_%r.txt <mpi-program>
 ```
@@ -65,7 +65,7 @@ So, we assume we want to run two processes on the same CPU socket and read the e
 ```
 likwid-perfctr -c S0:20 -g ENERGY -m mpiexec -np 2 <mpi-program>
 ```
-During initialization of the MarkerAPI inside our MPI program, each of the given CPUs is intialized and the socket locks are acquired. LIKWID sets up the socket locks because the energy counters as well as all Uncore counters are socket-specific, not core-specific. The first initialized CPU normally gets the lock, so in this case both MPI processes set the lock to the first CPU on socket 0 (commonly CPU 0). Now we pin our threads and execute the MarkerAPI calls. The start and stop calls measure only for the currently executing CPU, so in our second process this is the second half of the socket and never the first CPU on the socket. Since only the first CPU is able to read the energy counters, the second process will never read it. This is only done by the first process running on the first half. Finally, the Marker API writes it intermediate result file and if the second process comes last, it will only write 0 for the energy because it has never read the counter.
+During initialization of the MarkerAPI inside our MPI program, each of the given CPUs is intialized and the socket locks are acquired. LIKWID sets up the socket locks because the energy counters as well as all Uncore counters are socket-specific, not core-specific. The first initialized CPU normally gets the lock, so in this case both MPI processes set the lock to the first CPU on socket 0 (commonly CPU 0). Now we pin our threads and execute the MarkerAPI calls. The start and stop calls executed by a thread measure only for the currently executing CPU, so in our second process this is the second half of the socket and never the first CPU on the socket. Since only the first CPU is able to read the energy counters, the second process will never read them. This is only done by the first process running on the first half. Finally, the Marker API writes it intermediate result file and if the second process comes last, it will only write 0 for the energy because it has never read the counter.
 A proper call for this purpose is:
 ```
 mpiexec -np 1 likwid-perfctr -c S0:0-9 -g ENERGY -m <mpi-program> : -np 1 likwid-perfctr -c S0:10-19 -g ENERGY -m <mpi-program>
@@ -74,3 +74,4 @@ Or use [[likwid-mpirun|likwid-mpirun]]:
 ```
 likwid-mpirun -pin S0:0-9_S0:10-19 -g ENERGY -m <mpi-program>
 ```
+Although the option is named pin, you can pin your application yourself. The processes and threads are only measuring on the selected CPUs, if you pin outside of this range, there won't be any results for the process/thread. If you need the CPUs for the thread, you can use the environment variable `LIKWID_THREADS`.
\ No newline at end of file