formation:astrosim2017para4dummies

Différences

Ci-dessous, les différences entre deux révisions de la page.

--- formation:astrosim2017para4dummies [2017/06/30 08:20]
equemene
+++ formation:astrosim2017para4dummies [2017/07/07 09:25] (Version actuelle)
equemene [5 W/2H : Why ? What ? Where ? When ? Who ? How much ? How ?]
@@ Ligne 13: / Ligne 13: @@
   * **Where ?** On workstations, cluster nodes, laptop (well configured), inside terminals
   * **Who ?** For people who want to open the hood
-  * **How ?** Applying some simples commands (essentially shell ones)
+  * **How ?** Applying some simple commands (essentially shell ones)
 ===== Session Goal =====
@@ Ligne 134: / Ligne 134: @@
 </code>
-=== Question #1: get this informations on your host with ''cat /proc/cpuinfo'' and compare to one above ===
+=== Exercice #1: get this informations on your host with ''cat /proc/cpuinfo'' and compare to one above ===
   * How much lines of informations ?
-=== Question #2 : get the informations on your host with ''lscpu'' command ===
+=== Exercice #2 : get the informations on your host with ''lscpu'' command ===
   * What new informations appear on the output ?
@@ Ligne 150: / Ligne 150: @@
 {{ :formation:lstopo_035.png?400 |hwloc-ls}}
-=== Question #3 : get a graphical representation of hardware with ''hwloc-ls'' command ===
+=== Exercice #3 : get a graphical representation of hardware with ''hwloc-ls'' command ===
   * Locate and identify the elements provided with ''lscpu'' command
@@ Ligne 178: / Ligne 178: @@
 </code>
-=== Question #4 : list the PCI peripherals with ''lspci'' command ===
+=== Exercice #4 : list the PCI peripherals with ''lspci'' command ===
   * How many devices do you get ?
@@ Ligne 190: / Ligne 190: @@
 As when your drive a car, it's useful to get informations about running system during process. The commands ''top'' and ''htop''
-=== Question #5: open ''htop'' and ''top'' in two terminals ===
+=== Exercice #5: open ''htop'' and ''top'' in two terminals ===
   * What do you see first ?
@@ Ligne 283: / Ligne 283: @@
 </code>
-=== Question #6 : exploration of ''/usr/bin/time'' on several command Unix commands ''ls, cp,  ===
+=== Exercice #6 : exploration of ''/usr/bin/time'' on several command Unix commands or your small programs ===
@@ Ligne 343: / Ligne 343: @@
 </code>
-=== Question #7 : practice ''Rmmmms-$USER.r'' and investigate variability ===
+=== Exercice #7 : practice ''Rmmmms-$USER.r'' and investigate variability ===
   * Launch previous command to 10000, 1000, 100 launchs with respectly sizes of 10, 100, 1000
@@ Ligne 405: / Ligne 405: @@
 A program name ''PiMC-$USER.sh'' located in ''/tmp'' where ''$USER'' is your login is created and ready to use.
-=== Question #8: launch ''PiMC'' program with several number of iterations: from 100 to 1000000 ===
+=== Exercice #8: launch ''PiMC'' program with several number of iterations: from 100 to 1000000 ===
   * What is the typical precision of the result ?
-=== Question #9: launch ''PiMC'' program prefixed by ''/usr/bin/time'' with several number of iterations: 100 to 1000000 ===
+=== Exercice #9: launch ''PiMC'' program prefixed by ''/usr/bin/time'' with several number of iterations: 100 to 1000000 ===
   * Grep the ''Elapsed'' and ''Iterations'' and estimate manually the **ITOPS** (ITerative Operations Per Second) for this program implementation
@@ Ligne 415: / Ligne 415: @@
 One Solution:<code>
-echo $(/usr/bin/time /tmp/PiMC-jmylq.sh 100000 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l
+echo $(/usr/bin/time /tmp/PiMC-$USER.sh 100000 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l
 </code>
@@ Ligne 431: / Ligne 431: @@
 .45954692556634304207
 </code>
+Example of code for previous results:<code>
+for i in $(seq 10 ) ; do echo $(/usr/bin/time /tmp/PiMC-$USER.sh 100000 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l ; done</code>
 From 1000 to 1000000, 1 time:
@@ Ligne 440: / Ligne 443: @@
 </code>
+Example of code for previous results:<code>
+for POWER in $(seq 3 1 6); do ITERATIONS=$((10**$POWER)) ; echo -ne $ITERATIONS'\t' ; echo $(/usr/bin/time /tmp/PiMC-$USER.sh $ITERATIONS 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l ; done</code>
 ==== Split the execution in equal parts ====
@@ Ligne 465: / Ligne 470: @@
 On the previous launch, User time represents 99.6% of Elapsed time. Internal system operations only 0.4%.
-=== Question #10 : identification of the cost of splitting process ===
+=== Exercice #10 : identification of the cost of splitting process ===
   * Explore the values of ''User'', ''System'' and ''Elapsed'' times for different values of iterations
@@ Ligne 585: / Ligne 590: @@
 In this example, we see that the User time represents 98.52% of the Elapsed time. The total Elapsed time is greater up to 10% to unsplitted one. So, splitting has a cost. The system time represents 0.4% of Elapsed time.
-=== Question #11 : identification of the cost of splitting process ===
+=== Exercice #11 : identification of the cost of splitting process ===
   * Explore the values of ''User'', ''System'' and ''Elapsed'' times for different values of iterations
@@ Ligne 592: / Ligne 597: @@
   * What could you conclude ?
-=== Question #12 : merging results & improve metrology ===
+=== Exercice #12 : merging results & improve metrology ===
   * Append the program to extract the total amount of //Inside// number of iterations
@@ Ligne 726: / Ligne 731: @@
 In conclusion, splitting a huge job into small jobs has a Operating System cost. But distribute the jobs using system can very efficient to reduce Elapsed time.
-=== Question #13 : launch with ''-P'' set with the number of CPU detected ===
+=== Exercice #13 : launch with ''-P'' set with the number of CPU detected ===
   * Examine the ''Elapsed time'': decrease or not ?
@@ Ligne 732: / Ligne 737: @@
   * Examine the ''System time'': increase or not ?
+=== Exercice #14 : append the program to improve statistics ===
-=== Question #14 : append the program to improve statistics ===
   * Add iterator to redo the program 10 times
@@ Ligne 791: / Ligne 794: @@
 Examples of statistics on estimators:
 With //magic// ''Rmmmms-$USER.r'' command, we can extract statistics on different times
-  * for //Elapsed time// : ''cat /tmp/PiMC-jmylq_201706291231.log | grep Elapsed | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>1.96	2.02	1.985	1.987	0.01888562	0.009514167</code>
+  * for //Elapsed time// : ''cat /tmp/PiMC-$USER_201706291231.log | grep Elapsed | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>1.96	2.02	1.985	1.987	0.01888562	0.009514167</code>
-  * for //System time// : ''cat /tmp/PiMC-jmylq_201706291231.log | grep System | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>0.09	0.22	0.14	0.139	0.03665151	0.2617965</code>
+  * for //System time// : ''cat /tmp/PiMC-$USER_201706291231.log | grep System | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>0.09	0.22	0.14	0.139	0.03665151	0.2617965</code>
-  * for //User time// : ''cat /tmp/PiMC-jmylq_201706291231.log | grep User | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>59.12	59.81	59.375	59.436	0.2179297	0.003670394</code>
+  * for //User time// : ''cat /tmp/PiMC-$USER_201706291231.log | grep User | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>59.12	59.81	59.375	59.436	0.2179297	0.003670394</code>
 The previous results show that the variability, in this cas, in
@@ Ligne 825: / Ligne 828: @@
 <note important>You can control the selection by watching in another terminal the ''htop'' activity of cores</note>
-=== Question #? : launch the previous program on a slice of machine ===
+=== Exercice #15 : launch the previous program on a slice of machine ===
   * Identify and launch the program on only the first core
   * Identify and launch the program on the first half of cores
   * Identify and launch the program on the second half of cores
+  * Identify and launch on two first cores
   * Identify and launch on first on the first half and first on the second half of cores
+  * Why is there a so great difference between elapsed time
+Watch inside terminal with ''htop'' to check the right distribution of tasks.
 Solutions for a 32 cores workstation:
+  * On the first core: 0 <code>
-On the first core: 0 <code>
 ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')/2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-0 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)'
 </code>
+  * On the first half of cores: 0 to 15<code>
-On the first half of cores: 0 to 15<code>
 ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')/2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-15 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)'
 </code>
+  * On the second half of cores: 16 to 31<code>
-On the second half of cores: 16 to 31<code>
 ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:16-31 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)'
 </code>
+  * On the first of first half and first of second half of cores: 0 and 8<code>
-On the first of first half and first of second half of cores: 0 and 8<code>
+ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-1 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)'
+</code>
+  * On the first of first half and first of second half of cores: 0 and 8<code>
 ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-0 pu:8-8 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)'
 </code>
-Why to much user time
+=== Exercice #17 : from exploration to laws estimation ===
-HT Effect : why so much people desactivate...
+  * explore with previous program from ''PR=1'' to ''PR=<2x CPU>'', 10x for each
+  * store the results in a file
-Scalability exploration from PR=1 to PR=2x CPU
+Solution:
 <code>
 ITERATIONS=1000000 ;
@@ Ligne 943: / Ligne 949: @@
 </code>
-Examples of codes
+=== Question #18 : plot & fit with Amdahl and Mylq laws ===
+  * plot the curve with your favorite plotter the different values, focus on median one !
+  * fit with an Amdahl law where ''T=s+p/N'' where ''N'' is ''PR''
+  * fit with a Mylq law where ''T=s+c*N+p/N''
+  * what law match the best
+Examples of gnuplot bunch of commands to do the job. Adapt to your file and ''PR''...
+<code>
+Ta(x)=T1*(1-Pa+Pa/x)
+fit [x=1:16] Ta(x) 'PiMC_1_64.dat' using 1:4 via T1,Pa
+Tm(x)=Sm+Cm*x+Pm/x
+fit [x=1:16] Tm(x) 'PiMC_1_64.dat' using 1:4 via Sm,Cm,Pm
+set xlabel 'Parallel Rate'
+set xrange [1:64]
+set ylabel "Speedup Factor"
+set title "PiMC : parallel execution with Bash for distributed iterations"
+plot    'PiMC_1_64.dat' using ($1):(Tm(1)/$4) title 'Mesures' with points,\
+	 Tm(1)/Tm(x) title "Mylq Law" with lines,\
+ 	 Ta(1)/Ta(x) title "Amdahl Law" with lines
+</code>
-xGEMM
+{{ :formation:pimc_1_64.png?600 |}}
-NBody.py
+==== Other sample codes (used for courses) ====
-PiXPU.py
-Choose your prefered parallel code
+In folder ''/scratch/AstroSim2017'', you will find the following executables:
+  * ''PiXPU.py'' : Pi Monte Carlo Dart Dash in PyOpenCL
+  * ''NBody.py'' : N-Body in PyOpenCL
+  * ''xGEMM_DP_openblas'' : Matrix-Matrix multiplication with multithreaded OpenBLAS library in double precision
+  * ''xGEMM_SP_openblas'' : Matrix-Matrix multiplication with multithreaded OpenBLAS library in simple precision
+  * ''xGEMM_DP_clblas'' : Matrix-Matrix multiplication for OpenCL library in double precision
+  * ''xGEMM_SP_clblas'' : Matrix-Matrix multiplication for OpenCL library in simple precision
+  * ''xGEMM_DP_cublas'' : Matrix-Matrix multiplication for CUDA library in double precision
+  * ''xGEMM_SP_cublas'' : Matrix-Matrix multiplication for CUDA library in simple precision
-Improvment of statistics
+=== Exercice #19 : select parallelized program and explore salability ===
-Scalability law
+  * launch one of the upper code with ''PR'' from ''1'' to the 2 times the number of CPUs
+  * draw the scalability curve
+  * estimates the parameters with Amdahl Law and Mylq Law
-Amdahl Law
+==== Your prefered software ====
-Mylq Law
+=== Exercice #20 : select parallelized program and explore salability ===
+  * launch your MPI code with ''PR'' from ''1'' to the 2 times the number of CPUs
+  * draw the scalability curve
+  * estimates the parameters with Amdahl Law and Mylq Law
+ --- //[[emmanuel.quemener@ens-lyon.fr|Emmanuel Quemener]] 2017/06/30 14:26//

formation/astrosim2017para4dummies.1498803643.txt.gz · Dernière modification: 2017/06/30 08:20 par equemene

Rechercher

Translations

Piste:

Boîte à outils