Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
formation:astrosim2017para4dummies [2017/06/29 18:57]
equemene [An illustrative example: Pi Dart Dash]
formation:astrosim2017para4dummies [2017/07/07 09:25] (Version actuelle)
equemene [5 W/2H : Why ? What ? Where ? When ? Who ? How much ? How ?]
Ligne 12: Ligne 12:
   * **How much ?** Nothing, Blaise Pascal Center provides workstations & cluster nodes    * **How much ?** Nothing, Blaise Pascal Center provides workstations & cluster nodes 
   * **Where ?** On workstations,​ cluster nodes, laptop (well configured),​ inside terminals   * **Where ?** On workstations,​ cluster nodes, laptop (well configured),​ inside terminals
-  * **Who ?** For people who want to open the front +  * **Who ?** For people who want to open the hood  
-  * **How ?** Applying some simples ​commands+  * **How ?** Applying some simple ​commands ​(essentially shell ones)
  
 ===== Session Goal ===== ===== Session Goal =====
Ligne 27: Ligne 27:
 === Prerequisite for hardware === === Prerequisite for hardware ===
    
 +  * If using CBP resources, nothing... Just login... 
 +  * If NOT using CBP resources, a machine relatively recent with multi-cores CPU
  
 === Prerequisite for software === === Prerequisite for software ===
  
-  * Open graphical session on one workstation +  * Open graphical session on one workstation, several terminals and your favorite browser 
-  * Open four terminals+  * If NOT using CBP resources, a GNU/Linux Operating System well configured
  
 === Prerequisite for humanware ===  === Prerequisite for humanware === 
  
   * An allergy to command line will severely restrict the range of this practical session.   * An allergy to command line will severely restrict the range of this practical session.
 +  * A practice of shell scripts would be a asset, but you will improve it in this session!
  
 ===== Investigate Hardware ===== ===== Investigate Hardware =====
Ligne 47: Ligne 49:
   * Input and Output Devices   * Input and Output Devices
  
-First property of hardware is limited resources.+The first property of hardware is limited resources.
  
 In Posix systems, everything is file. So you can retreive informations (or set configurations) by classical file commands inside a terminal. For example ''​cat /​proc/​cpuinfo''​ returns information about processor. In Posix systems, everything is file. So you can retreive informations (or set configurations) by classical file commands inside a terminal. For example ''​cat /​proc/​cpuinfo''​ returns information about processor.
Ligne 132: Ligne 134:
 </​code>​ </​code>​
  
-=== Question ​#1: get this informations on your host with ''​cat /​proc/​cpuinfo''​ and compare to one above ===+=== Exercice ​#1: get this informations on your host with ''​cat /​proc/​cpuinfo''​ and compare to one above ===
  
   * How much lines of informations ?   * How much lines of informations ?
  
-=== Question ​#2 : get the informations on your host with ''​lscpu''​ command ===+=== Exercice ​#2 : get the informations on your host with ''​lscpu''​ command ===
  
   * What new informations appear on the output ?    * What new informations appear on the output ? 
   * How many CPUs ? Threads per core ? Cores per socket ? Sockets ?   * How many CPUs ? Threads per core ? Cores per socket ? Sockets ?
   * How many cache levels ?   * How many cache levels ?
-  * How many "​flags"​ ?+  * How many "​flags" ​? What do they represent ​?
  
 ==== Exploration ==== ==== Exploration ====
Ligne 148: Ligne 150:
 {{ :​formation:​lstopo_035.png?​400 |hwloc-ls}} {{ :​formation:​lstopo_035.png?​400 |hwloc-ls}}
  
-=== Question ​#3 : get a graphical representation of hardware with ''​hwloc-ls''​ command ===+=== Exercice ​#3 : get a graphical representation of hardware with ''​hwloc-ls''​ command ===
  
   * Locate and identify the elements provided with ''​lscpu''​ command   * Locate and identify the elements provided with ''​lscpu''​ command
Ligne 176: Ligne 178:
 </​code> ​ </​code> ​
  
-=== Question ​#4 : list the PCI peripherals with ''​lspci''​ command ===+=== Exercice ​#4 : list the PCI peripherals with ''​lspci''​ command ===
  
   * How many devices do you get ?   * How many devices do you get ?
   * Can you identify the devices listed with graphical representation ?   * Can you identify the devices listed with graphical representation ?
   * What keywords on graphical representation define the VGA device ?   * What keywords on graphical representation define the VGA device ?
- 
- 
  
 ==== Exploring dynamic system ==== ==== Exploring dynamic system ====
Ligne 189: Ligne 189:
  
 As when your drive a car, it's useful to get informations about running system during process. The commands ''​top''​ and ''​htop''​ As when your drive a car, it's useful to get informations about running system during process. The commands ''​top''​ and ''​htop''​
 +
 +=== Exercice #5: open ''​htop''​ and ''​top''​ in two terminals ===
 +
 +  * What do you see first ?
 +  * How much memory have you ?
 +  * How much swap ?
 +  * How many tasks are launched ? How many threads ?
  
 ==== Tiny metrology with ''/​usr/​bin/​time''​ ==== ==== Tiny metrology with ''/​usr/​bin/​time''​ ====
  
 <note important>​Be careful, there is a difference between ''​time''​ included as command in shells and ''​time''​ as standalone program. In order not to get difficulties,​ the program ''​time''​ has to be resquested by ''/​usr/​bin/​time''​!</​note>​ <note important>​Be careful, there is a difference between ''​time''​ included as command in shells and ''​time''​ as standalone program. In order not to get difficulties,​ the program ''​time''​ has to be resquested by ''/​usr/​bin/​time''​!</​note>​
- 
-=== Introduction & Redefinition of metrology arguments === 
  
 Difference between ''​time''​ build in command and ''​time''​ standalone program. Difference between ''​time''​ build in command and ''​time''​ standalone program.
Ligne 277: Ligne 282:
 TIME Exit status: 0 TIME Exit status: 0
 </​code>​ </​code>​
 +
 +=== Exercice #6 : exploration of ''/​usr/​bin/​time''​ on several command Unix commands or your small programs ===
 +
 +
  
 ==== Statistics on the fly ! Penstacle of statistics ==== ==== Statistics on the fly ! Penstacle of statistics ====
Ligne 303: Ligne 312:
 </​code>​ </​code>​
  
-To evaluate the variability to MemCopy test memory on 10 launches with a size of 1GB, the command is:+To evaluate the variability to MemCopy test memory ​in ''​mbw''​ tool on 10 launches with a size of 1GB, the command is:
 <​code>​ <​code>​
 mbw -a -t 0 -n 10 1000 mbw -a -t 0 -n 10 1000
 </​code>​ </​code>​
  
-Here is an example of output:+This is an example of output:
 <​code>​ <​code>​
 Long uses 8 bytes. Allocating 2*131072000 elements = 2097152000 bytes of memory. Long uses 8 bytes. Allocating 2*131072000 elements = 2097152000 bytes of memory.
Ligne 329: Ligne 338:
 </​code>​ </​code>​
  
-Here is the output+This is an example of output:
 <​code>​ <​code>​
 5595.783 5673.179 5624.503 5625.749 21.81671 0.003878869 5595.783 5673.179 5624.503 5625.749 21.81671 0.003878869
 </​code>​ </​code>​
  
-This will be very useful to extract and provides statistics of times.+=== Exercice #7 : practice ''​Rmmmms-$USER.r''​ and investigate variability ===
  
-===== First steps in parallelism ===== +  * Launch previous command to 100001000, 100 launchs with respectly sizes of 10, 100, 1000 
- +  * Have a look on statistics estimators : what tipically variability do you reach ?
-==== Before explorationcheck the instrumentation ! ====+
  
 +This will be very useful to extract and provides statistics of times.
  
-==== An illustrative example: Pi Dart Dash ====+===== An illustrative example: Pi Dart Dash =====
  
-=== Principle, inputs & outputs ===+==== Principle, inputs & outputs ​====
  
 The most common example of Monte Carlo program: estimate Pi number by the ratio between the number of points located in the quarter of a circle where random points are uniformly distributed. It needs: The most common example of Monte Carlo program: estimate Pi number by the ratio between the number of points located in the quarter of a circle where random points are uniformly distributed. It needs:
Ligne 354: Ligne 363:
   * Output: an integer as number of points inside the quarter of circle   * Output: an integer as number of points inside the quarter of circle
   * Output (bis): an estimation of Pi number (very inefficient method but the result is well known, so easy checked).   * Output (bis): an estimation of Pi number (very inefficient method but the result is well known, so easy checked).
 +  * Output (ter): the total amount of iterations (just to remind)
  
 The following implementation is as ''​bash''​ shell script one. The ''​RANDOM''​ command provides a random number between 0 and 32767. So the frontier is located on ''​32767*32767''​. ​ The following implementation is as ''​bash''​ shell script one. The ''​RANDOM''​ command provides a random number between 0 and 32767. So the frontier is located on ''​32767*32767''​. ​
Ligne 395: Ligne 405:
 A program name ''​PiMC-$USER.sh''​ located in ''/​tmp''​ where ''​$USER''​ is your login is created and ready to use. A program name ''​PiMC-$USER.sh''​ located in ''/​tmp''​ where ''​$USER''​ is your login is created and ready to use.
  
-== Question ​#?: launch ''​PiMC''​ program with several number of iterations: from 100 to 10000000 ​==+=== Exercice ​#8: launch ''​PiMC''​ program with several number of iterations: from 100 to 1000000 ===
  
-  *  +  * What is the typical precision of the result ?
  
-== Question ​#?: launch ''​PiMC''​ program prefixed by ''/​usr/​bin/​time''​ with several number of iterations: 100 to 1000000 ==+=== Exercice ​#9: launch ''​PiMC''​ program prefixed by ''/​usr/​bin/​time''​ with several number of iterations: 100 to 1000000 ​===
  
-=== Split the execution in equal parts ===+  * Grep the ''​Elapsed''​ and ''​Iterations''​ and estimate manually the **ITOPS** (ITerative Operations Per Second) for this program implementation 
 +  * Improve the test to estimate the ITOPS //on the fly//: apply to different amount of iterations and several time 
 + 
 +One Solution:<​code>​ 
 +echo $(/​usr/​bin/​time /​tmp/​PiMC-$USER.sh 100000 2>&1 | egrep '​(Elapsed|Iterations)'​ | awk '{ print $NF }' | tr '​\n'​ '/'​)1 | bc -l 
 +</​code>​ 
 + 
 +For 100000 iterations, 10 times: 
 +<​code>​ 
 +31250.00000000000000000000 
 +31645.56962025316455696202 
 +28248.58757062146892655367 
 +30864.19753086419753086419 
 +31847.13375796178343949044 
 +32362.45954692556634304207 
 +32467.53246753246753246753 
 +31545.74132492113564668769 
 +32573.28990228013029315960 
 +32362.45954692556634304207 
 +</​code>​ 
 + 
 +Example of code for previous results:<​code>​ 
 +for i in $(seq 10 ) ; do echo $(/​usr/​bin/​time /​tmp/​PiMC-$USER.sh 100000 2>&1 | egrep '​(Elapsed|Iterations)'​ | awk '{ print $NF }' | tr '​\n'​ '/'​)1 | bc -l ; done</​code>​ 
 + 
 +From 1000 to 1000000, 1 time: 
 +<​code>​ 
 +1000 20000.00000000000000000000 
 +10000 26315.78947368421052631578 
 +100000 32154.34083601286173633440 
 +1000000 31685.67807351077313054499 
 +</​code>​ 
 + 
 +Example of code for previous results:<​code>​ 
 +for POWER in $(seq 3 1 6); do ITERATIONS=$((10**$POWER)) ; echo -ne $ITERATIONS'​\t'​ ; echo $(/​usr/​bin/​time /​tmp/​PiMC-$USER.sh $ITERATIONS 2>&1 | egrep '​(Elapsed|Iterations)'​ | awk '{ print $NF }' | tr '​\n'​ '/'​)1 | bc -l ; done</​code>​ 
 +==== Split the execution in equal parts ====
  
 The following command line divides the job to do (10000000 iterations) into ''​PR''​ equal jobs. The following command line divides the job to do (10000000 iterations) into ''​PR''​ equal jobs.
Ligne 425: Ligne 469:
  
 On the previous launch, User time represents 99.6% of Elapsed time. Internal system operations only 0.4%. On the previous launch, User time represents 99.6% of Elapsed time. Internal system operations only 0.4%.
 +
 +=== Exercice #10 : identification of the cost of splitting process ===
 +
 +  * Explore the values of ''​User'',​ ''​System''​ and ''​Elapsed''​ times for different values of iterations
 +  * Estimate the ratio between ''​User time''​ and ''​Elapsed time''​ for the results
 +  * Estimate the ratio between ''​System time''​ and ''​Elapsed time''​ for the results
 +  * What could you conclude ?
  
 Replace the ''​PR''​ set as ''​1''​ by the detected number of CPU with ''​lspcu''​ command). Replace the ''​PR''​ set as ''​1''​ by the detected number of CPU with ''​lspcu''​ command).
Ligne 539: Ligne 590:
 In this example, we see that the User time represents 98.52% of the Elapsed time. The total Elapsed time is greater up to 10% to unsplitted one. So, splitting has a cost. The system time represents 0.4% of Elapsed time. In this example, we see that the User time represents 98.52% of the Elapsed time. The total Elapsed time is greater up to 10% to unsplitted one. So, splitting has a cost. The system time represents 0.4% of Elapsed time.
  
-Replace ​the end of the program to extract the total //Inside// number of iterations+=== Exercice #11 : identification of the cost of splitting process === 
 + 
 +  * Explore the values of ''​User'',​ ''​System''​ and ''​Elapsed''​ times for different values of iterations 
 +  * Estimate the ratio between ''​User time''​ and ''​Elapsed time''​ for the results 
 +  * Estimate the ratio between ''​System time''​ and ''​Elapsed time''​ for the results 
 +  * What could you conclude ? 
 + 
 +=== Exercice #12 : merging results & improve metrology === 
 + 
 +  * Append ​the program to extract the total amount of //Inside// number of iterations 
 +  * Set timers inside command lines to estimate the total Elapsed time 
 + 
 +Solution: the timer used are based on ''​date''​ command
 <​code>​ <​code>​
 ITERATIONS=1000000 ITERATIONS=1000000
 +START=$(date '​+%s.%N'​)
 PR=$(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }') PR=$(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }')
 EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1)))
 seq $PR | /​usr/​bin/​time xargs -I '​{}'​ /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep ^Inside | awk '{ sum+=$2 } END { printf "​Insides %i", sum }' ; echo seq $PR | /​usr/​bin/​time xargs -I '​{}'​ /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep ^Inside | awk '{ sum+=$2 } END { printf "​Insides %i", sum }' ; echo
 +STOP=$(date '​+%s.%N'​)
 +echo Total Elapsed time: $(echo $STOP-$START | bc -l) 
 </​code>​ </​code>​
 +
 +==== After splitting, finally the parallelization ====
  
 In this illustrative case, each job is independant to others. They can be distributed to all the computing resources available. ''​xargs''​ command line builder do it for you with ''​-P <​ConcurrentProcess>''​. In this illustrative case, each job is independant to others. They can be distributed to all the computing resources available. ''​xargs''​ command line builder do it for you with ''​-P <​ConcurrentProcess>''​.
Ligne 663: Ligne 731:
 In conclusion, splitting a huge job into small jobs has a Operating System cost. But distribute the jobs using system can very efficient to reduce Elapsed time. In conclusion, splitting a huge job into small jobs has a Operating System cost. But distribute the jobs using system can very efficient to reduce Elapsed time.
  
-We can improve statistics by launching 10x the previous program. We storage the different ​'time' ​estimators inside a logfile named as ''​Ouput_PiMC-$USER_YYYYmmddHHMM.log''​+=== Exercice #13 : launch with ''​-P'' ​set with the number of CPU detected ===
  
 +  * Examine the ''​Elapsed time'':​ decrease or not ?
 +  * Examine the ''​User time'':​ increase or not ?
 +  * Examine the ''​System time'':​ increase or not ?
 +
 +=== Exercice #14 : append the program to improve statistics ===
 +
 +  * Add iterator to redo the program 10 times
 +  * Store the ''​time''​ estimators inside an output file defined as : ''/​tmp/​PiMC-$USER_YYYYmmddHHMM.log''​
 +  * Parse the output file and extract statistics on 3 times estimators.
 +  * Estimate the speedup between ''​PR=1''​ and ''​PR=<​NumberOfCPU>''​
 +  * Multiply by 10 the number of iterations and estimate the speedup
 +
 +Solution:
 <​code>​ <​code>​
 ITERATIONS=1000000 ITERATIONS=1000000
Ligne 677: Ligne 758:
 </​code>​ </​code>​
  
 +Example of output file:
 <​code>​ <​code>​
 TIME User time (seconds): 59.81 TIME User time (seconds): 59.81
Ligne 710: Ligne 792:
 </​code>​ </​code>​
  
 +Examples of statistics on estimators:
 With //magic// ''​Rmmmms-$USER.r''​ command, we can extract statistics on different times  With //magic// ''​Rmmmms-$USER.r''​ command, we can extract statistics on different times 
-  * for //Elapsed time// : ''​cat /tmp/PiMC-jmylq_201706291231.log | grep Elapsed | awk '{ print $NF }' | /​tmp/​Rmmmms-$USER.r'':<​code>​1.96 2.02 1.985 1.987 0.01888562 0.009514167</​code>​ +  * for //Elapsed time// : ''​cat /tmp/PiMC-$USER_201706291231.log | grep Elapsed | awk '{ print $NF }' | /​tmp/​Rmmmms-$USER.r'':<​code>​1.96 2.02 1.985 1.987 0.01888562 0.009514167</​code>​ 
-  * for //System time// : ''​cat /tmp/PiMC-jmylq_201706291231.log | grep System | awk '{ print $NF }' | /​tmp/​Rmmmms-$USER.r'':<​code>​0.09 0.22 0.14 0.139 0.03665151 0.2617965</​code>​ +  * for //System time// : ''​cat /tmp/PiMC-$USER_201706291231.log | grep System | awk '{ print $NF }' | /​tmp/​Rmmmms-$USER.r'':<​code>​0.09 0.22 0.14 0.139 0.03665151 0.2617965</​code>​ 
-  * for //User time// : ''​cat /tmp/PiMC-jmylq_201706291231.log | grep User | awk '{ print $NF }' | /​tmp/​Rmmmms-$USER.r'':<​code>​59.12 59.81 59.375 59.436 0.2179297 0.003670394</​code>​+  * for //User time// : ''​cat /tmp/PiMC-$USER_201706291231.log | grep User | awk '{ print $NF }' | /​tmp/​Rmmmms-$USER.r'':<​code>​59.12 59.81 59.375 59.436 0.2179297 0.003670394</​code>​
  
 The previous results show that the variability,​ in this cas, in  The previous results show that the variability,​ in this cas, in 
Ligne 729: Ligne 812:
 </​code>​ </​code>​
  
-On the first half of cores: 0 to 15<​code>​ +==== Select ​the execution ​cores ==== 
-ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' ​| awk '{ print $NF }')/2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-15 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&| grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​+ 
 +It'​s ​possible with the ''​hwloc-bind'' command to select the cores on which you would like to execute your program. You just have to specify the physical units with the format ​//from//''​-''​//to//. For example, if you want to execute the parallelized application MyParallelApplication on a machine with 8 cores (defined from ''​0''​ to ''​7''​) only on the two first:<​code>​ 
 +hwloc-bind -p pu:​0-1 ​./​MyParallelApplication
 </​code>​ </​code>​
  
-On the second half of cores: 16 to 31<​code>​+If you want to select only one atomic core, the last one, for example:<​code>​ 
 +hwloc-bind -p pu:7-7 ./​MyParallelApplication 
 +</​code>​ 
 + 
 +If you want to select several non adjacent cores, the first and the last ones, for example:<​code>​ 
 +hwloc-bind -p pu:0-0 pu:7-7 ./​MyParallelApplication 
 +</​code>​ 
 + 
 +<note important>​You can control the selection by watching in another terminal the ''​htop''​ activity of cores</​note>​ 
 + 
 +=== Exercice #15 : launch the previous program on a slice of machine === 
 + 
 +  * Identify and launch the program on only the first core 
 +  * Identify and launch the program on the first half of cores 
 +  * Identify and launch the program on the second half of cores 
 +  * Identify and launch on two first cores 
 +  * Identify and launch on first on the first half and first on the second half of cores 
 +  * Why is there a so great difference between elapsed time  
 + 
 +Watch inside terminal with ''​htop''​ to check the right distribution of tasks. 
 + 
 +Solutions for a 32 cores workstation:​ 
 +  * On the first core: 0 <​code>​ 
 +ITERATIONS=10000000 ; PR=$(($(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }'​)/​2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; seq $PR | /​usr/​bin/​time hwloc-bind -p pu:0-0 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​ 
 +</​code>​ 
 +  * On the first half of cores: 0 to 15<​code>​ 
 +ITERATIONS=10000000 ; PR=$(($(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }'​)/​2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; seq $PR | /​usr/​bin/​time hwloc-bind -p pu:0-15 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​ 
 +</​code>​ 
 +  * On the second half of cores: 16 to 31<​code>​
 ITERATIONS=10000000 ; PR=$(($(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }'​)*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; seq $PR | /​usr/​bin/​time hwloc-bind -p pu:16-31 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​ ITERATIONS=10000000 ; PR=$(($(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }'​)*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; seq $PR | /​usr/​bin/​time hwloc-bind -p pu:16-31 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​
 +</​code>​
 +  * On the first of first half and first of second half of cores: 0 and 8<​code>​
 +ITERATIONS=10000000 ; PR=$(($(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }'​)*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; seq $PR | /​usr/​bin/​time hwloc-bind -p pu:0-1 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​
 +</​code>​
 +  * On the first of first half and first of second half of cores: 0 and 8<​code>​
 +ITERATIONS=10000000 ; PR=$(($(lscpu | grep '​^CPU(s):'​ | awk '{ print $NF }'​)*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; seq $PR | /​usr/​bin/​time hwloc-bind -p pu:0-0 pu:8-8 xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep -v timed | egrep '​(Pi|Inside|Iterations|time)'​
 </​code>​ </​code>​
  
-Why to much user time+=== Exercice #17 : from exploration ​to laws estimation ===
  
-HT Effect : why so much people desactivate...+  * explore with previous program from ''​PR=1''​ to ''​PR=<​2x CPU>'',​ 10x for each 
 +  * store the results in a file
  
-Scalability exploration from PR=1 to PR=2x CPU+Solution: 
 +<​code>​ 
 +ITERATIONS=1000000 ; 
 +REDO=10 ; 
 +PR_START=1 ;  
 +PR_STOP=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }'​)*2)) ; 
 +OUTPUT=/​tmp/​$(basename /​tmp/​PiMC-$USER.sh .sh)_${PR_START}_${PR_STOP}_$(date "​+%Y%m%d%H%M"​).dat 
 +seq $PR_START 1 $PR_STOP | while read PR ; 
 +do  
 +   echo -ne "​$PR\t"​ ;  
 +   ​EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/​$PR)) || echo $(($ITERATIONS/​$PR+1))) ; 
 +   seq $REDO | while read STEP ; 
 +   do 
 +      seq $PR | /​usr/​bin/​time xargs -I '​{}'​ -P $PR /​tmp/​PiMC-$USER.sh $EACHJOB '​{}'​ 2>&1 | grep Elapsed | awk '{ print $NF }'  
 +   done | /​tmp/​Rmmmms-$USER.r 
 +done > $OUTPUT 
 +echo Results in $OUTPUT file 
 +</​code>​
  
 +As an example, a 32HT cores workstation,​ we got:
 +<​code>​
 +# PR MIN MAX AVG MED STDEV Variability
 +1 29.94 35.16 30.56 30.99 1.54438 0.05053601
 +2 15.09 16.73 15.445 15.531 0.4647449 0.03009031
 +3 10.3 12.02 10.555 10.795 0.6131567 0.05809158
 +4 7.78 8.21 7.97 7.975 0.1269514 0.01592866
 +5 6.31 6.53 6.435 6.416 0.07366591 0.01144769
 +6 5.27 5.57 5.41 5.415 0.09778093 0.01807411
 +7 4.61 5.67 4.74 4.901 0.3989277 0.08416197
 +8 4.03 4.35 4.115 4.146 0.09800227 0.02381586
 +9 3.66 3.92 3.71 3.718 0.07420692 0.02000186
 +10 3.32 4.29 3.36 3.453 0.295524 0.08795358
 +11 3.01 4.45 3.08 3.229 0.4330114 0.1405881
 +12 2.77 4.29 2.86 3.019 0.4609519 0.161172
 +13 2.61 2.89 2.68 2.707 0.08602971 0.03210064
 +14 2.51 4.03 2.615 2.842 0.4982369 0.1905304
 +15 2.31 3.42 2.41 2.565 0.3422231 0.1420013
 +16 2.31 3.03 2.66 2.675 0.2382459 0.08956613
 +17 2.42 3.11 2.7 2.722 0.2395737 0.088731
 +18 2.42 2.8 2.67 2.627 0.1477272 0.05532855
 +19 2.52 2.72 2.605 2.615 0.06114645 0.02347273
 +20 2.43 2.91 2.54 2.579 0.136337 0.05367598
 +21 2.37 2.91 2.49 2.509 0.1540166 0.06185405
 +22 2.28 2.73 2.37 2.407 0.1271963 0.05366931
 +23 2.3 2.54 2.35 2.37 0.06879922 0.02927627
 +24 2.25 2.37 2.285 2.287 0.03368151 0.01474027
 +25 2.19 2.37 2.225 2.246 0.06022181 0.02706598
 +26 2.1 2.32 2.18 2.191 0.05606544 0.02571809
 +27 2.14 2.27 2.205 2.198 0.04516636 0.02048361
 +28 2.07 2.21 2.14 2.134 0.04273952 0.01997174
 +29 2.02 2.11 2.07 2.065 0.02758824 0.01332765
 +30 2 2.13 2.035 2.036 0.03806427 0.0187048
 +31 1.98 2.07 1.99 2.002 0.02820559 0.01417367
 +32 1.97 2.02 1.99 1.993 0.01766981 0.008879302
 +33 2.05 2.25 2.12 2.129 0.06402257 0.03019932
 +34 2.08 2.23 2.15 2.155 0.0457651 0.02128609
 +35 2.08 2.25 2.16 2.156 0.05853774 0.0271008
 +36 2.02 2.21 2.13 2.129 0.05782156 0.02714627
 +37 2.08 2.2 2.15 2.147 0.03560587 0.01656087
 +38 2.01 2.19 2.125 2.119 0.05384133 0.0253371
 +39 2.05 2.2 2.105 2.111 0.05108816 0.02426991
 +40 2.06 2.2 2.11 2.124 0.04526465 0.02145244
 +41 2.07 2.18 2.09 2.102 0.03425395 0.01638945
 +42 2.04 2.13 2.095 2.092 0.0265832 0.01268888
 +43 2.03 2.12 2.08 2.076 0.03025815 0.01454719
 +44 2.04 2.14 2.085 2.086 0.03204164 0.01536769
 +45 2.02 2.13 2.08 2.082 0.03392803 0.01631155
 +46 2.05 2.12 2.075 2.081 0.0218327 0.01052178
 +47 1.98 2.15 2.08 2.073 0.05250397 0.02524229
 +48 1.99 2.14 2.085 2.081 0.04557046 0.02185633
 +49 2.04 2.18 2.085 2.087 0.04321779 0.02072796
 +50 2.06 2.17 2.12 2.116 0.03657564 0.01725266
 +51 2.02 2.16 2.09 2.086 0.03864367 0.01848979
 +52 2.03 2.13 2.08 2.075 0.02915476 0.01401671
 +53 2.03 2.14 2.095 2.093 0.03465705 0.01654274
 +54 2 2.11 2.075 2.069 0.03212822 0.01548348
 +55 2.02 2.15 2.095 2.085 0.04062019 0.01938911
 +56 2.05 2.11 2.09 2.081 0.02078995 0.009947347
 +57 2.03 2.09 2.065 2.065 0.01840894 0.008914739
 +58 2.06 2.11 2.07 2.082 0.02250926 0.01087404
 +59 2.02 2.11 2.07 2.067 0.02451757 0.01184424
 +60 2.02 2.1 2.055 2.057 0.02406011 0.01170808
 +61 2.03 2.15 2.065 2.07 0.03333333 0.01614205
 +62 2.01 2.13 2.06 2.059 0.03842742 0.01865409
 +63 2.01 2.09 2.07 2.06 0.03018462 0.01458194
 +64 2.02 2.11 2.075 2.077 0.02945807 0.01419666
 +</​code>​
  
 +=== Question #18 : plot & fit with Amdahl and Mylq laws ===
  
-Examples of codes+  * plot the curve with your favorite plotter the different values, focus on median one ! 
 +  * fit with an Amdahl law where ''​T=s+p/​N''​ where ''​N''​ is ''​PR''​ 
 +  * fit with a Mylq law where ''​T=s+c*N+p/​N''​  
 +  * what law match the best 
 + 
 +Examples of gnuplot bunch of commands to do the job. Adapt to your file and ''​PR''​... 
 +<​code>​ 
 +Ta(x)=T1*(1-Pa+Pa/​x) 
 +fit [x=1:16] Ta(x) '​PiMC_1_64.dat'​ using 1:4 via T1,Pa 
 +Tm(x)=Sm+Cm*x+Pm/​x 
 +fit [x=1:16] Tm(x) '​PiMC_1_64.dat'​ using 1:4 via Sm,Cm,Pm 
 +set xlabel '​Parallel Rate'​ 
 +set xrange [1:64] 
 +set ylabel "​Speedup Factor"​ 
 +set title "PiMC : parallel execution with Bash for distributed iterations"​ 
 +plot    '​PiMC_1_64.dat'​ using ($1):​(Tm(1)/​$4) title '​Mesures'​ with points,\ 
 + Tm(1)/Tm(x) title "Mylq Law" with lines,\ 
 +  Ta(1)/Ta(x) title "​Amdahl Law" with lines 
 +</​code>​
  
-xGEMM  +{{ :​formation:​pimc_1_64.png?600 |}} 
-NBody.py +==== Other sample codes (used for courses) ====
-PiXPU.py+
  
-Choose your prefered parallel code+In folder ''/​scratch/​AstroSim2017'',​ you will find the following executables:​ 
 +  * ''​PiXPU.py''​ : Pi Monte Carlo Dart Dash in PyOpenCL 
 +  * ''​NBody.py''​ : N-Body in PyOpenCL 
 +  * ''​xGEMM_DP_openblas''​ : Matrix-Matrix multiplication with multithreaded OpenBLAS library in double precision 
 +  * ''​xGEMM_SP_openblas''​ : Matrix-Matrix multiplication with multithreaded OpenBLAS library in simple precision 
 +  * ''​xGEMM_DP_clblas''​ : Matrix-Matrix multiplication for OpenCL library in double precision 
 +  * ''​xGEMM_SP_clblas''​ : Matrix-Matrix multiplication for OpenCL library in simple precision 
 +  * ''​xGEMM_DP_cublas''​ : Matrix-Matrix multiplication for CUDA library in double precision 
 +  * ''​xGEMM_SP_cublas''​ : Matrix-Matrix multiplication for CUDA library in simple precision
  
-Improvment of statistics+=== Exercice #19 : select parallelized program and explore salability ===
  
-Scalability law+  * launch one of the upper code with ''​PR''​ from ''​1''​ to the 2 times the number of CPUs 
 +  * draw the scalability curve 
 +  * estimates the parameters with Amdahl Law and Mylq Law
  
-Amdahl Law+==== Your prefered software ====
  
-Mylq Law+=== Exercice #20 : select parallelized program and explore salability ===
  
 +  * launch your MPI code with ''​PR''​ from ''​1''​ to the 2 times the number of CPUs
 +  * draw the scalability curve
 +  * estimates the parameters with Amdahl Law and Mylq Law
  
 + --- //​[[emmanuel.quemener@ens-lyon.fr|Emmanuel Quemener]] 2017/06/30 14:26//
formation/astrosim2017para4dummies.1498755436.txt.gz · Dernière modification: 2017/06/29 18:57 par equemene