Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentes Révision précédente Prochaine révision | Révision précédente | ||
formation:astrosim2017para4dummies [2017/06/29 17:11] equemene |
formation:astrosim2017para4dummies [2017/07/07 09:25] equemene [5 W/2H : Why ? What ? Where ? When ? Who ? How much ? How ?] |
||
---|---|---|---|
Ligne 12: | Ligne 12: | ||
* **How much ?** Nothing, Blaise Pascal Center provides workstations & cluster nodes | * **How much ?** Nothing, Blaise Pascal Center provides workstations & cluster nodes | ||
* **Where ?** On workstations, cluster nodes, laptop (well configured), inside terminals | * **Where ?** On workstations, cluster nodes, laptop (well configured), inside terminals | ||
- | * **Who ?** For people who want to open the front | + | * **Who ?** For people who want to open the hood |
- | * **How ?** Applying some simples commands | + | * **How ?** Applying some simple commands (essentially shell ones) |
===== Session Goal ===== | ===== Session Goal ===== | ||
Ligne 27: | Ligne 27: | ||
=== Prerequisite for hardware === | === Prerequisite for hardware === | ||
+ | * If using CBP resources, nothing... Just login... | ||
+ | * If NOT using CBP resources, a machine relatively recent with multi-cores CPU | ||
=== Prerequisite for software === | === Prerequisite for software === | ||
- | * Open graphical session on one workstation | + | * Open graphical session on one workstation, several terminals and your favorite browser |
- | * Open four terminals | + | * If NOT using CBP resources, a GNU/Linux Operating System well configured |
=== Prerequisite for humanware === | === Prerequisite for humanware === | ||
* An allergy to command line will severely restrict the range of this practical session. | * An allergy to command line will severely restrict the range of this practical session. | ||
+ | * A practice of shell scripts would be a asset, but you will improve it in this session! | ||
===== Investigate Hardware ===== | ===== Investigate Hardware ===== | ||
Ligne 47: | Ligne 49: | ||
* Input and Output Devices | * Input and Output Devices | ||
- | First property of hardware is limited resources. | + | The first property of hardware is limited resources. |
In Posix systems, everything is file. So you can retreive informations (or set configurations) by classical file commands inside a terminal. For example ''cat /proc/cpuinfo'' returns information about processor. | In Posix systems, everything is file. So you can retreive informations (or set configurations) by classical file commands inside a terminal. For example ''cat /proc/cpuinfo'' returns information about processor. | ||
Ligne 132: | Ligne 134: | ||
</code> | </code> | ||
- | === Question #1: get this informations on your host with ''cat /proc/cpuinfo'' and compare to one above === | + | === Exercice #1: get this informations on your host with ''cat /proc/cpuinfo'' and compare to one above === |
* How much lines of informations ? | * How much lines of informations ? | ||
- | === Question #2 : get the informations on your host with ''lscpu'' command === | + | === Exercice #2 : get the informations on your host with ''lscpu'' command === |
* What new informations appear on the output ? | * What new informations appear on the output ? | ||
* How many CPUs ? Threads per core ? Cores per socket ? Sockets ? | * How many CPUs ? Threads per core ? Cores per socket ? Sockets ? | ||
* How many cache levels ? | * How many cache levels ? | ||
- | * How many "flags" ? | + | * How many "flags" ? What do they represent ? |
==== Exploration ==== | ==== Exploration ==== | ||
Ligne 148: | Ligne 150: | ||
{{ :formation:lstopo_035.png?400 |hwloc-ls}} | {{ :formation:lstopo_035.png?400 |hwloc-ls}} | ||
- | === Question #3 : get a graphical representation of hardware with ''hwloc-ls'' command === | + | === Exercice #3 : get a graphical representation of hardware with ''hwloc-ls'' command === |
* Locate and identify the elements provided with ''lscpu'' command | * Locate and identify the elements provided with ''lscpu'' command | ||
Ligne 176: | Ligne 178: | ||
</code> | </code> | ||
- | === Question #4 : list the PCI peripherals with ''lspci'' command === | + | === Exercice #4 : list the PCI peripherals with ''lspci'' command === |
* How many devices do you get ? | * How many devices do you get ? | ||
* Can you identify the devices listed with graphical representation ? | * Can you identify the devices listed with graphical representation ? | ||
* What keywords on graphical representation define the VGA device ? | * What keywords on graphical representation define the VGA device ? | ||
- | |||
- | |||
==== Exploring dynamic system ==== | ==== Exploring dynamic system ==== | ||
Ligne 189: | Ligne 189: | ||
As when your drive a car, it's useful to get informations about running system during process. The commands ''top'' and ''htop'' | As when your drive a car, it's useful to get informations about running system during process. The commands ''top'' and ''htop'' | ||
+ | |||
+ | === Exercice #5: open ''htop'' and ''top'' in two terminals === | ||
+ | |||
+ | * What do you see first ? | ||
+ | * How much memory have you ? | ||
+ | * How much swap ? | ||
+ | * How many tasks are launched ? How many threads ? | ||
==== Tiny metrology with ''/usr/bin/time'' ==== | ==== Tiny metrology with ''/usr/bin/time'' ==== | ||
<note important>Be careful, there is a difference between ''time'' included as command in shells and ''time'' as standalone program. In order not to get difficulties, the program ''time'' has to be resquested by ''/usr/bin/time''!</note> | <note important>Be careful, there is a difference between ''time'' included as command in shells and ''time'' as standalone program. In order not to get difficulties, the program ''time'' has to be resquested by ''/usr/bin/time''!</note> | ||
- | |||
- | === Introduction & Redefinition of metrology arguments === | ||
Difference between ''time'' build in command and ''time'' standalone program. | Difference between ''time'' build in command and ''time'' standalone program. | ||
Ligne 277: | Ligne 282: | ||
TIME Exit status: 0 | TIME Exit status: 0 | ||
</code> | </code> | ||
+ | |||
+ | === Exercice #6 : exploration of ''/usr/bin/time'' on several command Unix commands or your small programs === | ||
+ | |||
+ | |||
==== Statistics on the fly ! Penstacle of statistics ==== | ==== Statistics on the fly ! Penstacle of statistics ==== | ||
Ligne 303: | Ligne 312: | ||
</code> | </code> | ||
- | To evaluate the variability to MemCopy test memory on 10 launches with a size of 1GB, the command is: | + | To evaluate the variability to MemCopy test memory in ''mbw'' tool on 10 launches with a size of 1GB, the command is: |
<code> | <code> | ||
mbw -a -t 0 -n 10 1000 | mbw -a -t 0 -n 10 1000 | ||
</code> | </code> | ||
- | Here is an example of output: | + | This is an example of output: |
<code> | <code> | ||
Long uses 8 bytes. Allocating 2*131072000 elements = 2097152000 bytes of memory. | Long uses 8 bytes. Allocating 2*131072000 elements = 2097152000 bytes of memory. | ||
Ligne 329: | Ligne 338: | ||
</code> | </code> | ||
- | Here is the output | + | This is an example of output: |
<code> | <code> | ||
5595.783 5673.179 5624.503 5625.749 21.81671 0.003878869 | 5595.783 5673.179 5624.503 5625.749 21.81671 0.003878869 | ||
</code> | </code> | ||
- | This will be very useful to extract and provides statistics of times. | + | === Exercice #7 : practice ''Rmmmms-$USER.r'' and investigate variability === |
- | ===== First steps in parallelism ===== | + | * Launch previous command to 10000, 1000, 100 launchs with respectly sizes of 10, 100, 1000 |
- | + | * Have a look on statistics estimators : what tipically variability do you reach ? | |
- | ==== Before exploration, check the instrumentation ! ==== | + | |
+ | This will be very useful to extract and provides statistics of times. | ||
- | ==== An illustrative example: Pi Dart Dash ==== | + | ===== An illustrative example: Pi Dart Dash ===== |
- | === Principle, inputs & outputs === | + | ==== Principle, inputs & outputs ==== |
The most common example of Monte Carlo program: estimate Pi number by the ratio between the number of points located in the quarter of a circle where random points are uniformly distributed. It needs: | The most common example of Monte Carlo program: estimate Pi number by the ratio between the number of points located in the quarter of a circle where random points are uniformly distributed. It needs: | ||
Ligne 354: | Ligne 363: | ||
* Output: an integer as number of points inside the quarter of circle | * Output: an integer as number of points inside the quarter of circle | ||
* Output (bis): an estimation of Pi number (very inefficient method but the result is well known, so easy checked). | * Output (bis): an estimation of Pi number (very inefficient method but the result is well known, so easy checked). | ||
+ | * Output (ter): the total amount of iterations (just to remind) | ||
The following implementation is as ''bash'' shell script one. The ''RANDOM'' command provides a random number between 0 and 32767. So the frontier is located on ''32767*32767''. | The following implementation is as ''bash'' shell script one. The ''RANDOM'' command provides a random number between 0 and 32767. So the frontier is located on ''32767*32767''. | ||
Ligne 395: | Ligne 405: | ||
A program name ''PiMC-$USER.sh'' located in ''/tmp'' where ''$USER'' is your login is created and ready to use. | A program name ''PiMC-$USER.sh'' located in ''/tmp'' where ''$USER'' is your login is created and ready to use. | ||
- | == Question #?: launch ''PiMC'' program with several number of iterations: from 100 to 10000000 == | + | === Exercice #8: launch ''PiMC'' program with several number of iterations: from 100 to 1000000 === |
- | * | + | * What is the typical precision of the result ? |
- | == Question #?: launch ''PiMC'' program prefixed by ''/usr/bin/time'' with several number of iterations: 100 to 1000000 == | + | === Exercice #9: launch ''PiMC'' program prefixed by ''/usr/bin/time'' with several number of iterations: 100 to 1000000 === |
- | === Split the execution in equal parts === | + | * Grep the ''Elapsed'' and ''Iterations'' and estimate manually the **ITOPS** (ITerative Operations Per Second) for this program implementation |
+ | * Improve the test to estimate the ITOPS //on the fly//: apply to different amount of iterations and several time | ||
+ | |||
+ | One Solution:<code> | ||
+ | echo $(/usr/bin/time /tmp/PiMC-$USER.sh 100000 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l | ||
+ | </code> | ||
+ | |||
+ | For 100000 iterations, 10 times: | ||
+ | <code> | ||
+ | 31250.00000000000000000000 | ||
+ | 31645.56962025316455696202 | ||
+ | 28248.58757062146892655367 | ||
+ | 30864.19753086419753086419 | ||
+ | 31847.13375796178343949044 | ||
+ | 32362.45954692556634304207 | ||
+ | 32467.53246753246753246753 | ||
+ | 31545.74132492113564668769 | ||
+ | 32573.28990228013029315960 | ||
+ | 32362.45954692556634304207 | ||
+ | </code> | ||
+ | |||
+ | Example of code for previous results:<code> | ||
+ | for i in $(seq 10 ) ; do echo $(/usr/bin/time /tmp/PiMC-$USER.sh 100000 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l ; done</code> | ||
+ | |||
+ | From 1000 to 1000000, 1 time: | ||
+ | <code> | ||
+ | 1000 20000.00000000000000000000 | ||
+ | 10000 26315.78947368421052631578 | ||
+ | 100000 32154.34083601286173633440 | ||
+ | 1000000 31685.67807351077313054499 | ||
+ | </code> | ||
+ | |||
+ | Example of code for previous results:<code> | ||
+ | for POWER in $(seq 3 1 6); do ITERATIONS=$((10**$POWER)) ; echo -ne $ITERATIONS'\t' ; echo $(/usr/bin/time /tmp/PiMC-$USER.sh $ITERATIONS 2>&1 | egrep '(Elapsed|Iterations)' | awk '{ print $NF }' | tr '\n' '/')1 | bc -l ; done</code> | ||
+ | ==== Split the execution in equal parts ==== | ||
The following command line divides the job to do (10000000 iterations) into ''PR'' equal jobs. | The following command line divides the job to do (10000000 iterations) into ''PR'' equal jobs. | ||
Ligne 425: | Ligne 469: | ||
On the previous launch, User time represents 99.6% of Elapsed time. Internal system operations only 0.4%. | On the previous launch, User time represents 99.6% of Elapsed time. Internal system operations only 0.4%. | ||
+ | |||
+ | === Exercice #10 : identification of the cost of splitting process === | ||
+ | |||
+ | * Explore the values of ''User'', ''System'' and ''Elapsed'' times for different values of iterations | ||
+ | * Estimate the ratio between ''User time'' and ''Elapsed time'' for the results | ||
+ | * Estimate the ratio between ''System time'' and ''Elapsed time'' for the results | ||
+ | * What could you conclude ? | ||
Replace the ''PR'' set as ''1'' by the detected number of CPU with ''lspcu'' command). | Replace the ''PR'' set as ''1'' by the detected number of CPU with ''lspcu'' command). | ||
Ligne 539: | Ligne 590: | ||
In this example, we see that the User time represents 98.52% of the Elapsed time. The total Elapsed time is greater up to 10% to unsplitted one. So, splitting has a cost. The system time represents 0.4% of Elapsed time. | In this example, we see that the User time represents 98.52% of the Elapsed time. The total Elapsed time is greater up to 10% to unsplitted one. So, splitting has a cost. The system time represents 0.4% of Elapsed time. | ||
- | Replace the end of the program to extract the total //Inside// number of iterations. | + | === Exercice #11 : identification of the cost of splitting process === |
+ | |||
+ | * Explore the values of ''User'', ''System'' and ''Elapsed'' times for different values of iterations | ||
+ | * Estimate the ratio between ''User time'' and ''Elapsed time'' for the results | ||
+ | * Estimate the ratio between ''System time'' and ''Elapsed time'' for the results | ||
+ | * What could you conclude ? | ||
+ | |||
+ | === Exercice #12 : merging results & improve metrology === | ||
+ | |||
+ | * Append the program to extract the total amount of //Inside// number of iterations | ||
+ | * Set timers inside command lines to estimate the total Elapsed time | ||
+ | |||
+ | Solution: the timer used are based on ''date'' command | ||
<code> | <code> | ||
ITERATIONS=1000000 | ITERATIONS=1000000 | ||
+ | START=$(date '+%s.%N') | ||
PR=$(lscpu | grep '^CPU(s):' | awk '{ print $NF }') | PR=$(lscpu | grep '^CPU(s):' | awk '{ print $NF }') | ||
EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) | EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) | ||
seq $PR | /usr/bin/time xargs -I '{}' /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep ^Inside | awk '{ sum+=$2 } END { printf "Insides %i", sum }' ; echo | seq $PR | /usr/bin/time xargs -I '{}' /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep ^Inside | awk '{ sum+=$2 } END { printf "Insides %i", sum }' ; echo | ||
+ | STOP=$(date '+%s.%N') | ||
+ | echo Total Elapsed time: $(echo $STOP-$START | bc -l) | ||
</code> | </code> | ||
+ | |||
+ | ==== After splitting, finally the parallelization ==== | ||
In this illustrative case, each job is independant to others. They can be distributed to all the computing resources available. ''xargs'' command line builder do it for you with ''-P <ConcurrentProcess>''. | In this illustrative case, each job is independant to others. They can be distributed to all the computing resources available. ''xargs'' command line builder do it for you with ''-P <ConcurrentProcess>''. | ||
Ligne 663: | Ligne 731: | ||
In conclusion, splitting a huge job into small jobs has a Operating System cost. But distribute the jobs using system can very efficient to reduce Elapsed time. | In conclusion, splitting a huge job into small jobs has a Operating System cost. But distribute the jobs using system can very efficient to reduce Elapsed time. | ||
- | We can improve statistics by launching 10x the previous program. We storage the different 'time' estimators inside a logfile named as ''Ouput_PiMC-$USER_YYYYmmddHHMM.log'' | + | === Exercice #13 : launch with ''-P'' set with the number of CPU detected === |
+ | * Examine the ''Elapsed time'': decrease or not ? | ||
+ | * Examine the ''User time'': increase or not ? | ||
+ | * Examine the ''System time'': increase or not ? | ||
+ | |||
+ | === Exercice #14 : append the program to improve statistics === | ||
+ | |||
+ | * Add iterator to redo the program 10 times | ||
+ | * Store the ''time'' estimators inside an output file defined as : ''/tmp/PiMC-$USER_YYYYmmddHHMM.log'' | ||
+ | * Parse the output file and extract statistics on 3 times estimators. | ||
+ | * Estimate the speedup between ''PR=1'' and ''PR=<NumberOfCPU>'' | ||
+ | * Multiply by 10 the number of iterations and estimate the speedup | ||
+ | |||
+ | Solution: | ||
<code> | <code> | ||
ITERATIONS=1000000 | ITERATIONS=1000000 | ||
Ligne 677: | Ligne 758: | ||
</code> | </code> | ||
+ | Example of output file: | ||
<code> | <code> | ||
TIME User time (seconds): 59.81 | TIME User time (seconds): 59.81 | ||
Ligne 710: | Ligne 792: | ||
</code> | </code> | ||
+ | Examples of statistics on estimators: | ||
With //magic// ''Rmmmms-$USER.r'' command, we can extract statistics on different times | With //magic// ''Rmmmms-$USER.r'' command, we can extract statistics on different times | ||
- | * for //Elapsed time// : ''cat /tmp/PiMC-jmylq_201706291231.log | grep Elapsed | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>1.96 2.02 1.985 1.987 0.01888562 0.009514167</code> | + | * for //Elapsed time// : ''cat /tmp/PiMC-$USER_201706291231.log | grep Elapsed | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>1.96 2.02 1.985 1.987 0.01888562 0.009514167</code> |
- | * for //System time// : ''cat /tmp/PiMC-jmylq_201706291231.log | grep System | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>0.09 0.22 0.14 0.139 0.03665151 0.2617965</code> | + | * for //System time// : ''cat /tmp/PiMC-$USER_201706291231.log | grep System | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>0.09 0.22 0.14 0.139 0.03665151 0.2617965</code> |
- | * for //User time// : ''cat /tmp/PiMC-jmylq_201706291231.log | grep User | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>59.12 59.81 59.375 59.436 0.2179297 0.003670394</code> | + | * for //User time// : ''cat /tmp/PiMC-$USER_201706291231.log | grep User | awk '{ print $NF }' | /tmp/Rmmmms-$USER.r'':<code>59.12 59.81 59.375 59.436 0.2179297 0.003670394</code> |
The previous results show that the variability, in this cas, in | The previous results show that the variability, in this cas, in | ||
Ligne 729: | Ligne 812: | ||
</code> | </code> | ||
- | On the first half of cores: 0 to 15<code> | + | ==== Select the execution cores ==== |
- | ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')/2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-15 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | + | |
+ | It's possible with the ''hwloc-bind'' command to select the cores on which you would like to execute your program. You just have to specify the physical units with the format //from//''-''//to//. For example, if you want to execute the parallelized application MyParallelApplication on a machine with 8 cores (defined from ''0'' to ''7'') only on the two first:<code> | ||
+ | hwloc-bind -p pu:0-1 ./MyParallelApplication | ||
</code> | </code> | ||
+ | If you want to select only one atomic core, the last one, for example:<code> | ||
+ | hwloc-bind -p pu:7-7 ./MyParallelApplication | ||
+ | </code> | ||
+ | If you want to select several non adjacent cores, the first and the last ones, for example:<code> | ||
+ | hwloc-bind -p pu:0-0 pu:7-7 ./MyParallelApplication | ||
+ | </code> | ||
- | On the second half of cores: 16 to 31<code> | + | <note important>You can control the selection by watching in another terminal the ''htop'' activity of cores</note> |
+ | |||
+ | === Exercice #15 : launch the previous program on a slice of machine === | ||
+ | |||
+ | * Identify and launch the program on only the first core | ||
+ | * Identify and launch the program on the first half of cores | ||
+ | * Identify and launch the program on the second half of cores | ||
+ | * Identify and launch on two first cores | ||
+ | * Identify and launch on first on the first half and first on the second half of cores | ||
+ | * Why is there a so great difference between elapsed time | ||
+ | |||
+ | Watch inside terminal with ''htop'' to check the right distribution of tasks. | ||
+ | |||
+ | Solutions for a 32 cores workstation: | ||
+ | * On the first core: 0 <code> | ||
+ | ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')/2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-0 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | ||
+ | </code> | ||
+ | * On the first half of cores: 0 to 15<code> | ||
+ | ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')/2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-15 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | ||
+ | </code> | ||
+ | * On the second half of cores: 16 to 31<code> | ||
ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:16-31 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:16-31 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | ||
+ | </code> | ||
+ | * On the first of first half and first of second half of cores: 0 and 8<code> | ||
+ | ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-1 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | ||
+ | </code> | ||
+ | * On the first of first half and first of second half of cores: 0 and 8<code> | ||
+ | ITERATIONS=10000000 ; PR=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; seq $PR | /usr/bin/time hwloc-bind -p pu:0-0 pu:8-8 xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep -v timed | egrep '(Pi|Inside|Iterations|time)' | ||
</code> | </code> | ||
- | Why to much user time | + | === Exercice #17 : from exploration to laws estimation === |
- | HT Effect : why so much people desactivate... | + | * explore with previous program from ''PR=1'' to ''PR=<2x CPU>'', 10x for each |
+ | * store the results in a file | ||
- | Examples of codes | + | Solution: |
+ | <code> | ||
+ | ITERATIONS=1000000 ; | ||
+ | REDO=10 ; | ||
+ | PR_START=1 ; | ||
+ | PR_STOP=$(($(lscpu | grep '^CPU(s):' | awk '{ print $NF }')*2)) ; | ||
+ | OUTPUT=/tmp/$(basename /tmp/PiMC-$USER.sh .sh)_${PR_START}_${PR_STOP}_$(date "+%Y%m%d%H%M").dat | ||
+ | seq $PR_START 1 $PR_STOP | while read PR ; | ||
+ | do | ||
+ | echo -ne "$PR\t" ; | ||
+ | EACHJOB=$([ $(($ITERATIONS % $PR)) == 0 ] && echo $(($ITERATIONS/$PR)) || echo $(($ITERATIONS/$PR+1))) ; | ||
+ | seq $REDO | while read STEP ; | ||
+ | do | ||
+ | seq $PR | /usr/bin/time xargs -I '{}' -P $PR /tmp/PiMC-$USER.sh $EACHJOB '{}' 2>&1 | grep Elapsed | awk '{ print $NF }' | ||
+ | done | /tmp/Rmmmms-$USER.r | ||
+ | done > $OUTPUT | ||
+ | echo Results in $OUTPUT file | ||
+ | </code> | ||
+ | |||
+ | As an example, a 32HT cores workstation, we got: | ||
+ | <code> | ||
+ | # PR MIN MAX AVG MED STDEV Variability | ||
+ | 1 29.94 35.16 30.56 30.99 1.54438 0.05053601 | ||
+ | 2 15.09 16.73 15.445 15.531 0.4647449 0.03009031 | ||
+ | 3 10.3 12.02 10.555 10.795 0.6131567 0.05809158 | ||
+ | 4 7.78 8.21 7.97 7.975 0.1269514 0.01592866 | ||
+ | 5 6.31 6.53 6.435 6.416 0.07366591 0.01144769 | ||
+ | 6 5.27 5.57 5.41 5.415 0.09778093 0.01807411 | ||
+ | 7 4.61 5.67 4.74 4.901 0.3989277 0.08416197 | ||
+ | 8 4.03 4.35 4.115 4.146 0.09800227 0.02381586 | ||
+ | 9 3.66 3.92 3.71 3.718 0.07420692 0.02000186 | ||
+ | 10 3.32 4.29 3.36 3.453 0.295524 0.08795358 | ||
+ | 11 3.01 4.45 3.08 3.229 0.4330114 0.1405881 | ||
+ | 12 2.77 4.29 2.86 3.019 0.4609519 0.161172 | ||
+ | 13 2.61 2.89 2.68 2.707 0.08602971 0.03210064 | ||
+ | 14 2.51 4.03 2.615 2.842 0.4982369 0.1905304 | ||
+ | 15 2.31 3.42 2.41 2.565 0.3422231 0.1420013 | ||
+ | 16 2.31 3.03 2.66 2.675 0.2382459 0.08956613 | ||
+ | 17 2.42 3.11 2.7 2.722 0.2395737 0.088731 | ||
+ | 18 2.42 2.8 2.67 2.627 0.1477272 0.05532855 | ||
+ | 19 2.52 2.72 2.605 2.615 0.06114645 0.02347273 | ||
+ | 20 2.43 2.91 2.54 2.579 0.136337 0.05367598 | ||
+ | 21 2.37 2.91 2.49 2.509 0.1540166 0.06185405 | ||
+ | 22 2.28 2.73 2.37 2.407 0.1271963 0.05366931 | ||
+ | 23 2.3 2.54 2.35 2.37 0.06879922 0.02927627 | ||
+ | 24 2.25 2.37 2.285 2.287 0.03368151 0.01474027 | ||
+ | 25 2.19 2.37 2.225 2.246 0.06022181 0.02706598 | ||
+ | 26 2.1 2.32 2.18 2.191 0.05606544 0.02571809 | ||
+ | 27 2.14 2.27 2.205 2.198 0.04516636 0.02048361 | ||
+ | 28 2.07 2.21 2.14 2.134 0.04273952 0.01997174 | ||
+ | 29 2.02 2.11 2.07 2.065 0.02758824 0.01332765 | ||
+ | 30 2 2.13 2.035 2.036 0.03806427 0.0187048 | ||
+ | 31 1.98 2.07 1.99 2.002 0.02820559 0.01417367 | ||
+ | 32 1.97 2.02 1.99 1.993 0.01766981 0.008879302 | ||
+ | 33 2.05 2.25 2.12 2.129 0.06402257 0.03019932 | ||
+ | 34 2.08 2.23 2.15 2.155 0.0457651 0.02128609 | ||
+ | 35 2.08 2.25 2.16 2.156 0.05853774 0.0271008 | ||
+ | 36 2.02 2.21 2.13 2.129 0.05782156 0.02714627 | ||
+ | 37 2.08 2.2 2.15 2.147 0.03560587 0.01656087 | ||
+ | 38 2.01 2.19 2.125 2.119 0.05384133 0.0253371 | ||
+ | 39 2.05 2.2 2.105 2.111 0.05108816 0.02426991 | ||
+ | 40 2.06 2.2 2.11 2.124 0.04526465 0.02145244 | ||
+ | 41 2.07 2.18 2.09 2.102 0.03425395 0.01638945 | ||
+ | 42 2.04 2.13 2.095 2.092 0.0265832 0.01268888 | ||
+ | 43 2.03 2.12 2.08 2.076 0.03025815 0.01454719 | ||
+ | 44 2.04 2.14 2.085 2.086 0.03204164 0.01536769 | ||
+ | 45 2.02 2.13 2.08 2.082 0.03392803 0.01631155 | ||
+ | 46 2.05 2.12 2.075 2.081 0.0218327 0.01052178 | ||
+ | 47 1.98 2.15 2.08 2.073 0.05250397 0.02524229 | ||
+ | 48 1.99 2.14 2.085 2.081 0.04557046 0.02185633 | ||
+ | 49 2.04 2.18 2.085 2.087 0.04321779 0.02072796 | ||
+ | 50 2.06 2.17 2.12 2.116 0.03657564 0.01725266 | ||
+ | 51 2.02 2.16 2.09 2.086 0.03864367 0.01848979 | ||
+ | 52 2.03 2.13 2.08 2.075 0.02915476 0.01401671 | ||
+ | 53 2.03 2.14 2.095 2.093 0.03465705 0.01654274 | ||
+ | 54 2 2.11 2.075 2.069 0.03212822 0.01548348 | ||
+ | 55 2.02 2.15 2.095 2.085 0.04062019 0.01938911 | ||
+ | 56 2.05 2.11 2.09 2.081 0.02078995 0.009947347 | ||
+ | 57 2.03 2.09 2.065 2.065 0.01840894 0.008914739 | ||
+ | 58 2.06 2.11 2.07 2.082 0.02250926 0.01087404 | ||
+ | 59 2.02 2.11 2.07 2.067 0.02451757 0.01184424 | ||
+ | 60 2.02 2.1 2.055 2.057 0.02406011 0.01170808 | ||
+ | 61 2.03 2.15 2.065 2.07 0.03333333 0.01614205 | ||
+ | 62 2.01 2.13 2.06 2.059 0.03842742 0.01865409 | ||
+ | 63 2.01 2.09 2.07 2.06 0.03018462 0.01458194 | ||
+ | 64 2.02 2.11 2.075 2.077 0.02945807 0.01419666 | ||
+ | </code> | ||
+ | |||
+ | === Question #18 : plot & fit with Amdahl and Mylq laws === | ||
+ | |||
+ | * plot the curve with your favorite plotter the different values, focus on median one ! | ||
+ | * fit with an Amdahl law where ''T=s+p/N'' where ''N'' is ''PR'' | ||
+ | * fit with a Mylq law where ''T=s+c*N+p/N'' | ||
+ | * what law match the best | ||
+ | |||
+ | Examples of gnuplot bunch of commands to do the job. Adapt to your file and ''PR''... | ||
+ | <code> | ||
+ | Ta(x)=T1*(1-Pa+Pa/x) | ||
+ | fit [x=1:16] Ta(x) 'PiMC_1_64.dat' using 1:4 via T1,Pa | ||
+ | Tm(x)=Sm+Cm*x+Pm/x | ||
+ | fit [x=1:16] Tm(x) 'PiMC_1_64.dat' using 1:4 via Sm,Cm,Pm | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set xrange [1:64] | ||
+ | set ylabel "Speedup Factor" | ||
+ | set title "PiMC : parallel execution with Bash for distributed iterations" | ||
+ | plot 'PiMC_1_64.dat' using ($1):(Tm(1)/$4) title 'Mesures' with points,\ | ||
+ | Tm(1)/Tm(x) title "Mylq Law" with lines,\ | ||
+ | Ta(1)/Ta(x) title "Amdahl Law" with lines | ||
+ | </code> | ||
- | xGEMM | + | {{ :formation:pimc_1_64.png?600 |}} |
- | NBody.py | + | ==== Other sample codes (used for courses) ==== |
- | PiXPU.py | + | |
- | Choose your prefered parallel code | + | In folder ''/scratch/AstroSim2017'', you will find the following executables: |
+ | * ''PiXPU.py'' : Pi Monte Carlo Dart Dash in PyOpenCL | ||
+ | * ''NBody.py'' : N-Body in PyOpenCL | ||
+ | * ''xGEMM_DP_openblas'' : Matrix-Matrix multiplication with multithreaded OpenBLAS library in double precision | ||
+ | * ''xGEMM_SP_openblas'' : Matrix-Matrix multiplication with multithreaded OpenBLAS library in simple precision | ||
+ | * ''xGEMM_DP_clblas'' : Matrix-Matrix multiplication for OpenCL library in double precision | ||
+ | * ''xGEMM_SP_clblas'' : Matrix-Matrix multiplication for OpenCL library in simple precision | ||
+ | * ''xGEMM_DP_cublas'' : Matrix-Matrix multiplication for CUDA library in double precision | ||
+ | * ''xGEMM_SP_cublas'' : Matrix-Matrix multiplication for CUDA library in simple precision | ||
- | Improvment of statistics | + | === Exercice #19 : select parallelized program and explore salability === |
- | Scalability law | + | * launch one of the upper code with ''PR'' from ''1'' to the 2 times the number of CPUs |
+ | * draw the scalability curve | ||
+ | * estimates the parameters with Amdahl Law and Mylq Law | ||
- | Amdahl Law | + | ==== Your prefered software ==== |
- | Mylq Law | + | === Exercice #20 : select parallelized program and explore salability === |
+ | * launch your MPI code with ''PR'' from ''1'' to the 2 times the number of CPUs | ||
+ | * draw the scalability curve | ||
+ | * estimates the parameters with Amdahl Law and Mylq Law | ||
+ | --- //[[emmanuel.quemener@ens-lyon.fr|Emmanuel Quemener]] 2017/06/30 14:26// |