Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentes Révision précédente Prochaine révision | Révision précédente | ||
formation:astrosim2017gpu4dummies [2017/07/07 13:04] equemene [Test examples] |
formation:astrosim2017gpu4dummies [2017/07/10 18:52] (Version actuelle) equemene [NBody, a simplistic simulator] |
||
---|---|---|---|
Ligne 330: | Ligne 330: | ||
</code> | </code> | ||
+ | === Exercice #6 : launch ''xGEMM_<precision>_<implementation>'' with different sizes and iterations === | ||
+ | |||
+ | * Which on the CPU implementations is the powerful | ||
+ | * Increase the size of matrix to ''2000'', ''4000'', ''8000'' on GPU and check the results | ||
+ | * Move from simple precision to double precision (SP to DP) and examine the elapsed time on CPU | ||
+ | * Move from simple precision to double precision (SP to DP) and examine the elapsed time on GPU | ||
===== Exploration with dummie codes ===== | ===== Exploration with dummie codes ===== | ||
- | |||
- | ==== A GPU, a performant matrix multiplier ==== | ||
==== Pi Monte Carlo, a Compute Bound Example ==== | ==== Pi Monte Carlo, a Compute Bound Example ==== | ||
+ | The ''PiXPU.py'' code is a implementation of PiMC Pi Dart Dash on GPU, on OpenCL and CUDA devices. It's useful to evaluate que compute power of *PU devices as, CPU, GPU (both Nvidia, AMD and Intel), and CPU through the 3 implementations. | ||
+ | |||
+ | It's available on: | ||
+ | * on file: ''/scratch/AstroSim2017/PiXPU.py'' on workstations | ||
+ | * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/PiXPU.py|PiXPU.py]] | ||
+ | |||
+ | Copy the ''PiXPU.py'' inside your folder to use it<code> | ||
+ | mkdir /scratch/$USER | ||
+ | cd /scratch/$USER | ||
+ | cp /scratch/AstroSim2017/PiXPU.py /scratch/$USER | ||
+ | </code> | ||
+ | |||
+ | The documentation is available by the call of ''/scratch/$USER/PiXPU.py -h'':<code> | ||
+ | PiXPU.py -o (Out of Core Metrology) -c (Print Curves) -d <DeviceId> -g <CUDA/OpenCL> -i <Iterations> -b <BlocksBegin> -e <BlocksEnd> -s <BlocksStep> -f <ThreadsFirst> -l <ThreadsLast> -t <ThreadssTep> -r <RedoToImproveStats> -m <SHR3/CONG/MWC/KISS> -v <INT32/INT64/FP32/FP64> | ||
+ | |||
+ | Informations about devices detected under OpenCL API: | ||
+ | Device #0 from The pocl project of type *PU : pthread-Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz | ||
+ | Device #1 from NVIDIA Corporation of type *PU : GeForce GTX TITAN | ||
+ | Device #2 from Intel(R) Corporation of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz | ||
+ | Device #3 from Advanced Micro Devices, Inc. of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz | ||
+ | |||
+ | Informations about devices detected under CUDA API: | ||
+ | Device #0 of type GPU : GeForce GTX TITAN | ||
+ | </code> | ||
+ | |||
+ | The ''-h'' also detects the OpenCL and CUDA devices and sends each an ID which must be used for their specific call. <code> | ||
+ | Devices Identification : [0] | ||
+ | GpuStyle used : OpenCL | ||
+ | Iterations : 1000000 | ||
+ | Number of Blocks on begin : 1 | ||
+ | Number of Blocks on end : 1 | ||
+ | Step on Blocks : 1 | ||
+ | Number of Threads on begin : 1 | ||
+ | Number of Threads on end : 1 | ||
+ | Step on Threads : 1 | ||
+ | Number of redo : 1 | ||
+ | Metrology done out of XPU : False | ||
+ | Type of Marsaglia RNG used : MWC | ||
+ | Type of variable : FP32 | ||
+ | Device #0 from The pocl project of type xPU : pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz | ||
+ | Device #1 from NVIDIA Corporation of type xPU : GeForce GTX 1080 Ti | ||
+ | Device #2 from Intel(R) Corporation of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz | ||
+ | Device #3 from Advanced Micro Devices, Inc. of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz | ||
+ | ('CPU/GPU selected: ', 'pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz') | ||
+ | Pi estimation 3.14192800 | ||
+ | 0.03 0.03 0.00 0.03 0.03 37357749 37357749 0 37357749 37357749 | ||
+ | </code> | ||
+ | |||
+ | Two file are created by default: | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan.npz'' | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan'' | ||
+ | |||
+ | === Exercice #7 : explore ''PiXPU.py'' with several simple configurations pour ''PR=1'' === | ||
+ | |||
+ | * Without any parameters (the default ones) : | ||
+ | * what is the selected device ? How many itops (iterative operations per second) do you reach ? | ||
+ | * With only the device parameter as ''-d 1'' to select ''#1'' for all the available devices : | ||
+ | * What are the different ratios between the devices ? Which one is the most powerful ? | ||
+ | * With the selector of device and increasing the number of iterations and the number of redo : | ||
+ | * What arrive to itops values ? What is the typical variability on results ? | ||
+ | |||
+ | <code>/scratch/$USER/PiXPU.py</code> | ||
+ | |||
+ | <code> | ||
+ | /scratch/$USER/PiXPU.py -d 1 | ||
+ | /scratch/$USER/PiXPU.py -d 2 | ||
+ | /scratch/$USER/PiXPU.py -d 3 | ||
+ | </code> | ||
+ | |||
+ | <code> | ||
+ | /scratch/$USER/PiXPU.py -d 0 -i 100000000 -r 10 | ||
+ | /scratch/$USER/PiXPU.py -d 1 -i 100000000 -r 10 | ||
+ | /scratch/$USER/PiXPU.py -d 2 -i 100000000 -r 10 | ||
+ | /scratch/$USER/PiXPU.py -d 3 -i 100000000 -r 10 | ||
+ | </code> | ||
+ | |||
+ | === Exercice #8 : explore ''PiXPU.py'' by increasing the Parallel Rate ''PR'' === | ||
+ | |||
+ | * With a PR from ''1'' to ''64'' set by ''-b'' and ''-e'', a the number of iterations of 1 billion, and 10 times and on default device | ||
+ | * How decrease the elapsed time of | ||
+ | * With the selector of device and increasing the number of iterations and the number of redo : | ||
+ | * What arrive to itops values ? What is the typical variability on results ? | ||
+ | |||
+ | <code>./PiXPU.py -d 0 -b 1 -e 32 -i 1000000000 -r 10</code> | ||
+ | |||
+ | In this case, we define a gnuplot config file as follow. Adapt to your files and configuration. | ||
+ | <code> | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set ylabel 'Itops' | ||
+ | plot 'Pi_FP32_MWC_xPU_OpenCL_1_64_1_1_1000000000_Device0_InMetro_titan' using 1:9 title 'CPU with OpenCL' | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:pimc_1_64_cpu.png?600 |}} | ||
+ | |||
+ | === Exercice #9 : explore ''PiXPU.py'' with large PR on GPU (mostly power of 2) === | ||
+ | |||
+ | * Explore with ''PR'' from ''2048'' to ''32768'' with a 128 step | ||
+ | * For which ''PR'' the itops is the higher on you device ? | ||
+ | |||
+ | To explore on this platform the GPU device (device #1) from 2048 to 32768 as parallel rates with a step of 128 and 1000000000 iterations: <code> | ||
+ | ./PiXPU.py -d 1 -b 2048 -e $((2048*16)) -s 128 -i 10000000000 -r 10 | ||
+ | </code> | ||
+ | |||
+ | Output files are: | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan.npz'' | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan'' | ||
+ | |||
+ | In this case, you can define a gnuplot config file | ||
+ | <code> | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set ylabel 'Itops' | ||
+ | plot 'Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_10000000000_Device1_InMetro_titan' using 1:9 title 'GTX 1080 Ti' | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:pimc_2048_32768_gtx1080ti.png?600 |}} | ||
+ | |||
+ | === Exercice #10 : explore ''PiXPU.py'' with around a large ''PR'' === | ||
+ | |||
+ | <code>./PiXPU.py -d 1 -b $((2048-8)) -e $((2048+8)) -i 10000000000 -r 10</code> | ||
+ | |||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan'' | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan.npz'' | ||
+ | |||
+ | In this case, you can define a gnuplot config file | ||
+ | <code> | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set ylabel 'Itops' | ||
+ | plot 'Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan' using 1:9 title 'GTX 1080 Ti' | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:pimc_2040_2056_gtx1080ti.png?600 |}} | ||
==== NBody, a simplistic simulator ==== | ==== NBody, a simplistic simulator ==== | ||
+ | The ''NBody.py'' code is a implementation of N-Body kepkerian system on OpenCL devices. | ||
+ | |||
+ | It's available on: | ||
+ | * on file: ''/scratch/AstroSim2017/NBody.py'' on workstations | ||
+ | * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/NBody.py|NBody.py]] | ||
+ | |||
+ | Launch the code with a ''N=2'' on ''1000'' iterations with a graphical output | ||
+ | <code> | ||
+ | python NBody.py -n 2 -g -i 1000 | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:nbody_n2_gpu.png?600 |}} | ||
+ | |||
+ | |||
+ | === Exercice #10 : explore ''NBody.py'' with different devices === | ||
+ | |||
+ | === Exercice #11 : explore ''NBody.py'' with steps and iterations === | ||
+ | |||
+ | === Exercice #12 : explore ''NBody.py'' with Double Precision === | ||
===== Exploration with production codes ===== | ===== Exploration with production codes ===== | ||
==== PKDGRAV3 ==== | ==== PKDGRAV3 ==== | ||
+ | |||