formation:astrosim2017gpu4dummies

Différences

Ci-dessous, les différences entre deux révisions de la page.

--- formation:astrosim2017gpu4dummies [2017/07/07 13:04]
equemene [Test examples]
+++ formation:astrosim2017gpu4dummies [2017/07/10 18:52] (Version actuelle)
equemene [NBody, a simplistic simulator]
@@ Ligne 330: / Ligne 330: @@
 </code>
+=== Exercice #6 : launch ''xGEMM_<precision>_<implementation>'' with different sizes and iterations ===
+  * Which on the CPU implementations is the powerful
+  * Increase the size of matrix to ''2000'', ''4000'', ''8000'' on GPU and check the results
+  * Move from simple precision to double precision (SP to DP) and examine the elapsed time on CPU
+  * Move from simple precision to double precision (SP to DP) and examine the elapsed time on GPU
 ===== Exploration with dummie codes =====
-==== A GPU, a performant matrix multiplier ====
 ==== Pi Monte Carlo, a Compute Bound Example ====
+The ''PiXPU.py'' code is a implementation of PiMC Pi Dart Dash on GPU, on OpenCL and CUDA devices. It's useful to evaluate que compute power of *PU devices as, CPU, GPU (both Nvidia, AMD and Intel), and CPU through the 3 implementations.
+It's available on:
+  * on file: ''/scratch/AstroSim2017/PiXPU.py'' on workstations
+  * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/PiXPU.py|PiXPU.py]]
+Copy the ''PiXPU.py'' inside your folder to use it<code>
+mkdir /scratch/$USER
+cd /scratch/$USER
+cp /scratch/AstroSim2017/PiXPU.py /scratch/$USER
+</code>
+The documentation is available by the call of ''/scratch/$USER/PiXPU.py -h'':<code>
+PiXPU.py -o (Out of Core Metrology) -c (Print Curves) -d <DeviceId> -g <CUDA/OpenCL> -i <Iterations> -b <BlocksBegin> -e <BlocksEnd> -s <BlocksStep> -f <ThreadsFirst> -l <ThreadsLast> -t <ThreadssTep> -r <RedoToImproveStats> -m <SHR3/CONG/MWC/KISS> -v <INT32/INT64/FP32/FP64>
+Informations about devices detected under OpenCL API:
+Device #0 from The pocl project of type *PU : pthread-Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
+Device #1 from NVIDIA Corporation of type *PU : GeForce GTX TITAN
+Device #2 from Intel(R) Corporation of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
+Device #3 from Advanced Micro Devices, Inc. of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
+Informations about devices detected under CUDA API:
+Device #0 of type GPU : GeForce GTX TITAN
+</code>
+The ''-h'' also detects the OpenCL and CUDA devices and sends each an ID which must be used for their specific call. <code>
+Devices Identification : [0]
+GpuStyle used : OpenCL
+Iterations : 1000000
+Number of Blocks on begin : 1
+Number of Blocks on end : 1
+Step on Blocks : 1
+Number of Threads on begin : 1
+Number of Threads on end : 1
+Step on Threads : 1
+Number of redo : 1
+Metrology done out of XPU : False
+Type of Marsaglia RNG used : MWC
+Type of variable : FP32
+Device #0 from The pocl project of type xPU : pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
+Device #1 from NVIDIA Corporation of type xPU : GeForce GTX 1080 Ti
+Device #2 from Intel(R) Corporation of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
+Device #3 from Advanced Micro Devices, Inc. of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
+('CPU/GPU selected: ', 'pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz')
+Pi estimation 3.14192800
+.03 0.03 0.00 0.03 0.03 37357749 37357749 0 37357749 37357749
+</code>
+Two file are created by default:
+  * ''Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan.npz''
+  * ''Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan''
+=== Exercice #7 : explore ''PiXPU.py'' with several simple configurations pour ''PR=1'' ===
+  * Without any parameters (the default ones) :
+    * what is the selected device ? How many itops (iterative operations per second) do you reach ?
+  * With only the device parameter as ''-d 1'' to select ''#1'' for all the available devices :
+    * What are the different ratios between the devices ? Which one is the most powerful ?
+  * With the selector of device and increasing the number of iterations and the number of redo :
+    * What arrive to itops values ? What is the typical variability on results ?
+<code>/scratch/$USER/PiXPU.py</code>
+<code>
+/scratch/$USER/PiXPU.py -d 1
+/scratch/$USER/PiXPU.py -d 2
+/scratch/$USER/PiXPU.py -d 3
+</code>
+<code>
+/scratch/$USER/PiXPU.py -d 0 -i 100000000 -r 10
+/scratch/$USER/PiXPU.py -d 1 -i 100000000 -r 10
+/scratch/$USER/PiXPU.py -d 2 -i 100000000 -r 10
+/scratch/$USER/PiXPU.py -d 3 -i 100000000 -r 10
+</code>
+=== Exercice #8 : explore ''PiXPU.py'' by increasing the Parallel Rate ''PR'' ===
+  * With a PR from ''1'' to ''64'' set by ''-b'' and ''-e'', a the number of iterations of 1 billion, and 10 times and on default device
+    * How decrease the elapsed time of
+  * With the selector of device and increasing the number of iterations and the number of redo :
+    * What arrive to itops values ? What is the typical variability on results ?
+<code>./PiXPU.py -d 0 -b 1 -e 32 -i 1000000000 -r 10</code>
+In this case, we define a gnuplot config file as follow. Adapt to your files and configuration.
+<code>
+set xlabel 'Parallel Rate'
+set ylabel 'Itops'
+plot 'Pi_FP32_MWC_xPU_OpenCL_1_64_1_1_1000000000_Device0_InMetro_titan' using 1:9 title 'CPU with OpenCL'
+</code>
+{{ :formation:pimc_1_64_cpu.png?600 |}}
+=== Exercice #9 : explore ''PiXPU.py'' with large PR on GPU (mostly power of 2) ===
+  * Explore with ''PR'' from ''2048'' to ''32768'' with a 128 step
+  * For which ''PR'' the itops is the higher on you device ?
+To explore on this platform the GPU device (device #1) from 2048 to 32768 as parallel rates with a step of 128 and 1000000000 iterations: <code>
+./PiXPU.py -d 1 -b 2048 -e $((2048*16)) -s 128 -i 10000000000 -r 10
+</code>
+Output files are:
+  * ''Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan.npz''
+  * ''Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan''
+In this case, you can define a gnuplot config file
+<code>
+set xlabel 'Parallel Rate'
+set ylabel 'Itops'
+plot 'Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_10000000000_Device1_InMetro_titan' using 1:9 title 'GTX 1080 Ti'
+</code>
+{{ :formation:pimc_2048_32768_gtx1080ti.png?600 |}}
+=== Exercice #10 : explore ''PiXPU.py'' with around a large ''PR'' ===
+<code>./PiXPU.py -d 1 -b $((2048-8)) -e $((2048+8)) -i 10000000000 -r 10</code>
+  * ''Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan''
+  * ''Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan.npz''
+In this case, you can define a gnuplot config file
+<code>
+set xlabel 'Parallel Rate'
+set ylabel 'Itops'
+plot 'Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan' using 1:9 title 'GTX 1080 Ti'
+</code>
+{{ :formation:pimc_2040_2056_gtx1080ti.png?600 |}}
 ==== NBody, a simplistic simulator ====
+The ''NBody.py'' code is a implementation of N-Body kepkerian system on OpenCL devices.
+It's available on:
+  * on file: ''/scratch/AstroSim2017/NBody.py'' on workstations
+  * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/NBody.py|NBody.py]]
+Launch the code with a ''N=2'' on ''1000'' iterations with a graphical output
+<code>
+python NBody.py -n 2 -g -i 1000
+</code>
+{{ :formation:nbody_n2_gpu.png?600 |}}
+=== Exercice #10 : explore ''NBody.py'' with different devices ===
+=== Exercice #11 : explore ''NBody.py'' with steps and iterations ===
+=== Exercice #12 : explore ''NBody.py'' with Double Precision ===
 ===== Exploration with production codes =====
 ==== PKDGRAV3 ====

formation/astrosim2017gpu4dummies.1499425478.txt.gz · Dernière modification: 2017/07/07 13:04 par equemene

Rechercher

Translations

Piste:

Boîte à outils