Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentes Révision précédente Prochaine révision | Révision précédente | ||
formation:astrosim2017gpu4dummies [2017/07/07 09:24] equemene [5 W/2H : Why ? What ? Where ? When ? Who ? How much ? How ?] |
formation:astrosim2017gpu4dummies [2017/07/10 18:52] equemene [NBody, a simplistic simulator] |
||
---|---|---|---|
Ligne 40: | Ligne 40: | ||
* **p100alpha**, **p100beta** : virtual workstations with dedicated one Nvidia Tesla P100 | * **p100alpha**, **p100beta** : virtual workstations with dedicated one Nvidia Tesla P100 | ||
* **k40m** : virtual workstations with dedicated one Nvidia Tesla K40m | * **k40m** : virtual workstations with dedicated one Nvidia Tesla K40m | ||
+ | |||
+ | Have a look to [[http://styx.cbp.ens-lyon.fr/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Workstations|Monitoring website for workstations]] before connecting and launch your jobs! Huge requests may create DoS! | ||
=== Prerequisite for humanware === | === Prerequisite for humanware === | ||
Ligne 247: | Ligne 249: | ||
===== Exploration with original one : xGEMM ===== | ===== Exploration with original one : xGEMM ===== | ||
- | In the lecture about the GPUs, we present the GPU as a great matrix multiplier. In ''/scratch/Astrosim2017'' are | + | ==== From BLAS to xGEMM : implementations ==== |
- | ===== Exploration with dummie codes ===== | + | In the lecture about the GPUs, we present the GPU as a great matrix multiplier. On of the most common Linear Algebra librairies is BLAS one, formelly [[https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms|Basic Linear Algebra Subprograms]]. |
- | ==== A GPU, a performant matrix multiplier ==== | + | These subprograms can be considered as //standard// one. Lots of implementations exist on all architectures. On GPU, Nvidia provides its version with [[http://docs.nvidia.com/cuda/cublas/index.html|cuBLAS]] and AMD release in Open Source its OpenCL implementation [[https://github.com/clMathLibraries/clBLAS|clBLAS]]. |
+ | |||
+ | On CPU, Intel sells its optimized implementation in [[https://software.intel.com/en-us/mkl|MKL librairies]] but an Open Source equivalent, [[http://www.openblas.net/|OpenBLAS]]. Several others implementations exist and are deployed on CBP machines : [[http://math-atlas.sourceforge.net/|ATLAS]] and [[https://www.gnu.org/software/gsl/|GSL]]. | ||
+ | |||
+ | The implementation on Matrix Multiply in BLAS librairies is ''xGEMM'', with ''x'' to be replaced by ''S'', ''D'', ''C'' and ''Z'' respectively for Simple precision (32 bits), Double precision (64 bits), Complex & Simple precision, Complex & Double precision. | ||
+ | |||
+ | ==== Test examples ==== | ||
+ | |||
+ | Inside ''/scratch/Astrosim2017/xGEMM'' are programs implementing xGEMM for simple ''xGEMM_SP_<version>'' or double ''xGEMM_DP_<version>'': | ||
+ | * ''fblas'' using ATLAS libraries | ||
+ | * ''openblas'' using OpenBLAS libraries | ||
+ | * ''gsl'' using GSL librairies | ||
+ | * ''cublas'' using cuBLAS libraries with internal memory management | ||
+ | * ''thunking'' using cuBLAS libraries with external memory management | ||
+ | |||
+ | The source code and ''Makefile'' using to compile these examples is available in tarball at: | ||
+ | * on workstations: ''/scratch/AstroSim2017/xGEMM_EQ_170707.tgz'' | ||
+ | * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/xGEMM_EQ_170707.tgz|xGEMM_EQ_170707.tgz]] | ||
+ | |||
+ | The program call with ''-h'' option provides tiny informations to launch it. Input parameters are: | ||
+ | * size of square matrix | ||
+ | * number of iterations | ||
+ | |||
+ | The output provides: | ||
+ | * the mean elapsed time of each cycle | ||
+ | * the number of estimated GFlops | ||
+ | * the error estimated by the difference between trace of matrix multiply results | ||
+ | |||
+ | Examples on runs on the several implementations:<code> | ||
+ | # ./xGEMM_SP_fblas 1000 10 1 0 | ||
+ | Using FBLAS: 10 iterations for 1000x1000 matrix | ||
+ | |||
+ | Duration of each cycle : 0.2133281000 s | ||
+ | Number of GFlops : 18.741 | ||
+ | Error 0.0000000000 | ||
+ | |||
+ | # ./xGEMM_SP_gsl 1000 10 1 0 | ||
+ | Using GSL: 10 iterations for 1000x1000 matrix | ||
+ | |||
+ | Duration of each cycle : 8.1447937000 s | ||
+ | Number of GFlops : 0.491 | ||
+ | Error 0.0000000000 | ||
+ | |||
+ | # ./xGEMM_SP_openblas 1000 1000 1 0 | ||
+ | Using CBLAS: 1000 iterations for 1000x1000 matrix | ||
+ | |||
+ | Duration of each cycle : 0.0161011820 s | ||
+ | Number of GFlops : 248.305 | ||
+ | Error 0.0000000000 | ||
+ | |||
+ | # ./xGEMM_SP_cublas 1000 1000 1 0 | ||
+ | Using CuBLAS: 1000 iterations for 1000x1000 matrix | ||
+ | |||
+ | Duration of memory allocation : 0.6675190000 s | ||
+ | Duration of memory free : 0.0004700000 s | ||
+ | Duration of each cycle : 0.0005507960 s | ||
+ | Number of GFlops : 7258.586 | ||
+ | Error 0.0000000000 | ||
+ | |||
+ | # ./xGEMM_SP_thunking 1000 1000 1 0 | ||
+ | Using CuBLAS/Thunking: 1000 iterations for 1000x1000 matrix | ||
+ | |||
+ | Duration of each cycle : 0.0143951160 s | ||
+ | Number of GFlops : 277.733 | ||
+ | Error 0.0000000000 | ||
+ | |||
+ | # ./xGEMM_SP_clblas 1000 1000 1 0 | ||
+ | Using CLBLAS: 1000 iterations for 1000x1000 matrix on (1,0) | ||
+ | Device (1,0): GeForce GTX 1080 Ti | ||
+ | |||
+ | Duration of memory allocation : 0.6057190000 s | ||
+ | Duration of memory free : 0.0049670000 s | ||
+ | Duration of each cycle : 0.0029998720 s | ||
+ | Number of GFlops : 1332.724 | ||
+ | Error 0.0000000000 | ||
+ | </code> | ||
+ | |||
+ | === Exercice #6 : launch ''xGEMM_<precision>_<implementation>'' with different sizes and iterations === | ||
+ | |||
+ | * Which on the CPU implementations is the powerful | ||
+ | * Increase the size of matrix to ''2000'', ''4000'', ''8000'' on GPU and check the results | ||
+ | * Move from simple precision to double precision (SP to DP) and examine the elapsed time on CPU | ||
+ | * Move from simple precision to double precision (SP to DP) and examine the elapsed time on GPU | ||
+ | |||
+ | ===== Exploration with dummie codes ===== | ||
==== Pi Monte Carlo, a Compute Bound Example ==== | ==== Pi Monte Carlo, a Compute Bound Example ==== | ||
+ | The ''PiXPU.py'' code is a implementation of PiMC Pi Dart Dash on GPU, on OpenCL and CUDA devices. It's useful to evaluate que compute power of *PU devices as, CPU, GPU (both Nvidia, AMD and Intel), and CPU through the 3 implementations. | ||
+ | |||
+ | It's available on: | ||
+ | * on file: ''/scratch/AstroSim2017/PiXPU.py'' on workstations | ||
+ | * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/PiXPU.py|PiXPU.py]] | ||
+ | |||
+ | Copy the ''PiXPU.py'' inside your folder to use it<code> | ||
+ | mkdir /scratch/$USER | ||
+ | cd /scratch/$USER | ||
+ | cp /scratch/AstroSim2017/PiXPU.py /scratch/$USER | ||
+ | </code> | ||
+ | |||
+ | The documentation is available by the call of ''/scratch/$USER/PiXPU.py -h'':<code> | ||
+ | PiXPU.py -o (Out of Core Metrology) -c (Print Curves) -d <DeviceId> -g <CUDA/OpenCL> -i <Iterations> -b <BlocksBegin> -e <BlocksEnd> -s <BlocksStep> -f <ThreadsFirst> -l <ThreadsLast> -t <ThreadssTep> -r <RedoToImproveStats> -m <SHR3/CONG/MWC/KISS> -v <INT32/INT64/FP32/FP64> | ||
+ | |||
+ | Informations about devices detected under OpenCL API: | ||
+ | Device #0 from The pocl project of type *PU : pthread-Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz | ||
+ | Device #1 from NVIDIA Corporation of type *PU : GeForce GTX TITAN | ||
+ | Device #2 from Intel(R) Corporation of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz | ||
+ | Device #3 from Advanced Micro Devices, Inc. of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz | ||
+ | |||
+ | Informations about devices detected under CUDA API: | ||
+ | Device #0 of type GPU : GeForce GTX TITAN | ||
+ | </code> | ||
+ | |||
+ | The ''-h'' also detects the OpenCL and CUDA devices and sends each an ID which must be used for their specific call. <code> | ||
+ | Devices Identification : [0] | ||
+ | GpuStyle used : OpenCL | ||
+ | Iterations : 1000000 | ||
+ | Number of Blocks on begin : 1 | ||
+ | Number of Blocks on end : 1 | ||
+ | Step on Blocks : 1 | ||
+ | Number of Threads on begin : 1 | ||
+ | Number of Threads on end : 1 | ||
+ | Step on Threads : 1 | ||
+ | Number of redo : 1 | ||
+ | Metrology done out of XPU : False | ||
+ | Type of Marsaglia RNG used : MWC | ||
+ | Type of variable : FP32 | ||
+ | Device #0 from The pocl project of type xPU : pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz | ||
+ | Device #1 from NVIDIA Corporation of type xPU : GeForce GTX 1080 Ti | ||
+ | Device #2 from Intel(R) Corporation of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz | ||
+ | Device #3 from Advanced Micro Devices, Inc. of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz | ||
+ | ('CPU/GPU selected: ', 'pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz') | ||
+ | Pi estimation 3.14192800 | ||
+ | 0.03 0.03 0.00 0.03 0.03 37357749 37357749 0 37357749 37357749 | ||
+ | </code> | ||
+ | |||
+ | Two file are created by default: | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan.npz'' | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan'' | ||
+ | |||
+ | === Exercice #7 : explore ''PiXPU.py'' with several simple configurations pour ''PR=1'' === | ||
+ | |||
+ | * Without any parameters (the default ones) : | ||
+ | * what is the selected device ? How many itops (iterative operations per second) do you reach ? | ||
+ | * With only the device parameter as ''-d 1'' to select ''#1'' for all the available devices : | ||
+ | * What are the different ratios between the devices ? Which one is the most powerful ? | ||
+ | * With the selector of device and increasing the number of iterations and the number of redo : | ||
+ | * What arrive to itops values ? What is the typical variability on results ? | ||
+ | |||
+ | <code>/scratch/$USER/PiXPU.py</code> | ||
+ | |||
+ | <code> | ||
+ | /scratch/$USER/PiXPU.py -d 1 | ||
+ | /scratch/$USER/PiXPU.py -d 2 | ||
+ | /scratch/$USER/PiXPU.py -d 3 | ||
+ | </code> | ||
+ | |||
+ | <code> | ||
+ | /scratch/$USER/PiXPU.py -d 0 -i 100000000 -r 10 | ||
+ | /scratch/$USER/PiXPU.py -d 1 -i 100000000 -r 10 | ||
+ | /scratch/$USER/PiXPU.py -d 2 -i 100000000 -r 10 | ||
+ | /scratch/$USER/PiXPU.py -d 3 -i 100000000 -r 10 | ||
+ | </code> | ||
+ | |||
+ | === Exercice #8 : explore ''PiXPU.py'' by increasing the Parallel Rate ''PR'' === | ||
+ | |||
+ | * With a PR from ''1'' to ''64'' set by ''-b'' and ''-e'', a the number of iterations of 1 billion, and 10 times and on default device | ||
+ | * How decrease the elapsed time of | ||
+ | * With the selector of device and increasing the number of iterations and the number of redo : | ||
+ | * What arrive to itops values ? What is the typical variability on results ? | ||
+ | |||
+ | <code>./PiXPU.py -d 0 -b 1 -e 32 -i 1000000000 -r 10</code> | ||
+ | |||
+ | In this case, we define a gnuplot config file as follow. Adapt to your files and configuration. | ||
+ | <code> | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set ylabel 'Itops' | ||
+ | plot 'Pi_FP32_MWC_xPU_OpenCL_1_64_1_1_1000000000_Device0_InMetro_titan' using 1:9 title 'CPU with OpenCL' | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:pimc_1_64_cpu.png?600 |}} | ||
+ | |||
+ | === Exercice #9 : explore ''PiXPU.py'' with large PR on GPU (mostly power of 2) === | ||
+ | |||
+ | * Explore with ''PR'' from ''2048'' to ''32768'' with a 128 step | ||
+ | * For which ''PR'' the itops is the higher on you device ? | ||
+ | |||
+ | To explore on this platform the GPU device (device #1) from 2048 to 32768 as parallel rates with a step of 128 and 1000000000 iterations: <code> | ||
+ | ./PiXPU.py -d 1 -b 2048 -e $((2048*16)) -s 128 -i 10000000000 -r 10 | ||
+ | </code> | ||
+ | |||
+ | Output files are: | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan.npz'' | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan'' | ||
+ | |||
+ | In this case, you can define a gnuplot config file | ||
+ | <code> | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set ylabel 'Itops' | ||
+ | plot 'Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_10000000000_Device1_InMetro_titan' using 1:9 title 'GTX 1080 Ti' | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:pimc_2048_32768_gtx1080ti.png?600 |}} | ||
+ | |||
+ | === Exercice #10 : explore ''PiXPU.py'' with around a large ''PR'' === | ||
+ | |||
+ | <code>./PiXPU.py -d 1 -b $((2048-8)) -e $((2048+8)) -i 10000000000 -r 10</code> | ||
+ | |||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan'' | ||
+ | * ''Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan.npz'' | ||
+ | |||
+ | In this case, you can define a gnuplot config file | ||
+ | <code> | ||
+ | set xlabel 'Parallel Rate' | ||
+ | set ylabel 'Itops' | ||
+ | plot 'Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan' using 1:9 title 'GTX 1080 Ti' | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:pimc_2040_2056_gtx1080ti.png?600 |}} | ||
==== NBody, a simplistic simulator ==== | ==== NBody, a simplistic simulator ==== | ||
+ | The ''NBody.py'' code is a implementation of N-Body kepkerian system on OpenCL devices. | ||
+ | |||
+ | It's available on: | ||
+ | * on file: ''/scratch/AstroSim2017/NBody.py'' on workstations | ||
+ | * on website: [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/NBody.py|NBody.py]] | ||
+ | |||
+ | Launch the code with a ''N=2'' on ''1000'' iterations with a graphical output | ||
+ | <code> | ||
+ | python NBody.py -n 2 -g -i 1000 | ||
+ | </code> | ||
+ | |||
+ | {{ :formation:nbody_n2_gpu.png?600 |}} | ||
+ | |||
+ | |||
+ | === Exercice #10 : explore ''NBody.py'' with different devices === | ||
+ | |||
+ | === Exercice #11 : explore ''NBody.py'' with steps and iterations === | ||
+ | |||
+ | === Exercice #12 : explore ''NBody.py'' with Double Precision === | ||
===== Exploration with production codes ===== | ===== Exploration with production codes ===== | ||
==== PKDGRAV3 ==== | ==== PKDGRAV3 ==== | ||
+ | |||