Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
formation:astrosim2017gpu4dummies [2017/07/07 10:22]
equemene [Exploration with original one : xGEMM]
formation:astrosim2017gpu4dummies [2017/07/10 18:52] (Version actuelle)
equemene [NBody, a simplistic simulator]
Ligne 40: Ligne 40:
   * **p100alpha**,​ **p100beta** : virtual workstations with dedicated one Nvidia Tesla P100   * **p100alpha**,​ **p100beta** : virtual workstations with dedicated one Nvidia Tesla P100
   * **k40m** : virtual workstations with dedicated one Nvidia Tesla K40m   * **k40m** : virtual workstations with dedicated one Nvidia Tesla K40m
 +
 +Have a look to [[http://​styx.cbp.ens-lyon.fr/​ganglia/?​r=hour&​cs=&​ce=&​m=load_one&​s=by+name&​c=Workstations|Monitoring website for workstations]] before connecting and launch your jobs! Huge requests may create DoS!
  
 === Prerequisite for humanware ===  === Prerequisite for humanware === 
Ligne 266: Ligne 268:
   * ''​thunking''​ using cuBLAS libraries with external memory management   * ''​thunking''​ using cuBLAS libraries with external memory management
  
-The source code and ''​Makefile''​ using to compile these examples is available in tarball ''/​scratch/​Astrosim2017/​xGEMM_EQ_170707.tgz'' ​or ''​http://​www.cbp.ens-lyon.fr/​emmanuel.quemener/​documents/​Astrosim2017/​xGEMM_EQ_170707.tgz''​+The source code and ''​Makefile''​ using to compile these examples is available in tarball ​at: 
 +  * on workstations: ​''/​scratch/​AstroSim2017/​xGEMM_EQ_170707.tgz''​ 
 +  * on website: [[http://​www.cbp.ens-lyon.fr/​emmanuel.quemener/​documents/​Astrosim2017/​xGEMM_EQ_170707.tgz|xGEMM_EQ_170707.tgz]]
  
 +The program call with ''​-h''​ option provides tiny informations to launch it. Input parameters are:
 +  * size of square matrix
 +  * number of iterations
  
 +The output provides:
 +  * the mean elapsed time of each cycle
 +  * the number of estimated GFlops
 +  * the error estimated by the difference between trace of matrix multiply results
  
-===== Exploration with dummie codes =====+Examples on runs on the several implementations:<​code>​ 
 +# ./​xGEMM_SP_fblas 1000 10 1 0 
 +Using FBLAS: 10 iterations for 1000x1000 matrix
  
-==== A GPU, a performant ​matrix ​multiplier ​====+Duration of each cycle : 0.2133281000 s 
 +Number of GFlops : 18.741  
 +Error 0.0000000000 
 + 
 +# ./​xGEMM_SP_gsl 1000 10 1 0 
 +Using GSL: 10 iterations for 1000x1000 matrix 
 + 
 +Duration of each cycle : 8.1447937000 s 
 +Number of GFlops : 0.491  
 +Error 0.0000000000 
 + 
 +# ./​xGEMM_SP_openblas 1000 1000 1 0 
 +Using CBLAS: 1000 iterations for 1000x1000 matrix 
 + 
 +Duration of each cycle : 0.0161011820 s 
 +Number of GFlops : 248.305  
 +Error 0.0000000000 
 + 
 +# ./​xGEMM_SP_cublas 1000 1000 1 0 
 +Using CuBLAS: 1000 iterations for 1000x1000 matrix 
 + 
 +Duration of memory allocation : 0.6675190000 s 
 +Duration of memory free : 0.0004700000 s 
 +Duration of each cycle : 0.0005507960 s 
 +Number of GFlops : 7258.586  
 +Error 0.0000000000 
 + 
 +# ./​xGEMM_SP_thunking 1000 1000 1 0 
 +Using CuBLAS/​Thunking:​ 1000 iterations for 1000x1000 matrix 
 + 
 +Duration of each cycle : 0.0143951160 s 
 +Number of GFlops : 277.733  
 +Error 0.0000000000 
 + 
 +# ./​xGEMM_SP_clblas 1000 1000 1 0 
 +Using CLBLAS: 1000 iterations for 1000x1000 matrix on (1,0) 
 +Device (1,0): GeForce GTX 1080 Ti 
 + 
 +Duration of memory allocation : 0.6057190000 s 
 +Duration of memory free : 0.0049670000 s 
 +Duration of each cycle : 0.0029998720 s 
 +Number of GFlops : 1332.724  
 +Error 0.0000000000 
 +</​code>​ 
 + 
 +=== Exercice #6 : launch ''​xGEMM_<​precision>​_<​implementation>''​ with different sizes and iterations ​=== 
 + 
 +  * Which on the CPU implementations is the powerful 
 +  * Increase the size of matrix ​to ''​2000'',​ ''​4000'',​ ''​8000''​ on GPU and check the results 
 +  * Move from simple precision to double precision (SP to DP) and examine the elapsed time on CPU 
 +  * Move from simple precision to double precision (SP to DP) and examine the elapsed time on GPU 
 + 
 +===== Exploration with dummie codes =====
  
 ==== Pi Monte Carlo, a Compute Bound Example ==== ==== Pi Monte Carlo, a Compute Bound Example ====
  
 +The ''​PiXPU.py''​ code is a implementation of PiMC Pi Dart Dash on GPU, on OpenCL and CUDA devices. It's useful to evaluate que compute power of *PU devices as, CPU, GPU (both Nvidia, AMD and Intel), and CPU through the 3 implementations. ​
 +
 +It's available on:
 +  * on file: ''/​scratch/​AstroSim2017/​PiXPU.py''​ on workstations
 +  * on website: [[http://​www.cbp.ens-lyon.fr/​emmanuel.quemener/​documents/​Astrosim2017/​PiXPU.py|PiXPU.py]]
 +
 +Copy the ''​PiXPU.py''​ inside your folder to use it<​code>​
 +mkdir /​scratch/​$USER
 +cd /​scratch/​$USER
 +cp /​scratch/​AstroSim2017/​PiXPU.py /​scratch/​$USER
 +</​code>​
 +
 +The documentation is available by the call of ''/​scratch/​$USER/​PiXPU.py -h'':<​code>​
 +PiXPU.py -o (Out of Core Metrology) -c (Print Curves) -d <​DeviceId>​ -g <​CUDA/​OpenCL>​ -i <​Iterations>​ -b <​BlocksBegin>​ -e <​BlocksEnd>​ -s <​BlocksStep>​ -f <​ThreadsFirst>​ -l <​ThreadsLast>​ -t <​ThreadssTep>​ -r <​RedoToImproveStats>​ -m <​SHR3/​CONG/​MWC/​KISS>​ -v <​INT32/​INT64/​FP32/​FP64>​
 +
 +Informations about devices detected under OpenCL API:
 +Device #0 from The pocl project of type *PU : pthread-Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
 +Device #1 from NVIDIA Corporation of type *PU : GeForce GTX TITAN
 +Device #2 from Intel(R) Corporation of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
 +Device #3 from Advanced Micro Devices, Inc. of type *PU : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
 +
 +Informations about devices detected under CUDA API:
 +Device #0 of type GPU : GeForce GTX TITAN
 +</​code>​
 +
 +The ''​-h''​ also detects the OpenCL and CUDA devices and sends each an ID which must be used for their specific call. <​code>​
 +Devices Identification : [0]
 +GpuStyle used : OpenCL
 +Iterations : 1000000
 +Number of Blocks on begin : 1
 +Number of Blocks on end : 1
 +Step on Blocks : 1
 +Number of Threads on begin : 1
 +Number of Threads on end : 1
 +Step on Threads : 1
 +Number of redo : 1
 +Metrology done out of XPU : False
 +Type of Marsaglia RNG used : MWC
 +Type of variable : FP32
 +Device #0 from The pocl project of type xPU : pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
 +Device #1 from NVIDIA Corporation of type xPU : GeForce GTX 1080 Ti
 +Device #2 from Intel(R) Corporation of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
 +Device #3 from Advanced Micro Devices, Inc. of type xPU : Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
 +('​CPU/​GPU selected: ', '​pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz'​)
 +Pi estimation 3.14192800
 +0.03 0.03 0.00 0.03 0.03 37357749 37357749 0 37357749 37357749
 +</​code>​
 +
 +Two file are created by default:
 +  * ''​Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan.npz''​
 +  * ''​Pi_FP32_MWC_xPU_OpenCL_1_1_1_1_01000000_Device0_InMetro_titan''​
 +
 +=== Exercice #7 : explore ''​PiXPU.py''​ with several simple configurations pour ''​PR=1''​ ===
 +
 +  * Without any parameters (the default ones) : 
 +    * what is the selected device ? How many itops (iterative operations per second) do you reach ?
 +  * With only the device parameter as ''​-d 1''​ to select ''#​1''​ for all the available devices :
 +    * What are the different ratios between the devices ? Which one is the most powerful ?
 +  * With the selector of device and increasing the number of iterations and the number of redo :
 +    * What arrive to itops values ? What is the typical variability on results ?
 +
 +<​code>/​scratch/​$USER/​PiXPU.py</​code>​
 +
 +<​code>​
 +/​scratch/​$USER/​PiXPU.py -d 1
 +/​scratch/​$USER/​PiXPU.py -d 2
 +/​scratch/​$USER/​PiXPU.py -d 3
 +</​code>​
 +
 +<​code>​
 +/​scratch/​$USER/​PiXPU.py -d 0 -i 100000000 -r 10
 +/​scratch/​$USER/​PiXPU.py -d 1 -i 100000000 -r 10
 +/​scratch/​$USER/​PiXPU.py -d 2 -i 100000000 -r 10
 +/​scratch/​$USER/​PiXPU.py -d 3 -i 100000000 -r 10
 +</​code>​
 +
 +=== Exercice #8 : explore ''​PiXPU.py''​ by increasing the Parallel Rate ''​PR''​ ===
 +
 +  * With a PR from ''​1''​ to ''​64''​ set by ''​-b''​ and ''​-e'',​ a the number of iterations of 1 billion, and 10 times and on default device
 +    * How decrease the elapsed time of 
 +  * With the selector of device and increasing the number of iterations and the number of redo :
 +    * What arrive to itops values ? What is the typical variability on results ?
 +
 +<​code>​./​PiXPU.py -d 0 -b 1 -e 32 -i 1000000000 -r 10</​code>​
 +
 +In this case, we define a gnuplot config file as follow. Adapt to your files and configuration.
 +<​code>​
 +set xlabel '​Parallel Rate'
 +set ylabel '​Itops'​
 +plot '​Pi_FP32_MWC_xPU_OpenCL_1_64_1_1_1000000000_Device0_InMetro_titan'​ using 1:9 title 'CPU with OpenCL'​
 +</​code>​
 +
 +{{ :​formation:​pimc_1_64_cpu.png?​600 |}}
 +
 +=== Exercice #9 : explore ''​PiXPU.py''​ with large PR on GPU (mostly power of 2) ===
 +
 +  * Explore with ''​PR''​ from ''​2048''​ to ''​32768''​ with a 128 step
 +  * For which ''​PR''​ the itops is the higher on you device ?
 +
 +To explore on this platform the GPU device (device #1) from 2048 to 32768 as parallel rates with a step of 128 and 1000000000 iterations: <​code>​
 +./PiXPU.py -d 1 -b 2048 -e $((2048*16)) -s 128 -i 10000000000 -r 10
 +</​code>​
 +
 +Output files are: 
 +  * ''​Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan.npz''​
 +  * ''​Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_1000000000_Device1_InMetro_titan''​
 +
 +In this case, you can define a gnuplot config file
 +<​code>​
 +set xlabel '​Parallel Rate'
 +set ylabel '​Itops'​
 +plot '​Pi_FP32_MWC_xPU_OpenCL_2048_32768_1_1_10000000000_Device1_InMetro_titan'​ using 1:9 title 'GTX 1080 Ti'
 +</​code>​
 +
 +{{ :​formation:​pimc_2048_32768_gtx1080ti.png?​600 |}}
 +
 +=== Exercice #10 : explore ''​PiXPU.py''​ with around a large ''​PR''​ ===
 +
 +<​code>​./​PiXPU.py -d 1 -b $((2048-8)) -e $((2048+8)) -i 10000000000 -r 10</​code>​
 +
 +  * ''​Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan''​
 +  * ''​Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan.npz''​
 +
 +In this case, you can define a gnuplot config file
 +<​code>​
 +set xlabel '​Parallel Rate'
 +set ylabel '​Itops'​
 +plot '​Pi_FP32_MWC_xPU_OpenCL_2040_2056_1_1_10000000000_Device1_InMetro_titan'​ using 1:9 title 'GTX 1080 Ti'
 +</​code>​
 +
 +{{ :​formation:​pimc_2040_2056_gtx1080ti.png?​600 |}}
 ==== NBody, a simplistic simulator ==== ==== NBody, a simplistic simulator ====
  
 +The ''​NBody.py''​ code is a implementation of N-Body kepkerian system on OpenCL devices. ​
 +
 +It's available on:
 +  * on file: ''/​scratch/​AstroSim2017/​NBody.py''​ on workstations
 +  * on website: [[http://​www.cbp.ens-lyon.fr/​emmanuel.quemener/​documents/​Astrosim2017/​NBody.py|NBody.py]]
 +
 +Launch the code with a ''​N=2''​ on ''​1000''​ iterations with a graphical output
 +<​code>​
 +python NBody.py -n 2 -g -i 1000 
 +</​code>​
 +
 +{{ :​formation:​nbody_n2_gpu.png?​600 |}}
 +
 +
 +=== Exercice #10 : explore ''​NBody.py''​ with different devices ===
 +
 +=== Exercice #11 : explore ''​NBody.py''​ with steps and iterations ===
 +
 +=== Exercice #12 : explore ''​NBody.py''​ with Double Precision ===
  
 ===== Exploration with production codes ===== ===== Exploration with production codes =====
  
 ==== PKDGRAV3 ==== ==== PKDGRAV3 ====
 +
  
formation/astrosim2017gpu4dummies.1499415740.txt.gz · Dernière modification: 2017/07/07 10:22 par equemene