formation:astrosim2017gpu4dummies

Ceci est une ancienne révision du document !

Practical session for Astrosim 2017

Practical work support for Astrosim 2017

5 W/2H : Why ? What ? Where ? When ? Who ? How much ? How ?

5W/2H : CQQCOQP (Comment ? Quoi ? Qui, Combien ? Où ? Quand ? Pourquoi ?) in french…

Why ? Have a look on GPUs and improve investigations process
What ? Test with dummie examples
When ? Friday, the 7th of July in the afternoon
How much ? Nothing, Blaise Pascal Center provides GPU inside workstations & cluster nodes
Where ? On workstations, cluster nodes, laptop (well configured), inside terminals
Who ? For people who want to open the hood
How ? Applying some simple commands (essentially shell ones)

Session Goal

It's to take in the hands GPU components inside machines and compare performances to classical CPU trough simplistic examples and production codes.

Starting the session

Prerequisites hardware, software and humanware

In order to get a complete functional environment, Blaise Pascal Center provides hardware, software, and OS well designed. People who want to achieve this practical session on their own laptop must have a real Unix Operating System.

Prerequisite for hardware

If using CBP resources, nothing… Just login…
If NOT using CBP resources, a machine relatively recent with onboard GPU inside, Nvidia one are preferred

Prerequisite for software

Open graphical session on one workstation, several terminals and your favorite browser
If NOT using CBP resources, a GNU/Linux Operating System well configured with all GPU, Nvidia, OpenCL, PyOpenCL, PyCuda stuff

People who want to use huge GPU, GPGPU or accelerator can use connect to the following machines.

gtx1080alpha, gtx1080beta, gtx1080gamma, gtx1080delta : virtual workstations with dedicated Nvidia GTX 1080
k80alpha, k80beta, k80gamma : virtual workstations with dedicated one, one and two GPU inside Nvidia Tesla K80
p100alpha, p100beta : virtual workstations with dedicated one Nvidia Tesla P100
k40m : virtual workstations with dedicated one Nvidia Tesla K40m

Prerequisite for humanware

An allergy to command line will severely restrict the range of this practical session.
A practice of shell scripts would be a asset, but you will improve it in this session!

Investigate GPU Hardware

What inside my host ?

Hardware in computing science is defined by Von Neumann architecture:

CPU (Central Processing Unit) with CU (Control Unit) and ALU (Arithmetic and Logic Unit)
MU (Memory Unit)
Input and Output Devices

GPU are normally considered as Input/Output devices. As mainly peripherals installed on PC machines, they use a interconnection bus, PCI or PCI Express.

To get the list of PCI devices, use lspci -nn command. Inside this huge list appear some VGA or 3D devices. These are GPU or GPGPU devices.

This is an output of lspci -nn | egrep '(VGA|3D)' command

06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca)
82:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)

Exercice #1: get the list of (GP)GPU devices

How many VGA devices are listed ? How many 3D devices are listed ?
Get the model of GPU device, its long name.
Retreive on the www the following informations:
- the number of compute units
- the base frequency of the cores
- the base frequency of the memory

All of the huge workstations hold Nvidia boards.

In Posix operating systems, everything is file. Informations about Nvidia board and its discovery by the operating system on boot time can be get by a grep in dmesg.

You can get kabalistic informations which are very important to

[   19.545688] NVRM: The NVIDIA GPU 0000:82:00.0 (PCI ID: 10de:1b06)
               NVRM: NVIDIA Linux driver release.  Please see 'Appendix
               NVRM: A - Supported NVIDIA GPU Products' in this release's
               NVRM: at www.nvidia.com.
[   19.545903] nvidia: probe of 0000:82:00.0 failed with error -1
[   19.546254] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   19.546491] NVRM: None of the NVIDIA graphics adapters were initialized!
[   19.782970] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[   19.783084] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017 (using threaded interrupts)
[   19.814046] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  375.66  Mon May  1 14:33:30 PDT 2017
[   20.264453] [drm] [nvidia-drm] [GPU ID 0x00008200] Loading driver
[   23.360807] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card2/input19
[   23.360885] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card2/input20
[   23.360996] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card2/input21
[   23.361065] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card2/input22
[   32.896510] [drm] [nvidia-drm] [GPU ID 0x00008200] Unloading driver
[   32.935658] nvidia-modeset: Unloading
[   32.967939] nvidia-nvlink: Unregistered the Nvlink Core, major device number 244
[   33.034671] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[   33.034724] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017 (using threaded interrupts)
[   33.275804] nvidia-nvlink: Unregistered the Nvlink Core, major device number 244
[   33.993460] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[   33.993486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017 (using threaded interrupts)
[   35.110461] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  375.66  Mon May  1 14:33:30 PDT 2017
[   35.111628] nvidia-modeset: Allocated GPU:0 (GPU-ccc95482-6681-052e-eb30-20b138412b92) @ PCI:0000:82:00.0
[349272.210486] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 243

Exercice #2 : get the informations on your host with ''dmesg | grep -i nvidia'' command

What version of driver did the kernel load ?
What represents, if it exists, the input: HDA NVidia device ? Is it a graphical one ?

The lsmod provides the list of modules loaded. Modules are small programs dedicated to the support of on function in a kernel, the engine of the Operating System. The support of a device needs one or several modules.

An example of lsmod | grep nvidia on a workstation:

nvidia_uvm            638976  0
nvidia_modeset        790528  2
nvidia              12312576  42 nvidia_modeset,nvidia_uvm

We see that 3 modules are loaded. The last column (empty for the two first lines) lists the dependencies between modules. Here, nvidia_modeset and nvidia_uvm depend on nvidia module.

Exercice #3 : get the informations on your host with ''lsmod | grep nvidia'' command

Are the informations on devices identical to the above ? Character by Character ?

The device also appears in /dev the root folder for devices.

A ls -l /dev/nvidia* provides this kind of informations

crw-rw-rw- 1 root root 195,   0 Jun 30 18:17 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 30 18:17 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Jun 30 18:17 /dev/nvidia-modeset
crw-rw-rw- 1 root root 243,   0 Jul  4 19:17 /dev/nvidia-uvm
crw-rw-rw- 1 root root 243,   1 Jul  4 19:17 /dev/nvidia-uvm-tools

We can see that everybody can access to the device. There is only one NVIDIA device, nvidia0. On a multiple Nvidia GPU machine, we got nvidia0, nvidia1, etc…

Exercice #3 : get the informations on your host with ''dmesg | grep -i nvidia'' command

How many /dev/nvidia<number> do you get ?
Is this information coherent to the 3 previous commands ?

Nvidia provides information about its recognized devices via nvidia-smi command. This command can also be used to configure some tricks inside the GPU.

An example of nvidia-smi output:

Fri Jul  7 07:46:56 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 0000:82:00.0      On |                  N/A |
| 23%   31C    P8    10W / 250W |     35MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      4108    G   /usr/lib/xorg/Xorg                              32MiB |
+-----------------------------------------------------------------------------+

Lots of informations are available in this output:

version of driver and nvidia-smi software
the id of GPU,
its name,
its bus location,
its fan speed,
its temperature,
its instantly and maximum consumption,
its occupied and available
its processus and their location on GPU

Exercice #4 : get the informations with ''nvidia-smi'' command

Identify the above characteristics and compare the elements
How many process are listed in the bottom list ?

As we see in the introduction on GPU, programming them can be achieved with several ways. The first, for Nvidia devices, is to use CUDA environment. The problem is that it's impossible to reuse your program on other platform or compare directly with CPU. OpenCL is a more agnostic way.

On the workstations in CBP, all available implementations of OpenCL are available.

The command clinfo provides informations about devices. Here is an example of a short output with clinfo '-l' :

Platform #0: Clover
Platform #1: Portable Computing Language
 `-- Device #0: pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
Platform #2: NVIDIA CUDA
 `-- Device #0: GeForce GTX 1080 Ti
Platform #3: Intel(R) OpenCL
 `-- Device #0:        Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
Platform #4: AMD Accelerated Parallel Processing
 `-- Device #0: Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz

The #0 Clover implementation is a GPU one, based on Open Source drivers of GNU/Linux and provided by Mesa
The #1 Portable Computing Language is a CPU one. Not very efficient but Open Source.
The #2 NVIDIA CUDA implementation is a GPU one. The devices detected are listed below
The #3 Intel(R) OpenCL implementation is a CPU one. Provided by Intel, very efficient but FP results are sometimes strange.
The #4 AMD Accelerated Parallel Processing is a CPU one. Provided by AMD, rather efficient, the oldest one.

Exercice #5 : get the informations with ''clinfo -l'' command

Identify and compare with list above.
How many graphical devices do you get ?

The command clinfo without options provides lots (to much…) informations. You can restrict them for example to several attributes as Platform Name,Device Name,Max compute,Max clock.

On the example platform, the command clinfo | egrep '(Platform Name|Device Name|Max compute|Max clock)' provides the output:

 Platform Name                                   Clover
  Platform Name                                   Portable Computing Language
  Platform Name                                   NVIDIA CUDA
  Platform Name                                   Intel(R) OpenCL
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Name                                   Clover
  Platform Name                                   Portable Computing Language
  Device Name                                     pthread-Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
  Max compute units                               32
  Max clock frequency                             2401MHz
  Platform Name                                   NVIDIA CUDA
  Device Name                                     GeForce GTX 1080 Ti
  Max compute units                               28
  Max clock frequency                             1582MHz
  Platform Name                                   Intel(R) OpenCL
  Device Name                                     Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
  Max compute units                               32
  Max clock frequency                             2400MHz
  Platform Name                                   AMD Accelerated Parallel Processing
  Device Name                                     Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
  Max compute units                               32
  Max clock frequency                             1200MHz

Exercice #6 : get the informations with the previous and filtered ''clinfo'' command

Compare the informations between CPU implementations. Why these differencies ?
Compare the number of compute units to the number you find on WWW.
Compare the frequencies to the frequencies found on WWW.

Exploration with original one : xGEMM

From BLAS to xGEMM : implementations

In the lecture about the GPUs, we present the GPU as a great matrix multiplier. On of the most common Linear Algebra librairies is BLAS one, formelly Basic Linear Algebra Subprograms.

These subprograms can be considered as standard one. Lots of implementations exist on all architectures. On GPU, Nvidia provides its version with cuBLAS and AMD release in Open Source its OpenCL implementation clBLAS.

On CPU, Intel sells its optimized implementation in MKL librairies but an Open Source equivalent, OpenBLAS. Several others implementations exist and are deployed on CBP machines : ATLAS and GSL.

The implementation on Matrix Multiply in BLAS librairies is xGEMM, with x to be replaced by S, D, C and Z respectively for Simple precision (32 bits), Double precision (64 bits), Complex & Simple precision, Complex & Double precision.

Test examples

Inside /scratch/Astrosim2017/xGEMM are programs implementing xGEMM for simple xGEMM_SP_<version> or double xGEMM_DP_<version>:

fblas using ATLAS libraries
openblas using OpenBLAS libraries
gsl using GSL librairies
cublas using cuBLAS libraries with internal memory management
thunking using cuBLAS libraries with external memory management

The source code and Makefile using to compile these examples is available in tarball at:

on workstations: /scratch/AstroSim2017/xGEMM_EQ_170707.tgz
on website: http://www.cbp.ens-lyon.fr/emmanuel.quemener/documents/Astrosim2017/xGEMM_EQ_170707.tgz

The program call with -h option provides tiny informations to launch it. Input parameters are:

size of square matrix
number of iterations

The output provides:

the mean elapsed time of each cycle
the number of estimated GFlops
the error estimated by the difference between trace of matrix multiply results

Exploration with dummie codes

A GPU, a performant matrix multiplier

Pi Monte Carlo, a Compute Bound Example

NBody, a simplistic simulator

Exploration with production codes

PKDGRAV3

formation/astrosim2017gpu4dummies.1499416086.txt.gz · Dernière modification: 2017/07/07 10:28 par equemene

Rechercher

Translations

Traductions de cette page:

Piste:

Piste: • 2017 • 2013 • 2010 • 1 • oar4cbp • levelsetmethod • support • astrosim2017gpu4dummies

Boîte à outils