* [[developpement:activites:integration|Integration]]

====== Usage des GPU : PyCUDA, PyOpenCL et PyFFT sur Debian Squeeze ======

[[http://packages.python.org/pyfft/|PyFFT]] permet un appel simplissime de fonctions GPU à l'intérieur de ses scripts Python, en lieu et place des appels standards : certainement la plus simple façon d'exploiter son GPU pour effectuer ces opérations...

Le préalable est l'installation des pilotes graphiques et des environnements [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=cuda4squeeze|Nvidia]] ou [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=stream4squeeze|ATI]] idoines en fonction de son matériel. De plus, il est donc nécessaire d'avoir installé soit [[http://mathema.tician.de/software/pycuda|PyCUDA]], soit [[http://mathema.tician.de/software/pyopencl|PyOpenCL]], deux //wrappers// créés et maintenus par Andreas Klöckner.

===== Installation de PyCUDA  =====

Pour l'installer dans un environnement Nvidia sous Debian Squeeze,, c'est [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=cuda4squeeze|ici]], et, plus précisement, [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=cuda4squeeze#installation_de_pycuda|PyCUDA sous Nvidia]].

===== Installation de PyOpenCL  =====

==== Sur Nvidia ====

Pour l'installer dans un environnement Nvidia sous Debian Squeeze, c'est [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=cuda4squeeze|CUDA sur Squeeze]], et, plus précisement, [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=cuda4squeeze#installation_de_pyopencl|PyOpenCL sous Nvidia]].

==== Sur ATI ====

Pour l'installer dans un environnement ATI sous Debian Squeeze, c'est [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=stream4squeeze|Stream SDK sur Squeeze]], et, plus précisement, [[http://www.cbp.ens-lyon.fr/emmanuel.quemener/dokuwiki/doku.php?id=stream4squeeze#installation_de_pyopencl|PyOpenCL sous ATI]].


===== Installation de PyFFT  =====

<note important>Toutes ces commandes sont à effectuer comme ''root''</note>


==== Téléchargement des sources ====

<code>
cd /opt
git clone https://github.com/Manticore/pyfft

</code>

==== Installation des librairies ====

<code>
cd /opt
#tar xzf /root/pyfft-0.3.5.tar.gz
#cd pyfft-0.3.5
cd pyfft
export PYFFT=$PWD

python setup.py build
sudo python setup.py install
</code>

===== Exécution des tests d'intégration =====

<code>
cd $PYFFT/test
python test_performance.py
</code>

==== Sur un Dell Precision 360 et une ATI Radeon HD4890 ====
<code>
Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 12.6500844955 ms, 6.63126637846 GFLOPS
* cl, (1024,), batch 4096: 11.2925052643 ms, 18.5711846124 GFLOPS
* cl, (8192,), batch 512: 18.1638002396 ms, 15.0095110277 GFLOPS
* cl, (16, 16), batch 16384: 20.2838897705 ms, 8.27120251087 GFLOPS
* cl, (128, 128), batch 256: 15.6512022018 ms, 18.7590241448 GFLOPS
* cl, (1024, 1024), batch 4: 20.5834150314 ms, 20.3771045456 GFLOPS
* cl, (8, 8, 64), batch 1024: 13.6965036392 ms, 18.3739037808 GFLOPS
* cl, (16, 16, 16), batch 1024: 28.2988071442 ms, 8.8928921533 GFLOPS
* cl, (16, 16, 128), batch 128: 23.5196828842 ms, 13.3748742085 GFLOPS
* cl, (32, 32, 128), batch 32: 23.3653783798 ms, 15.2582951667 GFLOPS
* cl, (128, 128, 128), batch 2: 31.716299057 ms, 13.8856655125 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 12.9328966141 ms, 6.48625613451 GFLOPS
* cl, (1024,), batch 4096: 19.5894002914 ms, 10.7055446762 GFLOPS
* cl, (8192,), batch 512: 26.6664981842 ms, 10.2236805942 GFLOPS
* cl, (16, 16), batch 16384: 20.7551002502 ms, 8.08341843581 GFLOPS
* cl, (128, 128), batch 256: 19.7580814362 ms, 14.8598071604 GFLOPS
* cl, (1024, 1024), batch 4: 40.1411056519 ms, 10.4489000288 GFLOPS
* cl, (8, 8, 64), batch 1024: 14.3758058548 ms, 17.5056788149 GFLOPS
* cl, (16, 16, 16), batch 1024: 29.2860984802 ms, 8.5930954637 GFLOPS
* cl, (16, 16, 128), batch 128: 24.3818998337 ms, 12.9018986275 GFLOPS
* cl, (32, 32, 128), batch 32: 24.8661994934 ms, 14.3373674813 GFLOPS
* cl, (128, 128, 128), batch 2: 37.5190019608 ms, 11.7381032806 GFLOPS
</code>

==== Sur un Apple iMac et une ATI Radeon HD 4850M ====
<code>
Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 22.1525907516 ms, 3.78673902933 GFLOPS
* cl, (1024,), batch 4096: 19.5072889328 ms, 10.7506071563 GFLOPS
* cl, (8192,), batch 512: 31.4644098282 ms, 8.66470280195 GFLOPS
* cl, (16, 16), batch 16384: 35.2179050446 ms, 4.76383134624 GFLOPS
* cl, (128, 128), batch 256: 27.0842075348 ms, 10.8403127403 GFLOPS
* cl, (1024, 1024), batch 4: 36.1826896667 ms, 11.592018279 GFLOPS
* cl, (8, 8, 64), batch 1024: 24.2835998535 ms, 10.3633003969 GFLOPS
* cl, (16, 16, 16), batch 1024: 48.9276885986 ms, 5.14347289251 GFLOPS
* cl, (16, 16, 128), batch 128: 40.2245998383 ms, 7.82040843824 GFLOPS
* cl, (32, 32, 128), batch 32: 40.5035972595 ms, 8.80207843554 GFLOPS
* cl, (128, 128, 128), batch 2: 47.1528053284 ms, 9.33988798616 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 22.700881958 ms, 3.69527845461 GFLOPS
* cl, (1024,), batch 4096: 32.2904825211 ms, 6.49464435421 GFLOPS
* cl, (8192,), batch 512: 1405.95219135 ms, 0.19391111709 GFLOPS
* cl, (16, 16), batch 16384: 981.485915184 ms, 0.17093690027 GFLOPS
* cl, (128, 128), batch 256: 307.555294037 ms, 0.954629251041 GFLOPS
* cl, (1024, 1024), batch 4: 66.2887096405 ms, 6.32732787038 GFLOPS
* cl, (8, 8, 64), batch 1024: 24.7946977615 ms, 10.1496796783 GFLOPS
* cl, (16, 16, 16), batch 1024: 50.8825778961 ms, 4.9458626195 GFLOPS
* cl, (16, 16, 128), batch 128: 42.073392868 ms, 7.47676330708 GFLOPS
* cl, (32, 32, 128), batch 32: 43.0013895035 ms, 8.29079813738 GFLOPS
* cl, (128, 128, 128), batch 2: 58.1202983856 ms, 7.57742014809 GFLOPS
</code>

==== Sur un Dell Latitude E6410 et une Nvidia GT218 (NVS 3100M) ====
<code>
Running performance tests, single precision, fast math...
* cuda, (16,), batch 262144: 10.8457084656 ms, 7.73449519377 GFLOPS
* cuda, (1024,), batch 4096: 15.1082717896 ms, 13.8808199191 GFLOPS
* cuda, (8192,), batch 512: 23.5022171021 ms, 11.6001719674 GFLOPS
* cuda, (16, 16), batch 16384: 16.9765533447 ms, 9.88258079206 GFLOPS
* cuda, (128, 128), batch 256: 21.0505889893 ms, 13.9474140201 GFLOPS
* cuda, (1024, 1024), batch 4: 35.6067626953 ms, 11.7795151328 GFLOPS
* cuda, (8, 8, 64), batch 1024: 20.4859710693 ms, 12.2844184026 GFLOPS
* cuda, (16, 16, 16), batch 1024: 23.0558822632 ms, 10.9151424841 GFLOPS
* cuda, (16, 16, 128), batch 128: 21.3567718506 ms, 14.7294170767 GFLOPS
* cuda, (32, 32, 128), batch 32: 23.2541030884 ms, 15.3313090015 GFLOPS
* cuda, (128, 128, 128), batch 2: 33.5932098389 ms, 13.1098493449 GFLOPS
* cl, (16,), batch 262144: 10.9880924225 ms, 7.63427142534 GFLOPS
* cl, (1024,), batch 4096: 14.8217201233 ms, 14.149180949 GFLOPS
* cl, (8192,), batch 512: 23.6032962799 ms, 11.5504951837 GFLOPS
* cl, (16, 16), batch 16384: 17.1283960342 ms, 9.79497202567 GFLOPS
* cl, (128, 128), batch 256: 20.6398010254 ms, 14.2250053496 GFLOPS
* cl, (1024, 1024), batch 4: 34.9160909653 ms, 12.0125245526 GFLOPS
* cl, (8, 8, 64), batch 1024: 20.4857826233 ms, 12.2845314054 GFLOPS
* cl, (16, 16, 16), batch 1024: 23.1196880341 ms, 10.8850188475 GFLOPS
* cl, (16, 16, 128), batch 128: 20.8853006363 ms, 15.0619234781 GFLOPS
* cl, (32, 32, 128), batch 32: 22.8240013123 ms, 15.620216417 GFLOPS
* cl, (128, 128, 128), batch 2: 33.2809925079 ms, 13.232836127 GFLOPS
Running performance tests, single precision, accurate math...
* cuda, (16,), batch 262144: 16.0340286255 ms, 5.23175316443 GFLOPS
* cuda, (1024,), batch 4096: 32.7980010986 ms, 6.39414577033 GFLOPS
* cuda, (8192,), batch 512: 44.2999786377 ms, 6.15417362229 GFLOPS
* cuda, (16, 16), batch 16384: 23.9424224854 ms, 7.00731766398 GFLOPS
* cuda, (128, 128), batch 256: 37.4869567871 ms, 7.83209161702 GFLOPS
* cuda, (1024, 1024), batch 4: 70.3075195312 ms, 5.96565492278 GFLOPS
* cuda, (8, 8, 64), batch 1024: 24.4130752563 ms, 10.3083383538 GFLOPS
* cuda, (16, 16, 16), batch 1024: 32.3656616211 ms, 7.77547027915 GFLOPS
* cuda, (16, 16, 128), batch 128: 33.9374816895 ms, 9.26918511157 GFLOPS
* cuda, (32, 32, 128), batch 32: 37.5233032227 ms, 9.50118484731 GFLOPS
* cuda, (128, 128, 128), batch 2: 57.7268981934 ms, 7.62905913505 GFLOPS
* cl, (16,), batch 262144: 29.4173002243 ms, 2.85159002901 GFLOPS
* cl, (1024,), batch 4096: 81.5240859985 ms, 2.57243239751 GFLOPS
* cl, (8192,), batch 512: 134.916901588 ms, 2.02072354753 GFLOPS
* cl, (16, 16), batch 16384: 45.1912164688 ms, 3.71249488528 GFLOPS
* cl, (128, 128), batch 256: 79.5057058334 ms, 3.69283282152 GFLOPS
* cl, (1024, 1024), batch 4: 202.758383751 ms, 2.06862173707 GFLOPS
* cl, (8, 8, 64), batch 1024: 35.9812021255 ms, 6.99415875884 GFLOPS
* cl, (16, 16, 16), batch 1024: 61.35160923 ms, 4.10190120778 GFLOPS
* cl, (16, 16, 128), batch 128: 70.4071044922 ms, 4.46791275211 GFLOPS
* cl, (32, 32, 128), batch 32: 77.3554086685 ms, 4.60880300598 GFLOPS
* cl, (128, 128, 128), batch 2: 121.149802208 ms, 3.63518480405 GFLOPS
</code>

==== Sur un Dell Precision 360 et une Nvidia GTX260 ====
<code>
Running performance tests, single precision, fast math...
* cuda, (16,), batch 262144: 1.15618877411 ms, 72.5539651297 GFLOPS
* cuda, (1024,), batch 4096: 1.53842878342 ms, 136.317782312 GFLOPS
* cuda, (8192,), batch 512: 2.30843849182 ms, 118.101375006 GFLOPS
* cuda, (16, 16), batch 16384: 1.87551994324 ms, 89.4536795543 GFLOPS
* cuda, (128, 128), batch 256: 2.12306556702 ms, 138.291197673 GFLOPS
* cuda, (1024, 1024), batch 4: 3.58481292725 ms, 117.002032885 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.41319999695 ms, 104.284037924 GFLOPS
* cuda, (16, 16, 16), batch 1024: 2.61318397522 ms, 96.3032998772 GFLOPS
* cuda, (16, 16, 128), batch 128: 2.45464000702 ms, 128.154352207 GFLOPS
* cuda, (32, 32, 128), batch 32: 2.46014080048 ms, 144.916843756 GFLOPS
* cuda, (128, 128, 128), batch 2: 3.29523506165 ms, 133.648104539 GFLOPS
* cl, (16,), batch 262144: 1.17750167847 ms, 71.2407307217 GFLOPS
* cl, (1024,), batch 4096: 1.52101516724 ms, 137.878441003 GFLOPS
* cl, (8192,), batch 512: 2.36279964447 ms, 115.384205613 GFLOPS
* cl, (16, 16), batch 16384: 1.96299552917 ms, 85.4674182934 GFLOPS
* cl, (128, 128), batch 256: 2.26821899414 ms, 129.441328531 GFLOPS
* cl, (1024, 1024), batch 4: 3.52740287781 ms, 118.906292967 GFLOPS
* cl, (8, 8, 64), batch 1024: 2.51619815826 ms, 100.015270726 GFLOPS
* cl, (16, 16, 16), batch 1024: 2.69598960876 ms, 93.3454042931 GFLOPS
* cl, (16, 16, 128), batch 128: 2.60119438171 ms, 120.9339841 GFLOPS
* cl, (32, 32, 128), batch 32: 2.57868766785 ms, 138.254758203 GFLOPS
* cl, (128, 128, 128), batch 2: 3.4695148468 ms, 126.934727028 GFLOPS
Running performance tests, single precision, accurate math...
* cuda, (16,), batch 262144: 1.59408321381 ms, 52.6234008824 GFLOPS
* cuda, (1024,), batch 4096: 3.1795999527 ms, 65.9564734935 GFLOPS
* cuda, (8192,), batch 512: 4.20476150513 ms, 64.8383409303 GFLOPS
* cuda, (16, 16), batch 16384: 2.29162559509 ms, 73.2109819157 GFLOPS
* cuda, (128, 128), batch 256: 3.59081611633 ms, 81.7644987903 GFLOPS
* cuda, (1024, 1024), batch 4: 6.88502731323 ms, 60.9192064052 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.71439990997 ms, 92.7122930838 GFLOPS
* cuda, (16, 16, 16), batch 1024: 2.97145595551 ms, 84.6918964199 GFLOPS
* cuda, (16, 16, 128), batch 128: 3.00780487061 ms, 104.585507881 GFLOPS
* cuda, (32, 32, 128), batch 32: 3.61416625977 ms, 98.6440064944 GFLOPS
* cuda, (128, 128, 128), batch 2: 5.48771514893 ms, 80.2523287103 GFLOPS
* cl, (16,), batch 262144: 2.75559425354 ms, 30.4421015148 GFLOPS
* cl, (1024,), batch 4096: 5.65969944 ms, 37.0541231426 GFLOPS
* cl, (8192,), batch 512: 9.34660434723 ms, 29.1688563966 GFLOPS
* cl, (16, 16), batch 16384: 3.98399829865 ms, 42.1115039274 GFLOPS
* cl, (128, 128), batch 256: 6.76999092102 ms, 43.3680463423 GFLOPS
* cl, (1024, 1024), batch 4: 14.0426874161 ms, 29.8682429917 GFLOPS
* cl, (8, 8, 64), batch 1024: 3.82699966431 ms, 65.7586260974 GFLOPS
* cl, (16, 16, 16), batch 1024: 5.26819229126 ms, 47.7693725071 GFLOPS
* cl, (16, 16, 128), batch 128: 5.97939491272 ms, 52.6094704551 GFLOPS
* cl, (32, 32, 128), batch 32: 6.90059661865 ms, 51.6644950723 GFLOPS
* cl, (128, 128, 128), batch 2: 10.1180076599 ms, 43.5265454231 GFLOPS
</code>

==== Sur un Dell Precision 390 et une Nvidia Tesla C1060 ====
<code>
Running performance tests, single precision, fast math...
* cuda, (16,), batch 262144: 1.05226564407 ms, 79.7194895343 GFLOPS
* cuda, (1024,), batch 4096: 1.33839044571 ms, 156.692092859 GFLOPS
* cuda, (8192,), batch 512: 2.34901752472 ms, 116.061186062 GFLOPS
* cuda, (16, 16), batch 16384: 1.85666236877 ms, 90.3622343091 GFLOPS
* cuda, (128, 128), batch 256: 2.30894393921 ms, 127.158254046 GFLOPS
* cuda, (1024, 1024), batch 4: 3.64037437439 ms, 115.216281861 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.71556797028 ms, 92.6724142996 GFLOPS
* cuda, (16, 16, 16), batch 1024: 2.69772796631 ms, 93.2852545338 GFLOPS
* cuda, (16, 16, 128), batch 128: 2.55265598297 ms, 123.233527 GFLOPS
* cuda, (32, 32, 128), batch 32: 2.79064006805 ms, 127.75414647 GFLOPS
* cuda, (128, 128, 128), batch 2: 3.81527671814 ms, 115.431186919 GFLOPS
* cl, (16,), batch 262144: 0.958919525146 ms, 87.4797913695 GFLOPS
* cl, (1024,), batch 4096: 1.32009983063 ms, 158.863136813 GFLOPS
* cl, (8192,), batch 512: 2.39679813385 ms, 113.747485093 GFLOPS
* cl, (16, 16), batch 16384: 1.83990001678 ms, 91.1854766398 GFLOPS
* cl, (128, 128), batch 256: 2.42879390717 ms, 120.883570703 GFLOPS
* cl, (1024, 1024), batch 4: 3.57611179352 ms, 117.286713676 GFLOPS
* cl, (8, 8, 64), batch 1024: 2.84540653229 ms, 88.4436853379 GFLOPS
* cl, (16, 16, 16), batch 1024: 2.63938903809 ms, 95.3471566217 GFLOPS
* cl, (16, 16, 128), batch 128: 2.67498493195 ms, 117.597970831 GFLOPS
* cl, (32, 32, 128), batch 32: 2.92448997498 ms, 121.90701389 GFLOPS
* cl, (128, 128, 128), batch 2: 3.96769046783 ms, 110.997045654 GFLOPS
Running performance tests, single precision, accurate math...
* cuda, (16,), batch 262144: 1.42477121353 ms, 58.8768773564 GFLOPS
* cuda, (1024,), batch 4096: 3.3375743866 ms, 62.8346145159 GFLOPS
* cuda, (8192,), batch 512: 4.32820472717 ms, 62.9891091538 GFLOPS
* cuda, (16, 16), batch 16384: 2.25389766693 ms, 74.4364584344 GFLOPS
* cuda, (128, 128), batch 256: 3.7936416626 ms, 77.3929923046 GFLOPS
* cuda, (1024, 1024), batch 4: 7.26947555542 ms, 57.6974771842 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.80839996338 ms, 89.6091166791 GFLOPS
* cuda, (16, 16, 16), batch 1024: 3.09284152985 ms, 81.3679710297 GFLOPS
* cuda, (16, 16, 128), batch 128: 3.33400306702 ms, 94.3528826089 GFLOPS
* cuda, (32, 32, 128), batch 32: 3.94346237183 ms, 90.406806604 GFLOPS
* cuda, (128, 128, 128), batch 2: 5.91262397766 ms, 74.485020807 GFLOPS
* cl, (16,), batch 262144: 2.60689258575 ms, 32.1785717058 GFLOPS
* cl, (1024,), batch 4096: 5.85680007935 ms, 35.8071296884 GFLOPS
* cl, (8192,), batch 512: 9.5780134201 ms, 28.4641238263 GFLOPS
* cl, (16, 16), batch 16384: 3.86021137238 ms, 43.461910195 GFLOPS
* cl, (128, 128), batch 256: 6.71989917755 ms, 43.6913221825 GFLOPS
* cl, (1024, 1024), batch 4: 14.6693944931 ms, 28.5922094601 GFLOPS
* cl, (8, 8, 64), batch 1024: 4.20069694519 ms, 59.9086873639 GFLOPS
* cl, (16, 16, 16), batch 1024: 5.09431362152 ms, 49.3998325774 GFLOPS
* cl, (16, 16, 128), batch 128: 5.94499111176 ms, 52.9139226765 GFLOPS
* cl, (32, 32, 128), batch 32: 7.1622133255 ms, 49.7773277334 GFLOPS
* cl, (128, 128, 128), batch 2: 10.4041099548 ms, 42.329610309 GFLOPS
</code>

==== Sur un HP Pavilon p6237fr et une ATI Radeon 5850 ====
<code>
Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 1.91760063171 ms, 43.7453339411 GFLOPS
* cl, (1024,), batch 4096: 2.51891613007 ms, 83.2561265128 GFLOPS
* cl, (8192,), batch 512: 2.51779556274 ms, 108.281134511 GFLOPS
* cl, (16, 16), batch 16384: 1.60949230194 ms, 104.239181385 GFLOPS
* cl, (128, 128), batch 256: 3.12139987946 ms, 94.0607712368 GFLOPS
* cl, (1024, 1024), batch 4: 4.48298454285 ms, 93.5605278116 GFLOPS
* cl, (8, 8, 64), batch 1024: 2.3824930191 ms, 105.628112226 GFLOPS
* cl, (16, 16, 16), batch 1024: 3.38749885559 ms, 74.2902804483 GFLOPS
* cl, (16, 16, 128), batch 128: 3.13160419464 ms, 100.451008636 GFLOPS
* cl, (32, 32, 128), batch 32: 3.49609851837 ms, 101.975341406 GFLOPS
* cl, (128, 128, 128), batch 2: 4.69348430634 ms, 93.8326179989 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 3.46238613129 ms, 24.2278234776 GFLOPS
* cl, (1024,), batch 4096: 6.31630420685 ms, 33.2022007066 GFLOPS
* cl, (8192,), batch 512: 8.24019908905 ms, 33.0853365378 GFLOPS
* cl, (16, 16), batch 16384: 5.12380599976 ms, 32.7436596952 GFLOPS
* cl, (128, 128), batch 256: 7.48369693756 ms, 39.2321178222 GFLOPS
* cl, (1024, 1024), batch 4: 12.8624916077 ms, 32.6087987299 GFLOPS
* cl, (8, 8, 64), batch 1024: 4.75478172302 ms, 52.9274012267 GFLOPS
* cl, (16, 16, 16), batch 1024: 6.84630870819 ms, 36.7582372818 GFLOPS
* cl, (16, 16, 128), batch 128: 8.55000019073 ms, 36.7921395301 GFLOPS
* cl, (32, 32, 128), batch 32: 8.44972133636 ms, 42.1926150944 GFLOPS
* cl, (128, 128, 128), batch 2: 11.0075950623 ms, 40.0089136191 GFLOPS
</code>

==== Sur un HP z800 et une ATI Radeon 5770 ====
<code>
Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 2.65729427338 ms, 31.5682312044 GFLOPS
* cl, (1024,), batch 4096: 2.04300880432 ms, 102.650169474 GFLOPS
* cl, (8192,), batch 512: 3.96590232849 ms, 68.74343779 GFLOPS
* cl, (16, 16), batch 16384: 2.72769927979 ms, 61.5068388379 GFLOPS
* cl, (128, 128), batch 256: 4.22441959381 ms, 69.5009748629 GFLOPS
* cl, (1024, 1024), batch 4: 6.15720748901 ms, 68.1202315739 GFLOPS
* cl, (8, 8, 64), batch 1024: 4.86221313477 ms, 51.7579614519 GFLOPS
* cl, (16, 16, 16), batch 1024: 4.01949882507 ms, 62.6093577712 GFLOPS
* cl, (16, 16, 128), batch 128: 4.14161682129 ms, 75.9541052622 GFLOPS
* cl, (32, 32, 128), batch 32: 5.53939342499 ms, 64.3600865019 GFLOPS
* cl, (128, 128, 128), batch 2: 8.05718898773 ms, 54.6594998169 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 4.7709941864 ms, 17.5825156608 GFLOPS
* cl, (1024,), batch 4096: 8.02640914917 ms, 26.1281472328 GFLOPS
* cl, (8192,), batch 512: 11.2160921097 ms, 24.307018642 GFLOPS
* cl, (16, 16), batch 16384: 6.61840438843 ms, 25.3493365098 GFLOPS
* cl, (128, 128), batch 256: 10.5237960815 ms, 27.8987997986 GFLOPS
* cl, (1024, 1024), batch 4: 19.0495014191 ms, 22.01792009 GFLOPS
* cl, (8, 8, 64), batch 1024: 6.22680187225 ms, 40.4153279907 GFLOPS
* cl, (16, 16, 16), batch 1024: 10.3295087814 ms, 24.3630404238 GFLOPS
* cl, (16, 16, 128), batch 128: 11.3425016403 ms, 27.7339876136 GFLOPS
* cl, (32, 32, 128), batch 32: 12.5716209412 ms, 28.3587805955 GFLOPS
* cl, (128, 128, 128), batch 2: 16.1673069 ms, 27.2402771051 GFLOPS
</code>

===== Exemple d'utilisation =====

Voici un exemple présenté sur le site
<code>
from pyfft.cuda import Plan
import numpy
import pycuda.driver as cuda
from pycuda.tools import make_default_context
import pycuda.gpuarray as gpuarray

cuda.init()
context = make_default_context()
stream = cuda.Stream()

plan = Plan((16, 16), stream=stream)
data = numpy.ones((16, 16), dtype=numpy.complex64)
gpu_data = gpuarray.to_gpu(data)
plan.execute(gpu_data) 
plan.execute(gpu_data, inverse=True) 
result = gpu_data.get()
error = numpy.abs(numpy.sum(numpy.abs(data) - numpy.abs(result)) / data.size)
error < 1e-6
context.pop()
</code>