Usage des GPU : PyCUDA, PyOpenCL et PyFFT sur Debian Squeeze

PyFFT permet un appel simplissime de fonctions GPU à l'intérieur de ses scripts Python, en lieu et place des appels standards : certainement la plus simple façon d'exploiter son GPU pour effectuer ces opérations…

Le préalable est l'installation des pilotes graphiques et des environnements Nvidia ou ATI idoines en fonction de son matériel. De plus, il est donc nécessaire d'avoir installé soit PyCUDA, soit PyOpenCL, deux wrappers créés et maintenus par Andreas Klöckner.

Installation de PyCUDA

Pour l'installer dans un environnement Nvidia sous Debian Squeeze,, c'est ici, et, plus précisement, PyCUDA sous Nvidia.

Installation de PyOpenCL

Sur Nvidia

Pour l'installer dans un environnement Nvidia sous Debian Squeeze, c'est CUDA sur Squeeze, et, plus précisement, PyOpenCL sous Nvidia.

Sur ATI

Pour l'installer dans un environnement ATI sous Debian Squeeze, c'est Stream SDK sur Squeeze, et, plus précisement, PyOpenCL sous ATI.

Installation de PyFFT

Toutes ces commandes sont à effectuer comme root

Téléchargement des sources

cd /opt
git clone https://github.com/Manticore/pyfft

Installation des librairies

cd /opt
#tar xzf /root/pyfft-0.3.5.tar.gz
#cd pyfft-0.3.5
cd pyfft
export PYFFT=$PWD

python setup.py build
sudo python setup.py install

Exécution des tests d'intégration

cd $PYFFT/test
python test_performance.py

Sur un Dell Precision 360 et une ATI Radeon HD4890

Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 12.6500844955 ms, 6.63126637846 GFLOPS
* cl, (1024,), batch 4096: 11.2925052643 ms, 18.5711846124 GFLOPS
* cl, (8192,), batch 512: 18.1638002396 ms, 15.0095110277 GFLOPS
* cl, (16, 16), batch 16384: 20.2838897705 ms, 8.27120251087 GFLOPS
* cl, (128, 128), batch 256: 15.6512022018 ms, 18.7590241448 GFLOPS
* cl, (1024, 1024), batch 4: 20.5834150314 ms, 20.3771045456 GFLOPS
* cl, (8, 8, 64), batch 1024: 13.6965036392 ms, 18.3739037808 GFLOPS
* cl, (16, 16, 16), batch 1024: 28.2988071442 ms, 8.8928921533 GFLOPS
* cl, (16, 16, 128), batch 128: 23.5196828842 ms, 13.3748742085 GFLOPS
* cl, (32, 32, 128), batch 32: 23.3653783798 ms, 15.2582951667 GFLOPS
* cl, (128, 128, 128), batch 2: 31.716299057 ms, 13.8856655125 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 12.9328966141 ms, 6.48625613451 GFLOPS
* cl, (1024,), batch 4096: 19.5894002914 ms, 10.7055446762 GFLOPS
* cl, (8192,), batch 512: 26.6664981842 ms, 10.2236805942 GFLOPS
* cl, (16, 16), batch 16384: 20.7551002502 ms, 8.08341843581 GFLOPS
* cl, (128, 128), batch 256: 19.7580814362 ms, 14.8598071604 GFLOPS
* cl, (1024, 1024), batch 4: 40.1411056519 ms, 10.4489000288 GFLOPS
* cl, (8, 8, 64), batch 1024: 14.3758058548 ms, 17.5056788149 GFLOPS
* cl, (16, 16, 16), batch 1024: 29.2860984802 ms, 8.5930954637 GFLOPS
* cl, (16, 16, 128), batch 128: 24.3818998337 ms, 12.9018986275 GFLOPS
* cl, (32, 32, 128), batch 32: 24.8661994934 ms, 14.3373674813 GFLOPS
* cl, (128, 128, 128), batch 2: 37.5190019608 ms, 11.7381032806 GFLOPS

Sur un Apple iMac et une ATI Radeon HD 4850M

Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 22.1525907516 ms, 3.78673902933 GFLOPS
* cl, (1024,), batch 4096: 19.5072889328 ms, 10.7506071563 GFLOPS
* cl, (8192,), batch 512: 31.4644098282 ms, 8.66470280195 GFLOPS
* cl, (16, 16), batch 16384: 35.2179050446 ms, 4.76383134624 GFLOPS
* cl, (128, 128), batch 256: 27.0842075348 ms, 10.8403127403 GFLOPS
* cl, (1024, 1024), batch 4: 36.1826896667 ms, 11.592018279 GFLOPS
* cl, (8, 8, 64), batch 1024: 24.2835998535 ms, 10.3633003969 GFLOPS
* cl, (16, 16, 16), batch 1024: 48.9276885986 ms, 5.14347289251 GFLOPS
* cl, (16, 16, 128), batch 128: 40.2245998383 ms, 7.82040843824 GFLOPS
* cl, (32, 32, 128), batch 32: 40.5035972595 ms, 8.80207843554 GFLOPS
* cl, (128, 128, 128), batch 2: 47.1528053284 ms, 9.33988798616 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 22.700881958 ms, 3.69527845461 GFLOPS
* cl, (1024,), batch 4096: 32.2904825211 ms, 6.49464435421 GFLOPS
* cl, (8192,), batch 512: 1405.95219135 ms, 0.19391111709 GFLOPS
* cl, (16, 16), batch 16384: 981.485915184 ms, 0.17093690027 GFLOPS
* cl, (128, 128), batch 256: 307.555294037 ms, 0.954629251041 GFLOPS
* cl, (1024, 1024), batch 4: 66.2887096405 ms, 6.32732787038 GFLOPS
* cl, (8, 8, 64), batch 1024: 24.7946977615 ms, 10.1496796783 GFLOPS
* cl, (16, 16, 16), batch 1024: 50.8825778961 ms, 4.9458626195 GFLOPS
* cl, (16, 16, 128), batch 128: 42.073392868 ms, 7.47676330708 GFLOPS
* cl, (32, 32, 128), batch 32: 43.0013895035 ms, 8.29079813738 GFLOPS
* cl, (128, 128, 128), batch 2: 58.1202983856 ms, 7.57742014809 GFLOPS

Sur un Dell Latitude E6410 et une Nvidia GT218 (NVS 3100M)

Running performance tests, single precision, fast math...
* cuda, (16,), batch 262144: 10.8457084656 ms, 7.73449519377 GFLOPS
* cuda, (1024,), batch 4096: 15.1082717896 ms, 13.8808199191 GFLOPS
* cuda, (8192,), batch 512: 23.5022171021 ms, 11.6001719674 GFLOPS
* cuda, (16, 16), batch 16384: 16.9765533447 ms, 9.88258079206 GFLOPS
* cuda, (128, 128), batch 256: 21.0505889893 ms, 13.9474140201 GFLOPS
* cuda, (1024, 1024), batch 4: 35.6067626953 ms, 11.7795151328 GFLOPS
* cuda, (8, 8, 64), batch 1024: 20.4859710693 ms, 12.2844184026 GFLOPS
* cuda, (16, 16, 16), batch 1024: 23.0558822632 ms, 10.9151424841 GFLOPS
* cuda, (16, 16, 128), batch 128: 21.3567718506 ms, 14.7294170767 GFLOPS
* cuda, (32, 32, 128), batch 32: 23.2541030884 ms, 15.3313090015 GFLOPS
* cuda, (128, 128, 128), batch 2: 33.5932098389 ms, 13.1098493449 GFLOPS
* cl, (16,), batch 262144: 10.9880924225 ms, 7.63427142534 GFLOPS
* cl, (1024,), batch 4096: 14.8217201233 ms, 14.149180949 GFLOPS
* cl, (8192,), batch 512: 23.6032962799 ms, 11.5504951837 GFLOPS
* cl, (16, 16), batch 16384: 17.1283960342 ms, 9.79497202567 GFLOPS
* cl, (128, 128), batch 256: 20.6398010254 ms, 14.2250053496 GFLOPS
* cl, (1024, 1024), batch 4: 34.9160909653 ms, 12.0125245526 GFLOPS
* cl, (8, 8, 64), batch 1024: 20.4857826233 ms, 12.2845314054 GFLOPS
* cl, (16, 16, 16), batch 1024: 23.1196880341 ms, 10.8850188475 GFLOPS
* cl, (16, 16, 128), batch 128: 20.8853006363 ms, 15.0619234781 GFLOPS
* cl, (32, 32, 128), batch 32: 22.8240013123 ms, 15.620216417 GFLOPS
* cl, (128, 128, 128), batch 2: 33.2809925079 ms, 13.232836127 GFLOPS
Running performance tests, single precision, accurate math...
* cuda, (16,), batch 262144: 16.0340286255 ms, 5.23175316443 GFLOPS
* cuda, (1024,), batch 4096: 32.7980010986 ms, 6.39414577033 GFLOPS
* cuda, (8192,), batch 512: 44.2999786377 ms, 6.15417362229 GFLOPS
* cuda, (16, 16), batch 16384: 23.9424224854 ms, 7.00731766398 GFLOPS
* cuda, (128, 128), batch 256: 37.4869567871 ms, 7.83209161702 GFLOPS
* cuda, (1024, 1024), batch 4: 70.3075195312 ms, 5.96565492278 GFLOPS
* cuda, (8, 8, 64), batch 1024: 24.4130752563 ms, 10.3083383538 GFLOPS
* cuda, (16, 16, 16), batch 1024: 32.3656616211 ms, 7.77547027915 GFLOPS
* cuda, (16, 16, 128), batch 128: 33.9374816895 ms, 9.26918511157 GFLOPS
* cuda, (32, 32, 128), batch 32: 37.5233032227 ms, 9.50118484731 GFLOPS
* cuda, (128, 128, 128), batch 2: 57.7268981934 ms, 7.62905913505 GFLOPS
* cl, (16,), batch 262144: 29.4173002243 ms, 2.85159002901 GFLOPS
* cl, (1024,), batch 4096: 81.5240859985 ms, 2.57243239751 GFLOPS
* cl, (8192,), batch 512: 134.916901588 ms, 2.02072354753 GFLOPS
* cl, (16, 16), batch 16384: 45.1912164688 ms, 3.71249488528 GFLOPS
* cl, (128, 128), batch 256: 79.5057058334 ms, 3.69283282152 GFLOPS
* cl, (1024, 1024), batch 4: 202.758383751 ms, 2.06862173707 GFLOPS
* cl, (8, 8, 64), batch 1024: 35.9812021255 ms, 6.99415875884 GFLOPS
* cl, (16, 16, 16), batch 1024: 61.35160923 ms, 4.10190120778 GFLOPS
* cl, (16, 16, 128), batch 128: 70.4071044922 ms, 4.46791275211 GFLOPS
* cl, (32, 32, 128), batch 32: 77.3554086685 ms, 4.60880300598 GFLOPS
* cl, (128, 128, 128), batch 2: 121.149802208 ms, 3.63518480405 GFLOPS

Sur un Dell Precision 360 et une Nvidia GTX260

Running performance tests, single precision, fast math...
* cuda, (16,), batch 262144: 1.15618877411 ms, 72.5539651297 GFLOPS
* cuda, (1024,), batch 4096: 1.53842878342 ms, 136.317782312 GFLOPS
* cuda, (8192,), batch 512: 2.30843849182 ms, 118.101375006 GFLOPS
* cuda, (16, 16), batch 16384: 1.87551994324 ms, 89.4536795543 GFLOPS
* cuda, (128, 128), batch 256: 2.12306556702 ms, 138.291197673 GFLOPS
* cuda, (1024, 1024), batch 4: 3.58481292725 ms, 117.002032885 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.41319999695 ms, 104.284037924 GFLOPS
* cuda, (16, 16, 16), batch 1024: 2.61318397522 ms, 96.3032998772 GFLOPS
* cuda, (16, 16, 128), batch 128: 2.45464000702 ms, 128.154352207 GFLOPS
* cuda, (32, 32, 128), batch 32: 2.46014080048 ms, 144.916843756 GFLOPS
* cuda, (128, 128, 128), batch 2: 3.29523506165 ms, 133.648104539 GFLOPS
* cl, (16,), batch 262144: 1.17750167847 ms, 71.2407307217 GFLOPS
* cl, (1024,), batch 4096: 1.52101516724 ms, 137.878441003 GFLOPS
* cl, (8192,), batch 512: 2.36279964447 ms, 115.384205613 GFLOPS
* cl, (16, 16), batch 16384: 1.96299552917 ms, 85.4674182934 GFLOPS
* cl, (128, 128), batch 256: 2.26821899414 ms, 129.441328531 GFLOPS
* cl, (1024, 1024), batch 4: 3.52740287781 ms, 118.906292967 GFLOPS
* cl, (8, 8, 64), batch 1024: 2.51619815826 ms, 100.015270726 GFLOPS
* cl, (16, 16, 16), batch 1024: 2.69598960876 ms, 93.3454042931 GFLOPS
* cl, (16, 16, 128), batch 128: 2.60119438171 ms, 120.9339841 GFLOPS
* cl, (32, 32, 128), batch 32: 2.57868766785 ms, 138.254758203 GFLOPS
* cl, (128, 128, 128), batch 2: 3.4695148468 ms, 126.934727028 GFLOPS
Running performance tests, single precision, accurate math...
* cuda, (16,), batch 262144: 1.59408321381 ms, 52.6234008824 GFLOPS
* cuda, (1024,), batch 4096: 3.1795999527 ms, 65.9564734935 GFLOPS
* cuda, (8192,), batch 512: 4.20476150513 ms, 64.8383409303 GFLOPS
* cuda, (16, 16), batch 16384: 2.29162559509 ms, 73.2109819157 GFLOPS
* cuda, (128, 128), batch 256: 3.59081611633 ms, 81.7644987903 GFLOPS
* cuda, (1024, 1024), batch 4: 6.88502731323 ms, 60.9192064052 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.71439990997 ms, 92.7122930838 GFLOPS
* cuda, (16, 16, 16), batch 1024: 2.97145595551 ms, 84.6918964199 GFLOPS
* cuda, (16, 16, 128), batch 128: 3.00780487061 ms, 104.585507881 GFLOPS
* cuda, (32, 32, 128), batch 32: 3.61416625977 ms, 98.6440064944 GFLOPS
* cuda, (128, 128, 128), batch 2: 5.48771514893 ms, 80.2523287103 GFLOPS
* cl, (16,), batch 262144: 2.75559425354 ms, 30.4421015148 GFLOPS
* cl, (1024,), batch 4096: 5.65969944 ms, 37.0541231426 GFLOPS
* cl, (8192,), batch 512: 9.34660434723 ms, 29.1688563966 GFLOPS
* cl, (16, 16), batch 16384: 3.98399829865 ms, 42.1115039274 GFLOPS
* cl, (128, 128), batch 256: 6.76999092102 ms, 43.3680463423 GFLOPS
* cl, (1024, 1024), batch 4: 14.0426874161 ms, 29.8682429917 GFLOPS
* cl, (8, 8, 64), batch 1024: 3.82699966431 ms, 65.7586260974 GFLOPS
* cl, (16, 16, 16), batch 1024: 5.26819229126 ms, 47.7693725071 GFLOPS
* cl, (16, 16, 128), batch 128: 5.97939491272 ms, 52.6094704551 GFLOPS
* cl, (32, 32, 128), batch 32: 6.90059661865 ms, 51.6644950723 GFLOPS
* cl, (128, 128, 128), batch 2: 10.1180076599 ms, 43.5265454231 GFLOPS

Sur un Dell Precision 390 et une Nvidia Tesla C1060

Running performance tests, single precision, fast math...
* cuda, (16,), batch 262144: 1.05226564407 ms, 79.7194895343 GFLOPS
* cuda, (1024,), batch 4096: 1.33839044571 ms, 156.692092859 GFLOPS
* cuda, (8192,), batch 512: 2.34901752472 ms, 116.061186062 GFLOPS
* cuda, (16, 16), batch 16384: 1.85666236877 ms, 90.3622343091 GFLOPS
* cuda, (128, 128), batch 256: 2.30894393921 ms, 127.158254046 GFLOPS
* cuda, (1024, 1024), batch 4: 3.64037437439 ms, 115.216281861 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.71556797028 ms, 92.6724142996 GFLOPS
* cuda, (16, 16, 16), batch 1024: 2.69772796631 ms, 93.2852545338 GFLOPS
* cuda, (16, 16, 128), batch 128: 2.55265598297 ms, 123.233527 GFLOPS
* cuda, (32, 32, 128), batch 32: 2.79064006805 ms, 127.75414647 GFLOPS
* cuda, (128, 128, 128), batch 2: 3.81527671814 ms, 115.431186919 GFLOPS
* cl, (16,), batch 262144: 0.958919525146 ms, 87.4797913695 GFLOPS
* cl, (1024,), batch 4096: 1.32009983063 ms, 158.863136813 GFLOPS
* cl, (8192,), batch 512: 2.39679813385 ms, 113.747485093 GFLOPS
* cl, (16, 16), batch 16384: 1.83990001678 ms, 91.1854766398 GFLOPS
* cl, (128, 128), batch 256: 2.42879390717 ms, 120.883570703 GFLOPS
* cl, (1024, 1024), batch 4: 3.57611179352 ms, 117.286713676 GFLOPS
* cl, (8, 8, 64), batch 1024: 2.84540653229 ms, 88.4436853379 GFLOPS
* cl, (16, 16, 16), batch 1024: 2.63938903809 ms, 95.3471566217 GFLOPS
* cl, (16, 16, 128), batch 128: 2.67498493195 ms, 117.597970831 GFLOPS
* cl, (32, 32, 128), batch 32: 2.92448997498 ms, 121.90701389 GFLOPS
* cl, (128, 128, 128), batch 2: 3.96769046783 ms, 110.997045654 GFLOPS
Running performance tests, single precision, accurate math...
* cuda, (16,), batch 262144: 1.42477121353 ms, 58.8768773564 GFLOPS
* cuda, (1024,), batch 4096: 3.3375743866 ms, 62.8346145159 GFLOPS
* cuda, (8192,), batch 512: 4.32820472717 ms, 62.9891091538 GFLOPS
* cuda, (16, 16), batch 16384: 2.25389766693 ms, 74.4364584344 GFLOPS
* cuda, (128, 128), batch 256: 3.7936416626 ms, 77.3929923046 GFLOPS
* cuda, (1024, 1024), batch 4: 7.26947555542 ms, 57.6974771842 GFLOPS
* cuda, (8, 8, 64), batch 1024: 2.80839996338 ms, 89.6091166791 GFLOPS
* cuda, (16, 16, 16), batch 1024: 3.09284152985 ms, 81.3679710297 GFLOPS
* cuda, (16, 16, 128), batch 128: 3.33400306702 ms, 94.3528826089 GFLOPS
* cuda, (32, 32, 128), batch 32: 3.94346237183 ms, 90.406806604 GFLOPS
* cuda, (128, 128, 128), batch 2: 5.91262397766 ms, 74.485020807 GFLOPS
* cl, (16,), batch 262144: 2.60689258575 ms, 32.1785717058 GFLOPS
* cl, (1024,), batch 4096: 5.85680007935 ms, 35.8071296884 GFLOPS
* cl, (8192,), batch 512: 9.5780134201 ms, 28.4641238263 GFLOPS
* cl, (16, 16), batch 16384: 3.86021137238 ms, 43.461910195 GFLOPS
* cl, (128, 128), batch 256: 6.71989917755 ms, 43.6913221825 GFLOPS
* cl, (1024, 1024), batch 4: 14.6693944931 ms, 28.5922094601 GFLOPS
* cl, (8, 8, 64), batch 1024: 4.20069694519 ms, 59.9086873639 GFLOPS
* cl, (16, 16, 16), batch 1024: 5.09431362152 ms, 49.3998325774 GFLOPS
* cl, (16, 16, 128), batch 128: 5.94499111176 ms, 52.9139226765 GFLOPS
* cl, (32, 32, 128), batch 32: 7.1622133255 ms, 49.7773277334 GFLOPS
* cl, (128, 128, 128), batch 2: 10.4041099548 ms, 42.329610309 GFLOPS

Sur un HP Pavilon p6237fr et une ATI Radeon 5850

Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 1.91760063171 ms, 43.7453339411 GFLOPS
* cl, (1024,), batch 4096: 2.51891613007 ms, 83.2561265128 GFLOPS
* cl, (8192,), batch 512: 2.51779556274 ms, 108.281134511 GFLOPS
* cl, (16, 16), batch 16384: 1.60949230194 ms, 104.239181385 GFLOPS
* cl, (128, 128), batch 256: 3.12139987946 ms, 94.0607712368 GFLOPS
* cl, (1024, 1024), batch 4: 4.48298454285 ms, 93.5605278116 GFLOPS
* cl, (8, 8, 64), batch 1024: 2.3824930191 ms, 105.628112226 GFLOPS
* cl, (16, 16, 16), batch 1024: 3.38749885559 ms, 74.2902804483 GFLOPS
* cl, (16, 16, 128), batch 128: 3.13160419464 ms, 100.451008636 GFLOPS
* cl, (32, 32, 128), batch 32: 3.49609851837 ms, 101.975341406 GFLOPS
* cl, (128, 128, 128), batch 2: 4.69348430634 ms, 93.8326179989 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 3.46238613129 ms, 24.2278234776 GFLOPS
* cl, (1024,), batch 4096: 6.31630420685 ms, 33.2022007066 GFLOPS
* cl, (8192,), batch 512: 8.24019908905 ms, 33.0853365378 GFLOPS
* cl, (16, 16), batch 16384: 5.12380599976 ms, 32.7436596952 GFLOPS
* cl, (128, 128), batch 256: 7.48369693756 ms, 39.2321178222 GFLOPS
* cl, (1024, 1024), batch 4: 12.8624916077 ms, 32.6087987299 GFLOPS
* cl, (8, 8, 64), batch 1024: 4.75478172302 ms, 52.9274012267 GFLOPS
* cl, (16, 16, 16), batch 1024: 6.84630870819 ms, 36.7582372818 GFLOPS
* cl, (16, 16, 128), batch 128: 8.55000019073 ms, 36.7921395301 GFLOPS
* cl, (32, 32, 128), batch 32: 8.44972133636 ms, 42.1926150944 GFLOPS
* cl, (128, 128, 128), batch 2: 11.0075950623 ms, 40.0089136191 GFLOPS

Sur un HP z800 et une ATI Radeon 5770

Running performance tests, single precision, fast math...
* cl, (16,), batch 262144: 2.65729427338 ms, 31.5682312044 GFLOPS
* cl, (1024,), batch 4096: 2.04300880432 ms, 102.650169474 GFLOPS
* cl, (8192,), batch 512: 3.96590232849 ms, 68.74343779 GFLOPS
* cl, (16, 16), batch 16384: 2.72769927979 ms, 61.5068388379 GFLOPS
* cl, (128, 128), batch 256: 4.22441959381 ms, 69.5009748629 GFLOPS
* cl, (1024, 1024), batch 4: 6.15720748901 ms, 68.1202315739 GFLOPS
* cl, (8, 8, 64), batch 1024: 4.86221313477 ms, 51.7579614519 GFLOPS
* cl, (16, 16, 16), batch 1024: 4.01949882507 ms, 62.6093577712 GFLOPS
* cl, (16, 16, 128), batch 128: 4.14161682129 ms, 75.9541052622 GFLOPS
* cl, (32, 32, 128), batch 32: 5.53939342499 ms, 64.3600865019 GFLOPS
* cl, (128, 128, 128), batch 2: 8.05718898773 ms, 54.6594998169 GFLOPS
Running performance tests, single precision, accurate math...
* cl, (16,), batch 262144: 4.7709941864 ms, 17.5825156608 GFLOPS
* cl, (1024,), batch 4096: 8.02640914917 ms, 26.1281472328 GFLOPS
* cl, (8192,), batch 512: 11.2160921097 ms, 24.307018642 GFLOPS
* cl, (16, 16), batch 16384: 6.61840438843 ms, 25.3493365098 GFLOPS
* cl, (128, 128), batch 256: 10.5237960815 ms, 27.8987997986 GFLOPS
* cl, (1024, 1024), batch 4: 19.0495014191 ms, 22.01792009 GFLOPS
* cl, (8, 8, 64), batch 1024: 6.22680187225 ms, 40.4153279907 GFLOPS
* cl, (16, 16, 16), batch 1024: 10.3295087814 ms, 24.3630404238 GFLOPS
* cl, (16, 16, 128), batch 128: 11.3425016403 ms, 27.7339876136 GFLOPS
* cl, (32, 32, 128), batch 32: 12.5716209412 ms, 28.3587805955 GFLOPS
* cl, (128, 128, 128), batch 2: 16.1673069 ms, 27.2402771051 GFLOPS

Exemple d'utilisation

Voici un exemple présenté sur le site

from pyfft.cuda import Plan
import numpy
import pycuda.driver as cuda
from pycuda.tools import make_default_context
import pycuda.gpuarray as gpuarray

cuda.init()
context = make_default_context()
stream = cuda.Stream()

plan = Plan((16, 16), stream=stream)
data = numpy.ones((16, 16), dtype=numpy.complex64)
gpu_data = gpuarray.to_gpu(data)
plan.execute(gpu_data) 
plan.execute(gpu_data, inverse=True) 
result = gpu_data.get()
error = numpy.abs(numpy.sum(numpy.abs(data) - numpy.abs(result)) / data.size)
error < 1e-6
context.pop()
developpement/activites/integration/pyfft4squeeze.txt · Dernière modification: 2015/01/07 10:04 (modification externe)