Wednesday, December 22, 2010

Installing pyCUDA-0.94.2 on CentOS 5.5

Installing pyCUDA-0.94.2 can be quite a hassle on CentOS 5.5 Linus OS. As per couple of my posts to the pyCUDA list, I did use the correct parameters while configuring pyCUDA. However CentOS 5.5 ships with gcc-4.1.2 and that gives an error while installing pyCUDA. So I have posted my steps on how I installed pyCUDA-0.94.2 on CentOS 5.5:

Step 1: Change your default gcc to gcc-4.4 and g++ to g++-4.4. I used yum to install them.

$ cd /usr/bin
$ sudo rm /usr/bin/gcc
$ sudo ln -s /usr/bin/gcc44 /usr/bin/gcc
$ sudo rm /usr/bin/g++
$ sudo ln -s /usr/bin/g++44 /usr/bin/g++

Check your installation:

$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux6E
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,fortran --disable-libgcj --with-mpfr=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/mpfr-install/ --with-ppl=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/ppl-install --with-cloog=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/cloog-install --with-tune=generic --with-arch_32=i586 --build=x86_64-redhat-linux6E
Thread model: posix
gcc version 4.4.0 20090514 (Red Hat 4.4.0-6) (GCC) 

$ g++ -v
Using built-in specs.
Target: x86_64-redhat-linux6E
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,fortran --disable-libgcj --with-mpfr=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/mpfr-install/ --with-ppl=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/ppl-install --with-cloog=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/cloog-install --with-tune=generic --with-arch_32=i586 --build=x86_64-redhat-linux6E
Thread model: posix
gcc version 4.4.0 20090514 (Red Hat 4.4.0-6) (GCC)

Step 2:  Unzip pycuda-0.94.2 to your directory and install numpy, pytest and pytools

$ tar xzf pycuda-0.94.2.tar.gz
$ sudo easy_install numpy
install_dir /usr/local/lib/python2.6/site-packages/
Searching for numpy
Best match: numpy 1.5.1
Processing numpy-1.5.1-py2.6-linux-x86_64.egg
numpy 1.5.1 is already the active version in easy-install.pth
Installing f2py script to /usr/local/bin

Using /usr/local/lib/python2.6/site-packages/numpy-1.5.1-py2.6-linux-x86_64.egg
Processing dependencies for numpy
Finished processing dependencies for numpy

$ sudo easy_install pytest
install_dir /usr/local/lib/python2.6/site-packages/
Searching for pytest
Reading http://pypi.python.org/simple/pytest/
Reading http://pytest.org
Best match: pytest 2.0.0
Downloading http://pypi.python.org/packages/source/p/pytest/pytest-2.0.0.zip#md5=f07c521dfd5a540f3dfea1846e58dab7
Processing pytest-2.0.0.zip
Running pytest-2.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-5pyoXO/pytest-2.0.0/egg-dist-tmp-ssbYCg
Adding pytest 2.0.0 to easy-install.pth file
Installing py.test script to /usr/local/bin
Installing py.test-2.6 script to /usr/local/bin

Installed /usr/local/lib/python2.6/site-packages/pytest-2.0.0-py2.6.egg
Processing dependencies for pytest
Finished processing dependencies for pytest


$ sudo easy_install pytools
install_dir /usr/local/lib/python2.6/site-packages/
Searching for pytools
Best match: pytools 11
Processing pytools-11-py2.6.egg
pytools 11 is already the active version in easy-install.pth
Installing runalyzer-gather script to /usr/local/bin
Installing runalyzer script to /usr/local/bin
Installing logtool script to /usr/local/bin

Using /usr/local/lib/python2.6/site-packages/pytools-11-py2.6.egg
Processing dependencies for pytools
Finished processing dependencies for pytools

Step 3: Configure siteconf.py


$ python configure.py --cuda-root=/usr/local/cuda --cudadrv-lib-dir=/usr/lib64/nvidia --boost-inc-dir=/usr/include --boost-lib-dir=/usr/local/lib CXXFLAGS=-DBOOST_PYTHON_NO_PY_SIGNATURES

Step 4: Install
$ sudo make install


Step 5: Test your installation
$ python test/test_driver.py 
=============================================================== test session starts ===============================================================
platform linux2 -- Python 2.6.6 -- pytest-2.0.0
collected 18 items 

test/test_driver.py ..................

============================================================ 18 passed in 4.58 seconds ========================================


If you see the message above, pyCuda should be working now. Enjoy :)


FYI: For installing pyCuda on Ubuntu 10.10, refer to http://wiki.tiker.net/PyCuda/Installation/Linux/Ubuntu



Thursday, October 7, 2010

Installing Tesla C2050 and C1060 on Centos

I have been waiting to get a NVIDIA Fermi GPU to prototype my algorithms, but I was waiting to get an optimized implementation on the Tesla C1060 and then scale up to the Fermi. I got a C2050 and set about installing it on my machine. (note: my host machine is a Lian-Li PC with an integrated NVIDIA nForce 980a/780a (8 cores). 

Installing the C2050 was easy, it has 2 power connections (6-pin and 8-pin). In most of the cases, connecting the 8-pin connector is more than enough. 



After installing the C2050, I had to boot the machine and change the BIOS setting to disable display from external GPUs. The C2050 has a display out and I don't intend to use it now. 

My deviceQuery output:
[vivekv@atstgpu release]$ more deviceQuery.txt
./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla C2050"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 3220897792 bytes
  Number of multiprocessors:                     14
  Number of cores:                               448
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No
Device 1: "Tesla C1060"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294770688 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
Device 2: "nForce 980a/780a SLI"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 131399680 bytes
  Number of multiprocessors:                     1
  Number of cores:                               8
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.20 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     Yes
  Integrated:                                    Yes
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 3, Device = Tesla C2050, Device = Tesla 
C1060
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------


I hope to post the performance comparison of my kernels using the Tesla C2050 sometime soon.

Thursday, July 29, 2010

More HPC tutorials

I want to quickly share this blogpost with lot of tutorials on MPI, OpenMP, CUDA and Graphics programming http://supercomputingblog.com


I like the tutorials here and rate it higher than Dobb’s Supercomputing for masses series.