Monday, April 6, 2015

Compiling MNIST on MacOSX [cudnn test]


$ make
g++ -c -o mnistCUDNN.o mnistCUDNN.cpp -I. -I/usr/local/cuda/include -I/usr/local/cuda/cudnn  -I/opt/local/include -I/usr/local/cuda/samples/7_CUDALibraries/common/UtilNPP
gcc -o mnistCUDNN mnistCUDNN.o -L/usr/local/cuda/lib -L/usr/local/cuda/cudnn -L/opt/local/lib -lnppi -lnppc -lfreeimage -lcudart -lcublas -lcudnn -lm -lstdc++


$ ./mnistCUDNN 
Loading image data/one_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
4.05186e-07 0.999404 2.21383e-07 1.20837e-08 0.000587085 5.06682e-08 2.80583e-06 1.47965e-06 3.56051e-06 2.46337e-07 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
4.67739e-05 5.83973e-07 1.76501e-06 0.75859 1.06138e-11 0.24133 2.62157e-10 1.11104e-05 3.39113e-07 1.88164e-05 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
3.22452e-10 8.69774e-10 3.73033e-12 3.2219e-07 2.67785e-11 0.999992 4.58862e-06 5.08385e-10 9.35238e-07 1.87656e-06 
Result of classification: 1 3 5
Test passed!

Installing Caffe on Mac OSX 10.10

Here are the instructions I followed for installing Caffe on Mac OSX 10.10


Clone the repository from here:
git clone https://github.com/BVLC/caffe
cp Makefile.config.example Makefile.config


Inside Makefile.config make the following changes:
PYTHON_LIB := $(ANACONDA_HOME)/lib
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /opt/local/include /usr/local/cuda/cudnn
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /opt/local/lib /usr/local/cuda/cudnn

Check the requirements.txt file:
Cython>=0.19.2
numpy>=1.7.1
scipy>=0.13.2
scikit-image>=0.9.3
matplotlib>=1.3.1
ipython>=1.1.0
h5py>=2.2.0
leveldb>=0.191
networkx>=1.8.1
nose>=1.3.0
pandas>=0.12.0
python-dateutil>=1.4,<2 protobuf="">=2.5.0
python-gflags>=2.0
pyyaml>=3.10
Pillow>=2.3.0

If you are using Anaconda:
$ for req in $(cat requirements.txt); do conda install $req; done
Compile and build:
$ make all
$ make py
$ make test
$ make runtest

$ export CAFFE_HOME=${HOME}/caffe

Thursday, August 15, 2013

Latex help: where to save .cls and .bst files local on Mac

Recently I was looking at options on where to save latex cls and bst files locally on my Mac and I came across this very good source from TeXStackEchangeHow to have local package override default package 

Most of the times I have the IEEEtran.bst files modified so that if I cite the same author twice, I would like to see the author's name again 


% #0 turns off the "dashification" of repeated (i.e., identical to those
% of the previous entry) names. IEEE normally does this.
% #1 enables
FUNCTION {default.is.dash.repeated.names} { #0 }

and sometimes if an article has more than 6 authors, I want that reference shortened to et. al. 

% The maximum number of names that can be present beyond which an "et al."
% usage is forced. Be sure that num.names.shown.with.forced.et.al (below)
% is not greater than this value!
% Note: There are many instances of references in IEEE journals which have
% a very large number of authors as well as instances in which "et al." is
% used profusely.
FUNCTION {default.max.num.names.before.forced.et.al} { #3 }

So as per the instructions, I created this structure:
1. For .cls files:


/Users/'username'/Library/texmf/tex/latex/



2. For .bst files:


/Users/'username'/Library/texmf/bibtex/bst/







Happy Latexing !

Sunday, July 31, 2011

ERSA presentation comparing DSPs vs FPGAs vs GPUs

I recently presented my work at the 2011 International Conference on 
Engineering of Reconfigurable Systems and Algorithms (ERSA 2011)  conference, held at Las Vegas, NV. 


The three keynotes were: 
1. How Engineering Mathematics can Improve Software by Prof. David Lorge Parnas
2. The Nature of Cyber Security by Prof. Eugene H. Spafford
3. Changing Lives around the World: the Power of Technology by Dr. Sandeep Chatterjee


All the keynotes were impressive and very diverse, started with applying engineering mathematics to a process; cyber security as a science and method; and finally mobile computing applications in developing countries. Though there are no slides to link for Dr. Chatterjee's talk, his talk had example scenarios where mobile computing was used for banking, construction, etc.


The ERSA sessions started with a bang and the tutorial session on evolvable computing by Prof. Jim Torresen was very interesting. The Evolvable And Bio-Inspired Hardware session by Dr. Eric Stahlberg was interesting where the convergence was on what biological process can inspire computing in general. 


Finally, here's a link to my presentation on "Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures". My talk was on Wednesday July 20,2011 at the Gold Room.


I also chaired Session 13-PDPTA: Systems Software + OS + Programming Models + Architecture Issues +
Fault-Tolerant Systems & Tools on Thursday July 21, 2011 from 08:00am - 12:20pm (LOCATION: Ballroom 4) as part of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'11).


Next year's ERSA is expected to be bigger and better and promises to have a developers meet/session. Eagerly looking forward to next year's conference.

Monday, April 25, 2011

Some great reference books for CUDA and OpenCL

This is a list of some good reference books for CUDA and OpenCL:



1. CUDA by Example: An Introduction to General-Purpose GPU Programming Jason Sanders & Edward Kandrot For Kindle: CUDA by Example: An Introduction to General-Purpose GPU Programming
CUDA by Example: An Introduction to General-Purpose GPU Programming


2. Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series) David B. Kirk & Wen-mei W. Hwu For Kindle: Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)


3. CUDA Application Design and Development [Paperback] Rob Farber (Dr Dobbs series author) Excellent book with some good examples and overview into the CUDA+OpenGL interface.
CUDA Application Design and Development


4. Heterogeneous Computing with OpenCL Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, Dana Schaa
Heterogeneous Computing with OpenCL




Wednesday, December 22, 2010

Installing pyCUDA-0.94.2 on CentOS 5.5

Installing pyCUDA-0.94.2 can be quite a hassle on CentOS 5.5 Linus OS. As per couple of my posts to the pyCUDA list, I did use the correct parameters while configuring pyCUDA. However CentOS 5.5 ships with gcc-4.1.2 and that gives an error while installing pyCUDA. So I have posted my steps on how I installed pyCUDA-0.94.2 on CentOS 5.5:

Step 1: Change your default gcc to gcc-4.4 and g++ to g++-4.4. I used yum to install them.

$ cd /usr/bin
$ sudo rm /usr/bin/gcc
$ sudo ln -s /usr/bin/gcc44 /usr/bin/gcc
$ sudo rm /usr/bin/g++
$ sudo ln -s /usr/bin/g++44 /usr/bin/g++

Check your installation:

$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux6E
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,fortran --disable-libgcj --with-mpfr=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/mpfr-install/ --with-ppl=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/ppl-install --with-cloog=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/cloog-install --with-tune=generic --with-arch_32=i586 --build=x86_64-redhat-linux6E
Thread model: posix
gcc version 4.4.0 20090514 (Red Hat 4.4.0-6) (GCC) 

$ g++ -v
Using built-in specs.
Target: x86_64-redhat-linux6E
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,fortran --disable-libgcj --with-mpfr=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/mpfr-install/ --with-ppl=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/ppl-install --with-cloog=/builddir/build/BUILD/gcc-4.4.0-20090514/obj-x86_64-redhat-linux6E/cloog-install --with-tune=generic --with-arch_32=i586 --build=x86_64-redhat-linux6E
Thread model: posix
gcc version 4.4.0 20090514 (Red Hat 4.4.0-6) (GCC)

Step 2:  Unzip pycuda-0.94.2 to your directory and install numpy, pytest and pytools

$ tar xzf pycuda-0.94.2.tar.gz
$ sudo easy_install numpy
install_dir /usr/local/lib/python2.6/site-packages/
Searching for numpy
Best match: numpy 1.5.1
Processing numpy-1.5.1-py2.6-linux-x86_64.egg
numpy 1.5.1 is already the active version in easy-install.pth
Installing f2py script to /usr/local/bin

Using /usr/local/lib/python2.6/site-packages/numpy-1.5.1-py2.6-linux-x86_64.egg
Processing dependencies for numpy
Finished processing dependencies for numpy

$ sudo easy_install pytest
install_dir /usr/local/lib/python2.6/site-packages/
Searching for pytest
Reading http://pypi.python.org/simple/pytest/
Reading http://pytest.org
Best match: pytest 2.0.0
Downloading http://pypi.python.org/packages/source/p/pytest/pytest-2.0.0.zip#md5=f07c521dfd5a540f3dfea1846e58dab7
Processing pytest-2.0.0.zip
Running pytest-2.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-5pyoXO/pytest-2.0.0/egg-dist-tmp-ssbYCg
Adding pytest 2.0.0 to easy-install.pth file
Installing py.test script to /usr/local/bin
Installing py.test-2.6 script to /usr/local/bin

Installed /usr/local/lib/python2.6/site-packages/pytest-2.0.0-py2.6.egg
Processing dependencies for pytest
Finished processing dependencies for pytest


$ sudo easy_install pytools
install_dir /usr/local/lib/python2.6/site-packages/
Searching for pytools
Best match: pytools 11
Processing pytools-11-py2.6.egg
pytools 11 is already the active version in easy-install.pth
Installing runalyzer-gather script to /usr/local/bin
Installing runalyzer script to /usr/local/bin
Installing logtool script to /usr/local/bin

Using /usr/local/lib/python2.6/site-packages/pytools-11-py2.6.egg
Processing dependencies for pytools
Finished processing dependencies for pytools

Step 3: Configure siteconf.py


$ python configure.py --cuda-root=/usr/local/cuda --cudadrv-lib-dir=/usr/lib64/nvidia --boost-inc-dir=/usr/include --boost-lib-dir=/usr/local/lib CXXFLAGS=-DBOOST_PYTHON_NO_PY_SIGNATURES

Step 4: Install
$ sudo make install


Step 5: Test your installation
$ python test/test_driver.py 
=============================================================== test session starts ===============================================================
platform linux2 -- Python 2.6.6 -- pytest-2.0.0
collected 18 items 

test/test_driver.py ..................

============================================================ 18 passed in 4.58 seconds ========================================


If you see the message above, pyCuda should be working now. Enjoy :)


FYI: For installing pyCuda on Ubuntu 10.10, refer to http://wiki.tiker.net/PyCuda/Installation/Linux/Ubuntu



Thursday, October 7, 2010

Installing Tesla C2050 and C1060 on Centos

I have been waiting to get a NVIDIA Fermi GPU to prototype my algorithms, but I was waiting to get an optimized implementation on the Tesla C1060 and then scale up to the Fermi. I got a C2050 and set about installing it on my machine. (note: my host machine is a Lian-Li PC with an integrated NVIDIA nForce 980a/780a (8 cores). 

Installing the C2050 was easy, it has 2 power connections (6-pin and 8-pin). In most of the cases, connecting the 8-pin connector is more than enough. 



After installing the C2050, I had to boot the machine and change the BIOS setting to disable display from external GPUs. The C2050 has a display out and I don't intend to use it now. 

My deviceQuery output:
[vivekv@atstgpu release]$ more deviceQuery.txt
./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla C2050"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 3220897792 bytes
  Number of multiprocessors:                     14
  Number of cores:                               448
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No
Device 1: "Tesla C1060"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294770688 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
Device 2: "nForce 980a/780a SLI"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 131399680 bytes
  Number of multiprocessors:                     1
  Number of cores:                               8
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.20 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     Yes
  Integrated:                                    Yes
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 3, Device = Tesla C2050, Device = Tesla 
C1060
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------


I hope to post the performance comparison of my kernels using the Tesla C2050 sometime soon.