Showing posts with label Multi-GPU. Show all posts
Showing posts with label Multi-GPU. Show all posts

Sunday, July 31, 2011

ERSA presentation comparing DSPs vs FPGAs vs GPUs

I recently presented my work at the 2011 International Conference on 
Engineering of Reconfigurable Systems and Algorithms (ERSA 2011)  conference, held at Las Vegas, NV. 


The three keynotes were: 
1. How Engineering Mathematics can Improve Software by Prof. David Lorge Parnas
2. The Nature of Cyber Security by Prof. Eugene H. Spafford
3. Changing Lives around the World: the Power of Technology by Dr. Sandeep Chatterjee


All the keynotes were impressive and very diverse, started with applying engineering mathematics to a process; cyber security as a science and method; and finally mobile computing applications in developing countries. Though there are no slides to link for Dr. Chatterjee's talk, his talk had example scenarios where mobile computing was used for banking, construction, etc.


The ERSA sessions started with a bang and the tutorial session on evolvable computing by Prof. Jim Torresen was very interesting. The Evolvable And Bio-Inspired Hardware session by Dr. Eric Stahlberg was interesting where the convergence was on what biological process can inspire computing in general. 


Finally, here's a link to my presentation on "Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures". My talk was on Wednesday July 20,2011 at the Gold Room.


I also chaired Session 13-PDPTA: Systems Software + OS + Programming Models + Architecture Issues +
Fault-Tolerant Systems & Tools on Thursday July 21, 2011 from 08:00am - 12:20pm (LOCATION: Ballroom 4) as part of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'11).


Next year's ERSA is expected to be bigger and better and promises to have a developers meet/session. Eagerly looking forward to next year's conference.

Thursday, October 7, 2010

Installing Tesla C2050 and C1060 on Centos

I have been waiting to get a NVIDIA Fermi GPU to prototype my algorithms, but I was waiting to get an optimized implementation on the Tesla C1060 and then scale up to the Fermi. I got a C2050 and set about installing it on my machine. (note: my host machine is a Lian-Li PC with an integrated NVIDIA nForce 980a/780a (8 cores). 

Installing the C2050 was easy, it has 2 power connections (6-pin and 8-pin). In most of the cases, connecting the 8-pin connector is more than enough. 



After installing the C2050, I had to boot the machine and change the BIOS setting to disable display from external GPUs. The C2050 has a display out and I don't intend to use it now. 

My deviceQuery output:
[vivekv@atstgpu release]$ more deviceQuery.txt
./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla C2050"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 3220897792 bytes
  Number of multiprocessors:                     14
  Number of cores:                               448
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No
Device 1: "Tesla C1060"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294770688 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
Device 2: "nForce 980a/780a SLI"
 CUDA Driver Version:                           3.10
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 131399680 bytes
  Number of multiprocessors:                     1
  Number of cores:                               8
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.20 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     Yes
  Integrated:                                    Yes
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 3, Device = Tesla C2050, Device = Tesla 
C1060
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------


I hope to post the performance comparison of my kernels using the Tesla C2050 sometime soon.