Errors while compiling with gpu package

_Luis_Goncalves · June 17, 2011, 8:20pm

Here is the result

NVIDIA: could not open the device file /dev/nvidiactl (No such file or
directory).
Failed to initialize NVML: Unknown Error

Well, maybe the toolkit I installed is too new (4.0). My driver is the
275.09.07 64 bits. I really don't know if they are compatible.

Luis

OK, I added the flag -fno-rtti and succeded with the compilation. I

tested the executable with this fix

fix gpuConf all gpu force 0 0 -1
and kpace_style pppm/gpu/single. Running on 6 processors, the following

message came:

Cuda driver error 100 in call at file 'geryon/nvd_device.h' in line

207.

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Cuda driver error 100 in call at file 'geryon/nvd_device.h' in line

207.

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4
Cuda driver error 100 in call at file 'geryon/nvd_device.h' in line

207.

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
Cuda driver error 100 in call at file 'geryon/nvd_device.h' in line

207.

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Cuda driver error 100 in call at file 'geryon/nvd_device.h' in line

207.

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Cuda driver error 100 in call at file 'geryon/nvd_device.h' in line

207.

_Christian_Muller · June 17, 2011, 8:23pm

Hi

That is a common problem. Sometimes the device entries are not created.
This should help (as root):

mknod /dev/nvidia0 c 195 0 ; chmod 666 /dev/nvidia0
mknod /dev/nvidiactl c 195 255 ; chmod 666 /dev/nvidiactl

If you have more than one GPU you need more of the first lines with increasing numbers. E.g. for a second GPU you would need to do:

mknod /dev/nvidia1 c 195 1 ; chmod 666 /dev/nvidia1

You might need to add that to your bootup script.

Cheers
Christian

-------- Original-Nachricht --------

akohlmey · June 17, 2011, 8:34pm

Here is the result

NVIDIA: could not open the device file /dev/nvidiactl (No such file or
directory).

well, there is your problem. due to the udev system,
the cuda devices get deleted at every reboot. you either
have to configure udev to create them for you with the proper
permissions.

the most convenient way to create those devices is to run
nvidia-smi -a as root. just add it to /etc/rc.d/rc.local

Failed to initialize NVML: Unknown Error

Well, maybe the toolkit I installed is too new (4.0). My driver is the
275.09.07 64 bits. I really don't know if they are compatible.

yes they are:

[[email protected]... gpu]$ ./nvc_get_devices
Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA
CUDA Driver Version: 4.0
CUDA Runtime Version: 4.0

Device 0: "GeForce GTX 560 Ti"
  Type of device: GPU
  Compute capability: 2.1
  Double precision support: Yes
  Total amount of global memory: 0.999207 GB
  Number of compute units/multiprocessors: 8
  Number of cores: 256
  Total amount of constant memory: 65536 bytes
  Total amount of local/shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum group size (# of threads per block) 1024 x 1024 x 64
  Maximum item sizes (# threads for each dim) 65535 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.645 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: Yes
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default
  Concurrent kernel execution: Yes
  Device has ECC support enabled: No
[[email protected]... gpu]$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 275.09.07 Wed Jun 8
14:16:46 PDT 2011
GCC version: gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC)

_Luis_Goncalves · June 20, 2011, 9:35pm

Hi,

Now the nvidia-smi -a gives:

==============NVSMI LOG==============

Timestamp : Mon Jun 20 18:25:08 2011

Driver Version : 270.41.19

Attached GPUs : 1

GPU 0:3:0
    Product Name : GeForce GT 240
    Display Mode : N/A
    Persistence Mode : Disabled
    Driver Model
        Current : N/A
        Pending : N/A
    Serial Number : N/A
    GPU UUID : N/A
    Inforom Version
        OEM Object : N/A
        ECC Object : N/A
        Power Management Object : N/A
    PCI
        Bus : 3
        Device : 0
        Domain : 0
        Device Id : CA310DE
        Bus Id : 0:3:0
    Fan Speed : 41 %
    Memory Usage
        Total : 1023 Mb
        Used : 40 Mb
        Free : 982 Mb
    Compute Mode : Default
    Utilization
        Gpu : N/A
        Memory : N/A
    Ecc Mode
        Current : N/A
        Pending : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
        Aggregate
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Total : N/A
    Temperature
        Gpu : 33 C
    Power Readings
        Power State : N/A
        Power Management : N/A
        Power Draw : N/A
        Power Limit : N/A
    Clocks
        Graphics : N/A
        SM : N/A
        Memory : N/A

while running with fix gpu 0 0 1 and pppm/gpu/single. I have not obtained
any improvement on performance running on 1 or 2 cpu cores. I am running a
9216-atom system with buck/coul/long potential.

What should be the speedup in this case?

Best regards,
Luis

akohlmey · June 20, 2011, 9:39pm

while running with fix gpu 0 0 1 and pppm/gpu/single. I have not obtained
any improvement on performance running on 1 or 2 cpu cores. I am running a
9216-atom system with buck/coul/long potential.

What should be the speedup in this case?

hard to say if there would be any speedup. your GPU model is pretty
limited in terms of memory bandwidth and floating point performance
(clock rate and number of cores). all of that has significant impact on
the amount of acceleration possible. please note that this pair style is
only supported for acceleration with the USER-CUDA package, not
the GPU package. does the output show which GPU package is active?

cheers,
axel.

_Luis_Goncalves · June 20, 2011, 9:55pm

I plan on using the pppm acceleration. Below is an extract of the output

akohlmey · June 20, 2011, 9:54pm

I plan on using the pppm acceleration. Below is an extract of the output

that is a pretty pointless thing, if the majority of the time
is spent on computing the real-space interactions.

have a look at amdahl's law and be enlightened.

cheers,
axel.

_Brown_W_Michael · June 21, 2011, 3:30pm

I don't think that PPPM with GPU acceleration should be slower than a CPU run. I am happy to look at the screen output for both the CPU and GPU that includes GPU timings, however, there are some things you should think about before pursuing this:

1. pppm/gpu can provide significant speedups for cases where the k-space computational time represents a significant fraction of the total runtime. This will happen for certain simulations, mostly when using GPU acceleration for pair forces. If k-space is 10% of the runtime, however, the absolute best you can do is a 10% improvement in performance. k-space times for parallel jobs can be communication bound and gpu-acceleration will not help with this.

2. Since the pair time run on the CPU is 90% of the run-time, running the simulation on 2 cores should be faster even if for some reason the pppm/gpu time is a little slower - you are dividing >90% of the work between different CPU cores.

3. A new GPU is not going to help much if buck/coul/long is not available for GPU acceleration.

4. With only 9216 atoms, the speedup with a good GPU and a port for buck/coul/long versus a hex-core opteron will probably be between 2 and 3 times. I do believe LAMMPS can be improved for smaller problem sizes, however, I also think that this is unlikely to happen soon (at least with full feature compatibility).

- Mike

Luis Goncalves wrote:

_Luis_Goncalves · June 21, 2011, 6:07pm

Firstly, thank you all for the great help.

It appears to me that I am better off using the CUDA-USER package because
it will speedup my pair interaction portion. pppm/gpu will then help me
since the kspace timings will increase in comparison with the pairs
timings. Besides, my pair potential is already implemented in CUDA-USER.

My other concern is regarding multiple cores and gpu processing. If I
understood correctly, item 4 below means that

1 gpu + 1 cpu = (6 cpus) x 2 times faster

if I use gpu for pair interactions. Is that right?

Cheers,
Luis

_Brown_W_Michael · June 21, 2011, 6:35pm

This is a future option if you have a different GPU (compute capability
1.3 is required for user-cuda - yours is 1.2). Regarding your question,
I meant that the best time you could get on a 6-core opteron with GPU
would be 2-3x the best time you could get without. The "cuda" pkg is
entirely different code with different performance.

- Mike