threading question

Hi,

I use “mpirun -np 4 lmp_mpi -sf gpu -pk gpu 1” and that will launch 4 processes on cpu (seen by top ) each with 100% utilization and 4 processes on gpu (seen by nvidia-smi). That basically means four processes are sharing gpu.

I want to know if it is possible to have one process on cpu with 400% cpu utilization.

Regards,
Mahmood

Hi,

I use “mpirun -np 4 lmp_mpi -sf gpu -pk gpu 1” and that will launch 4 processes on cpu (seen by top ) each with 100% utilization and 4 processes on gpu (seen by nvidia-smi). That basically means four processes are sharing gpu.

​yes, and this is a good thing. it increases GPU occupancy with most GPUs (you may need even more concurrent tasks for modern high-end GPUs) and parallelizes the parts of the calculation, that are not GPU accelerated.

I want to know if it is possible to have one process on cpu with 400% cpu utilization.

​no. besides, higher utilization doesn’t automatically result in better performance.

​axel.​

​yes, and this is a good thing. it increases GPU occupancy with most GPUs (you may need even >more concurrent tasks for modern high-end GPUs) and parallelizes the parts of the calculation, that >are not GPU accelerated.

But the case for GPU is different from CPU. Traditionally, if you have 4 cores and launch an MPI job with "-np 4", a function foo() will be run on 4 cores (each core running one process) and each core runs foo() with different data.

Now, assume there are 4 MPI processes and each process is running on one core. Each process reaches bar() which is a GPU kernel. When processes 1 offloads bar() on GPU, other processes have to wait.

So, I think using 4 MPI processes in the presence of GPU is tricky. Any thought?

Regards,
Mahmood

>yes, and this is a good thing. it increases GPU occupancy with most GPUs (you may need even >more concurrent tasks for modern high-end GPUs) and parallelizes the parts of the calculation, that >are not GPU accelerated.

But the case for GPU is different from CPU. Traditionally, if you have 4 cores and launch an MPI job with "-np 4", a function foo() will be run on 4 cores (each core running one process) and each core runs foo() with different data.

Now, assume there are 4 MPI processes and each process is running on one core. Each process reaches bar() which is a GPU kernel. When processes 1 offloads bar() on GPU, other processes have to wait.

So, I think using 4 MPI processes in the presence of GPU is tricky. Any thought?

a) i already gave you my thoughts
b) perhaps you need to look at some benchmarks and read some papers on
the subject. e.g. the ones describing the GPU package in LAMMPS.

axel.