GPU efficiency

Dear all,
I have build LAMMPS with GPU option. However, when I submit the job the GPU utilisation for my programme is only ~10%.
Could you please give me some suggestions how to improve the efficiency?

With best regards,
Pritam

Well, the GPU usage will be highly dependent on your problem, so until you provide a description of that, there is not much we can do. Perhaps a lot of computation time is spent on bonds rather than pair potentials? In that case you will barely get any speedup with the gpu package.

Thank you very much Stefan for the reply.
Here I have attached my Input script. Could you please have a look and give some suggestions.

Here I have attached my input script.

newton off

variable T equal 298.0

variable V equal 1e-6

variable Fn equal -0.157674782e-4 # -0.157674782e-4 is 1 atm

variable K equal 1.0e-2

variable DT equal 1.0

timestep ${DT}

thermo 10000

units real

atom_style full

dimension 3

boundary p p f

#Non-banded energy

pair_style lj/cut/coul/long 10.0

kspace_style pppm 1.0e-4

kspace_modify slab 3.0

pair_modify shift yes mix geometric

#Bond stretching force constants

bond_style harmonic

#Bond angle bending force constants

angle_style harmonic

#Torsional Rotation

dihedral_style opls

improper_style none

read_data system_mica_96.data

group mol type 1 2 3 4 5 6 7 8 9 10

group vector type 10 3

group head type 10

group stage type 19

group top molecule <> 97 136

group bot molecule <> 137 176

dump 1 all xyz 100000 all.xyz

dump_modify 1 sort id element C C C C C C C H H N K Si Al Al O O O H Fe

neighbor 1.0 bin

neigh_modify every 5 delay 0 check yes

velocity mol create T {seed} mom yes rot yes dist gaussian

velocity top set 0 0 0 units box

velocity bot set 0 0 0 units box

velocity stage set 0 0 0 units box

fix 1 bot setforce 0 0 0

fix 2 top setforce 0 0 0

fix 3 stage setforce 0 0 0

minimize 1.0e-4 1.0e-4 10000 100000

fix NVT mol nvt temp $T $T 100.0

restart 500000 restart_before_steady_state.*.friction

run 10000

unfix 2

fix 4 top aveforce NULL NULL ${Fn}

fix 5 top move linear $V 0 NULL units box

run 1000000

unfix 1

unfix 3

variable Rx equal xcm(bot,x)

variable Vx equal vcm(bot,x)

variable MM equal mass(bot)

variable Sx equal xcm(stage,x)

variable Fx equal v_K*(v_Sx-v_Rx)

fix 6 bot move linear NULL 0 0 units box

fix 7 stage move linear 0 0 0 units box

fix 8 bot aveforce v_Fx NULL NULL

unfix NVT

fix Langevin mol langevin $T T 100 {seed}

compute newT mol temp/partial 0 1 0

fix_modify Langevin temp newT

fix NVE mol nve

thermo_style custom step c_newT pe

#restart 500000 restart_before_steady_state.*.friction

#run 2000000

variable Ff equal fcm(bot,x)/count(bot)-v_Fx

variable topZ equal xcm(top,z)

#fix AvgForce all ave/time 100 1 100 v_F_string file force_friction.txt mode scalar ave one

fix MassCenter all ave/time 100 1 100 v_Fx v_Ff v_Rx v_Vx v_topZ file mass_center.txt mode scalar ave one

compute MSD head msd

fix MSD all ave/time 100 1 100 c_MSD[1] c_MSD[2] c_MSD[3] c_MSD[4] file mol_msd.txt mode scalar ave one

dump 2 vector custom 100 dump.lammpstrj id mol type xu yu zu

dump_modify 2 sort id format “%5d %5d %5d %10lf %10lf %10lf”

restart 1000000 restart_at_steady_state.*.friction

run 40000000

clear

With best regards,
Pritam

Thank you very much Stefan for the reply.
Here I have attached my Input script. Could you please have a look and give
some suggestions.

the LAMMPS manual has a complete section with advice for how to use
accelerators and what determine its efficient use.
http://lammps.sandia.gov/doc/Section_accelerate.html
you most certainly won't get much help by just barfing a convoluted
and cluttered input file in people's faces.

there is a lot of *crucial* information missing here:
- what GPU do you have?
- what kind of hardware are you running on (desktop, HPC cluster,
laptop, what CPU, shared or exclusive use)?
- how did you compile the GPU library?
- how did you run your job? how many MPI tasks per GPU?
- how large is your system (i.e. how many atoms)?
- what LAMMPS version do you use?
- how does your GPU machine stack up against the available benchmarks
with the provided benchmark inputs and the data from here:
http://lammps.sandia.gov/bench.html

axel.

Dear Axel,
Sorry. I should mention all details not only the input script.
Here I have mentioned every details about the GPU run. If you make some comment It would be very helpful.

In CSC (Finland) GPU cluster, 38 nodes have twoNVIDIA Tesla K40 GPU. Each compute node hosts two Intel Xeon E5-2620-v2 CPUs with 6 core processors. Another 12 nodes have two NVIDIA Tesla K80 GPU accelerator card. Each node is equipped with two 12-core Intel Xeon processors (E5-2680).

I have used the command
make -f Makefile.linux.double
where in Makefile.linux.double (which have been selected from **lammps-30Jul16/**lib/gpu) the following 3 settings have been mentioned

CUDA_HOME = /appl/opt/cuda/7.5
CUDA_ARCH = -arch=sm_35
CUDA_PRECISION = -D_SINGLE_DOUBLE
and in Makefile.lammps.standard I have made necessary changes.

LAMMPS was compiled with mpicxx compiler using module openmpi/1.10.2 gcc/4.9.3 cuda/7.5 StdEnv git/1.9.2

Before compilation of LAMMPS, I have installed the packages gpu and kspace.

srun - -gres=gpu:1 lmp_mpi -sf gpu -in script.in

I have asked for 1 CPU node and 1 gpu node.

The recent version of LAMMPS (lammps-30Jul16) is used.

With best regards,
Pritam

Dear Axel,
Sorry. I should mention all details not only the input script.
Here I have mentioned every details about the GPU run. If you make some

no you haven't.
axel.

One more information: the number of atoms is 17377. The system has 96 Liquid crystals and every liquid crystal has 41 atoms and rest are from the mica surface.
With best regards,
Pritam