Dear all,
I have build LAMMPS with GPU option. However, when I submit the job the GPU utilisation for my programme is only ~10%.
Could you please give me some suggestions how to improve the efficiency?
With best regards,
Pritam
Dear all,
I have build LAMMPS with GPU option. However, when I submit the job the GPU utilisation for my programme is only ~10%.
Could you please give me some suggestions how to improve the efficiency?
With best regards,
Pritam
Well, the GPU usage will be highly dependent on your problem, so until you provide a description of that, there is not much we can do. Perhaps a lot of computation time is spent on bonds rather than pair potentials? In that case you will barely get any speedup with the gpu package.
Thank you very much Stefan for the reply.
Here I have attached my Input script. Could you please have a look and give some suggestions.
Here I have attached my input script.
newton off
variable T equal 298.0
variable V equal 1e-6
variable Fn equal -0.157674782e-4 # -0.157674782e-4 is 1 atm
variable K equal 1.0e-2
variable DT equal 1.0
timestep ${DT}
thermo 10000
units real
atom_style full
dimension 3
boundary p p f
#Non-banded energy
pair_style lj/cut/coul/long 10.0
kspace_style pppm 1.0e-4
kspace_modify slab 3.0
pair_modify shift yes mix geometric
#Bond stretching force constants
bond_style harmonic
#Bond angle bending force constants
angle_style harmonic
#Torsional Rotation
dihedral_style opls
improper_style none
read_data system_mica_96.data
group mol type 1 2 3 4 5 6 7 8 9 10
group vector type 10 3
group head type 10
group stage type 19
group top molecule <> 97 136
group bot molecule <> 137 176
dump 1 all xyz 100000 all.xyz
dump_modify 1 sort id element C C C C C C C H H N K Si Al Al O O O H Fe
neighbor 1.0 bin
neigh_modify every 5 delay 0 check yes
velocity mol create T {seed} mom yes rot yes dist gaussian
velocity top set 0 0 0 units box
velocity bot set 0 0 0 units box
velocity stage set 0 0 0 units box
fix 1 bot setforce 0 0 0
fix 2 top setforce 0 0 0
fix 3 stage setforce 0 0 0
minimize 1.0e-4 1.0e-4 10000 100000
fix NVT mol nvt temp $T $T 100.0
restart 500000 restart_before_steady_state.*.friction
run 10000
unfix 2
fix 4 top aveforce NULL NULL ${Fn}
fix 5 top move linear $V 0 NULL units box
run 1000000
unfix 1
unfix 3
variable Rx equal xcm(bot,x)
variable Vx equal vcm(bot,x)
variable MM equal mass(bot)
variable Sx equal xcm(stage,x)
variable Fx equal v_K*(v_Sx-v_Rx)
fix 6 bot move linear NULL 0 0 units box
fix 7 stage move linear 0 0 0 units box
fix 8 bot aveforce v_Fx NULL NULL
unfix NVT
fix Langevin mol langevin $T T 100 {seed}
compute newT mol temp/partial 0 1 0
fix_modify Langevin temp newT
fix NVE mol nve
thermo_style custom step c_newT pe
#restart 500000 restart_before_steady_state.*.friction
#run 2000000
variable Ff equal fcm(bot,x)/count(bot)-v_Fx
variable topZ equal xcm(top,z)
#fix AvgForce all ave/time 100 1 100 v_F_string file force_friction.txt mode scalar ave one
fix MassCenter all ave/time 100 1 100 v_Fx v_Ff v_Rx v_Vx v_topZ file mass_center.txt mode scalar ave one
compute MSD head msd
fix MSD all ave/time 100 1 100 c_MSD[1] c_MSD[2] c_MSD[3] c_MSD[4] file mol_msd.txt mode scalar ave one
dump 2 vector custom 100 dump.lammpstrj id mol type xu yu zu
dump_modify 2 sort id format “%5d %5d %5d %10lf %10lf %10lf”
restart 1000000 restart_at_steady_state.*.friction
run 40000000
clear
With best regards,
Pritam
Thank you very much Stefan for the reply.
Here I have attached my Input script. Could you please have a look and give
some suggestions.
the LAMMPS manual has a complete section with advice for how to use
accelerators and what determine its efficient use.
http://lammps.sandia.gov/doc/Section_accelerate.html
you most certainly won't get much help by just barfing a convoluted
and cluttered input file in people's faces.
there is a lot of *crucial* information missing here:
- what GPU do you have?
- what kind of hardware are you running on (desktop, HPC cluster,
laptop, what CPU, shared or exclusive use)?
- how did you compile the GPU library?
- how did you run your job? how many MPI tasks per GPU?
- how large is your system (i.e. how many atoms)?
- what LAMMPS version do you use?
- how does your GPU machine stack up against the available benchmarks
with the provided benchmark inputs and the data from here:
http://lammps.sandia.gov/bench.html
axel.
Dear Axel,
Sorry. I should mention all details not only the input script.
Here I have mentioned every details about the GPU run. If you make some comment It would be very helpful.
In CSC (Finland) GPU cluster, 38 nodes have twoNVIDIA Tesla K40 GPU. Each compute node hosts two Intel Xeon E5-2620-v2 CPUs with 6 core processors. Another 12 nodes have two NVIDIA Tesla K80 GPU accelerator card. Each node is equipped with two 12-core Intel Xeon processors (E5-2680).
I have used the command
make -f Makefile.linux.double
where in Makefile.linux.double (which have been selected from **lammps-30Jul16/**lib/gpu) the following 3 settings have been mentioned
CUDA_HOME = /appl/opt/cuda/7.5
CUDA_ARCH = -arch=sm_35
CUDA_PRECISION = -D_SINGLE_DOUBLE
and in Makefile.lammps.standard I have made necessary changes.
LAMMPS was compiled with mpicxx compiler using module openmpi/1.10.2 gcc/4.9.3 cuda/7.5 StdEnv git/1.9.2
Before compilation of LAMMPS, I have installed the packages gpu and kspace.
srun - -gres=gpu:1 lmp_mpi -sf gpu -in script.in
I have asked for 1 CPU node and 1 gpu node.
The recent version of LAMMPS (lammps-30Jul16) is used.
With best regards,
Pritam
Dear Axel,
Sorry. I should mention all details not only the input script.
Here I have mentioned every details about the GPU run. If you make some
no you haven't.
axel.
One more information: the number of atoms is 17377. The system has 96 Liquid crystals and every liquid crystal has 41 atoms and rest are from the mica surface.
With best regards,
Pritam