issue running lennard/mdf when using gpu acceleration for other pair_style commands

riccardo_innocenti · November 19, 2017, 2:12pm

Dear All,

I am trying to run some simulations on gpu accelerated nodes (NVIDIA Tesla K20X with 6 GB GDDR5 memory) using the mdf class of potentials. The lammps version I am using is 10Mar17.

When I used the mdf pair_style (Does not matter if it is the buck, lennard or lj type) the simulation fails after outputting the energy at step 0 without error messages (in log.lammps). My output file last lines look like:

Rank 20 [Sun Nov 19 15:05:23 2017] [c0-1c1s1n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 20
srun: error: nid01988: task 20: Aborted
srun: Terminating job step 4576432.0
slurmstepd: error: *** STEP 4576432.0 ON nid01987 CANCELLED AT 2017-11-19T15:05:23 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Initializing Device 0 on core 11…srun: error: nid01992: tasks 60-68,70-71: Killed
srun: error: nid01990: tasks 36-47: Killed
srun: error: nid01994: tasks 84-95: Killed
srun: error: nid01993: tasks 72-83: Killed
srun: error: nid01988: tasks 12-19,21-23: Killed
srun: error: nid01989: tasks 24-35: Killed
srun: error: nid01992: task 69: Killed
srun: error: nid01987: tasks 0-11: Killed
srun: error: nid01991: tasks 48-59: Killed
“slurm-4576432.out” 170L, 6808C

When I run the simulation without GPU acceleration the simulations run without any issues.

I am not sure what the error could be. Does anyone has any suggestion?

Kind regards,

Riccardo

akohlmey · November 19, 2017, 3:19pm

Dear All,

I am trying to run some simulations on gpu accelerated nodes (NVIDIA
Tesla K20X with 6 GB GDDR5 memory) using the mdf class of potentials.
The lammps version I am using is 10Mar17.

When I used the mdf pair_style (Does not matter if it is the buck, lennard
or lj type) the simulation fails after outputting the energy at step 0
without error messages (in log.lammps). My output file last lines look like:

the output below if from your queuing system, except for the first line.
so it is not useful at all. consult with your local admin staff to learn
how to find the output to the screen.

also, trying to run mdf pair styles on the GPU is a pointless exercise,
since those styles are not GPU accelerated, as is clearly evident from the
LAMMPS manual.

axel.

riccardo_innocenti · November 19, 2017, 3:34pm

Dear Axel,

Thank you for the reply.

I was not interested in accelerating those styles (mdf) on the gpu, but the other ones present in my force field file (e.g. pppm, coul/long…).

what part of the output could help identify what the problem is?

Kind regards,

Riccardo

akohlmey · November 19, 2017, 6:07pm

Dear Axel,

Thank you for the reply.

I was not interested in accelerating those styles (mdf) on the gpu, but
the other ones present in my force field file (e.g. pppm, coul/long...).

what part of the output could help identify what the problem is?

wherever the error messages are captured. when LAMMPS calls MPI_Abort(),
this will only be after it printed an error message stating why it stopped.

axel.

riccardo_innocenti · November 20, 2017, 7:58am

Dear Axel,

But in this case it does not seem to print an error message.

These are my last lines in log.lammps:

Neighbor list info …
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 25 25 25
7 neighbor lists, perpetual/occasional/extra = 7 0 0
(1) pair coul/long/gpu, perpetual, skip from (6)
attributes: full, newton off
pair build: skip
stencil: none
bin: none
(2) pair lj/cut/gpu, perpetual, skip from (6)
attributes: full, newton off
pair build: skip
stencil: none
bin: none
(3) pair lj/cut/coul/long/gpu, perpetual, skip from (6)
attributes: full, newton off
pair build: skip
stencil: none
bin: none
(4) pair buck/gpu, perpetual, skip from (6)
attributes: full, newton off
pair build: skip
stencil: none
bin: none
(5) pair lennard/mdf, perpetual, skip from (7)
attributes: half, newton off
pair build: skip
stencil: none
bin: none
(6) neighbor class addition, perpetual
attributes: full, newton off
pair build: full/bin
stencil: full/bin/3d
bin: standard
(7) neighbor class addition, perpetual, half/full from (6)
attributes: half, newton off
pair build: halffull/newtoff
stencil: none
bin: none
WARNING: Inconsistent image flags (…/domain.cpp:785)
Memory usage per processor = 84.5436 Mbytes
Step Time PotEng Temp Press Volume Pxx Pyy Pzz Cella Cellb Cellc CellAlpha CellBeta CellGamma CPU
0 0 -45205.24 300 -297.94047 3167414.5 -130.57545 -340.03169 -423.21427 146.85936 146.85936 146.85936 90 90 90 0

and then the programs just call MPI_Abort(). There does not seem to be any indication of where the error is.

Kind regards,

Riccardo

akohlmey · November 20, 2017, 2:31pm

unless you turned them off, batch systems usually capture the standard and error output from the submitted scripts. there must be an error message in those somewhere.

but there are a few things in your log file that don’t make much sense to me.
have you been able to run any of the benchmark examples correctly on the GPUs?
what exactly is the command line and the input for your simulation?