I am simulating a droplet of NaCl solution on a gold surface which has around ~9000 fluid atoms and ~50000 gold atoms. A smaller version of the simulation with 1100 atoms, did run without any out of memory issues, so I am trying to debug this issue. The slurm error message:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3158334.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cmp277: task 2: Out Of Memory
From the log file, the max memory usage per proc:
Per MPI rank memory allocation (min/avg/max) = 11.2 | 11.91 | 14 Mbytes
which indicates it is not memory limited. I use the stable version of LAMMPS (2 Aug 2023 - Update 2) , and tried to run the simulation with 32 cores and 4GB mem-per-cpu.
My initial thoughts are that the 2 ave/chunk and 1 ave/time fixes in addition to the nvt and nve fixes are causing this. But as I am still a beginner in LAMMPS and yet to totally understand the C++ source code, I am not 100% sure.
Has someone else faced an out-of memory error in such a case? And if so, how does one go about debugging it?
Relevant lines of input script:
## This is where the simulation runs out-of-memory ##
print ""
print "PRODUCTION RUN"
print ""
# Time-averaged calculations
# chunk/atom helps get the space-averaged property values
compute cc1 FLUID chunk/atom bin/3d x lower 0.3 y lower 0.3 z lower 0.3
compute cc2 FLUID com/chunk cc1
# Use the computes defined earlier to get the time-averaged center of mass (to check for drift in droplet), space-averaged number and mass density of FLUID atoms
fix at_com FLUID ave/time 1000 10 10000 c_cc2[*] file outputs/data/com.dat mode vector
fix ac_dn FLUID ave/chunk 1000 10 10000 cc1 density/number ave running file outputs/data/density_number_FLUID(chunk).dat overwrite
fix ac_dm FLUID ave/chunk 1000 10 10000 cc1 density/mass ave running file outputs/data/density_mass_FLUID(chunk).dat overwrite
# Record time-avg FLUID temperature, to analyze fluctuations
compute cc3 FLUID temp
fix at_fluid_temp FLUID ave/time 1000 10 10000 c_cc3 file outputs/data/temp_FLUID.dat
timestep 1
fix nvt1 WALL nvt temp ${Temp} ${Temp} $(100*dt)
fix nve1 FLUID nve
run ${N_prod}
write_data outputs/out.lammpsdata pair ij
write_restart outputs/out.restartlammps