fix balance rcb and kokkos gpu

Dear all,

I have an orthogonal simulation box(50-50-500Angstroms) that has atoms on the z direction at both ends and the middle volume is empty as given below. I am investigating the decomposition behavior of the target material by applying heat via fix heat command so the middle section of the box would be filled during the simulation.

When I test my simulation on my desktop (i7-6700 - 4 cores) I’ve got 1.3 timestep/second performance with " fix balance all balance ${balanceupt} 1.1 rcb ‘’. However if I run the same script on the cluster my job hangs when using 8 or more cores.(lammps 3May2020 version using openmpi 4.0)

I’ve checked the mail list why the simulation hangs when using fix balance rcb and found there were the same questions regarding this issue but could not find the answer

Since I am using ReaxFF potential I decided to use the KOKKOS CUDA package.
My question is, Should I still use the fix balance command or similar ones? When I test my script with fix balance command simulation performance is getting lower since there are other fixes that Kokkos does not support in my script. I’ve tested my script removing all fixes and dumps with 4 gpu 10 cores. I’ve got 30 timestep/second timing which is fine. I’ve got 4 timestep/second timing with fixes and dumps . I’ve understood that the CPU is bottlenecking by changing the number of gpus from 1 to 4 gpus. performance is always the same with different numbers of GPUs. It may be because of the imbalance on the cores. Is there any way to increase performance using kokkos-cuda for my setup?

A simplified script is attached below. I can not add the data file due to the size limit.

Best Regards
Garip

image.png

units real
dimension 3
boundary p p f
atom_style charge

variable wobblex equal normal(0,5,1000)
variable wobbley equal normal(0,5,1000)
variable z equal ramp(0,-35)
variable numrun equal 100
variable num_dump equal 10
variable balanceupt equal 1
variable a_heat equal 20 #

read_data equilib.data

region heat cylinder z 25 25 8 45 150 side in move v_wobblex v_wobbley v_z units box

region tim block INF INF INF INF INF 8

group target region target

group tim dynamic all region tim every ${gupd}

pair_style reax/c NULL checkqeq no
pair_coeff * * ffield.reax.ni B B N N N N Ni
compute reax all pair reax/c
compute ent1 all entropy/atom 0.25 5

#neighbor 10.0 bin
#neigh_modify every 1 delay 10 check yes page 100000 cluster yes binsize 7
#comm_modify mode single cutoff 22.0 vel yes
#comm_style tiled
#fix balance all balance ${balanceupt} 1.1 rcb

velocity all create 300 23432 rot yes dist gaussian

fix zwalls all wall/reflect zlo EDGE zhi EDGE
fix 1 all nve
fix 7a target heat {num_dump} v_a_heat region heat fix 3 tim temp/berendsen 300 300 {num_dump}

fix 2d all reax/c/species 1 10 ${num_dump} species.out element Bx By Nc Ny Nx Np Ni

dump 1 all custom ${num_dump} dump.atom* id type x y z vx vy vz fx fy fz q c_ke c_pe c_ent1

thermo 100

timestep 0.1
run ${numrun}

there really is no need to use tiled communication and rcb balancing for your system.

the first step in improving load balancing should always be to use the processors command.
by default the processor grid is created assuming homogeneous density, which is not the case here.
with few cores, using processors 1 1 * is probably best (in combination with load balancing), but when increasing the number of cores, you may want to test processors 1 2 * and processors 2 2 *)

next step would be to just use a couple of static rebalances with the balance command.
with your geometry, you would only really need to adjust the sizes of the subdomains in z direction.

the biggest problem i see with your simulation setup vs Kokkos/CUDA is that you are mixing KOKKOS fixes with non-KOKKOS fixes and thus requiring excessive data transfer between CPU and GPU.

axel.

image.png

Can you post your mpirun command with all the args? The issue with using comm_style tiled is that it isn’t Kokkos-enabled yet and it just copies all of the data to the host CPU to pack/unpack comm buffers. This, in addition with using non-Kokkos fixes/computes, can lead to a significant overhead transferring the data back and forth between GPU and CPU. As Axel said, it would be better to use static load balancing or fix balance with the “​shift” option, without “comm_style tiled”.

Stan

image.png

Dear Axel and Stan,

Thank you for your replies to my question. After I have some problems with fix balance rcb with only cpu mpi mode I’ve decided to use Kokkos CUDA package because I thought It would be better to use less CPU cores after I’ve seen the recommendation that using less MPI per GPU is better on the mail list. So I used kokkos-cuda without "comm_style tiled ". I have only one time used fix balance rcb and comm_tiled command with the kokkos package and I saw the results and deleted the command on the script. Then I ran several tests without fix balance and comm_style tiled before I wrote here. On the tests I’ve seen If I add the upper layer to the simulation box , simulation performance significantly drops. I thought the higher numbers of atoms affect the simulation performance then I used more GPU in order to increase performance but it ended with the same result ( test number 10 and 13). So I thought It might be a balance issue.

I’ve done some tests on the cluster using different settings after your recommendations. I wish I could have done more but my test reservation has ended. I attached a table below of my test results. If I delete the upper layer without any balance or processor command I get 17 to 18 timestep / second performans with near 13000 atoms using 2 GPUs even with the non kokkos fixes. If I run the simulation without a fix and dump I get 22-23 timestep/second . balance command or fix balance commands resulted in exit code when using Kokkos CUDA. Lammps info and error code are added below.

Here is the full mpirun command

mpirun --mca btl ^tcp /mylammps/kokkos-gpu/lmp -k on g #-sf kk -pk kokkos newton on neigh full comm device -in input

Best Regards
Garip

what(): Kokkos::TeamPolicy< Cuda > the team size is too large. Team size x vector length must be smaller than 1024.
Traceback functionality not available

[akya14:90248] *** Process received signal ***
[akya14:90248] Signal: Aborted (6)
[akya14:90248] Signal code: (-6)

image.png

image.png

LAMMPS (3 Mar 2020)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
using 1 OpenMP thread(s) per MPI task

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info
Printed on Fri Jun 12 13:39:03 2020

LAMMPS version: 3 Mar 2020 / 20200303
Git info: / /

OS information: Linux 3.10.0-514.6.1.el7.x86_64 on x86_64

sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit

Compiler: GNU C++ 7.0.1 20170326 (experimental) with OpenMP 4.5
C++ standard: C++11

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_SMALLBIG

Installed packages:

KOKKOS MANYBODY MPIIO QEQ USER-MISC USER-OMP USER-REAXC

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info

Total wall time: 0:00:00

image.png