I tried your input on my box with a single MPI process.
Without gpu acceleration, the host uses about 600MB RAM with 407720 atoms.
With gpu package, you can expect a little more than twice that because full neighbor lists are used (neighbor list is twice as big).
while your simulation is running indicates that 1478 MB are in use.
I did notice that there were dangerous builds from your input script. You might want to look at the documentation for neigh_modify.