Slow performance with 2 concurrent jobs on a node

Hi everyone,

We are running 12-core lammps on a 24-core machine. When we run only a single 12-core job, it performs well with 100% CPU utilization most of the time. When we run 2x 12-core instances, using the entire machine, the CPU utilization drops to 50% for all cores.

I would expect some slowdown due to use of shared resources, such as the memory bandwidth, but 50% slowdown makes me think that these instances somehow intervene in each other. I was wondering, have you ever had a similar experience? Do multiple lammps instances use a shared resource (a generic-name lock file for synchronization, perhaps) that would cause slowdown?

Interestingly, this observations conflicts with the slide 11 of this study: http://www.hpcadvisorycouncil.com/pdf/LAMMPS_Analysis_and_Profiling_Intel.pdf

Thank you very much in advance for your time!

-Mehmet

mehmet,

Hi everyone,
We are running 12-core lammps on a 24-core machine. When we run only a

exactly _what_ hardware do you have. are those 24 real cores?
or 12 cores with hyperthreading? a 2-way node with 12-core processors,
or a 4-way machine with 6-core processors?

single 12-core job, it performs well with 100% CPU utilization most of the
time. When we run 2x 12-core instances, using the entire machine, the CPU
utilization drops to 50% for *all* cores.

how do you determine cpu utilization?
are you using memory/processor affinity?

I would expect some slowdown due to use of shared resources, such as the
memory bandwidth, but 50% slowdown makes me think that these instances
somehow intervene in each other. I was wondering, have you ever had a
similar experience? Do multiple lammps instances use a shared resource (a
generic-name lock file for synchronization, perhaps) that would cause
slowdown?

it depends on _how_ you share the resource. you have to provide
more details about exactly what you were running.

Interestingly, this observations conflicts with the slide 11 of this
study: http://www.hpcadvisorycouncil.com/pdf/LAMMPS_Analysis_and_Profiling_Intel.pdf

these benchmarks look a bit lopsided to me.
due to the way how the kspace code scales
you will always be more efficient running
with less MPI tasks. however, they don't show
how the performance is with 12 cores/node,
but using only half the nodes.

this is already the second benchmark document from
that website that recommends "best practices" and
efficiency considerations, that are orthogonal to what
i can measure on _very_ similar hardware.
you have to keep in mind: there are lies, damn lies,
and benchmarks. :wink:

cheers,
    axel.

Axel,

Thanks so much for your fast reply!

exactly what hardware do you have. are those 24 real cores?
or 12 cores with hyperthreading? a 2-way node with 12-core processors,
or a 4-way machine with 6-core processors?

These are 4-way 6-core (AMD Opteron 8431) machines.

how do you determine cpu utilization?
are you using memory/processor affinity?

I monitor the ‘top’ output. I am not using memory/processor affinity (but very interested in trying it). I know how to use affinity for my own C codes using hwloc library, but have no idea if lammps allows for different mappings. This is particularly important given that our machines are NUMA.

it depends on how you share the resource. you have to provide
more details about exactly what you were running.

We use PBS, which makes sure that the node is entirely allocated to the user. I compiled lammps using PGI compilers and MVAPICH, which provided the bast performance for us (as opposed to GNU and Intel). I am comparing two identical runs, so the input is not a factor. Any other details you would like to know?

these benchmarks look a bit lopsided to me.
due to the way how the kspace code scales
you will always be more efficient running
with less MPI tasks. however, they don’t show
how the performance is with 12 cores/node,
but using only half the nodes.

Yes, that’s why I am very interested in others’ experiences on a similar 24-core hardware…

this is already the second benchmark document from
that website that recommends “best practices” and
efficiency considerations, that are orthogonal to what
i can measure on very similar hardware.
you have to keep in mind: there are lies, damn lies,
and benchmarks. :wink:

I hear you! :slight_smile:

Thanks a lot again!
-Mehmet

Axel,
Thanks so much for your fast reply!

exactly _what_ hardware do you have. are those 24 real cores?
or 12 cores with hyperthreading? a 2-way node with 12-core processors,
or a 4-way machine with 6-core processors?

These are 4-way 6-core (AMD Opteron 8431) machines.

ok. so you run entirely inside the machine, i.e.
not trying to run across multiple nodes?

your mentioning of MVAPICH hints at the latter.
in that case you may be overloading the communication
layer. i've seen similar on 4-way opteron 61xx 12-core nodes.

how do you determine cpu utilization?
are you using memory/processor affinity?

I monitor the 'top' output. I am not using memory/processor affinity (but

if you use top, hit the '1' key to get the per processor summary.
that should give out some insight whether the lammps tasks are
actually equally scattered across the processors or if only half
of the processors are busy. the latter may be a hint at processor
affinity already being active. it would also be helpful to see if
processors that don't compute ("user" state) are in "idle",
"system" or "wait" state.

very interested in trying it). I know how to use affinity for my own C codes
using hwloc library, but have no idea if lammps allows for different
mappings. This is particularly important given that our machines are NUMA.

this can be done with OpenMPI. probably the best way to get the proper
processor placement would be to use the LAMMPS internal partitioning
rather than running two lammps instances.

[...]

these benchmarks look a bit lopsided to me.
due to the way how the kspace code scales
you will always be more efficient running
with less MPI tasks. however, they don't show
how the performance is with 12 cores/node,
but using only half the nodes.

Yes, that's why I am very interested in others' experiences on a similar
24-core hardware...

are you running one of the standard lammps benchmarks?
if yes, which one?

i'd like to double check this on our 48-core hardware
with QDR infiniband.

cheers,
    axel.

Are you sure there is no system OS running on any of the 24 cores and competing
wtih LAMMPS. I get poor performace on a dual quad-core Linux box if
I run on all 8 cores, b/c the OS also effectively needs a core.

Steve

Steve,

We make sure OS (or any other stuff, like sys daemons) does not consume more that 3-4% on one core (this is a compute node with almost all services turned off). Also the slowdown affects all cores in the same way. This very much looks like a shared resource conflict that is being used by to concurrent Lammps instances… Is Lammps thread safe?

Thanks!
-Mehmet

Steve,
We make sure OS (or any other stuff, like sys daemons) does not consume more
that 3-4% on one core (this is a compute node with almost all services
turned off). Also the slowdown affects all cores in the same way. This very
much looks like a shared resource conflict that is being used by to
concurrent Lammps instances... Is Lammps thread safe?

no, but what has thread safety to do with that?
different lammps instances are different processes.

axel.

just to put some numbers to this discussion.

this is running lammps with the in.lj benchmark setting x, y, and z to 4
and running on a machine with 4 Opteron 6174 (2.2.GHz) processors:

a single run on an otherwise idle machine w/o processor affinity:
log.1x-1:Loop time of 29.6834 on 12 procs for 100 steps with 2048000 atoms

a single run on an otherwise idle machine w/ processor affinity:
log.1x-2:Loop time of 26.9721 on 12 procs for 100 steps with 2048000 atoms

both runs at the same time. first w/o affinity, the second with.
log.2x-1:Loop time of 29.6888 on 12 procs for 100 steps with 2048000 atoms
log.2x-2:Loop time of 28.549 on 12 procs for 100 steps with 2048000 atoms

so i don't see any slowdown from having multiple calculations on the same
machine when no processors affinity is used.

however, when processors affinity is set up to tie both calculations to the
_same_ processors, there is a massive slowdown (not unexpectedly).
log.3x-1:Loop time of 116.62 on 12 procs for 100 steps with 2048000 atoms
log.3x-2:Loop time of 117.511 on 12 procs for 100 steps with 2048000 atoms

i am using OpenMPI, not MVAPICH, though.

axel.

however, when processors affinity is set up to tie both calculations to the
_same_ processors, there is a massive slowdown (not unexpectedly).

This is a key point. More generally, you want to insure the OS does
not attempt to migrate processes to different processors during a run.
There is a typically some setting to prevent this. This can have a big
negative performance impact on clusters, where many jobs are running.
You want one LAMMPS/MPI process per a single processor for the duration
of a run.

Steve

Thank you everyone for your answers! My plan now is to recompile lammps with openMPI/PGI and use a rank file for affinity. I will keep the list updated :slight_smile:

-Mehmet

And… here’s some good news and information for other users potentially suffering from the same problem :slight_smile:

Recap: my problem was two concurrent 12-core Lammps jobs running 50% slower than a single instance on a 24-core machine.

Turns out Mvapich was implementing core affinity by default, which always allocates 12 processes of both instances on cores 0->11, regardless of their runtime load. Alternatively, MVAPICH can implement any given affinity using “VIADEV_CPU_MAPPING=12,13,14…”, which could potentially solve this problem. In my case I chose to completely turn of affinity, using "MV2_ENABLE_AFFINITY=0 ", and leave the load balancing to OS all together, which does a good job at that. The reason is, I am submitting many Lammps jobs to a PBS queue, and it is possible that two jobs with the same set of affinity could land on the same node.

If you are using OpenMPI, you could do that using a rank list. Try hwloc (which is an openmpi sub-project) and the awesome ‘lstopo’ command to draw an illustration of your hardware, which shows you exactly how the physical cores are numbered. It is not always safe to assume that cores are numbered sequentially, I have seen a “0,2,4,6” on the same socket before!!

Thanks everyone once again for your prompt and comprehensive input.

-Mehmet

leave the load balancing to OS all together, which does a good job at that.

Not always. My experience has been that if you turn off affinity, then the
OS may move LAMMPS/MPI processes around from processor to processor
during a run, which is not a good thing. If it happens a lot it can
have a very bad impact
on performance, especially if you are running on a large number of cores.

What you want is one LAMMPS (or MPI) process
per physical processor, and for it never to change for the duration of a run.
I.e. you do not want the OS or batch scheduler to put more than one
process on a physical core.

Steve

leave the load balancing to OS all together, which does a good job at that.

Not always. My experience has been that if you turn off affinity, then the
OS may move LAMMPS/MPI processes around from processor to processor
during a run, which is not a good thing. If it happens a lot it can
have a very bad impact
on performance, especially if you are running on a large number of cores.

What you want is one LAMMPS (or MPI) process
per physical processor, and for it never to change for the duration of a run.
I.e. you do not want the OS or batch scheduler to put more than one
process on a physical core.

absolutely, even more so, with the memory controllers integrated into
the CPUs nowadays, you also want to use the memory that is
physically attached to the CPU that your process is on (memory
affinity). particularly for MPI codes that communicate a lot, that
helps to contain memory bandwidth contention. even on an otherwise
empty machine, this can make 5-10% difference in performance.
on a loaded machine even more.

the linux scheduler is optimized to give good interactive performance,
i.e. have a fast response when waking up random processes and
having lots of them linger around. that requires a lot of guessing
and heuristics about what will be needed next. for parallel applications,
this is not needed. we know _exactly_ which process is to be preferred.
any delay for that process (usually referred to as OS jitter) can cause
significant performance degradation when running across a large number
of processor cores. there is a reason why IBM's blue gene and cray's
xt series machines use lightweight kernels that eliminate or reduce
that distraction.

cheers,
    axel.