Are benchmarks available for comparing single-threaded LAMMPS performance in recent CPUS? (specifically Ryzen 3900X vs 3950X)

Dear all,

I am writing this email in regards to a question about single-threaded performance of more recent CPUs in LAMMPS.

My question is whether comparative (single-threaded) performance benchmarks are available for the AMD Ryzen 3900X vs the AMD Ryzen 3950X CPUs , or other recent consumer-grade CPUs. I have not been able to find such single-threaded benchmarks either on the LAMMPs mailing list or different sources. I did find this topic without any replies yet from 2017 on the previous generation of Ryzen CPUs. I also found several LAMMPS Rhodopsin benchmark results, which I think reflects performance for a parallelizable load. These results however do not answer my question. Does someone on the mailing list have experience with single-threaded performance in LAMMPS of these CPUs?

For context; I run simulations of a tethered polymer with a small number of coarse grained particles (N<100), but for rather long times (order 1e11 timesteps). Therefore - to my knowledge - I cannot gain a performance advantage by parallelizing the workload either on the CPU or GPU(s). Hence, I would like to be able to compare single-threaded performance of modern CPUs in LAMMPS.

I currently have a 3900X installed running at stock speeds, but I am considering upgrading to a 3950X to allow for more simulations running simultaneously. Furthermore, the 3950x is supposed to have a higher sustained boost clock. I cannot afford more professional hardware and the hardware at the university is designed around parallelizable loads, so I would like to be able to compare the respective performance of these consumer-grade CPUs before I spend money on an upgrade. I am aware that performance can vary based on the nature of a simulation, the used compiler, LAMMPS version, and other hardware specifics, but I would expect an identical benchmark ran on both CPUs to be at least indicative of the potential performance gain.

(As per the mailing list guidelines: I am using LAMMPS-64bit-7Aug2019 prebuilt windows binary, as well as linux builds of the same version.)

Thank you in advance.

Kind regards,

Joost Bergen
Student
Department of Biomedical Engineering, department of Applied Physics
Eindhoven University of Technology, Netherlands

Dear all,

I am writing this email in regards to a question about single-threaded performance of more recent CPUs in LAMMPS.

My question is whether comparative (single-threaded) performance benchmarks are available for the AMD Ryzen 3900X vs the AMD Ryzen 3950X CPUs , or other recent consumer-grade CPUs. I have not been able to find such single-threaded benchmarks either on the LAMMPs mailing list or different sources. I did find this topic without any replies yet from 2017 on the previous generation of Ryzen CPUs. I also found several LAMMPS Rhodopsin benchmark results, which I think reflects performance for a parallelizable load. These results however do not answer my question. Does someone on the mailing list have experience with single-threaded performance in LAMMPS of these CPUs?

what pair style and other force styles are you using? do you need long-range electrostatics?
there may be ways to optimize this on the code level and using improved settings.

also, you may want to check out the USER-INTEL package for vectorized execution, which - especially in mixed precision - can result in significant single MPI rank performance improvements.

For context; I run simulations of a tethered polymer with a small number of coarse grained particles (N<100), but for rather long times (order 1e11 timesteps). Therefore - to my knowledge - I cannot gain a performance advantage by parallelizing the workload either on the CPU or GPU(s). Hence, I would like to be able to compare single-threaded performance of modern CPUs in LAMMPS.

do you really need a continuous trajectory for your work?
otherwise, you can try to create multiple decorrelated restarts (by repeatedly re-initializing velocities with different random seeds) and then run those decorrelated trajectories concurrently and cut your wait time down massively and make your code significantly more parallel.

for theory behind this, you may want to look up papers from Art Voter (e.g. about parallel replica dynamics, PRD, which uses this kind of approach to speed up the search for rare events).

Axel.

Hello Axel,

Thanks for your quick reply.

“What pair style and other force styles are you using? There may be ways to optimize this on the code level and using improved settings.”

I’m using lj/cut for the pair style. Bond and angle style are both harmonic. The atom style currently is angle, although I will want to add a rigid body with angular momentum as a tethered particle either via the fix rigid command or the body/sphere style (I haven’t yet properly read up on this). Furthermore there’s a Langevin fix implemented. I use a wall/lj93 potential for the surface (p p f box).

These settings should be compatible with USER-INTEL styles.

”Do you need long-range electrostatics?”

I do not need long range electrostatics (high salt is assumed for now).

“Also, you may want to check out the USER-INTEL package for vectorized execution, which - especially in mixed precision - can result in significant single MPI rank performance improvements.”

Thank you for that suggestion! I assumed that USER-INTEL was optimized only for certain Intel processors and would not result in significant performance gains for AMD CPUs. Is that the case?

From the docs: “Although any compiler can be used with the USER-INTEL package, currently, vectorization directives are disabled by default when not using Intel compilers due to lack of standard support and observations of decreased performance. The OpenMP standard now supports directives for vectorization and we plan to transition the code to this standard once it is available in most compilers. We expect this to allow improved performance and support with other compilers.”

Is it possible to enable vectorization support for other compilers than Intel’s? Do you happen to know whether Intel compilers still disadvantage performance in non-Intel processors? Meanwhile I will try compiling with Intel’s compiler to see if there’s a performance gain.

“Do you really need a continuous trajectory for your work? Otherwise, you can try to create multiple decorrelated restarts (by repeatedly re-initializing velocities with different random seeds) and then run those decorrelated trajectories concurrently and cut your wait time down massively and make your code significantly more parallel. For theory behind this, you may want to look up papers from Art Voter (e.g. about parallel replica dynamics, PRD, which uses this kind of approach to speed up the search for rare events).”

I am specifically interested in the dynamics of the tether. I think I will need continuous trajectories (at least as long as the relaxation time of the polymer) for this as the conformations are correlated. However, I am now simulating for longer than this relaxation time, so I could indeed parallelize this. Very useful suggestion. Why didn’t I think of this before :). This will come in especially handy when I will need to find specific types of conformations. Thanks!

Regards,

Joost

Hello Axel,

[…]

“Also, you may want to check out the USER-INTEL package for vectorized execution, which - especially in mixed precision - can result in significant single MPI rank performance improvements.”

Thank you for that suggestion! I assumed that USER-INTEL was optimized only for certain Intel processors and would not result in significant performance gains for AMD CPUs. Is that the case?

From the docs: “Although any compiler can be used with the USER-INTEL package, currently, vectorization directives are disabled by default when not using Intel compilers due to lack of standard support and observations of decreased performance. The OpenMP standard now supports directives for vectorization and we plan to transition the code to this standard once it is available in most compilers. We expect this to allow improved performance and support with other compilers.”

this is not specific to any hardware but in reference to the compiler vendor. if a CPU supports the generated (vector) instructions (SSE, AVX, etc.) then it is compatible with the styles in this package.

Is it possible to enable vectorization support for other compilers than Intel’s? Do you happen to know whether Intel compilers still disadvantage performance in non-Intel processors? Meanwhile I will try compiling with Intel’s compiler to see if there’s a performance gain.

there is some performance gain with other compilers, too. we regularly test compilation with GNU compilers, but it is not as effective as with using (recent) intel compilers.

It is mostly a question of compiler compatibility with standards and the actual implementation of the compiler recognizing opportunities to vectorize. the Intel compilers are far superior in employing vectorization. The directives in the code help, but they are only part of the issue.

The story about Intel compilers disadvantaging other vendors is very old and even when it was an issue, it only applied to “cpu dispatch”, i.e. when the compiler would create multiple versions with different levels of instruction sets and then would decide which version to dispatch. when compiling for a specific architecture and instruction set explicitly, it would always generate the code for that choice and would run on AMD cpus just as well, if they supported the instructions. USER-INTEL code, especially in mixed or single precision will run faster even when compiled with GNU compilers, but using intel compilers will fully unleash the optimizations in the code.

“Do you really need a continuous trajectory for your work? Otherwise, you can try to create multiple decorrelated restarts (by repeatedly re-initializing velocities with different random seeds) and then run those decorrelated trajectories concurrently and cut your wait time down massively and make your code significantly more parallel. For theory behind this, you may want to look up papers from Art Voter (e.g. about parallel replica dynamics, PRD, which uses this kind of approach to speed up the search for rare events).”

I am specifically interested in the dynamics of the tether. I think I will need continuous trajectories (at least as long as the relaxation time of the polymer) for this as the conformations are correlated. However, I am now simulating for longer than this relaxation time, so I could indeed parallelize this. Very useful suggestion. Why didn’t I think of this before :). This will come in especially handy when I will need to find specific types of conformations. Thanks!

another option to look into are enhanced free energy sampling methods. your problem sounds like it could be reduced to a few collective variables to describe the different conformations. in that case, there is significant potential in using a biasing method to first map out the available phase space and thus the free energy landscape using methods like metadynamics. this will allow to determine the height and accessibility of different barriers to different conformations and thus may give you a reasonable answer in a much shorter time. you won’t get specific dynamics from that, but it should be much easier to explore what is accessible to your system after such a story instead of just doing a brute force run (or multiple of those).

…the bottom line is that using a smart simulation strategy can outperform any superior hardware almost all the time.

Axel.

Dear dr. Kohlmeyer,

“[…]”

Excuse me, I meant no disrespect. I’m used to a culture in which people address each other very informally regardless of status; I should have stepped out of my bubble. Thank you for your elaborate reply.

“This is not specific to any hardware but in reference to the compiler vendor. If a CPU supports the generated (vector) instructions (SSE, AVX, etc.) then it is compatible with the styles in this package. There is some performance gain with other compilers, too. We regularly test compilation with GNU compilers, but it is not as effective as with using (recent) intel compilers.

It is mostly a question of compiler compatibility with standards and the actual implementation of the compiler recognizing opportunities to vectorize. The Intel compilers are far superior in employing vectorization. The directives in the code help, but they are only part of the issue.

The story about Intel compilers disadvantaging other vendors is very old and even when it was an issue, it only applied to “CPU dispatch”, i.e. when the compiler would create multiple versions with different levels of instruction sets and then would decide which version to dispatch. When compiling for a specific architecture and instruction set explicitly, it would always generate the code for that choice and would run on AMD CPUs just as well, if they supported the instructions. USER-INTEL code, especially in mixed or single precision will run faster even when compiled with GNU compilers, but using intel compilers will fully unleash the optimizations in the code.”

This addresses all my questions. The Intel compilers will then be my first approach if I can’t reduce the scale of my problems by smarter means.

“Another option to look into are enhanced free energy sampling methods. Your problem sounds like it could be reduced to a few collective variables to describe the different conformations. In that case, there is significant potential in using a biasing method to first map out the available phase space and thus the free energy landscape using methods like metadynamics. This will allow to determine the height and accessibility of different barriers to different conformations and thus may give you a reasonable answer in a much shorter time. You won’t get specific dynamics from that, but it should be much easier to explore what is accessible to your system after such a story instead of just doing a brute force run (or multiple of those).”

I will read up on the literature on the methods you described. Like you stated, it sounds like this could greatly ease the process of discovering plausible conformations, after which the dynamics could be simulated from that point onward.

…the bottom line is that using a smart simulation strategy can outperform any superior hardware almost all the time.

I will keep that in mind. It feels tempting to get the most out of one’s hardware/software and optimize as much as possible, but I suppose spending much time on that is indeed very inefficient in my case (and perhaps quite commonly in general). Thank you once again for making the effort to answer all my questions and suggesting better solutions!

Kind regards,

Joost Bergen