The efficiency of gpu fluctuates

makous_ye · July 10, 2023, 9:56am

i followed instructions from https://www.kryii.com/96.html，and things went well for around one month ，suddenly yesterday

my gpu fluctuated as shown， i didnt do anything.
first i thought i might be the nvidia or cuda package was somewhat failed , so i rebuilded the whole system, and tested with same in.test but no thing changed .
are there anyone met same problem?

makous_ye · July 10, 2023, 9:57am

before that the occupancy of gpu was around 80%

initialize · July 10, 2023, 11:58am

It’s likely that the problem originates from some issue in your system configuration that only you can find out, but here are some comments from what I can see:

I don’t suggest check GPU usage by task manager. The % reported by it comes from the “Copy” panel by default which doesn’t really reflect the usage of CUDA applications. Please try nvidia-smi, or if you really love the task manager, switch to the “Cuda” panel in the GPU subpage.
even with the GPU package enabled, some of the calculations (e.g. kspace) are still performed on CPU. So when you compare the “30%” and “80%” usage scenarios, please make sure that you’re using the same input script, command line options, # of mpi processes, etc.
not every simulation can utilize GPU very well. This is especially true for small systems, where in many cases they simple cannot saturate GPU, resulting in low GPU utilization.
you may also check the timing breakdown printed by lammps at the end of every simulation. Make sure that output etc. are not taking too much of time.
it seems that you run lammps in WSL2 on Windows. You may also try the native Windows build of lammps (if it satisfies yours need) and compare the performance. Generally speaking I’d expect the native Windows version performs worse (since it uses OpenCL instead of CUDA), but if it’s not the case, then there may be some issue in your system setup.

srtee · July 10, 2023, 12:09pm

It’s hard to see how you were previously using CUDA on an Intel Iris Xe GPU, which is (1) not an Nvidia GPU (2) not likely to give significant speedup anyway.

EDIT - just saw that GPU 2 is an Nvidia GPU being used at 30%. Is that the change you’re referring to? If so, it’s quite likely a configuration change caused by a Windows update – that’s the usual suspect when you didn’t change something (knowingly) but the system is now doing something different.

makous_ye · July 10, 2023, 1:44pm

thanks for your reply ,
i tried check gpu through watch nvidia command ,the occupancy of gpu is lower than 20%.
i am pretty sure that windows didn’t update. and i thought it might be hardware defect, so i tested nivdia gpu, it went well.
i am trying to rebuild it again and hope it will fix the problem.

makous_ye · July 10, 2023, 1:45pm

thanks for you reply. i checked windows update log, nothing was updated.
i am trying to build it again

akohlmey · July 10, 2023, 2:25pm

GPU occupancy is not automatically a good indicator for efficiency (same for CPU occupancy as shown by the top program). What matters primarily is how fast the actual calculation runs (with the same input under the same conditions).

With Windows the situation can be quite complex since there are many background processes that can use a significant amount of resources (e.g. a background virus scan) and you are not always able to control those. Since you need the CPU to launch GPU kernels, it can be that a CPU bound background task can affect the GPU occupation. Another factor is that nowadays most hardware has a NUMA architecture, so accessing memory or devices from a direct connected CPU will be faster than if the process is hosted by a CPU core on a different socket.

Most certainly this is not something that LAMMPS has much of an impact on and details depend strongly on your specific hardware and software (including Windows applications outside of WSL).

akohlmey · July 10, 2023, 2:39pm

That is not true. The difference has always been small and most significant when oversubscribing the GPU. However some time ago, changes were added to the OpenCL code path in the GPU package so that it can automatically tune its settings to the specifics of the GPU hardware, which made using the GPU with OpenCL even more competitive. Code paths that are extremely tuned toward Nvidia’s CUDA (or AMD’s ROCm) are more present in the KOKKOS package. Most of functionality of the GPU package is based on rather generic API calls (so that it is possible to hide the specifics of CUDA vs. OpenCL vs. HIP behind a set of C preprocessor macros and use the same “abstract” code throughout).

makous_ye · July 10, 2023, 3:03pm

thanks for all of your reples
and i think i found the problem. it must be the in code. i tried another code, it runs pretty fast with gpu occupancy around 80. i am determining which part of my code caused this problem.

makous_ye · July 10, 2023, 3:22pm

shit i think someone hacked my system and changed my code!!!
when did i changed mydump to dump every step!!!
i guess the drop of gpu is because cpu is output the data frequently and gpu is like waitting until it finshed, but the cpu output too frequent and the gpu is always waitting like stopped.
thanks again for all of your time
hope it will help others someday