A CUDA driver error 4 Due to the Failure of cuMemFreeHost

Dear developers:

I’m a new developer working on a GPU accelerator library for a new algorithm, similar to the PPPM library. I’m utilizing the geryon library located in lammps/lib/gpu/geryon for development. However, I’ve encountered some issues when using the UCL_Vector class.

Problem Description

When the program is running, the forces and energy can be correctly calculated. However, there are something wrong in the process of destructing the class I defined in the LAMMPS_AL namespace. The following error occurs:

in call at file '/dssg/home/acct-hpc/project/00_LAMMPS/lammps/lib/gpu/geryon/nvd_memory.h' in line 85.
Cuda driver error 4 in file '/dssg/home/acct-hpc/project/00_LAMMPS/lammps/lib/gpu/geryon/nvd_memory.h' in line 85.                             
Cuda driver error 4 in call at file '/dssg/home/acct-hpc/project/00_LAMMPS/lammps/lib/gpu/geryon/nvd_memory.h' in line 85.                     
Cuda driver error 4 in file '/dssg/home/acct-hpc/hpclqz/project/00_LAMMPS/lammps/lib/gpu/geryon/nvd_memory.h' in line 85.                             
Cuda driver error 4 in call at file '/dssg/home/acct-hpc/project/00_LAMMPS/lammps/lib/gpu/geryon/nvd_memory.h' in line 85.                     
Cuda driver error 4 in file '/dssg/home/acct-hpc/project/00_LAMMPS/lammps/lib/gpu/geryon/nvd_memory.h' in line 85. 

These errors occur because the UCL_H_Vec<devtype> _buffer is not correctly cleared when destroying the UCL_Vector . Using cuda-gdb , I discovered that the _cols variable in the private class UCL_H_Vec<devtype> _buffer is not set to 0 after correctly clearing the UCL_H_Vec<hosttype> host and UCL_D_Vec<devtype> device .

Case in My Program

I have define 4 UCL_Vector class in my class. They are:

UCL_Vector<acctyp, acctyp> pxyz;
UCL_Vector<numtyp3, numtyp3> K;
UCL_Vector<numtyp2, numtyp2> Rho;
UCL_Vector<numtyp2, numtyp2> Rho_All;  

After I call clear functions like:

pxyz.clear();
K.clear();
Rho.clear();
Rho_All.clear();

The pxyz can be cleared correctly. The _cols of the _buffer is 0 as follows:

(cuda-gdb) print pxyz
$1 = {host = {<ucl_cudadr::UCL_BaseMat> = {
      _vptr.UCL_BaseMat = 0x33d1160 <vtable for ucl_cudadr::UCL_H_Vec<double>+16>, _cq = 0x0,
      _kind = UCL_VIEW}, _array = 0x14d2c7643a00, _end = 0x14d2c7643a18, _row_bytes = 24, _cols = 0},
  device = {<ucl_cudadr::UCL_BaseMat> = {
      _vptr.UCL_BaseMat = 0x33d1140 <vtable for ucl_cudadr::UCL_D_Vec<double>+16>, _cq = 0x0,
      _kind = UCL_VIEW}, _row_bytes = 24, _row_size = 0, _rows = 0, _cols = 0, _array = 22895518891008},
  _buffer = {<ucl_cudadr::UCL_BaseMat> = {
      _vptr.UCL_BaseMat = 0x33d1160 <vtable for ucl_cudadr::UCL_H_Vec<double>+16>, _cq = 0x0,
      _kind = UCL_VIEW}, _array = 0x0, _end = 0x0, _row_bytes = 0, _cols = 0}}

However, other three UCL_Vector classes do not behave correctly. Take class K for example. After calling K.clear(), the contents of K is:

$4 = {host = {<ucl_cudadr::UCL_BaseMat> = {
      _vptr.UCL_BaseMat = 0x33d19c0 <vtable for ucl_cudadr::UCL_H_Vec<_lgpu_float3>+16>, _cq = 0x0,
      _kind = UCL_VIEW}, _array = 0x4f7ee50, _end = 0x4f805c0, _row_bytes = 6000, _cols = 0},
  device = {<ucl_cudadr::UCL_BaseMat> = {
      _vptr.UCL_BaseMat = 0x33d19a0 <vtable for ucl_cudadr::UCL_D_Vec<_lgpu_float3>+16>, _cq = 0x0,
      _kind = UCL_READ_WRITE}, _row_bytes = 6000, _row_size = 0, _rows = 0, _cols = 500,
    _array = 22895518870528}, _buffer = {<ucl_cudadr::UCL_BaseMat> = {
      _vptr.UCL_BaseMat = 0x33d19c0 <vtable for ucl_cudadr::UCL_H_Vec<_lgpu_float3>+16>, _cq = 0x0,
      _kind = UCL_READ_WRITE}, _array = 0x14d2c7640200, _end = 0x14d2c7641970, _row_bytes = 6000,
    _cols = 500}}

Above all, during the destruction of the K, Rho, and Rho_All instances of the UCL_Vector class, the classes invoke the _host_free function in lib/gpu/geryon/nvd_memory.h. This leads to the assertion assert(0==1) triggered by CU_DESTRUCT_CALL(cuMemFreeHost(mat.begin())).

My Solution

I use the most simple solution. I add the _buffer.clear() in clear() function of class UCL_Vector in lib/gpu/geryon/ucl_vector.h. After recompiling, all the problems have gone.

inline void clear()
    { host.clear(); _buffer.clear(); device.clear(); }

I know that solving it this way might seem foolish, but I haven’t been able to find the error in my program. So I sincerely ask if anyone has any experience solving this type of problem.

Sincere thanks to everyone.

Perhaps @ndtrung can comment on this.

Please give him a few days to have a closer look. Otherwise, you may then consider submitting this as a bug report issue on the LAMMPS GitHub repo.

Thanks for your help. I’d be glad to wait for a few days.

Hi @MC_DA, I have a couple of questions:

  1. What is the exact command that you used to run LAMMPS?
  2. How do you allocate/resize the 3 vectors K, Rho and Rho_All differently from pxyz?
  3. Where are the clear() functions invoked in your class?

You mentioned your feature is similar to PPPM, I suppose you can base your implementation on how the UCL_Vector variables brick and vd_brick are allocated and deallocated in that class.

The CUDA driver error 4 indicates that the CUDA driver is in the process of shutting down, which is expected at the end of the run. Adding _buffer.clear() has been unnecessary, but appears to fix the issue in your case. I am wondering if there are certain details in your implementation and run that requires this extra step.

Thanks for your replying!

My MPI launching command is:

mpirun -n 1 ../../build/lmp -sf gpu -i in.gpu

Here are my class defination and memory allocation codes:

Defination:

 UCL_Vector<acctyp, acctyp> pxyz;
 UCL_Vector<numtyp3, numtyp3> K;
 UCL_Vector<numtyp2, numtyp2> Rho;
 UCL_Vector<numtyp2, numtyp2> Rho_All;
 UCL_Vector<int, int> error_flag;

Allocation:

success = success && (K.alloc(P,*ucl_device) == UCL_SUCCESS);
success = success && (Rho.alloc(P,*ucl_device) == UCL_SUCCESS);
success = success && (Rho_All.alloc(P,*ucl_device) == UCL_SUCCESS);
success = success && (pxyz.alloc(3,*ucl_device) == UCL_SUCCESS);
success = success && (error_flag.alloc(1,*ucl_device)==UCL_SUCCESS);

Here is my clear function:

template <class numtyp, class acctyp, class grdtyp, class grdtyp4>
void RbeT::clear(const double cpu_time) {
	if (!_allocated)
		return;
	_allocated=false;
	_precompute_done=false;

	pxyz.clear();
	K.clear();
	Rho.clear();
	Rho_All.clear();
	error_flag.clear();
        
    //Some irrelevant details have been omitted.

}

And here is where the clear function is invoked:

template <class numtyp, class acctyp, class grdtyp, class grdtyp4>
RbeT::~Rbe() {
	clear(0.0);
	delete ans;
	k_make_rho.clear();
	k_get_force.clear();
	k_get_energy.clear();
	if (rbe_program) delete rbe_program;
}

Here is how I use the UCL_Vector class variables in the compute function, which is the only place I use them:

template <class numtyp, class acctyp, class grdtyp, class grdtyp4>
void RbeT::compute(/***omit the parameters***/) {

  //omit other irrelative details
  
  /*----initialize datas----*/ 
  
  //omit the calculation of the pxyz on host
  pxyz.update_device();
  
  //omit the calculation of the K on host
  K.update_device();
  
  /**********************************************/
  
  
  //omit launching kernel function1 which writes `Rho` and reads `pxyz` and `K`
  
  Rho.update_host();
  
  //omit the calculation of the Rho_All on host
  
  Rho_All.update_device();
  
  
  /**********************************************/
  
  
  //omit launching kernel function2 which just reads `Rho_All`
  
  //omit other irrelative details
}

If you need more details, I’d like to reply as soon as possible.

@MC_DA Thanks for the info. All looks reasonable to me. Nevertheless, I cannot reproduce the issue on my end by adding similar variables (K, Rho, Rho_All), and then allocating with some fixed value of P (say 10), updating on the host/device sides, and de-allocating them the way you are doing. The use patterns of UCL_Vector variables are everywhere in the lib/gpu classes. Therefore, I am not entirely convinced that there’s a bug, or an issue, with the current implementation of UCL_Vector::clear().

Your proposed solution (clearing _buffer) does no harm to performance from my point of view. Before deciding if we could go with this change, could you please debug your code a bit more to see which variables trigger the issue, for example, by commenting out the use of these variables and un-commenting them back one by one. Using a run 0, or commenting out the time integrator (so all the atoms are fixed in place) could help.

1 Like