QMCPACK
DiracMatrixComputeCUDA< VALUE_FP > Class Template Reference

class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes. More...

+ Inheritance diagram for DiracMatrixComputeCUDA< VALUE_FP >:
+ Collaboration diagram for DiracMatrixComputeCUDA< VALUE_FP >:

Public Member Functions

 DiracMatrixComputeCUDA ()
 
 DiracMatrixComputeCUDA (const DiracMatrixComputeCUDA &other)
 
 ~DiracMatrixComputeCUDA ()
 
std::unique_ptr< ResourcemakeClone () const override
 
template<typename TMAT >
void invert_transpose (compute::Queue< PlatformKind::CUDA > &queue, DualMatrix< TMAT > &a_mat, DualMatrix< TMAT > &inv_a_mat, DualVector< LogValue > &log_values)
 Given a_mat returns inverted amit and log determinant of a_matches. More...
 
template<typename TMAT >
std::enable_if_t<!std::is_same< VALUE_FP, TMAT >::value > mw_invertTranspose (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values)
 Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision. More...
 
template<typename TMAT >
std::enable_if_t< std::is_same< VALUE_FP, TMAT >::value > mw_invertTranspose (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values)
 Batched inversion and calculation of log determinants. More...
 
- Public Member Functions inherited from Resource
 Resource (const std::string &name)
 
virtual ~Resource ()=default
 
const std::string & getName () const
 

Private Types

using FullPrecReal = RealAlias< VALUE_FP >
 
using LogValue = std::complex< FullPrecReal >
 
template<typename T >
using DualMatrix = Matrix< T, PinnedDualAllocator< T > >
 
template<typename T >
using DualVector = Vector< T, PinnedDualAllocator< T > >
 

Private Member Functions

void mw_computeInvertAndLog (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< VALUE_FP >> &a_mats, const RefVector< DualMatrix< VALUE_FP >> &inv_a_mats, const int n, DualVector< LogValue > &log_values)
 Calculates the actual inv and log determinant on accelerator. More...
 
void mw_computeInvertAndLog_stride (compute::Queue< PlatformKind::CUDA > &queue, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
 Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors. More...
 

Private Attributes

DualVector< VALUE_FP > psiM_fp_
 
DualVector< VALUE_FP > invM_fp_
 
DualVector< VALUE_FP > LU_diags_fp_
 
DualVector< int > pivots_
 
DualVector< int > infos_
 
DualVector< VALUE_FP * > psiM_invM_ptrs_
 Transfer buffer for device pointers to matrices. More...
 
VALUE_FP host_one {1.0}
 
VALUE_FP host_zero {0.0}
 
cublasHandle_t h_cublas_
 

Detailed Description

template<typename VALUE_FP>
class qmcplusplus::DiracMatrixComputeCUDA< VALUE_FP >

class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes.

Multiplicty is one per crowd not one per UpdateEngine It matches the multiplicity of the accelerator call and batched resource requirement.

Template Parameters
VALUE_FPthe datatype used in the actual computation of matrix inversion

There are no per walker variables, resources specific to the per crowd compute object are owned here. The compute object itself is the resource to the per walker DiracDeterminantBatched. Resources used by this object but owned by the surrounding scope are passed as arguments.

All the public APIs are synchronous. The asynchronous queue argument gets synchronized before return. rocBLAS, indirectly used via hipBLAS, requires synchronizing the old stream before setting a new one. We don't need to actively synchronize the old stream because it gets synchronized right after each use.

Definition at line 49 of file DiracMatrixComputeCUDA.hpp.

Member Typedef Documentation

◆ DualMatrix

using DualMatrix = Matrix<T, PinnedDualAllocator<T> >
private

Definition at line 55 of file DiracMatrixComputeCUDA.hpp.

◆ DualVector

using DualVector = Vector<T, PinnedDualAllocator<T> >
private

Definition at line 58 of file DiracMatrixComputeCUDA.hpp.

◆ FullPrecReal

using FullPrecReal = RealAlias<VALUE_FP>
private

Definition at line 51 of file DiracMatrixComputeCUDA.hpp.

◆ LogValue

using LogValue = std::complex<FullPrecReal>
private

Definition at line 52 of file DiracMatrixComputeCUDA.hpp.

Constructor & Destructor Documentation

◆ DiracMatrixComputeCUDA() [1/2]

Definition at line 220 of file DiracMatrixComputeCUDA.hpp.

References cublasCreate, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.

220  : Resource("DiracMatrixComputeCUDA")
221  {
222  cublasErrorCheck(cublasCreate(&h_cublas_), "cublasCreate failed!");
223  }
Resource(const std::string &name)
Definition: Resource.h:23
#define cublasCreate
Definition: cuda2hip.h:37
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34

◆ DiracMatrixComputeCUDA() [2/2]

DiracMatrixComputeCUDA ( const DiracMatrixComputeCUDA< VALUE_FP > &  other)
inline

Definition at line 225 of file DiracMatrixComputeCUDA.hpp.

References cublasCreate, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.

225  : Resource(other.getName())
226  {
227  cublasErrorCheck(cublasCreate(&h_cublas_), "cublasCreate failed!");
228  }
Resource(const std::string &name)
Definition: Resource.h:23
#define cublasCreate
Definition: cuda2hip.h:37
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34

◆ ~DiracMatrixComputeCUDA()

Definition at line 230 of file DiracMatrixComputeCUDA.hpp.

References cublasDestroy, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.

230 { cublasErrorCheck(cublasDestroy(h_cublas_), "cublasDestroy failed!"); }
#define cublasDestroy
Definition: cuda2hip.h:38
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34

Member Function Documentation

◆ invert_transpose()

void invert_transpose ( compute::Queue< PlatformKind::CUDA > &  queue,
DualMatrix< TMAT > &  a_mat,
DualMatrix< TMAT > &  inv_a_mat,
DualVector< LogValue > &  log_values 
)
inline

Given a_mat returns inverted amit and log determinant of a_matches.

Parameters
[in]a_mata matrix input
[out]inv_a_matinverted matrix
[out]logdeterminant is in logvalues[0]

I consider this single call to be semi depricated so the log determinant values vector is used to match the primary batched interface to the accelerated routings. There is no optimization (yet) for TMAT same type as TREAL

Definition at line 244 of file DiracMatrixComputeCUDA.hpp.

References Matrix< T, Alloc >::assignUpperLeft(), Matrix< T, Alloc >::attachReference(), Matrix< T, Alloc >::cols(), cublasErrorCheck, cublasSetStream, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, cudaStream_t, Matrix< T, Alloc >::data(), Matrix< T, Alloc >::device_data(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, qmcplusplus::queue, Matrix< T, Alloc >::rows(), Matrix< T, Alloc >::size(), and qmcplusplus::simd::transpose().

Referenced by qmcplusplus::TEST_CASE().

248  {
249  cudaStream_t h_stream = queue.getNative();
250  cublasErrorCheck(cublasSetStream(h_cublas_, h_stream), "cublasSetStream failed!");
251  const int n = a_mat.rows();
252  const int lda = a_mat.cols();
253  psiM_fp_.resize(n * lda);
254  invM_fp_.resize(n * lda);
255  std::fill(log_values.begin(), log_values.end(), LogValue{0.0, 0.0});
256  // making sure we know the log_values are zero'd on the device.
257  cudaErrorCheck(cudaMemcpyAsync(log_values.device_data(), log_values.data(), log_values.size() * sizeof(LogValue),
258  cudaMemcpyHostToDevice, h_stream),
259  "cudaMemcpyAsync failed copying DiracMatrixBatch::log_values to device");
260  simd::transpose(a_mat.data(), n, lda, psiM_fp_.data(), n, lda);
261  cudaErrorCheck(cudaMemcpyAsync(psiM_fp_.device_data(), psiM_fp_.data(), psiM_fp_.size() * sizeof(VALUE_FP),
262  cudaMemcpyHostToDevice, h_stream),
263  "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM_fp to device");
265  DualMatrix<VALUE_FP> data_ref_matrix;
266 
267  data_ref_matrix.attachReference(invM_fp_.data(), n, n);
268 
269  // We can't use operator= with different lda, ldb which can happen so we use this assignment which is over the
270  // smaller of the two's dimensions
271  inv_a_mat.assignUpperLeft(data_ref_matrix);
272  cudaErrorCheck(cudaMemcpyAsync(inv_a_mat.device_data(), inv_a_mat.data(), inv_a_mat.size() * sizeof(TMAT),
273  cudaMemcpyHostToDevice, h_stream),
274  "cudaMemcpyAsync of inv_a_mat to device failed!");
275  }
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
void transpose(const T *restrict A, size_t m, size_t lda, TO *restrict B, size_t n, size_t ldb)
transpose of A(m,n) to B(n,m)
Definition: algorithm.hpp:97
#define cudaStream_t
Definition: cuda2hip.h:149
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:139
void mw_computeInvertAndLog_stride(compute::Queue< PlatformKind::CUDA > &queue, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full pre...
#define cublasSetStream
Definition: cuda2hip.h:39
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34
#define cudaMemcpyAsync
Definition: cuda2hip.h:136

◆ makeClone()

std::unique_ptr<Resource> makeClone ( ) const
inlineoverridevirtual

Implements Resource.

Definition at line 232 of file DiracMatrixComputeCUDA.hpp.

232 { return std::make_unique<DiracMatrixComputeCUDA>(*this); }

◆ mw_computeInvertAndLog()

void mw_computeInvertAndLog ( compute::Queue< PlatformKind::CUDA > &  queue,
const RefVector< const DualMatrix< VALUE_FP >> &  a_mats,
const RefVector< DualMatrix< VALUE_FP >> &  inv_a_mats,
const int  n,
DualVector< LogValue > &  log_values 
)
inlineprivate

Calculates the actual inv and log determinant on accelerator.

Parameters
[in]h_cublascublas handle, h_stream handle is retrieved from it.
[in,out]a_matsdual A matrices, they will be transposed on the device side as a side effect.
[out]inv_a_matsdual invM matrices
[in]nmatrices rank.
[out]log_valueslog determinant value for each matrix, batch_size = log_values.size()

On Volta so far little seems to be achieved by having the mats continuous.

List of operations:

  1. matrix-by-matrix. Copy a_mat to inv_a_mat on host, transfer inv_a_mat to device, transpose inv_a_mat to a_mat on device.
  2. batched. LU and invert
  3. matrix-by-matrix. Transfer inv_a_mat to host

Pros and cons:

  1. Todo:
    try to do like mw_computeInvertAndLog_stride, copy and transpose to psiM_fp_ and fuse transfer.

Definition at line 105 of file DiracMatrixComputeCUDA.hpp.

References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), CUBLAS_OP_N, CUBLAS_OP_T, cublasErrorCheck, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, qmcplusplus::cuBLAS::geam(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::host_one, DiracMatrixComputeCUDA< VALUE_FP >::host_zero, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, qmcplusplus::queue, and qmcplusplus::simd::remapCopy().

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

110  {
111  const int nw = a_mats.size();
112  assert(a_mats.size() == inv_a_mats.size());
113 
114  psiM_invM_ptrs_.resize(nw * 2);
115  const int lda = a_mats[0].get().cols();
116  const int ldinv = inv_a_mats[0].get().cols();
117  cudaStream_t h_stream = queue.getNative();
118  psiM_fp_.resize(n * ldinv * nw);
119 
120  for (int iw = 0; iw < nw; ++iw)
121  {
122  psiM_invM_ptrs_[iw] = psiM_fp_.device_data() + iw * n * ldinv;
123  psiM_invM_ptrs_[iw + nw] = inv_a_mats[iw].get().device_data();
124  // Since inv_a_mat can have a different leading dimension from a_mat first we remap copy on the host
125  simd::remapCopy(n, n, a_mats[iw].get().data(), lda, inv_a_mats[iw].get().data(), ldinv);
126  // Then copy a_mat in inv_a_mats to the device
127  cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().device_data(), inv_a_mats[iw].get().data(),
128  inv_a_mats[iw].get().size() * sizeof(VALUE_FP), cudaMemcpyHostToDevice, h_stream),
129  "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM to device");
130  // On the device Here we transpose to a_mat;
132  inv_a_mats[iw].get().device_data(), ldinv, &host_zero,
133  a_mats[iw].get().device_data(), lda, psiM_invM_ptrs_[iw], ldinv),
134  "cuBLAS::geam failed.");
135  }
136  pivots_.resize(n * nw);
137  infos_.resize(nw);
138  LU_diags_fp_.resize(n * nw);
140  psiM_invM_ptrs_.size() * sizeof(VALUE_FP*), cudaMemcpyHostToDevice, h_stream),
141  "cudaMemcpyAsync psiM_invM_ptrs_ failed!");
143  psiM_invM_ptrs_.device_data() + nw, LU_diags_fp_.device_data(),
144  pivots_.device_data(), infos_.data(), infos_.device_data(),
145  log_values.device_data(), nw);
146  for (int iw = 0; iw < nw; ++iw)
147  {
148  cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().data(), inv_a_mats[iw].get().device_data(),
149  inv_a_mats[iw].get().size() * sizeof(VALUE_FP), cudaMemcpyDeviceToHost, h_stream),
150  "cudaMemcpyAsync failed copying DiracMatrixBatch::inv_psiM to host");
151  }
152  cudaErrorCheck(cudaMemcpyAsync(log_values.data(), log_values.device_data(), log_values.size() * sizeof(LogValue),
153  cudaMemcpyDeviceToHost, h_stream),
154  "cudaMemcpyAsync log_values failed!");
155  cudaErrorCheck(cudaStreamSynchronize(h_stream), "cudaStreamSynchronize failed!");
156  }
#define CUBLAS_OP_N
Definition: cuda2hip.h:19
DualVector< VALUE_FP * > psiM_invM_ptrs_
Transfer buffer for device pointers to matrices.
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
#define cudaStream_t
Definition: cuda2hip.h:149
void computeInverseAndDetLog_batched(cublasHandle_t &h_cublas, cudaStream_t &hstream, const int n, const int lda, T *Ms[], T *Cs[], T *LU_diags, int *pivots, int *host_infos, int *infos, std::complex< double > *log_dets, const int batch_size)
Takes PsiM in column major layout and uses LU factorization to compute the log determinant and invPsi...
void remapCopy(size_t m, size_t n, const T *restrict A, size_t lda, TO *restrict B, size_t ldb)
copy of A(m,n) to B(m,n)
Definition: algorithm.hpp:115
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
#define cudaMemcpyDeviceToHost
Definition: cuda2hip.h:138
#define CUBLAS_OP_T
Definition: cuda2hip.h:20
#define cudaStreamSynchronize
Definition: cuda2hip.h:152
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:139
cublasStatus_t geam(cublasHandle_t &handle, cublasOperation_t &transa, cublasOperation_t &transb, int m, int n, const float *alpha, const float *A, int lda, const float *beta, const float *B, int ldb, float *C, int ldc)
Definition: cuBLAS.hpp:110
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34
#define cudaMemcpyAsync
Definition: cuda2hip.h:136

◆ mw_computeInvertAndLog_stride()

void mw_computeInvertAndLog_stride ( compute::Queue< PlatformKind::CUDA > &  queue,
DualVector< VALUE_FP > &  psi_Ms,
DualVector< VALUE_FP > &  inv_Ms,
const int  n,
const int  lda,
DualVector< LogValue > &  log_values 
)
inlineprivate

Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors.

Parameters
[in]h_cublascublas handle, h_stream handle is retrieved from it.
[in,out]psi_Msmatrices flattened into single pinned vector, returned with LU matrices.
[out]inv_Msmatrices flattened into single pinned vector.
[in]nmatrices rank.
[in]ldaleading dimension of each matrix
[out]log_valueslog determinant value for each matrix, batch_size = log_values.size()

List of operations:

  1. batched. Transfer psi_Ms to device
  2. batched. LU and invert
  3. batched. Transfer inv_Ms to host
    Todo:
    Remove 1 and 3. Handle transfer at upper level.

Definition at line 175 of file DiracMatrixComputeCUDA.hpp.

References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, Vector< T, Alloc >::data(), Vector< T, Alloc >::device_data(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, qmcplusplus::queue, and Vector< T, Alloc >::size().

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

181  {
182  // This is probably dodgy
183  const int nw = log_values.size();
184  psiM_invM_ptrs_.resize(nw * 2);
185  for (int iw = 0; iw < nw; ++iw)
186  {
187  psiM_invM_ptrs_[iw] = psi_Ms.device_data() + iw * n * lda;
188  psiM_invM_ptrs_[iw + nw] = inv_Ms.device_data() + iw * n * lda;
189  }
190  pivots_.resize(n * nw);
191  infos_.resize(nw);
192  LU_diags_fp_.resize(n * nw);
193 
194  cudaStream_t h_stream = queue.getNative();
195  cudaErrorCheck(cudaMemcpyAsync(psi_Ms.device_data(), psi_Ms.data(), psi_Ms.size() * sizeof(VALUE_FP),
196  cudaMemcpyHostToDevice, h_stream),
197  "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM_fp to device");
199  psiM_invM_ptrs_.size() * sizeof(VALUE_FP*), cudaMemcpyHostToDevice, h_stream),
200  "cudaMemcpyAsync psiM_invM_ptrs_ failed!");
202  psiM_invM_ptrs_.device_data() + nw, LU_diags_fp_.device_data(),
203  pivots_.device_data(), infos_.data(), infos_.device_data(),
204  log_values.device_data(), nw);
205 #if NDEBUG
206  // This is very useful to see whether the data after all kernels and cublas calls are run is wrong on the device or due to copy.
207  // cuBLAS_LU::peekinvM_batched(h_stream, psiM_mw_ptr, invM_mw_ptr, pivots_.device_data(), infos_.device_data(),
208  // log_values.device_data(), nw);
209 #endif
210  cudaErrorCheck(cudaMemcpyAsync(inv_Ms.data(), inv_Ms.device_data(), inv_Ms.size() * sizeof(VALUE_FP),
211  cudaMemcpyDeviceToHost, h_stream),
212  "cudaMemcpyAsync failed copying back DiracMatrixBatch::invM_fp from device");
213  cudaErrorCheck(cudaMemcpyAsync(log_values.data(), log_values.device_data(), log_values.size() * sizeof(LogValue),
214  cudaMemcpyDeviceToHost, h_stream),
215  "cudaMemcpyAsync log_values failed!");
216  cudaErrorCheck(cudaStreamSynchronize(h_stream), "cudaStreamSynchronize failed!");
217  }
DualVector< VALUE_FP * > psiM_invM_ptrs_
Transfer buffer for device pointers to matrices.
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
#define cudaStream_t
Definition: cuda2hip.h:149
void computeInverseAndDetLog_batched(cublasHandle_t &h_cublas, cudaStream_t &hstream, const int n, const int lda, T *Ms[], T *Cs[], T *LU_diags, int *pivots, int *host_infos, int *infos, std::complex< double > *log_dets, const int batch_size)
Takes PsiM in column major layout and uses LU factorization to compute the log determinant and invPsi...
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
#define cudaMemcpyDeviceToHost
Definition: cuda2hip.h:138
#define cudaStreamSynchronize
Definition: cuda2hip.h:152
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:139
#define cudaMemcpyAsync
Definition: cuda2hip.h:136

◆ mw_invertTranspose() [1/2]

std::enable_if_t<!std::is_same<VALUE_FP, TMAT>::value> mw_invertTranspose ( compute::Queue< PlatformKind::CUDA > &  queue,
const RefVector< const DualMatrix< TMAT >> &  a_mats,
const RefVector< DualMatrix< TMAT >> &  inv_a_mats,
DualVector< LogValue > &  log_values 
)
inline

Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision.

This is not yet optimized to transpose on the GPU

List of operations:

  1. matrix-by-matrix. Transpose a_mat to psiM_fp_ used on host
  2. batched. Call mw_computeInvertAndLog_stride, H2D, invert, D2H
  3. matrix-by-matrix. Copy invM_fp_ to inv_a_mat on host. Transfer inv_a_mat to device.

Pros and cons:

  1. transfer is batched but double the transfer size due to precision promotion
  2. Todo:
    Copy invM_fp_ to inv_a_mat on device is desired. Transfer inv_a_mat to host should be handled by the upper level code.

Definition at line 291 of file DiracMatrixComputeCUDA.hpp.

References Matrix< T, Alloc >::attachReference(), cublasErrorCheck, cublasSetStream, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, cudaStream_t, DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, qmcplusplus::queue, and qmcplusplus::simd::transpose().

Referenced by qmcplusplus::TEST_CASE().

296  {
297  cudaStream_t h_stream = queue.getNative();
298  cublasErrorCheck(cublasSetStream(h_cublas_, h_stream), "cublasSetStream failed!");
299  assert(log_values.size() == a_mats.size());
300  const int nw = a_mats.size();
301  const int n = a_mats[0].get().rows();
302  const int lda = a_mats[0].get().cols();
303  size_t nsqr = n * n;
304  psiM_fp_.resize(n * lda * nw);
305  invM_fp_.resize(n * lda * nw);
306  std::fill(log_values.begin(), log_values.end(), LogValue{0.0, 0.0});
307  // making sure we know the log_values are zero'd on the device.
308  cudaErrorCheck(cudaMemcpyAsync(log_values.device_data(), log_values.data(), log_values.size() * sizeof(LogValue),
309  cudaMemcpyHostToDevice, h_stream),
310  "cudaMemcpyAsync failed copying DiracMatrixBatch::log_values to device");
311  for (int iw = 0; iw < nw; ++iw)
312  simd::transpose(a_mats[iw].get().data(), n, a_mats[iw].get().cols(), psiM_fp_.data() + nsqr * iw, n, lda);
314  for (int iw = 0; iw < a_mats.size(); ++iw)
315  {
316  DualMatrix<VALUE_FP> data_ref_matrix;
317  data_ref_matrix.attachReference(invM_fp_.data() + nsqr * iw, n, lda);
318  // We can't use operator= with different lda, ldb which can happen so we use this assignment which is over the
319  // smaller of the two's dimensions
320  inv_a_mats[iw].get().assignUpperLeft(data_ref_matrix);
321  cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().device_data(), inv_a_mats[iw].get().data(),
322  inv_a_mats[iw].get().size() * sizeof(TMAT), cudaMemcpyHostToDevice, h_stream),
323  "cudaMemcpyAsync of inv_a_mat to device failed!");
324  }
325  }
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
void transpose(const T *restrict A, size_t m, size_t lda, TO *restrict B, size_t n, size_t ldb)
transpose of A(m,n) to B(n,m)
Definition: algorithm.hpp:97
#define cudaStream_t
Definition: cuda2hip.h:149
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:139
void mw_computeInvertAndLog_stride(compute::Queue< PlatformKind::CUDA > &queue, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full pre...
#define cublasSetStream
Definition: cuda2hip.h:39
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34
#define cudaMemcpyAsync
Definition: cuda2hip.h:136

◆ mw_invertTranspose() [2/2]

std::enable_if_t<std::is_same<VALUE_FP, TMAT>::value> mw_invertTranspose ( compute::Queue< PlatformKind::CUDA > &  queue,
const RefVector< const DualMatrix< TMAT >> &  a_mats,
const RefVector< DualMatrix< TMAT >> &  inv_a_mats,
DualVector< LogValue > &  log_values 
)
inline

Batched inversion and calculation of log determinants.

When TMAT is full precision we can use the a_mat and inv_mat directly Side effect of this is after this call the device copy of a_mats contains the LU factorization matrix.

Definition at line 333 of file DiracMatrixComputeCUDA.hpp.

References cublasErrorCheck, cublasSetStream, cudaStream_t, DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), qmcplusplus::n, and qmcplusplus::queue.

338  {
339  cudaStream_t h_stream = queue.getNative();
340  cublasErrorCheck(cublasSetStream(h_cublas_, h_stream), "cublasSetStream failed!");
341  assert(log_values.size() == a_mats.size());
342  const int n = a_mats[0].get().rows();
343  mw_computeInvertAndLog(queue, a_mats, inv_a_mats, n, log_values);
344  }
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
#define cudaStream_t
Definition: cuda2hip.h:149
void mw_computeInvertAndLog(compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< VALUE_FP >> &a_mats, const RefVector< DualMatrix< VALUE_FP >> &inv_a_mats, const int n, DualVector< LogValue > &log_values)
Calculates the actual inv and log determinant on accelerator.
#define cublasSetStream
Definition: cuda2hip.h:39
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34

Member Data Documentation

◆ h_cublas_

◆ host_one

VALUE_FP host_one {1.0}
private

◆ host_zero

VALUE_FP host_zero {0.0}
private

◆ infos_

◆ invM_fp_

◆ LU_diags_fp_

◆ pivots_

◆ psiM_fp_

◆ psiM_invM_ptrs_

DualVector<VALUE_FP*> psiM_invM_ptrs_
private

Transfer buffer for device pointers to matrices.

The element count is usually low and the transfer launch cost are more than the transfer themselves. For this reason, it is beneficial to fusing multiple lists of pointers. Right now this buffer packs nw psiM pointers and then packs nw invM pointers. Use only within a function scope and do not rely on previous value.

Definition at line 77 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().


The documentation for this class was generated from the following file: