QMCPACK
DiracMatrixComputeCUDA< VALUE_FP > Class Template Reference

class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes. More...

+ Inheritance diagram for DiracMatrixComputeCUDA< VALUE_FP >:
+ Collaboration diagram for DiracMatrixComputeCUDA< VALUE_FP >:

Public Member Functions

 DiracMatrixComputeCUDA ()
 
 DiracMatrixComputeCUDA (const DiracMatrixComputeCUDA &other)
 
std::unique_ptr< ResourcemakeClone () const override
 
template<typename TMAT >
void invert_transpose (compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, DualMatrix< TMAT > &a_mat, DualMatrix< TMAT > &inv_a_mat, DualVector< LogValue > &log_values)
 Given a_mat returns inverted amit and log determinant of a_matches. More...
 
template<typename TMAT >
std::enable_if_t<!std::is_same< VALUE_FP, TMAT >::value > mw_invertTranspose (compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values)
 Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision. More...
 
template<typename TMAT >
std::enable_if_t< std::is_same< VALUE_FP, TMAT >::value > mw_invertTranspose (compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values)
 Batched inversion and calculation of log determinants. More...
 
- Public Member Functions inherited from Resource
 Resource (const std::string &name)
 
virtual ~Resource ()=default
 
const std::string & getName () const
 

Private Types

using FullPrecReal = RealAlias< VALUE_FP >
 
using LogValue = std::complex< FullPrecReal >
 
template<typename T >
using DualMatrix = Matrix< T, PinnedDualAllocator< T > >
 
template<typename T >
using DualVector = Vector< T, PinnedDualAllocator< T > >
 

Private Member Functions

void mw_computeInvertAndLog (compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, const RefVector< const DualMatrix< VALUE_FP >> &a_mats, const RefVector< DualMatrix< VALUE_FP >> &inv_a_mats, const int n, DualVector< LogValue > &log_values)
 Calculates the actual inv and log determinant on accelerator. More...
 
void mw_computeInvertAndLog_stride (compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
 Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors. More...
 

Private Attributes

DualVector< VALUE_FP > psiM_fp_
 
DualVector< VALUE_FP > invM_fp_
 
DualVector< VALUE_FP > LU_diags_fp_
 
DualVector< int > pivots_
 
DualVector< int > infos_
 
DualVector< VALUE_FP * > psiM_invM_ptrs_
 Transfer buffer for device pointers to matrices. More...
 
VALUE_FP host_one {1.0}
 
VALUE_FP host_zero {0.0}
 

Detailed Description

template<typename VALUE_FP>
class qmcplusplus::DiracMatrixComputeCUDA< VALUE_FP >

class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes.

Multiplicty is one per crowd not one per UpdateEngine It matches the multiplicity of the accelerator call and batched resource requirement.

Template Parameters
VALUE_FPthe datatype used in the actual computation of matrix inversion

There are no per walker variables, resources specific to the per crowd compute object are owned here. The compute object itself is the resource to the per walker DiracDeterminantBatched. Resources used by this object but owned by the surrounding scope are passed as arguments.

Definition at line 44 of file DiracMatrixComputeCUDA.hpp.

Member Typedef Documentation

◆ DualMatrix

using DualMatrix = Matrix<T, PinnedDualAllocator<T> >
private

Definition at line 50 of file DiracMatrixComputeCUDA.hpp.

◆ DualVector

using DualVector = Vector<T, PinnedDualAllocator<T> >
private

Definition at line 53 of file DiracMatrixComputeCUDA.hpp.

◆ FullPrecReal

using FullPrecReal = RealAlias<VALUE_FP>
private

Definition at line 46 of file DiracMatrixComputeCUDA.hpp.

◆ LogValue

using LogValue = std::complex<FullPrecReal>
private

Definition at line 47 of file DiracMatrixComputeCUDA.hpp.

Constructor & Destructor Documentation

◆ DiracMatrixComputeCUDA() [1/2]

Definition at line 214 of file DiracMatrixComputeCUDA.hpp.

214 : Resource("DiracMatrixComputeCUDA") {}
Resource(const std::string &name)
Definition: Resource.h:23

◆ DiracMatrixComputeCUDA() [2/2]

DiracMatrixComputeCUDA ( const DiracMatrixComputeCUDA< VALUE_FP > &  other)
inline

Definition at line 216 of file DiracMatrixComputeCUDA.hpp.

216 : Resource(other.getName()) {}
Resource(const std::string &name)
Definition: Resource.h:23

Member Function Documentation

◆ invert_transpose()

void invert_transpose ( compute::BLASHandle< PlatformKind::CUDA > &  cuda_handles,
DualMatrix< TMAT > &  a_mat,
DualMatrix< TMAT > &  inv_a_mat,
DualVector< LogValue > &  log_values 
)
inline

Given a_mat returns inverted amit and log determinant of a_matches.

Parameters
[in]a_mata matrix input
[out]inv_a_matinverted matrix
[out]logdeterminant is in logvalues[0]

I consider this single call to be semi depricated so the log determinant values vector is used to match the primary batched interface to the accelerated routings. There is no optimization (yet) for TMAT same type as TREAL

Definition at line 230 of file DiracMatrixComputeCUDA.hpp.

References Matrix< T, Alloc >::assignUpperLeft(), Matrix< T, Alloc >::attachReference(), Matrix< T, Alloc >::cols(), qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, Matrix< T, Alloc >::data(), Matrix< T, Alloc >::device_data(), BLASHandle< PlatformKind::CUDA >::h_stream, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, Matrix< T, Alloc >::rows(), Matrix< T, Alloc >::size(), and qmcplusplus::simd::transpose().

Referenced by qmcplusplus::TEST_CASE().

234  {
235  const int n = a_mat.rows();
236  const int lda = a_mat.cols();
237  psiM_fp_.resize(n * lda);
238  invM_fp_.resize(n * lda);
239  std::fill(log_values.begin(), log_values.end(), LogValue{0.0, 0.0});
240  // making sure we know the log_values are zero'd on the device.
241  cudaErrorCheck(cudaMemcpyAsync(log_values.device_data(), log_values.data(), log_values.size() * sizeof(LogValue),
242  cudaMemcpyHostToDevice, cuda_handles.h_stream),
243  "cudaMemcpyAsync failed copying DiracMatrixBatch::log_values to device");
244  simd::transpose(a_mat.data(), n, lda, psiM_fp_.data(), n, lda);
245  cudaErrorCheck(cudaMemcpyAsync(psiM_fp_.device_data(), psiM_fp_.data(), psiM_fp_.size() * sizeof(VALUE_FP),
246  cudaMemcpyHostToDevice, cuda_handles.h_stream),
247  "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM_fp to device");
249  DualMatrix<VALUE_FP> data_ref_matrix;
250 
251  data_ref_matrix.attachReference(invM_fp_.data(), n, n);
252 
253  // We can't use operator= with different lda, ldb which can happen so we use this assignment which is over the
254  // smaller of the two's dimensions
255  inv_a_mat.assignUpperLeft(data_ref_matrix);
256  cudaErrorCheck(cudaMemcpyAsync(inv_a_mat.device_data(), inv_a_mat.data(), inv_a_mat.size() * sizeof(TMAT),
257  cudaMemcpyHostToDevice, cuda_handles.h_stream),
258  "cudaMemcpyAsync of inv_a_mat to device failed!");
259  }
void mw_computeInvertAndLog_stride(compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full pre...
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:131
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
void transpose(const T *restrict A, size_t m, size_t lda, TO *restrict B, size_t n, size_t ldb)
transpose of A(m,n) to B(n,m)
Definition: algorithm.hpp:97
#define cudaMemcpyAsync
Definition: cuda2hip.h:128

◆ makeClone()

std::unique_ptr<Resource> makeClone ( ) const
inlineoverridevirtual

Implements Resource.

Definition at line 218 of file DiracMatrixComputeCUDA.hpp.

218 { return std::make_unique<DiracMatrixComputeCUDA>(*this); }

◆ mw_computeInvertAndLog()

void mw_computeInvertAndLog ( compute::BLASHandle< PlatformKind::CUDA > &  cuda_handles,
const RefVector< const DualMatrix< VALUE_FP >> &  a_mats,
const RefVector< DualMatrix< VALUE_FP >> &  inv_a_mats,
const int  n,
DualVector< LogValue > &  log_values 
)
inlineprivate

Calculates the actual inv and log determinant on accelerator.

Parameters
[in]h_cublascublas handle, hstream handle is retrieved from it.
[in,out]a_matsdual A matrices, they will be transposed on the device side as a side effect.
[out]inv_a_matsdual invM matrices
[in]nmatrices rank.
[out]log_valueslog determinant value for each matrix, batch_size = log_values.size()

On Volta so far little seems to be achieved by having the mats continuous.

List of operations:

  1. matrix-by-matrix. Copy a_mat to inv_a_mat on host, transfer inv_a_mat to device, transpose inv_a_mat to a_mat on device.
  2. batched. LU and invert
  3. matrix-by-matrix. Transfer inv_a_mat to host

Pros and cons:

  1. Todo:
    try to do like mw_computeInvertAndLog_stride, copy and transpose to psiM_fp_ and fuse transfer.

Definition at line 97 of file DiracMatrixComputeCUDA.hpp.

References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), CUBLAS_OP_N, CUBLAS_OP_T, cublasErrorCheck, cublasHandle_t, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, qmcplusplus::cuBLAS::geam(), BLASHandle< PlatformKind::CUDA >::h_cublas, BLASHandle< PlatformKind::CUDA >::h_stream, DiracMatrixComputeCUDA< VALUE_FP >::host_one, DiracMatrixComputeCUDA< VALUE_FP >::host_zero, qmcplusplus::hstream, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, and qmcplusplus::simd::remapCopy().

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

102  {
103  const int nw = a_mats.size();
104  assert(a_mats.size() == inv_a_mats.size());
105 
106  psiM_invM_ptrs_.resize(nw * 2);
107  const int lda = a_mats[0].get().cols();
108  const int ldinv = inv_a_mats[0].get().cols();
109  cudaStream_t hstream = cuda_handles.h_stream;
110  cublasHandle_t h_cublas = cuda_handles.h_cublas;
111  psiM_fp_.resize(n * ldinv * nw);
112 
113  for (int iw = 0; iw < nw; ++iw)
114  {
115  psiM_invM_ptrs_[iw] = psiM_fp_.device_data() + iw * n * ldinv;
116  psiM_invM_ptrs_[iw + nw] = inv_a_mats[iw].get().device_data();
117  // Since inv_a_mat can have a different leading dimension from a_mat first we remap copy on the host
118  simd::remapCopy(n, n, a_mats[iw].get().data(), lda, inv_a_mats[iw].get().data(), ldinv);
119  // Then copy a_mat in inv_a_mats to the device
120  cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().device_data(), inv_a_mats[iw].get().data(),
121  inv_a_mats[iw].get().size() * sizeof(VALUE_FP), cudaMemcpyHostToDevice, hstream),
122  "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM to device");
123  // On the device Here we transpose to a_mat;
125  inv_a_mats[iw].get().device_data(), ldinv, &host_zero,
126  a_mats[iw].get().device_data(), lda, psiM_invM_ptrs_[iw], ldinv),
127  "cuBLAS::geam failed.");
128  }
129  pivots_.resize(n * nw);
130  infos_.resize(nw);
131  LU_diags_fp_.resize(n * nw);
133  psiM_invM_ptrs_.size() * sizeof(VALUE_FP*), cudaMemcpyHostToDevice, hstream),
134  "cudaMemcpyAsync psiM_invM_ptrs_ failed!");
135  cuBLAS_LU::computeInverseAndDetLog_batched(h_cublas, hstream, n, ldinv, psiM_invM_ptrs_.device_data(),
136  psiM_invM_ptrs_.device_data() + nw, LU_diags_fp_.device_data(),
137  pivots_.device_data(), infos_.data(), infos_.device_data(),
138  log_values.device_data(), nw);
139  for (int iw = 0; iw < nw; ++iw)
140  {
141  cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().data(), inv_a_mats[iw].get().device_data(),
142  inv_a_mats[iw].get().size() * sizeof(VALUE_FP), cudaMemcpyDeviceToHost, hstream),
143  "cudaMemcpyAsync failed copying DiracMatrixBatch::inv_psiM to host");
144  }
145  cudaErrorCheck(cudaMemcpyAsync(log_values.data(), log_values.device_data(), log_values.size() * sizeof(LogValue),
147  "cudaMemcpyAsync log_values failed!");
148  cudaErrorCheck(cudaStreamSynchronize(hstream), "cudaStreamSynchronize failed!");
149  }
#define CUBLAS_OP_N
Definition: cuda2hip.h:19
#define cudaStream_t
Definition: cuda2hip.h:141
cublasStatus_t geam(cublasHandle_t &handle, cublasOperation_t &transa, cublasOperation_t &transb, int m, int n, const float *alpha, const float *A, int lda, const float *beta, const float *B, int ldb, float *C, int ldc)
Definition: cuBLAS.hpp:110
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
DualVector< VALUE_FP * > psiM_invM_ptrs_
Transfer buffer for device pointers to matrices.
#define cudaMemcpyDeviceToHost
Definition: cuda2hip.h:130
#define CUBLAS_OP_T
Definition: cuda2hip.h:20
#define cudaStreamSynchronize
Definition: cuda2hip.h:144
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:131
void computeInverseAndDetLog_batched(cublasHandle_t &h_cublas, cudaStream_t &hstream, const int n, const int lda, T *Ms[], T *Cs[], T *LU_diags, int *pivots, int *host_infos, int *infos, std::complex< double > *log_dets, const int batch_size)
Takes PsiM in column major layout and uses LU factorization to compute the log determinant and invPsi...
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
#define cublasErrorCheck(ans, cause)
Definition: cuBLAS.hpp:34
void remapCopy(size_t m, size_t n, const T *restrict A, size_t lda, TO *restrict B, size_t ldb)
copy of A(m,n) to B(m,n)
Definition: algorithm.hpp:115
#define cublasHandle_t
Definition: cuda2hip.h:35
#define cudaMemcpyAsync
Definition: cuda2hip.h:128

◆ mw_computeInvertAndLog_stride()

void mw_computeInvertAndLog_stride ( compute::BLASHandle< PlatformKind::CUDA > &  cuda_handles,
DualVector< VALUE_FP > &  psi_Ms,
DualVector< VALUE_FP > &  inv_Ms,
const int  n,
const int  lda,
DualVector< LogValue > &  log_values 
)
inlineprivate

Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors.

Parameters
[in]h_cublascublas handle, hstream handle is retrieved from it.
[in,out]psi_Msmatrices flattened into single pinned vector, returned with LU matrices.
[out]inv_Msmatrices flattened into single pinned vector.
[in]nmatrices rank.
[in]ldaleading dimension of each matrix
[out]log_valueslog determinant value for each matrix, batch_size = log_values.size()

List of operations:

  1. batched. Transfer psi_Ms to device
  2. batched. LU and invert
  3. batched. Transfer inv_Ms to host
    Todo:
    Remove 1 and 3. Handle transfer at upper level.

Definition at line 168 of file DiracMatrixComputeCUDA.hpp.

References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), cublasHandle_t, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, Vector< T, Alloc >::data(), Vector< T, Alloc >::device_data(), BLASHandle< PlatformKind::CUDA >::h_cublas, BLASHandle< PlatformKind::CUDA >::h_stream, qmcplusplus::hstream, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, and Vector< T, Alloc >::size().

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

174  {
175  // This is probably dodgy
176  const int nw = log_values.size();
177  psiM_invM_ptrs_.resize(nw * 2);
178  for (int iw = 0; iw < nw; ++iw)
179  {
180  psiM_invM_ptrs_[iw] = psi_Ms.device_data() + iw * n * lda;
181  psiM_invM_ptrs_[iw + nw] = inv_Ms.device_data() + iw * n * lda;
182  }
183  pivots_.resize(n * nw);
184  infos_.resize(nw);
185  LU_diags_fp_.resize(n * nw);
186 
187  cudaStream_t hstream = cuda_handles.h_stream;
188  cublasHandle_t h_cublas = cuda_handles.h_cublas;
189  cudaErrorCheck(cudaMemcpyAsync(psi_Ms.device_data(), psi_Ms.data(), psi_Ms.size() * sizeof(VALUE_FP),
191  "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM_fp to device");
193  psiM_invM_ptrs_.size() * sizeof(VALUE_FP*), cudaMemcpyHostToDevice, hstream),
194  "cudaMemcpyAsync psiM_invM_ptrs_ failed!");
196  psiM_invM_ptrs_.device_data() + nw, LU_diags_fp_.device_data(),
197  pivots_.device_data(), infos_.data(), infos_.device_data(),
198  log_values.device_data(), nw);
199 #if NDEBUG
200  // This is very useful to see whether the data after all kernels and cublas calls are run is wrong on the device or due to copy.
201  // cuBLAS_LU::peekinvM_batched(hstream, psiM_mw_ptr, invM_mw_ptr, pivots_.device_data(), infos_.device_data(),
202  // log_values.device_data(), nw);
203 #endif
204  cudaErrorCheck(cudaMemcpyAsync(inv_Ms.data(), inv_Ms.device_data(), inv_Ms.size() * sizeof(VALUE_FP),
206  "cudaMemcpyAsync failed copying back DiracMatrixBatch::invM_fp from device");
207  cudaErrorCheck(cudaMemcpyAsync(log_values.data(), log_values.device_data(), log_values.size() * sizeof(LogValue),
209  "cudaMemcpyAsync log_values failed!");
210  cudaErrorCheck(cudaStreamSynchronize(hstream), "cudaStreamSynchronize failed!");
211  }
#define cudaStream_t
Definition: cuda2hip.h:141
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
DualVector< VALUE_FP * > psiM_invM_ptrs_
Transfer buffer for device pointers to matrices.
#define cudaMemcpyDeviceToHost
Definition: cuda2hip.h:130
#define cudaStreamSynchronize
Definition: cuda2hip.h:144
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:131
void computeInverseAndDetLog_batched(cublasHandle_t &h_cublas, cudaStream_t &hstream, const int n, const int lda, T *Ms[], T *Cs[], T *LU_diags, int *pivots, int *host_infos, int *infos, std::complex< double > *log_dets, const int batch_size)
Takes PsiM in column major layout and uses LU factorization to compute the log determinant and invPsi...
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
#define cublasHandle_t
Definition: cuda2hip.h:35
#define cudaMemcpyAsync
Definition: cuda2hip.h:128

◆ mw_invertTranspose() [1/2]

std::enable_if_t<!std::is_same<VALUE_FP, TMAT>::value> mw_invertTranspose ( compute::BLASHandle< PlatformKind::CUDA > &  cuda_handles,
const RefVector< const DualMatrix< TMAT >> &  a_mats,
const RefVector< DualMatrix< TMAT >> &  inv_a_mats,
DualVector< LogValue > &  log_values 
)
inline

Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision.

This is not yet optimized to transpose on the GPU

List of operations:

  1. matrix-by-matrix. Transpose a_mat to psiM_fp_ used on host
  2. batched. Call mw_computeInvertAndLog_stride, H2D, invert, D2H
  3. matrix-by-matrix. Copy invM_fp_ to inv_a_mat on host. Transfer inv_a_mat to device.

Pros and cons:

  1. transfer is batched but double the transfer size due to precision promotion
  2. Todo:
    Copy invM_fp_ to inv_a_mat on device is desired. Transfer inv_a_mat to host should be handled by the upper level code.

Definition at line 275 of file DiracMatrixComputeCUDA.hpp.

References Matrix< T, Alloc >::attachReference(), qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, BLASHandle< PlatformKind::CUDA >::h_stream, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, and qmcplusplus::simd::transpose().

Referenced by qmcplusplus::TEST_CASE().

280  {
281  assert(log_values.size() == a_mats.size());
282  const int nw = a_mats.size();
283  const int n = a_mats[0].get().rows();
284  const int lda = a_mats[0].get().cols();
285  size_t nsqr = n * n;
286  psiM_fp_.resize(n * lda * nw);
287  invM_fp_.resize(n * lda * nw);
288  std::fill(log_values.begin(), log_values.end(), LogValue{0.0, 0.0});
289  // making sure we know the log_values are zero'd on the device.
290  cudaErrorCheck(cudaMemcpyAsync(log_values.device_data(), log_values.data(), log_values.size() * sizeof(LogValue),
291  cudaMemcpyHostToDevice, cuda_handles.h_stream),
292  "cudaMemcpyAsync failed copying DiracMatrixBatch::log_values to device");
293  for (int iw = 0; iw < nw; ++iw)
294  simd::transpose(a_mats[iw].get().data(), n, a_mats[iw].get().cols(), psiM_fp_.data() + nsqr * iw, n, lda);
296  for (int iw = 0; iw < a_mats.size(); ++iw)
297  {
298  DualMatrix<VALUE_FP> data_ref_matrix;
299  data_ref_matrix.attachReference(invM_fp_.data() + nsqr * iw, n, lda);
300  // We can't use operator= with different lda, ldb which can happen so we use this assignment which is over the
301  // smaller of the two's dimensions
302  inv_a_mats[iw].get().assignUpperLeft(data_ref_matrix);
303  cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().device_data(), inv_a_mats[iw].get().data(),
304  inv_a_mats[iw].get().size() * sizeof(TMAT), cudaMemcpyHostToDevice,
305  cuda_handles.h_stream),
306  "cudaMemcpyAsync of inv_a_mat to device failed!");
307  }
308  }
void mw_computeInvertAndLog_stride(compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full pre...
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
#define cudaMemcpyHostToDevice
Definition: cuda2hip.h:131
cudaErrorCheck(cudaMemcpyAsync(dev_lu.data(), lu.data(), sizeof(decltype(lu)::value_type) *lu.size(), cudaMemcpyHostToDevice, hstream), "cudaMemcpyAsync failed copying log_values to device")
void transpose(const T *restrict A, size_t m, size_t lda, TO *restrict B, size_t n, size_t ldb)
transpose of A(m,n) to B(n,m)
Definition: algorithm.hpp:97
#define cudaMemcpyAsync
Definition: cuda2hip.h:128

◆ mw_invertTranspose() [2/2]

std::enable_if_t<std::is_same<VALUE_FP, TMAT>::value> mw_invertTranspose ( compute::BLASHandle< PlatformKind::CUDA > &  cuda_handles,
const RefVector< const DualMatrix< TMAT >> &  a_mats,
const RefVector< DualMatrix< TMAT >> &  inv_a_mats,
DualVector< LogValue > &  log_values 
)
inline

Batched inversion and calculation of log determinants.

When TMAT is full precision we can use the a_mat and inv_mat directly Side effect of this is after this call the device copy of a_mats contains the LU factorization matrix.

Definition at line 316 of file DiracMatrixComputeCUDA.hpp.

References qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and qmcplusplus::n.

321  {
322  assert(log_values.size() == a_mats.size());
323  const int n = a_mats[0].get().rows();
324  mw_computeInvertAndLog(cuda_handles, a_mats, inv_a_mats, n, log_values);
325  }
std::vector< StdComp, CUDAHostAllocator< StdComp > > log_values(batch_size)
void mw_computeInvertAndLog(compute::BLASHandle< PlatformKind::CUDA > &cuda_handles, const RefVector< const DualMatrix< VALUE_FP >> &a_mats, const RefVector< DualMatrix< VALUE_FP >> &inv_a_mats, const int n, DualVector< LogValue > &log_values)
Calculates the actual inv and log determinant on accelerator.

Member Data Documentation

◆ host_one

VALUE_FP host_one {1.0}
private

◆ host_zero

VALUE_FP host_zero {0.0}
private

◆ infos_

◆ invM_fp_

◆ LU_diags_fp_

◆ pivots_

◆ psiM_fp_

◆ psiM_invM_ptrs_

DualVector<VALUE_FP*> psiM_invM_ptrs_
private

Transfer buffer for device pointers to matrices.

The element count is usually low and the transfer launch cost are more than the transfer themselves. For this reason, it is beneficial to fusing multiple lists of pointers. Right now this buffer packs nw psiM pointers and then packs nw invM pointers. Use only within a function scope and do not rely on previous value.

Definition at line 72 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().


The documentation for this class was generated from the following file: