class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes. More...

Inheritance diagram for DiracMatrixComputeCUDA< VALUE_FP >:

Collaboration diagram for DiracMatrixComputeCUDA< VALUE_FP >:

Public Member Functions
	DiracMatrixComputeCUDA ()

	DiracMatrixComputeCUDA (const DiracMatrixComputeCUDA &other)

	~DiracMatrixComputeCUDA ()

std::unique_ptr< Resource >	makeClone () const override

template<typename TMAT >
void	invert_transpose (compute::Queue< PlatformKind::CUDA > &queue, DualMatrix< TMAT > &a_mat, DualMatrix< TMAT > &inv_a_mat, DualVector< LogValue > &log_values)
	Given a_mat returns inverted amit and log determinant of a_matches. More...

template<typename TMAT >
std::enable_if_t<!std::is_same< VALUE_FP, TMAT >::value >	mw_invertTranspose (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values)
	Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision. More...

template<typename TMAT >
std::enable_if_t< std::is_same< VALUE_FP, TMAT >::value >	mw_invertTranspose (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values)
	Batched inversion and calculation of log determinants. More...

Public Member Functions inherited from Resource
	Resource (const std::string &name)

virtual	~Resource ()=default

const std::string &	getName () const

Private Types
using	FullPrecReal = RealAlias< VALUE_FP >

using	LogValue = std::complex< FullPrecReal >

template<typename T >
using	DualMatrix = Matrix< T, PinnedDualAllocator< T > >

template<typename T >
using	DualVector = Vector< T, PinnedDualAllocator< T > >

Private Member Functions
void	mw_computeInvertAndLog (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< VALUE_FP >> &a_mats, const RefVector< DualMatrix< VALUE_FP >> &inv_a_mats, const int n, DualVector< LogValue > &log_values)
	Calculates the actual inv and log determinant on accelerator. More...

void	mw_computeInvertAndLog_stride (compute::Queue< PlatformKind::CUDA > &queue, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values)
	Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors. More...

Private Attributes
DualVector< VALUE_FP >	psiM_fp_

DualVector< VALUE_FP >	invM_fp_

DualVector< VALUE_FP >	LU_diags_fp_

DualVector< int >	pivots_

DualVector< int >	infos_

DualVector< VALUE_FP * >	psiM_invM_ptrs_
	Transfer buffer for device pointers to matrices. More...

VALUE_FP	host_one {1.0}

VALUE_FP	host_zero {0.0}

cublasHandle_t	h_cublas_

Detailed Description

template<typename VALUE_FP>
class qmcplusplus::DiracMatrixComputeCUDA< VALUE_FP >

class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes.

Multiplicty is one per crowd not one per UpdateEngine It matches the multiplicity of the accelerator call and batched resource requirement.

Template Parameters

VALUE_FP the datatype used in the actual computation of matrix inversion

There are no per walker variables, resources specific to the per crowd compute object are owned here. The compute object itself is the resource to the per walker DiracDeterminantBatched. Resources used by this object but owned by the surrounding scope are passed as arguments.

All the public APIs are synchronous. The asynchronous queue argument gets synchronized before return. rocBLAS, indirectly used via hipBLAS, requires synchronizing the old stream before setting a new one. We don't need to actively synchronize the old stream because it gets synchronized right after each use.

Definition at line 49 of file DiracMatrixComputeCUDA.hpp.

Member Typedef Documentation

◆ DualMatrix

using DualMatrix = Matrix<T, PinnedDualAllocator<T> >

private

Definition at line 55 of file DiracMatrixComputeCUDA.hpp.

◆ DualVector

using DualVector = Vector<T, PinnedDualAllocator<T> >

private

Definition at line 58 of file DiracMatrixComputeCUDA.hpp.

◆ FullPrecReal

using FullPrecReal = RealAlias<VALUE_FP>

private

Definition at line 51 of file DiracMatrixComputeCUDA.hpp.

◆ LogValue

using LogValue = std::complex<FullPrecReal>

private

Definition at line 52 of file DiracMatrixComputeCUDA.hpp.

Constructor & Destructor Documentation

◆ DiracMatrixComputeCUDA() [1/2]

DiracMatrixComputeCUDA ( )

inline

Definition at line 220 of file DiracMatrixComputeCUDA.hpp.

References cublasCreate, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.

                            : Resource("DiracMatrixComputeCUDA")
   {
     cublasErrorCheck(cublasCreate(&h_cublas_), "cublasCreate failed!");
   }

◆ DiracMatrixComputeCUDA() [2/2]

DiracMatrixComputeCUDA ( const DiracMatrixComputeCUDA< VALUE_FP > & other )

inline

Definition at line 225 of file DiracMatrixComputeCUDA.hpp.

References cublasCreate, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.

                                                               : Resource(other.getName())
   {
     cublasErrorCheck(cublasCreate(&h_cublas_), "cublasCreate failed!");
   }

◆ ~DiracMatrixComputeCUDA()

~DiracMatrixComputeCUDA ( )

inline

Definition at line 230 of file DiracMatrixComputeCUDA.hpp.

References cublasDestroy, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.

230 { cublasErrorCheck(cublasDestroy(h_cublas_), "cublasDestroy failed!"); }

qmcplusplus::DiracMatrixComputeCUDA::h_cublas_

cublasHandle_t h_cublas_

Definition: DiracMatrixComputeCUDA.hpp:84

cublasDestroy

#define cublasDestroy

Definition: cuda2hip.h:38

cublasErrorCheck

#define cublasErrorCheck(ans, cause)

Definition: cuBLAS.hpp:34

Member Function Documentation

◆ invert_transpose()

void invert_transpose	(	compute::Queue< PlatformKind::CUDA > &	queue,
		DualMatrix< TMAT > &	a_mat,
		DualMatrix< TMAT > &	inv_a_mat,
		DualVector< LogValue > &	log_values
	)

inline

Given a_mat returns inverted amit and log determinant of a_matches.

Parameters

[in]	a_mat	a matrix input
[out]	inv_a_mat	inverted matrix
[out]	log	determinant is in logvalues[0]

I consider this single call to be semi depricated so the log determinant values vector is used to match the primary batched interface to the accelerated routings. There is no optimization (yet) for TMAT same type as TREAL

Definition at line 244 of file DiracMatrixComputeCUDA.hpp.

References Matrix< T, Alloc >::assignUpperLeft(), Matrix< T, Alloc >::attachReference(), Matrix< T, Alloc >::cols(), cublasErrorCheck, cublasSetStream, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, cudaStream_t, Matrix< T, Alloc >::data(), Matrix< T, Alloc >::device_data(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, qmcplusplus::queue, Matrix< T, Alloc >::rows(), Matrix< T, Alloc >::size(), and qmcplusplus::simd::transpose().

Referenced by qmcplusplus::TEST_CASE().

   {
     cudaStream_t h_stream = queue.getNative();
     cublasErrorCheck(cublasSetStream(h_cublas_, h_stream), "cublasSetStream failed!");
     const int n   = a_mat.rows();
     const int lda = a_mat.cols();
     psiM_fp_.resize(n * lda);
     invM_fp_.resize(n * lda);
     std::fill(log_values.begin(), log_values.end(), LogValue{0.0, 0.0});
     // making sure we know the log_values are zero'd on the device.
     cudaErrorCheck(cudaMemcpyAsync(log_values.device_data(), log_values.data(), log_values.size() * sizeof(LogValue),
                                    cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync failed copying DiracMatrixBatch::log_values to device");
     simd::transpose(a_mat.data(), n, lda, psiM_fp_.data(), n, lda);
     cudaErrorCheck(cudaMemcpyAsync(psiM_fp_.device_data(), psiM_fp_.data(), psiM_fp_.size() * sizeof(VALUE_FP),
                                    cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM_fp to device");
     mw_computeInvertAndLog_stride(queue, psiM_fp_, invM_fp_, n, lda, log_values);
     DualMatrix<VALUE_FP> data_ref_matrix;
 
     data_ref_matrix.attachReference(invM_fp_.data(), n, n);
 
     // We can't use operator= with different lda, ldb which can happen so we use this assignment which is over the
     // smaller of the two's dimensions
     inv_a_mat.assignUpperLeft(data_ref_matrix);
     cudaErrorCheck(cudaMemcpyAsync(inv_a_mat.device_data(), inv_a_mat.data(), inv_a_mat.size() * sizeof(TMAT),
                                    cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync of inv_a_mat to device failed!");
   }

◆ makeClone()

std::unique_ptr<Resource> makeClone ( ) const

inlineoverridevirtual

Implements Resource.

Definition at line 232 of file DiracMatrixComputeCUDA.hpp.

232 { return std::make_unique<DiracMatrixComputeCUDA>(*this); }

◆ mw_computeInvertAndLog()

void mw_computeInvertAndLog	(	compute::Queue< PlatformKind::CUDA > &	queue,
		const RefVector< const DualMatrix< VALUE_FP >> &	a_mats,
		const RefVector< DualMatrix< VALUE_FP >> &	inv_a_mats,
		const int	n,
		DualVector< LogValue > &	log_values
	)

inlineprivate

Calculates the actual inv and log determinant on accelerator.

Parameters

[in]	h_cublas	cublas handle, h_stream handle is retrieved from it.
[in,out]	a_mats	dual A matrices, they will be transposed on the device side as a side effect.
[out]	inv_a_mats	dual invM matrices
[in]	n	matrices rank.
[out]	log_values	log determinant value for each matrix, batch_size = log_values.size()

On Volta so far little seems to be achieved by having the mats continuous.

List of operations:

matrix-by-matrix. Copy a_mat to inv_a_mat on host, transfer inv_a_mat to device, transpose inv_a_mat to a_mat on device.
batched. LU and invert
matrix-by-matrix. Transfer inv_a_mat to host

Pros and cons:

Todo:
try to do like mw_computeInvertAndLog_stride, copy and transpose to psiM_fp_ and fuse transfer.

Definition at line 105 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

   {
     const int nw = a_mats.size();
     assert(a_mats.size() == inv_a_mats.size());
 
     psiM_invM_ptrs_.resize(nw * 2);
     const int lda         = a_mats[0].get().cols();
     const int ldinv       = inv_a_mats[0].get().cols();
     cudaStream_t h_stream = queue.getNative();
     psiM_fp_.resize(n * ldinv * nw);
 
     for (int iw = 0; iw < nw; ++iw)
     {
       psiM_invM_ptrs_[iw]      = psiM_fp_.device_data() + iw * n * ldinv;
       psiM_invM_ptrs_[iw + nw] = inv_a_mats[iw].get().device_data();
       // Since inv_a_mat can have a different leading dimension from a_mat first we remap copy on the host
       simd::remapCopy(n, n, a_mats[iw].get().data(), lda, inv_a_mats[iw].get().data(), ldinv);
       // Then copy a_mat in inv_a_mats to the device
       cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().device_data(), inv_a_mats[iw].get().data(),
                                      inv_a_mats[iw].get().size() * sizeof(VALUE_FP), cudaMemcpyHostToDevice, h_stream),
                      "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM to device");
       // On the device Here we transpose to a_mat;
       cublasErrorCheck(cuBLAS::geam(h_cublas_, CUBLAS_OP_T, CUBLAS_OP_N, n, n, &host_one,
                                     inv_a_mats[iw].get().device_data(), ldinv, &host_zero,
                                     a_mats[iw].get().device_data(), lda, psiM_invM_ptrs_[iw], ldinv),
                        "cuBLAS::geam failed.");
     }
     pivots_.resize(n * nw);
     infos_.resize(nw);
     LU_diags_fp_.resize(n * nw);
     cudaErrorCheck(cudaMemcpyAsync(psiM_invM_ptrs_.device_data(), psiM_invM_ptrs_.data(),
                                    psiM_invM_ptrs_.size() * sizeof(VALUE_FP*), cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync psiM_invM_ptrs_ failed!");
     cuBLAS_LU::computeInverseAndDetLog_batched(h_cublas_, h_stream, n, ldinv, psiM_invM_ptrs_.device_data(),
                                                psiM_invM_ptrs_.device_data() + nw, LU_diags_fp_.device_data(),
                                                pivots_.device_data(), infos_.data(), infos_.device_data(),
                                                log_values.device_data(), nw);
     for (int iw = 0; iw < nw; ++iw)
     {
       cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().data(), inv_a_mats[iw].get().device_data(),
                                      inv_a_mats[iw].get().size() * sizeof(VALUE_FP), cudaMemcpyDeviceToHost, h_stream),
                      "cudaMemcpyAsync failed copying DiracMatrixBatch::inv_psiM to host");
     }
     cudaErrorCheck(cudaMemcpyAsync(log_values.data(), log_values.device_data(), log_values.size() * sizeof(LogValue),
                                    cudaMemcpyDeviceToHost, h_stream),
                    "cudaMemcpyAsync log_values failed!");
     cudaErrorCheck(cudaStreamSynchronize(h_stream), "cudaStreamSynchronize failed!");
   }

◆ mw_computeInvertAndLog_stride()

void mw_computeInvertAndLog_stride	(	compute::Queue< PlatformKind::CUDA > &	queue,
		DualVector< VALUE_FP > &	psi_Ms,
		DualVector< VALUE_FP > &	inv_Ms,
		const int	n,
		const int	lda,
		DualVector< LogValue > &	log_values
	)

inlineprivate

Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors.

Parameters

[in]	h_cublas	cublas handle, h_stream handle is retrieved from it.
[in,out]	psi_Ms	matrices flattened into single pinned vector, returned with LU matrices.
[out]	inv_Ms	matrices flattened into single pinned vector.
[in]	n	matrices rank.
[in]	lda	leading dimension of each matrix
[out]	log_values	log determinant value for each matrix, batch_size = log_values.size()

List of operations:

batched. Transfer psi_Ms to device
batched. LU and invert
batched. Transfer inv_Ms to host
Todo:
Remove 1 and 3. Handle transfer at upper level.

Definition at line 175 of file DiracMatrixComputeCUDA.hpp.

References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, Vector< T, Alloc >::data(), Vector< T, Alloc >::device_data(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, qmcplusplus::queue, and Vector< T, Alloc >::size().

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

   {
     // This is probably dodgy
     const int nw = log_values.size();
     psiM_invM_ptrs_.resize(nw * 2);
     for (int iw = 0; iw < nw; ++iw)
     {
       psiM_invM_ptrs_[iw]      = psi_Ms.device_data() + iw * n * lda;
       psiM_invM_ptrs_[iw + nw] = inv_Ms.device_data() + iw * n * lda;
     }
     pivots_.resize(n * nw);
     infos_.resize(nw);
     LU_diags_fp_.resize(n * nw);
 
     cudaStream_t h_stream = queue.getNative();
     cudaErrorCheck(cudaMemcpyAsync(psi_Ms.device_data(), psi_Ms.data(), psi_Ms.size() * sizeof(VALUE_FP),
                                    cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync failed copying DiracMatrixBatch::psiM_fp to device");
     cudaErrorCheck(cudaMemcpyAsync(psiM_invM_ptrs_.device_data(), psiM_invM_ptrs_.data(),
                                    psiM_invM_ptrs_.size() * sizeof(VALUE_FP*), cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync psiM_invM_ptrs_ failed!");
     cuBLAS_LU::computeInverseAndDetLog_batched(h_cublas_, h_stream, n, lda, psiM_invM_ptrs_.device_data(),
                                                psiM_invM_ptrs_.device_data() + nw, LU_diags_fp_.device_data(),
                                                pivots_.device_data(), infos_.data(), infos_.device_data(),
                                                log_values.device_data(), nw);
 #if NDEBUG
     // This is very useful to see whether the data after all kernels and cublas calls are run is wrong on the device or due to copy.
     // cuBLAS_LU::peekinvM_batched(h_stream, psiM_mw_ptr, invM_mw_ptr, pivots_.device_data(), infos_.device_data(),
     //                             log_values.device_data(), nw);
 #endif
     cudaErrorCheck(cudaMemcpyAsync(inv_Ms.data(), inv_Ms.device_data(), inv_Ms.size() * sizeof(VALUE_FP),
                                    cudaMemcpyDeviceToHost, h_stream),
                    "cudaMemcpyAsync failed copying back DiracMatrixBatch::invM_fp from device");
     cudaErrorCheck(cudaMemcpyAsync(log_values.data(), log_values.device_data(), log_values.size() * sizeof(LogValue),
                                    cudaMemcpyDeviceToHost, h_stream),
                    "cudaMemcpyAsync log_values failed!");
     cudaErrorCheck(cudaStreamSynchronize(h_stream), "cudaStreamSynchronize failed!");
   }

◆ mw_invertTranspose() [1/2]

std::enable_if_t<!std::is_same<VALUE_FP, TMAT>::value> mw_invertTranspose	(	compute::Queue< PlatformKind::CUDA > &	queue,
		const RefVector< const DualMatrix< TMAT >> &	a_mats,
		const RefVector< DualMatrix< TMAT >> &	inv_a_mats,
		DualVector< LogValue > &	log_values
	)

inline

Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision.

This is not yet optimized to transpose on the GPU

List of operations:

matrix-by-matrix. Transpose a_mat to psiM_fp_ used on host
batched. Call mw_computeInvertAndLog_stride, H2D, invert, D2H
matrix-by-matrix. Copy invM_fp_ to inv_a_mat on host. Transfer inv_a_mat to device.

Pros and cons:

transfer is batched but double the transfer size due to precision promotion
Todo:
Copy invM_fp_ to inv_a_mat on device is desired. Transfer inv_a_mat to host should be handled by the upper level code.

Definition at line 291 of file DiracMatrixComputeCUDA.hpp.

References Matrix< T, Alloc >::attachReference(), cublasErrorCheck, cublasSetStream, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, cudaStream_t, DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, qmcplusplus::queue, and qmcplusplus::simd::transpose().

Referenced by qmcplusplus::TEST_CASE().

   {
     cudaStream_t h_stream = queue.getNative();
     cublasErrorCheck(cublasSetStream(h_cublas_, h_stream), "cublasSetStream failed!");
     assert(log_values.size() == a_mats.size());
     const int nw  = a_mats.size();
     const int n   = a_mats[0].get().rows();
     const int lda = a_mats[0].get().cols();
     size_t nsqr   = n * n;
     psiM_fp_.resize(n * lda * nw);
     invM_fp_.resize(n * lda * nw);
     std::fill(log_values.begin(), log_values.end(), LogValue{0.0, 0.0});
     // making sure we know the log_values are zero'd on the device.
     cudaErrorCheck(cudaMemcpyAsync(log_values.device_data(), log_values.data(), log_values.size() * sizeof(LogValue),
                                    cudaMemcpyHostToDevice, h_stream),
                    "cudaMemcpyAsync failed copying DiracMatrixBatch::log_values to device");
     for (int iw = 0; iw < nw; ++iw)
       simd::transpose(a_mats[iw].get().data(), n, a_mats[iw].get().cols(), psiM_fp_.data() + nsqr * iw, n, lda);
     mw_computeInvertAndLog_stride(queue, psiM_fp_, invM_fp_, n, lda, log_values);
     for (int iw = 0; iw < a_mats.size(); ++iw)
     {
       DualMatrix<VALUE_FP> data_ref_matrix;
       data_ref_matrix.attachReference(invM_fp_.data() + nsqr * iw, n, lda);
       // We can't use operator= with different lda, ldb which can happen so we use this assignment which is over the
       // smaller of the two's dimensions
       inv_a_mats[iw].get().assignUpperLeft(data_ref_matrix);
       cudaErrorCheck(cudaMemcpyAsync(inv_a_mats[iw].get().device_data(), inv_a_mats[iw].get().data(),
                                      inv_a_mats[iw].get().size() * sizeof(TMAT), cudaMemcpyHostToDevice, h_stream),
                      "cudaMemcpyAsync of inv_a_mat to device failed!");
     }
   }

◆ mw_invertTranspose() [2/2]

std::enable_if_t<std::is_same<VALUE_FP, TMAT>::value> mw_invertTranspose	(	compute::Queue< PlatformKind::CUDA > &	queue,
		const RefVector< const DualMatrix< TMAT >> &	a_mats,
		const RefVector< DualMatrix< TMAT >> &	inv_a_mats,
		DualVector< LogValue > &	log_values
	)

inline

Batched inversion and calculation of log determinants.

When TMAT is full precision we can use the a_mat and inv_mat directly Side effect of this is after this call the device copy of a_mats contains the LU factorization matrix.

Definition at line 333 of file DiracMatrixComputeCUDA.hpp.

References cublasErrorCheck, cublasSetStream, cudaStream_t, DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), qmcplusplus::n, and qmcplusplus::queue.

   {
     cudaStream_t h_stream = queue.getNative();
     cublasErrorCheck(cublasSetStream(h_cublas_, h_stream), "cublasSetStream failed!");
     assert(log_values.size() == a_mats.size());
     const int n = a_mats[0].get().rows();
     mw_computeInvertAndLog(queue, a_mats, inv_a_mats, n, log_values);
   }

Member Data Documentation

◆ h_cublas_

cublasHandle_t h_cublas_

private

Definition at line 84 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::DiracMatrixComputeCUDA(), DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose(), and DiracMatrixComputeCUDA< VALUE_FP >::~DiracMatrixComputeCUDA().

◆ host_one

VALUE_FP host_one {1.0}

private

Definition at line 80 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog().

◆ host_zero

VALUE_FP host_zero {0.0}

private

Definition at line 81 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog().

◆ infos_

DualVector<int> infos_

private

Definition at line 67 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().

◆ invM_fp_

DualVector<VALUE_FP> invM_fp_

private

Definition at line 62 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

◆ LU_diags_fp_

DualVector<VALUE_FP> LU_diags_fp_

private

Definition at line 65 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().

◆ pivots_

DualVector<int> pivots_

private

Definition at line 66 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().

◆ psiM_fp_

DualVector<VALUE_FP> psiM_fp_

private

Definition at line 61 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().

◆ psiM_invM_ptrs_

DualVector<VALUE_FP*> psiM_invM_ptrs_

private

Transfer buffer for device pointers to matrices.

The element count is usually low and the transfer launch cost are more than the transfer themselves. For this reason, it is beneficial to fusing multiple lists of pointers. Right now this buffer packs nw psiM pointers and then packs nw invM pointers. Use only within a function scope and do not rely on previous value.

Definition at line 77 of file DiracMatrixComputeCUDA.hpp.

Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().

The documentation for this class was generated from the following file:

/home/pk7/projects/qmc/for_cron_doxygen/qmcpack/src/QMCWaveFunctions/Fermion/DiracMatrixComputeCUDA.hpp

Public Member Functions

Private Types

Private Member Functions

Private Attributes

Detailed Description

template<typename VALUE_FP> class qmcplusplus::DiracMatrixComputeCUDA< VALUE_FP >

Member Typedef Documentation

◆ DualMatrix

◆ DualVector

◆ FullPrecReal

◆ LogValue

Constructor & Destructor Documentation

◆ DiracMatrixComputeCUDA() [1/2]

◆ DiracMatrixComputeCUDA() [2/2]

◆ ~DiracMatrixComputeCUDA()

Member Function Documentation

◆ invert_transpose()

◆ makeClone()

◆ mw_computeInvertAndLog()

◆ mw_computeInvertAndLog_stride()

◆ mw_invertTranspose() [1/2]

◆ mw_invertTranspose() [2/2]

Member Data Documentation

◆ h_cublas_

◆ host_one

◆ host_zero

◆ infos_

◆ invM_fp_

◆ LU_diags_fp_

◆ pivots_

◆ psiM_fp_

◆ psiM_invM_ptrs_

template<typename VALUE_FP>
class qmcplusplus::DiracMatrixComputeCUDA< VALUE_FP >