![]() |
QMCPACK
|
class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes. More...
Public Member Functions | |
DiracMatrixComputeCUDA () | |
DiracMatrixComputeCUDA (const DiracMatrixComputeCUDA &other) | |
~DiracMatrixComputeCUDA () | |
std::unique_ptr< Resource > | makeClone () const override |
template<typename TMAT > | |
void | invert_transpose (compute::Queue< PlatformKind::CUDA > &queue, DualMatrix< TMAT > &a_mat, DualMatrix< TMAT > &inv_a_mat, DualVector< LogValue > &log_values) |
Given a_mat returns inverted amit and log determinant of a_matches. More... | |
template<typename TMAT > | |
std::enable_if_t<!std::is_same< VALUE_FP, TMAT >::value > | mw_invertTranspose (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values) |
Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision. More... | |
template<typename TMAT > | |
std::enable_if_t< std::is_same< VALUE_FP, TMAT >::value > | mw_invertTranspose (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< TMAT >> &a_mats, const RefVector< DualMatrix< TMAT >> &inv_a_mats, DualVector< LogValue > &log_values) |
Batched inversion and calculation of log determinants. More... | |
![]() | |
Resource (const std::string &name) | |
virtual | ~Resource ()=default |
const std::string & | getName () const |
Private Types | |
using | FullPrecReal = RealAlias< VALUE_FP > |
using | LogValue = std::complex< FullPrecReal > |
template<typename T > | |
using | DualMatrix = Matrix< T, PinnedDualAllocator< T > > |
template<typename T > | |
using | DualVector = Vector< T, PinnedDualAllocator< T > > |
Private Member Functions | |
void | mw_computeInvertAndLog (compute::Queue< PlatformKind::CUDA > &queue, const RefVector< const DualMatrix< VALUE_FP >> &a_mats, const RefVector< DualMatrix< VALUE_FP >> &inv_a_mats, const int n, DualVector< LogValue > &log_values) |
Calculates the actual inv and log determinant on accelerator. More... | |
void | mw_computeInvertAndLog_stride (compute::Queue< PlatformKind::CUDA > &queue, DualVector< VALUE_FP > &psi_Ms, DualVector< VALUE_FP > &inv_Ms, const int n, const int lda, DualVector< LogValue > &log_values) |
Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors. More... | |
Private Attributes | |
DualVector< VALUE_FP > | psiM_fp_ |
DualVector< VALUE_FP > | invM_fp_ |
DualVector< VALUE_FP > | LU_diags_fp_ |
DualVector< int > | pivots_ |
DualVector< int > | infos_ |
DualVector< VALUE_FP * > | psiM_invM_ptrs_ |
Transfer buffer for device pointers to matrices. More... | |
VALUE_FP | host_one {1.0} |
VALUE_FP | host_zero {0.0} |
cublasHandle_t | h_cublas_ |
class defining a compute and memory resource to compute matrix inversion and the log determinants of a batch of DiracMatrixes.
Multiplicty is one per crowd not one per UpdateEngine It matches the multiplicity of the accelerator call and batched resource requirement.
VALUE_FP | the datatype used in the actual computation of matrix inversion |
There are no per walker variables, resources specific to the per crowd compute object are owned here. The compute object itself is the resource to the per walker DiracDeterminantBatched. Resources used by this object but owned by the surrounding scope are passed as arguments.
All the public APIs are synchronous. The asynchronous queue argument gets synchronized before return. rocBLAS, indirectly used via hipBLAS, requires synchronizing the old stream before setting a new one. We don't need to actively synchronize the old stream because it gets synchronized right after each use.
Definition at line 49 of file DiracMatrixComputeCUDA.hpp.
|
private |
Definition at line 55 of file DiracMatrixComputeCUDA.hpp.
|
private |
Definition at line 58 of file DiracMatrixComputeCUDA.hpp.
|
private |
Definition at line 51 of file DiracMatrixComputeCUDA.hpp.
|
private |
Definition at line 52 of file DiracMatrixComputeCUDA.hpp.
|
inline |
Definition at line 220 of file DiracMatrixComputeCUDA.hpp.
References cublasCreate, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.
|
inline |
Definition at line 225 of file DiracMatrixComputeCUDA.hpp.
References cublasCreate, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.
|
inline |
Definition at line 230 of file DiracMatrixComputeCUDA.hpp.
References cublasDestroy, cublasErrorCheck, and DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_.
|
inline |
Given a_mat returns inverted amit and log determinant of a_matches.
[in] | a_mat | a matrix input |
[out] | inv_a_mat | inverted matrix |
[out] | log | determinant is in logvalues[0] |
I consider this single call to be semi depricated so the log determinant values vector is used to match the primary batched interface to the accelerated routings. There is no optimization (yet) for TMAT same type as TREAL
Definition at line 244 of file DiracMatrixComputeCUDA.hpp.
References Matrix< T, Alloc >::assignUpperLeft(), Matrix< T, Alloc >::attachReference(), Matrix< T, Alloc >::cols(), cublasErrorCheck, cublasSetStream, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, cudaStream_t, Matrix< T, Alloc >::data(), Matrix< T, Alloc >::device_data(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, qmcplusplus::queue, Matrix< T, Alloc >::rows(), Matrix< T, Alloc >::size(), and qmcplusplus::simd::transpose().
Referenced by qmcplusplus::TEST_CASE().
|
inlineoverridevirtual |
Implements Resource.
Definition at line 232 of file DiracMatrixComputeCUDA.hpp.
|
inlineprivate |
Calculates the actual inv and log determinant on accelerator.
[in] | h_cublas | cublas handle, h_stream handle is retrieved from it. |
[in,out] | a_mats | dual A matrices, they will be transposed on the device side as a side effect. |
[out] | inv_a_mats | dual invM matrices |
[in] | n | matrices rank. |
[out] | log_values | log determinant value for each matrix, batch_size = log_values.size() |
On Volta so far little seems to be achieved by having the mats continuous.
List of operations:
Pros and cons:
Definition at line 105 of file DiracMatrixComputeCUDA.hpp.
References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), CUBLAS_OP_N, CUBLAS_OP_T, cublasErrorCheck, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, qmcplusplus::cuBLAS::geam(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::host_one, DiracMatrixComputeCUDA< VALUE_FP >::host_zero, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, qmcplusplus::queue, and qmcplusplus::simd::remapCopy().
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().
|
inlineprivate |
Calculates the actual inv and log determinant on accelerator with psiMs and invMs widened to full precision and copied into continuous vectors.
[in] | h_cublas | cublas handle, h_stream handle is retrieved from it. |
[in,out] | psi_Ms | matrices flattened into single pinned vector, returned with LU matrices. |
[out] | inv_Ms | matrices flattened into single pinned vector. |
[in] | n | matrices rank. |
[in] | lda | leading dimension of each matrix |
[out] | log_values | log determinant value for each matrix, batch_size = log_values.size() |
List of operations:
Definition at line 175 of file DiracMatrixComputeCUDA.hpp.
References qmcplusplus::cuBLAS_LU::computeInverseAndDetLog_batched(), qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaStream_t, cudaStreamSynchronize, Vector< T, Alloc >::data(), Vector< T, Alloc >::device_data(), DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::infos_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::LU_diags_fp_, qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::pivots_, DiracMatrixComputeCUDA< VALUE_FP >::psiM_invM_ptrs_, qmcplusplus::queue, and Vector< T, Alloc >::size().
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().
|
inline |
Mixed precision specialization When TMAT is not full precision we need to still do the inversion and log at full precision.
This is not yet optimized to transpose on the GPU
List of operations:
Pros and cons:
Definition at line 291 of file DiracMatrixComputeCUDA.hpp.
References Matrix< T, Alloc >::attachReference(), cublasErrorCheck, cublasSetStream, qmcplusplus::cudaErrorCheck(), cudaMemcpyAsync, cudaMemcpyHostToDevice, cudaStream_t, DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, DiracMatrixComputeCUDA< VALUE_FP >::invM_fp_, qmcplusplus::lda, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), qmcplusplus::n, DiracMatrixComputeCUDA< VALUE_FP >::psiM_fp_, qmcplusplus::queue, and qmcplusplus::simd::transpose().
Referenced by qmcplusplus::TEST_CASE().
|
inline |
Batched inversion and calculation of log determinants.
When TMAT is full precision we can use the a_mat and inv_mat directly Side effect of this is after this call the device copy of a_mats contains the LU factorization matrix.
Definition at line 333 of file DiracMatrixComputeCUDA.hpp.
References cublasErrorCheck, cublasSetStream, cudaStream_t, DiracMatrixComputeCUDA< VALUE_FP >::h_cublas_, qmcplusplus::log_values(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), qmcplusplus::n, and qmcplusplus::queue.
|
private |
Definition at line 84 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::DiracMatrixComputeCUDA(), DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride(), DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose(), and DiracMatrixComputeCUDA< VALUE_FP >::~DiracMatrixComputeCUDA().
|
private |
Definition at line 80 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog().
|
private |
Definition at line 81 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog().
|
private |
Definition at line 67 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().
|
private |
Definition at line 62 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().
|
private |
Definition at line 65 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().
|
private |
Definition at line 66 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().
|
private |
Definition at line 61 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::invert_transpose(), DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_invertTranspose().
|
private |
Transfer buffer for device pointers to matrices.
The element count is usually low and the transfer launch cost are more than the transfer themselves. For this reason, it is beneficial to fusing multiple lists of pointers. Right now this buffer packs nw psiM pointers and then packs nw invM pointers. Use only within a function scope and do not rely on previous value.
Definition at line 77 of file DiracMatrixComputeCUDA.hpp.
Referenced by DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog(), and DiracMatrixComputeCUDA< VALUE_FP >::mw_computeInvertAndLog_stride().