Age | Commit message (Collapse) | Author |
|
|
|
|
|
* Add THHalfTensor support to cutorch.
|
|
A few of us have been using this extensively without problems. This
avoids synchronizations due to cudaFree calls which makes it much easier
to write performant CUDA code.
|
|
Adds a caching allocator for CUDA pinned (page-locked) memory. This
avoid synchronization due to cudaFreeHost or cudaHostUnregister at the
expense of potentially higher host memory usage.
Correctness is preserved by recording CUDA events after each
cudaMemcpyAsync involving the pinned memory. The pinned memory
allocations are not reused until all events associated with it have
completed.
|
|
Adds a CUDA "sleep" kernel which spins for the given number of
iterations. This is useful for testing correct synchronization with
streams.
|
|
Previously, cutorch would initialize every CUDA device and enable P2P
access between all pairs. This slows down start-up, especially with 8
devices. Now, THCudaInit does not initialize any devices and P2P access
is enabled lazily. Setting the random number generator seed also does
not initialize the device until random numbers are actually used.
|
|
|
|
Lazily initialize CUDA devices
|
|
* Implemented cudaMemGetInfo for caching allocator
|
|
Previously, cutorch would initialize every CUDA device and enable P2P
access between all pairs. This slows down start-up, especially with 8
devices. Now, THCudaInit does not initialize any devices and P2P access
is enabled lazily. Setting the random number generator seed also does
not initialize the device until random numbers are actually used.
|
|
|
|
|
|
|
|
This implements the THC code so that we can expose streams as objects
instead of simply referring to them by indices. This is not exposed in
Lua yet.
|
|
Use a single, global THCCachingAllocator instance.
Previously, each Lua thread had its own THCCachingAllocator instance.
However, threads can share storages, which means a segment could be
allocated from on THCCachingAllocator and freed on another, which
breaks.
Fixes #539
|
|
|
|
Switching the device, setting the stream, and switching BLAS handles is
now thread-safe. Some other operations, like reserveStreams, are still
not thread-safe.
|
|
The allocator can be enabled by setting the environment variable
THC_CACHING_ALLOCATOR=1
|
|
reduce and BLAS work
|
|
|
|
|
|
|
|
|
|
Add FP16 support (CudaHalfStorage, CudaHalfTensor)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
stream event fixes
|
|
|
|
|
|
|
|
maskedCopy implemented
generic Reduce kernels
|
|
|
|
* A new allocator that uses cudaMallocHost.
* cutorch.createCudaHostTensor(...) to create FloatTensor allocated with
CudaHostAllocator.
|
|
This reverts commit d88ac24c712e3a40d4aaf3ac2d043bd79ba4280e.
Revert "Auto device mode, plus allocation helper functions."
This reverts commit 47a2f6de252c2254234edfc1c6115229b5383bac.
|
|
|
|
This diff introduces an alternative way of writing multi-GPU cutorch
code. In this mode, the location of each tensor is specified, and the
appropriate GPU for each kernel is determined automatically based on the
location of its argument tensors. It's backwards-compatible and interoperable
with the old-style multi-GPU API.
|
|
|
|
|
|
|
|
and maskedFill operations (and tests).
Also adds generic Reduce and Apply kernels that can be reused.
|
|
Only need to reset the cuBLAS handle for the current device, because
only resources associated with the current device will be reset by
cudaDeviceReset.
|
|
Every THC function gets a THCState pointer as the first argument.
Some generic files that were previously included have been instantiated
because TH functions currently don't get a state parameter.
|
|
A device reset destroys the state of the RNG, so we have to re-initialize
it after each reset.
|
|
|
|
|