This diff introduces an alternative way of writing multi-GPU cutorch
code. In this mode, the location of each tensor is specified, and the
appropriate GPU for each kernel is determined automatically based on the
location of its argument tensors. It's backwards-compatible and interoperable
with the old-style multi-GPU API.
Every THC function gets a THCState pointer as the first argument.
Some generic files that were previously included have been instantiated
because TH functions currently don't get a state parameter.