ROCM Support(1) =============== NAME ---- amdgpu_plugin - A plugin extension to CRIU to support checkpoint/restore in userspace for AMD GPUs. CURRENT SUPPORT --------------- Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint / Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer DESCRIPTION ----------- Though *criu* is a great tool for checkpointing and restoring running applications, it has certain limitations such as it cannot handle applications that have device files open. In order to support *ROCm* based workloads with *criu* we need to augment criu's core functionality with a plugin based extension mechanism. *amdgpu_plugin* provides the necessary support to criu to allow Checkpoint / Restore with ROCm. Dependencies ~~~~~~~~~~~~~~ *amdkfd support*:: In order to snapshot the *VRAM* and other *GPU* device states, we require an updated version of amdkfd(amdgpu) driver. The kernel patches are under review currently. *criu 3.16*:: This work is rebased on latest criu release available at this time. OPTIONS ------- Optional parameters can be passed in as environment variables before executing criu command. *KFD_FW_VER_CHECK*:: Enable or disable firmware version check. If enabled, firmware version on restored gpu needs to be greater than or equal firmware version on checkpointed GPU. Default:Enabled E.g: KFD_FW_VER_CHECK=0 *KFD_SDMA_FW_VER_CHECK*:: Enable or disable SDMA firmware version check. If enabled, SDMA firmware version on restored gpu needs to be greater than or equal firmware version on checkpointed GPU. Default:Enabled E.g: KFD_SDMA_FW_VER_CHECK=0 *KFD_CACHES_COUNT_CHECK*:: Enable or disable caches count check. If enabled, the caches count on restored GPU needs to be greater than or equal caches count on checkpointed GPU. Default:Enabled E.g: KFD_CACHES_COUNT_CHECK=0 *KFD_NUM_GWS_CHECK*:: Enable or disable num_gws check. If enabled, the num_gws on restored GPU needs to be greater than or equal num_gws on checkpointed GPU. Default:Enabled E.g: KFD_NUM_GWS_CHECK=0 *KFD_VRAM_SIZE_CHECK*:: Enable or disable VRAM size check. If enabled, the VRAM size on restored GPU needs to be greater than or equal VRAM size on checkpointed GPU. Default:Enabled E.g: KFD_VRAM_SIZE_CHECK=0 *KFD_NUMA_CHECK*:: Enable or disable NUMA CPU region check. If enabled, the plugin will restore GPUs that belong to one CPU NUMA region to the same CPU NUMA region. Default:Enabled E.g: KFD_NUMA_CHECK=1 *KFD_CAPABILITY_CHECK*:: Enable or disable capability check. If enabled, the capability on restored GPU needs to be equal to the capability on the checkpointed GPU. Default:Enabled E.g: KFD_CAPABILITY_CHECK=1 AUTHOR ------ The AMDKFD team. COPYRIGHT --------- Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)