github.com/checkpoint-restore/criu.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2022-06-22	ci: Fix code indent	Radostin Stoyanov
	This patch contains auto-generated changes from `make indent` Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-06-13	amdgpu: Add gitignore	Radostin Stoyanov
	Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-05-17	amdgpu: Set PLUGINDIR to /usr/lib/criu	Radostin Stoyanov
	Building the criu packages for Ubuntu/Debian fails with: mkdir: cannot create directory '/var/lib/criu': Permission denied This patch updates PLUGINDIR with the value /usr/lib/criu Fixes: #1877 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-05-16	amdgpu/Makefile: Fix include path	Radostin Stoyanov
	When building packages for CRIU the source directory might have a name different than 'criu'. Fixes: #1877 Reported-by: @siris Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-04-29	Fix some codespell warnings	Kir Kolyshkin
	Brought to you by codespell -w (using codespell v2.1.0). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-04-29	criu/plugin: Add support for criu image streamer	David Yat Sin
	Modifications to support criu image streamer when using amdgpu_plugin. When running with criu image streamer, fseek/lseek is not available so we store the file size in the first 8-bytes of the actual file. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Store BO contents directly to file	David Yat Sin
	Store BO contents directly to file (1 per GPU) instead of using protobuf. Bug Fix: Fixes an issue where we could not handle BOs bigger than 4GB because protobuf has an internal limit of 4GB for the Bytes structure. Performance Improvements: This significantly reduces CR duration on multi-GPU systems as it allows reading and writing to disk in parallel. During checkpoint, instead of waiting for all the BO contents to be read from the one protobuf file, we can now start writing the BO contents as soon as the first BO is read from disk. During restore, we can start writing BO contents to disk after the first BO from VRAM. This also reduces the peak amount of system memory used as we only need to keep 1 BO content in memory per GPU at a time instead of all the BO contents. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Add whitepaper document	Felix Kuehling
	Adding whitepaper document Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Fix for FDs not allowed to mmap	David Yat Sin
	On newer kernel's (> 5.13), KFD & DRM drivers will only allow the /dev/renderD* file descriptors that were used during the CRIU_RESTORE ioctl when calling mmap for the vma's. During restore, after opening /dev/renderD*, amdgpu_plugin keeps the FDs opened and instead returns a copy of the FDs to CRIU. The same FDs are then returned during the UPDATE_VMAMAP hooks so that they can be used by CRIU to call mmap. Duplicated FDs created using dup are references to the same struct file inside the kernel so they are also allowed to mmap. To prevent the opened FDs inside amdgpu_plugin from conflicting with FDs used by the target restore application, we make sure that the lowest-numbered FD that amdgpu_plugin will use is greater than the highest-numbered FD that is used by the target application. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Implement sDMA based buffer access	Rajneesh Bhardwaj
	AMD Radeon GPUs have special sDMA (system dma engines) IPs that can be used to speed up the read write operations from the VRAM and GTT memory. Depends on: * The kernel mode driver (kfd) creating the dmabuf objects for the kfd BOs in both checkpoint and restore operation. * libdrm and libdrm_amdgpu libraries Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Restore libhsakmt shared memory files	David Yat Sin
	Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during checkpoint to see if these two files exist. If they exist then the plugin will try to restore them during restore. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Read and write BO contents in parallel	David Yat Sin
	Implement multi-threaded code to read and write contents of each GPU VRAM BOs in parallel in order to speed up dumping process when using multiple GPUs. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2022-04-29	criu/plugin: Add unit tests for GPU remapping	David Yat Sin
	Adding unit tests for GPU remapping code when checkpointing and restoring on different nodes with different topologies. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Add parameters to override mapping	David Yat Sin
	Add optional parameters to override default behavior during restore. These parameters are passed in as environment variables before executing CRIU. List of parameters: KFD_FW_VER_CHECK - disable firmware version check KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check KFD_CACHES_COUNT_CHECK - disable caches count check KFD_NUM_GWS_CHECK - disable num_gws check KFD_VRAM_SIZE_CHECK - disable VRAM size check KFD_NUMA_CHECK - preserve NUMA regions KFD_CAPABILITY_CHECK - disable capability check Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Remap GPUs on checkpoint restore	David Yat Sin
	The device topology on the restore node can be different from the topology on the checkpointed node. The GPUs on the restore node may have different gpu_ids, minor number. or some GPUs may have different properties as checkpointed node. During restore, the CRIU plugin determines the target GPUs to avoid restore failures caused by trying to restore a process on a gpu that is different. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Implement system topology parsing	David Yat Sin
	Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and store properties for each gpu in the CRIU image files. The gpu properties can then be used later during restore to make the process is restored on gpu's with similar properties. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Adding check for kernel IOCTL version	David Yat Sin
	Adding check for minimum kernel IOCTL version before attempting to checkpoint. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Support AMD ROCm Checkpoint Restore with KFD	Rajneesh Bhardwaj
	To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2022-04-29	criu/plugin: Initialize AMD KFD header	Rajneesh Bhardwaj
	kfd_ioctl.h contains the definitions for the APIs and required arguments to call the ioctls so simply copy the header as is for amdgpu plugin. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-29	criu/plugin: Implement dummy amdgpu plugin hooks	Rajneesh Bhardwaj
	This is just a placeholder dummy plugin and will be replaced by a proper plugin that implements support for AMD GPU devices. This just facilitates the initial pull request and CI build test trigger for early code review of CRIU specific changes. Future PRs will bring in more support for amdgpu_plugin to enable CRIU with AMD ROCm. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>