Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/checkpoint-restore/criu.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/criu
AgeCommit message (Collapse)Author
2022-11-12cgroup: Remove redundant code that handles zombie tasksHEADcriu-devBui Quang Minh
Zombie tasks are dumped in dump_zombies() so it is redundant to handle them in dump_one_task(). Deprecate cg_set in task_core_entry as this field must be per thread now. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-11-11kerndat: Mark memfd_create(MFD_HUGETLB) unavailable when ENOSYS is returnedBui Quang Minh
Some users on Raspberry Pi report that the kerndat checking for memfd_create(MFD_HUGETLB) support returns ENOSYS even when memfd_create syscall is available. We currently treat this error as unexpected and return error. This commit marks the memfd_create(MFD_HUGETLB) as unavailable when ENOSYS is returned. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-11-02cgroup-v2: Restore threads in a process into correct threaded controllersBui Quang Minh
As threads in a process may be in different threaded controllers, we need to move thoses threads to the correct controllers. Because the threads of a process are restored in later stage in restorer.c, we need to create a cgroupd service to help to move those threads into correct controllers when they are restored. We cannot use usernsd as the code in restorer does not know the address of outside function to pass to userns_call. However, this cgroupd service still reuses a lot of code from usernsd. The main logic is that restored threads receive the cg_set number they belong to before restorer stage in case their cg_set are different from main thread. When these threads are restored, they send the cg_set number and their thread ids through unix socket to cgroupd. cgroupd receives the cg_set number and thread ids and moves those threads into correct controllers. Thread ids are sent through SCM_CREDENTIALS of unix socket so they are translated into correct thread ids in the receiving end. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-11-02cgroup-v2: Dump cgroup controllers of every threads in a processBui Quang Minh
Currently, we assume all threads in process are in the same cgroup controllers. However, with threaded controllers, threads in a process may be in different controllers. So we need to dump cgroup controllers of every threads in process and fixup the procfs cgroup parsing to parse from self/task/<tid>/cgroup. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-11-02cgroup-v2: Checkpoint and restore some global propertiesBui Quang Minh
This commit supports checkpoint/restore some new global properties in cgroup-v2 cgroup.subtree_control cgroup.max.descendants cgroup.max.depth cgroup.freeze cgroup.type Only cgroup.subtree_control, cgroup.type need some more code to handle. cgroup.subtree_control value needs to be set with "+", "-" prefix and cgroup.type can only be written with value "threaded" if we want to make this controller threaded. cgroup.type is a special property because this property must be restored before any processes can move into this controller. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-11-02ipc_sysctl: Prioritize restoring IPC variables using non usernsd approachBui Quang Minh
Since commit https://github.com/torvalds/linux/commit/5563cabdde, user with enough capability can open IPC sysctl files and write to them. Therefore, we don't need to use usernsd process in the outside user namespace to help with that anymore. Furthermore, some later commits: https://github.com/torvalds/linux/commit/1f5c135ee5, https://github.com/torvalds/linux/commit/0889f44e28 bind the IPC namespace to the opened file descriptor of IPC sysctl at the open() time, the changed value does not depend on the IPC namespace of write() time anymore. This breaks the current usernsd approach. So, we prioritize opening/writing IPC sysctl files in the context of restored process directly without usernsd help. This approach succeeds in the newer kernel since the restored process has enough capabilities at this restore stage. With older kernel, the open() fails and we fallback to the usernsd approach. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-10-25cgroup: add a comment to restore_cgroup_prop about path argument requirementsPavel Tikhomirov
In Virtuozzo we've faced out-of-bound access when calling this function on short path string, which corrupted other memory and lead to segmentation fault. So it may be useful to have this comment in code to avoid such a missuse of this function in future. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-10-25non-root: Introduce unprivileged mode to kerndatYounes Manton
This patch modifies how kerndat is handled in unprivileged mode. Initialization and functionality that can only be done as root is made separate from common code. The kerndat file's location is defined as $XDG_RUNTIME_DIR/criu.kdat in unprivileged mode. Since we expect that directory to be on tmpfs we maintain the same behavior as the root-mode kerndat which lives in /run. Co-authored-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
2022-10-25non-root: enable non-root checkpoint/restoreYounes Manton
This commit enables checkpointing and restoring of applications as non-root. First goal was to enable checkpoint and restore of the env00 and pthread00 test case. This uses the information from opts.unprivileged and opts.cap_eff to skip certain code paths which do not work as non-root. Co-authored-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
2022-10-25non-root: add functions to work with capabilitiesAdrian Reber
This adds the function check_caps() which checks if CRIU is running with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability CRIU needs to do a minimal checkpoint and restore from it. In addition helper functions are added to easily query for other capability for enhanced checkpoint/restore support. Co-authored-by: Younes Manton <ymanton@ca.ibm.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
2022-10-25non-root: add infrastructure to run as non-rootAdrian Reber
The idea behind the rootless CRIU code is, that CRIU reads out its effective capabilities and stores that in the global opts structure. Different parts of CRIU can then, based on the existing capabilities, automatically enable or disable certain code paths. Currently at least CAP_CHECKPOINT_RESTORE is required. CRIU will not start without this capability. Signed-off-by: Adrian Reber <areber@redhat.com>
2022-09-15seize: do not overwrite exit code from failpathLiu Hua
Signed-off-by: Liu Hua <weldonliu@tencent.com>
2022-08-31files-reg: skip failed mount lookup for shell-job's ttyPavel Tikhomirov
When we restore a shell-job we would inherit tty-s, so even if we don't have a right mount for it in container on dump, on restore it should just be right. Else when dumping second time via criu-ns we get: (00.005678) Error (criu/files-reg.c:1710): Can't lookup mount=29 for fd=0 path=/dev/pts/20 Fixes: #1893 Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-08-29mount: add definition for FSOPEN_CLOEXECRadostin Stoyanov
A recent change in glibc introduced `enum fsconfig_command` [1] and as a result the compilation of criu fails with the following errors In file included from criu/pie/util.c:3: /usr/include/sys/mount.h:240:6: error: redeclaration of 'enum fsconfig_command' 240 | enum fsconfig_command | ^~~~~~~~~~~~~~~~ In file included from /usr/include/sys/mount.h:32: criu/include/linux/mount.h:11:6: note: originally defined here 11 | enum fsconfig_command { | ^~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:242:3: error: redeclaration of enumerator 'FSCONFIG_SET_FLAG' 242 | FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */ | ^~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:12:9: note: previous definition of 'FSCONFIG_SET_FLAG' with type 'enum fsconfig_command' 12 | FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */ | ^~~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:244:3: error: redeclaration of enumerator 'FSCONFIG_SET_STRING' 244 | FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */ | ^~~~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:14:9: note: previous definition of 'FSCONFIG_SET_STRING' with type 'enum fsconfig_command' 14 | FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */ | ^~~~~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:246:3: error: redeclaration of enumerator 'FSCONFIG_SET_BINARY' 246 | FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */ | ^~~~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:16:9: note: previous definition of 'FSCONFIG_SET_BINARY' with type 'enum fsconfig_command' 16 | FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */ | ^~~~~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:248:3: error: redeclaration of enumerator 'FSCONFIG_SET_PATH' 248 | FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */ | ^~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:18:9: note: previous definition of 'FSCONFIG_SET_PATH' with type 'enum fsconfig_command' 18 | FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */ | ^~~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:250:3: error: redeclaration of enumerator 'FSCONFIG_SET_PATH_EMPTY' 250 | FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */ | ^~~~~~~~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:20:9: note: previous definition of 'FSCONFIG_SET_PATH_EMPTY' with type 'enum fsconfig_command' 20 | FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */ | ^~~~~~~~~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:252:3: error: redeclaration of enumerator 'FSCONFIG_SET_FD' 252 | FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */ | ^~~~~~~~~~~~~~~ criu/include/linux/mount.h:22:9: note: previous definition of 'FSCONFIG_SET_FD' with type 'enum fsconfig_command' 22 | FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */ | ^~~~~~~~~~~~~~~ /usr/include/sys/mount.h:254:3: error: redeclaration of enumerator 'FSCONFIG_CMD_CREATE' 254 | FSCONFIG_CMD_CREATE = 6, /* Invoke superblock creation */ | ^~~~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:24:9: note: previous definition of 'FSCONFIG_CMD_CREATE' with type 'enum fsconfig_command' 24 | FSCONFIG_CMD_CREATE = 6, /* Invoke superblock creation */ | ^~~~~~~~~~~~~~~~~~~ /usr/include/sys/mount.h:256:3: error: redeclaration of enumerator 'FSCONFIG_CMD_RECONFIGURE' 256 | FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */ | ^~~~~~~~~~~~~~~~~~~~~~~~ criu/include/linux/mount.h:26:9: note: previous definition of 'FSCONFIG_CMD_RECONFIGURE' with type 'enum fsconfig_command' 26 | FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */ This patch adds definition for FSOPEN_CLOEXEC to solve this problem. In particular, sys/mount.h includes ifndef check for FSOPEN_CLOEXEC surrounding `enum fsconfig_command`. [1] https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=7eae6a91e9b1670330c9f15730082c91c0b1d570 Reported-by: Younes Manton (@ymanton) Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-08-23criu: fail migration if data was sent to an in-flight socketMichal Clapinski
Before this change, CRIU would just lose that data upon migration. So it's better to fail migration in this case. To reproduce the bug one can: 1. Create an AF_UNIX socket and call listen on it. 2. Create a second AF_UNIX socket and call connect to the first one. 3. Send the data to the second socket. 4. Migrate. 5. Call accept on the first socket and then read. There would be no data available. It should be even possible to close the second socket before migration. This would cause accept to hang because CRIU totally misses a closed in-flight socket. Signed-off-by: Michal Clapinski <mclapinski@google.com>
2022-08-15breakpoint: enable breakpoints by default on amd64 and arm64fu.lin
Signed-off-by: fu.lin <fulin10@huawei.com> Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-08-15compel: clear a breakpoint right after it's been triggeredAndrei Vagin
Breakpoints are used to stop as close as possible to a target system call. First, we don't need it after this point. Second, PTRACE_CONT can't pass through a breakpoint on arm64. Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-08-15compel: set TRACESYSGOOD to distinguish breakpoints from syscallsAndrei Vagin
When delivering system call traps, set bit 7 in the signal number (i.e., deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish normal traps from those caused by a system call. Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-08-08cr-restore: rseq: use glibc-specific way to unregister only as fallbackAlexander Mikhalitsyn
Let's use dynamic approach to detect built-in *libc rseq in all cases, and "old" static approach as a fallback path if the user kernel lacks support of ptrace_get_rseq_conf feature. Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
2022-08-08cr-restore: rseq: dynamically handle *libc with rseqAlexander Mikhalitsyn
Before this patch we assumed that CRIU is compiled against the same GLibc as it runs with. But as we see from real world examples like #1935 it's not always true. The idea of this patch is to detect rseq configuration for the main CRIU process and use it to unregister rseq for all further child processes. It's correct, because we restore pstree using clone*() syscalls, don't use exec*() (!) syscalls, so rseq gets inherited in the kernel and rseq configuration remains the same for all children processes. This will prevent issues like this: https://github.com/checkpoint-restore/criu/issues/1935 Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
2022-08-08cr-check: optimize check for apparmor stackingPavel Tikhomirov
The result of check_aa_ns_dumping() is stored in kdat. Instead of doing the same check twice - once on kerndat_init(), and again in check_apparmor_stacking(), we can check the stored value. Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-08-08cr-check: fix check for apparmor stackingRadostin Stoyanov
The feature check for AppArmor stacking was introduced in commit: 8723e3f998d1ec5f125e6600436a96f7ff9c1631 check: add a feature test for apparmor_stacking However, on systems that don't support AppArmour, this check always fails. As a result, `criu check --all` shows the following message: Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. Reported-by: André Rösti (@andrej) Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-08-05criu: fix conflicting headersRadostin Stoyanov
There are several changes in glibc 2.36 that make sys/mount.h header incompatible with kernel headers: https://sourceware.org/glibc/wiki/Release/2.36#Usage_of_.3Clinux.2Fmount.h.3E_and_.3Csys.2Fmount.h.3E This patch removes conflicting includes for `<linux/mount.h>` and updates the content of `criu/include/linux/mount.h` to match `/usr/include/sys/mount.h`. In addition, inline definitions sys_*() functions have been moved from "linux/mount.h" to "syscall.h" to avoid conflicts with `uapi/compel/plugins/std/syscall.h` and `<unistd.h>`. The include for `<linux/aio_abi.h>` has been replaced with local include to avoid conflicts with `<sys/mount.h>`. Fixes: #1949 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-08-04files-reg.c: modify the check of ghost_limit to support large sparse filesLiang-Chun Chen
files-reg.c checks whether the file size is larger than ghost_limit with st_size (in dump_ghost_remap), which can not deal with large ghost sparse file, since its actual file size is not the same as what st_size shows. Therefore, in this commit, I replace st_size with st_blocks, which shows the actual file size. (1 block = 512B), thus criu can deal with large ghost sparse file. Signed-off-by: Liang-Chun Chen <featherclc@gmail.com>
2022-07-26vdso-compat: Increase the reserved buffer for compat vdsoBui Quang Minh
On Arch Linux with 5.18.3-zen1-1-zen kernel, the vdso's size is 3 pages which exceeds the current 2-page reserved buffer. This commit simply increases the reserved buffer size to 4 pages. Fixes: https://github.com/checkpoint-restore/criu/issues/1916 Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-07-19rseq: fix headers conflict on Mariner GNU/LinuxAlexander Mikhalitsyn
1. For some reason, Marier distribution headers not correctly define __GLIBC_HAVE_KERNEL_RSEQ compile-time constant. It remains undefined, but in fact header files provides corresponding rseq types declaration which leads to conflict. 2. Another issue, is that they use uint*_t types instead of __u* types as in original rseq.h. This leads to compile time issues like this: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type 'uint64_t' {aka 'long unsigned int'} and we can't even replace %llx to %PRIx64 because it will break compilation on other distros (like Fedora) with analogical error: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘__u64’ {aka ‘long long unsigned int’} Let's use our-own struct rseq copy fully equal to the kernel one, it's safe because this structure is a part of Linux Kernel ABI. Fixes #1934 Reported-by: Nikola Bojanic Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
2022-07-13config/files-reg: Add opt to skip file r/w/x check on restoreYounes Manton
A file's r/w/x changing between checkpoint and restore does not necessarily imply that something is wrong. For example, if a process opens a file having perms rw- for reading and we change the perms to r--, the process can be restored and will function as expected. Therefore, this patch adds an option --skip-file-rwx-check to disable this check on restore. File validation is unaffected and should still function as expected with respect to the content of files. Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
2022-07-02infect: add SIGTSTP supportYuriy Vasiliev
Add SIGTSTP signal dump and restore. Add a corresponding field in the image, save it only if a task is in the stopped state. Restore task state by sending desired stop signal if it is present in the image. Fallback to SIGSTOP if it's absent. Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
2022-06-22ci: Fix code indentRadostin Stoyanov
This patch contains auto-generated changes from `make indent` Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-06-21config: fail on --track-mem option if dirty tracking is not availablePavel Tikhomirov
Else we trigger BUG in task_reset_dirty_track(): Error (criu/mem.c:45): BUG at criu/mem.c:45 The check in kerndat_get_dirty_track() does not work right. https://github.com/checkpoint-restore/criu/issues/1917 Reported-by: @mrc1119 Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-06-20util/mount-v2: fix resolve_mountpoint() to always return freeable pointerPavel Tikhomirov
Else we have a Segmentation fault in __move_mount_set_group() on xfree(source_mp) if resolve_mountpoint() returned statically allocated path. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-06-14hugetlb: don't dump anonymous private hugetlb mapping using memfd approachBui Quang Minh
Currently, the content of anonymous private hugetlb mapping is dumped in 2 different images: memfd approach and normal private mapping dumping. In memfd approach, we dump the content of the backing pseudo file (/anon_hugepage). This is incorrect and redundant since the mapping is private, the content of backing file may differ from the content of the mapping. With this commit, we remove the redundant memfd approach dump and only do the normal private mapping dump on anonymous hugetlb mapping. Run zdtm.py run -f h --keep-img always -t zdtm/static/maps09, du -h in the dumped image directory Before this commit 13M test/dump/zdtm/static/maps09/55/1 After this commit 8.5M test/dump/zdtm/static/maps09/55/1 The reduction in size is approximately 4MB which is the size of anonymous private hugetlb mapping in the test. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-06-13mount-v2: workaround for multiple external bindmounts with no common rootPavel Tikhomirov
It's a problem when while restoring sharing group we need to copy sharing between two mounts with non-intersecting roots, because kernel does not allow it. We have a case https://github.com/opencontainers/runc/pull/3442, where runc adds different devtmpfs file-bindmounts to container and there is no fsroot mount in container for this devtmpfs, thus mount-v2 faces the above problem. Luckily for the case of external mounts which are in one sharing group and which have non-intersecting roots, these mounts likely only have external master with no sharing, so we can just copy sharing from external source and make it slave as a workaround. https://github.com/checkpoint-restore/criu/issues/1886 Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-06-13mount-v2: split out restore_one_sharing helperPavel Tikhomirov
This helper restores master_id and shared_id of first mount in the sharing group. It first copies sharing from either external source or internal parent sharing group and makes master_id from shared_id. Next it creates new shared_id when needed. All other mounts except first are just copied from the first one. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-06-13sk-unix: make add_fake_unix_queuers earier and rework find_queuer_forPavel Tikhomirov
Before this patch, if we had a unixsk with incomming scm packets (with fds) and with the sender side fd closed, we got an error: Error (criu/sk-unix.c:1125): unix: Can't find sender for 0x1e First part of the problem is that unix_note_scm_rights() expects to see a "queuer" which would send scm packets to the unixsk, and there is no as the sender side is closed. Second part of the problem is that we already have "fake" queuers feature so that it already creates a unix socket pair and leaves other end open for later queuing packets. But function add_fake_unix_queuers() is called after unix_note_scm_rights() thus there is no chance to find queuer at the point of failure. Third part is that when we look for a queuer in find_queuer_for() we actually look for a socket for which we are a queuer and not for the socket which is a queuer for us, which is opposite to the name. For cases where both ends are alive both are queuers for each other so this was not important, but for our closed sender case it breaks. So let's reorder add_fake_unix_queuers() before unix_note_scm_rights() and make find_queuer_for() actually do what it's name implies. This situation is started to reproduce on Virtuozzo start/stop tests with the unixsk belonging to systemd, we suppose that this state where the sender fd side is closed happens rarely only on systemd start/stop, so we don't see it in regular suspend resume of long-living containers. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-05-17amdgpu: Set PLUGINDIR to /usr/lib/criuRadostin Stoyanov
Building the criu packages for Ubuntu/Debian fails with: mkdir: cannot create directory '/var/lib/criu': Permission denied This patch updates PLUGINDIR with the value /usr/lib/criu Fixes: #1877 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-05-13page-xfer: refactoring analyze_iov and fill_userbufAndrei Vagin
* handle unexpected errors of process_vm_readv * adjust riovs in analyze_iov * call handle_faulty_iov only if process_vm_readv returns EFAULT. Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-05-13pre-dump: call vmsplice with SPLICE_F_GIFTAndrei Vagin
In this case, vmplice attaches pages without coping them. Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-05-13page-xfer: adjust a buffer to a pipe sizeAndrei Vagin
Due to side effects of F_SETPIPE_SZ, the actual pipe size can be greater than PIPE_MAX_SIZE. Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-05-13page-xfer: use negative values for error codesAndrei Vagin
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-05-13page-pipe: fix limiting a pipe sizeAndrei Vagin
But actually, 5a92f100b88e probably has to be reverted as a whole. PIPE_MAX_SIZE is the hard limit to avoid PAGE_ALLOC_COSTLY_ORDER allocations in the kernel. But F_SETPIPE_SZ rounds up a requested pipe size to a power-of-2 pages. It means that when we request PIPE_MAX_SIZE that isn't a power-of-2 number, we actually request a pipe size greater than PIPE_MAX_SIZE. Fixes: 5a92f100b88e ("page-pipe: Resize up to PIPE_MAX_SIZE") Signed-off-by: Andrei Vagin <avagin@gmail.com>
2022-05-13mem: Skip pre-dumping on hugetlb mappingsBui Quang Minh
As private hugetlb mappings are not pre-mapped, the content of them is restored in the the restorer which cannot use page_read->read_pages. As a result, we cannot recursively read the content of pre-dumped image in the parent directory and use preadv to read the content from the last dumped image only. Therefore, it may freeze while restoring when the content of mapping is in pre-dumped image in parent directory. We need to skip pre-dumping on hugetlb mappings to resolve the issue. Suggested-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2022-05-13cr-dump: do not report success to logs if post-dump script failedPavel Tikhomirov
It can be confusing to see error from post-dump action script and non zero return from criu though at the same time see "Dumping finished successfully" in log. I believe it is logical to consider post-dump action script as a part of "dump" process so fail in it means that the whole dump failed. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-05-05kerndat: handle the case when hugetlb isn't supportedAlexander Mikhalitsyn
Currently we check memfd_hugetlb by doing memfd_create("", MFD_HUGETLB). If we see EINVAL we report that it's not supported, but we can also get ENOENT error in such case in hugetlb_file_setup() while trying to find proper hugetlbfs mount. Reference: https://github.com/torvalds/linux/blob/06fb4ecfeac/fs/hugetlbfs/inode.c#L1465 Fixes: 4245e6b02fa ("check: Add a check for using memfd with hugetlb") Reported-by: Mr. Jenkins (ppc64le) Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
2022-04-29sk-unix: rework bind_on_deleted() return codesAndrey Zhadchenko
bind_on_delete() return code is only used for setting errno for pr_perror() This is mostly useless since a lot of syscalls already set it. All of non-syscall errors already have prints in case of failure. Fix bind_on_deleted() always returning 0 and simplify error juggling to returning -1 in case of errors. Fixes: #1771 Fixes: d0308e5ecc1c ("sk-unix: make criu respect existing files while restoring ghost unix socket fd") Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
2022-04-29proc_parse: Fix parsing bpf map_extraRadostin Stoyanov
The map_extra field has been introduced in Linux Kernel release 5.16 and does not exist in older kernel versions. The current parsing implementation fails when map_extra is missing. In particular, it tries to parse the `memlock` field as `map_extra` and fails but it does not exit with an error because map_extra is marked as "optional". It then tries to parse the `map_id` field as `memlock` and fails with an error because map_id is not optional: Error (criu/proc_parse.c:2161): parse_fdinfo_pid_s: error parsing [map_type:\t2] for 19: Success' To correctly handle this, we should try to parse again the next field when parsing of `map_extra` fails, without reading the next line from the bpfmap. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-04-29bpf: update deprecated APIRadostin Stoyanov
bpf_create_map_xattr() has been replaced with bpf_map_create() https://github.com/libbpf/libbpf/commit/6cfb97c DECLARE_LIBBPF_OPTS has been renamed to LIBBPF_OPTS https://github.com/libbpf/libbpf/commit/ea6c242 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-04-29rseq: handle rseq/rseq_cs flags properlyAlexander Mikhalitsyn
Userspace may configure rseq cs abort policy by setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags. In ("cr-dump: fixup thread IP when inside rseq cs") we have supported the case when process was caught by CRIU during rseq cs execution by fixing up IP to abort_ip. Thats a common case, but there is special flag called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave process IP as it was before CRIU seized it. Unfortunately, that's not all that we need here. We also must preserve (struct rseq)->rseq_cs field. You may ask like "why we need to preserve it by hands? CRIU is dumping all process memory and restores it". That's true. But not so easy. The problem here is that the kernel performs this field cleanup when it realized that the process gets out of rseq cs. But during dump/restore procedures we are executing parasite/restorer from the process context. It means that process will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared by the kernel. So we need to restore this field by hands at the *last* stage of restore just before releasing processes. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
2022-04-29cr-dump: fixup thread IP when inside rseq csAlexander Mikhalitsyn
If we caught the process when it's inside rseq critical section we have to handle it properly. From the kernel side of view, if the process is executing inside the rseq cs and gets a signal, rseq critical section execution will be interrupted and after signal handler execution, we will proceed to rseq cs abort handler instead of continuing normal rseq cs execution (if RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL isn't set). When CRIU seizes processes that's the same thing as getting signal from the rseq point of view. So we need to fixup instruction pointer to rseq cs abort handler address. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
2022-04-29compel: add helpers to get/set instruction pointerAlexander Mikhalitsyn
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>