Age | Commit message (Collapse) | Author |
|
Zombie tasks are dumped in dump_zombies() so it is redundant to handle them
in dump_one_task().
Deprecate cg_set in task_core_entry as this field must be per thread now.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
|
|
Currently, we assume all threads in process are in the same cgroup controllers.
However, with threaded controllers, threads in a process may be in different
controllers. So we need to dump cgroup controllers of every threads in process
and fixup the procfs cgroup parsing to parse from self/task/<tid>/cgroup.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
|
|
This commit enables checkpointing and restoring of applications as
non-root.
First goal was to enable checkpoint and restore of the env00 and
pthread00 test case.
This uses the information from opts.unprivileged and opts.cap_eff to
skip certain code paths which do not work as non-root.
Co-authored-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
|
|
A file's r/w/x changing between checkpoint and restore does
not necessarily imply that something is wrong. For example,
if a process opens a file having perms rw- for reading and
we change the perms to r--, the process can be restored and
will function as expected.
Therefore, this patch adds an option
--skip-file-rwx-check
to disable this check on restore. File validation is unaffected
and should still function as expected with respect to the content
of files.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
|
|
Add SIGTSTP signal dump and restore. Add a corresponding field
in the image, save it only if a task is in the stopped state.
Restore task state by sending desired stop signal if it is present
in the image. Fallback to SIGSTOP if it's absent.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
|
|
Userspace may configure rseq cs abort policy by
setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
In ("cr-dump: fixup thread IP when inside rseq cs") we have supported
the case when process was caught by CRIU during rseq cs execution by
fixing up IP to abort_ip. Thats a common case, but there is special flag
called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave
process IP as it was before CRIU seized it. Unfortunately, that's not
all that we need here. We also must preserve (struct rseq)->rseq_cs field.
You may ask like "why we need to preserve it by hands? CRIU is dumping
all process memory and restores it". That's true. But not so easy. The problem
here is that the kernel performs this field cleanup when it realized that
the process gets out of rseq cs. But during dump/restore procedures we are
executing parasite/restorer from the process context. It means that process
will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared
by the kernel. So we need to restore this field by hands at the *last* stage
of restore just before releasing processes.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
|
|
Support basic rseq C/R scenario. Assume that:
- there are no processes with IP inside the rseq critical section (CS)
- kernel has ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support
On dump:
1. use ptrace(PTRACE_GET_RSEQ_CONFIGURATION) to get
struct rseq pointer, rseq size and signature from the kernel.
2. save to the image
On restore:
1. get rseq ptr, size, signature from the image
2. register it back using rseq() from the restorer parasite
Fixes: #1696
Reported-by: Radostin Stoyanov <radostin@redhat.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
|
|
Brought to you by
codespell -w
(using codespell v2.1.0).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
|
|
I am not sure if this is going to bring any compatibility issues.
If yes, we need to remove this patch and add "useable" to the list of
ignored words instead.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
|
|
We plan to switch to Mounts-v2 engine for restoring mounts by default,
this options is to allow switching to old engine. This patch only adds
an option, no engine behind it yet.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/503f9ad2c
Changes: allow --mntns-compat-mode option only on restore and only if
MOVE_MOUNT_SET_GROUP is supported (this also requires change in
unittest/mock.c), change id in rpc criu_opts.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
|
|
Starting with Linux Kernel release 5.16 the fdinfo proc entry contains
a map_extra field which breaks CRIU parsing of bpfmap entries.
This commit adds the map_extra as a possible field to CRIU. The value of
map_extra is not passed to the kernel on restore as it does not seem to
be evaluated in the code paths CRIU restore is using for BPF.
This fixes CRIU CI using Fedora with 5.16.
See Linux commit 9330986c03006ab1d33d243b7cfe598a7a3c1baa
"bpf: Add bloom filter map implementation"
Signed-off-by: Adrian Reber <areber@redhat.com>
|
|
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
|
|
Attach the System V shared memory segments to the address space via shmat() to
determine if they are backed by hugetlb and their page size. Use these
information for setting the correct flags on restore.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
|
|
In contrast to the CLI it is not possible to do a single pre-dump via
RPC and thus libcriu. In cr-service.c pre-dump always goes into a
pre-dump loop followed by a final dump. runc already works around this
to only do a single pre-dump by killing the CRIU process waiting for the
message for the final dump.
Trying to implement pre-dump in crun via libcriu it is not as easy to
work around CRIU's pre-dump loop expectations as with runc that directly
talks to CRIU via RPC.
We know that LXC/LXD also does single pre-dumps using the CLI and runc
also only does single pre-dumps by misusing the pre-dump loop interface.
With this commit it is possible to trigger a single pre-dump via RPC and
libcriu without misusing the interface provided via cr-service.c. So
this commit basically updates CRIU to the existing use cases.
The existing pre-dump loop still sounds like a very good idea, but so
far most tools have decided to implement the pre-dump loop themselves.
With this change we can implement pre-dump in crun to match what is
currently implemented in runc.
Signed-off-by: Adrian Reber <areber@redhat.com>
|
|
When one sets socket buffer sizes with setsockopt(SO_{SND,RCV}BUF*),
kernel sets coresponding SOCK_SNDBUF_LOCK or SOCK_RCVBUF_LOCK flags on
struct sock. It means that such a socket with explicitly changed buffer
size can not be auto-adjusted by kernel (e.g. if there is free memory
kernel can auto-increase default socket buffers to improve perfomance).
(see tcp_fixup_rcvbuf() and tcp_sndbuf_expand())
CRIU is always changing buf sizes on restore, that means that all
sockets receive lock flags on struct sock and become non-auto-adjusted
after migration. In some cases it can decrease perfomance of network
connections quite a lot.
So let's c/r socket buf locks (SO_BUF_LOCKS), so that sockets for which
auto-adjustment is available does not lose it.
Reviewed-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
|
|
When the network is locked using a specific method like iptables
or nftables there is no need to require passing the same method
during restore.
We save the lock method during dump in the inventory image and
use that in restore.
This always overwrites the restore --network-lock option.
v2: store opts.network_lock_method directly to avoid dependency
on rpc.proto's 'enum criu_network_lock_method'.
v3: fall back to iptables if image is generated with an older
version of CRIU.
v4: remove --network-lock from netns_lock_* from restore
Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
|
|
v2: run make indent
Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
|
|
Support for apparmor namespaces and stacking is coming to Ubuntu kernels in
16.10, and should hopefully be upstreamed Soon (TM) :).
The basic idea is similar to how cgroups are done: we can restore the
apparmor namespace and profile blobs independently of the tasks, and then
at the end we can just set the task's label appropriately. This means the
code that moves tasks under a label stays the same, and the only new code
is the stuff that dumps and restores the policy blobs that are in the
namespace that were loaded by the container.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
When sigev_notify_thread_id is not set, get_pid will return a NULL
pointer and do_timer_create will return -EINVAL in kernel. So criu
will failed to create posix timer:
(09.806760) pie: 41301: Error (criu/pie/restorer.c:1998): Can't restore posix timers -22
(09.806824) pie: 41301: Error (criu/pie/restorer.c:2133): Restorer fail 41301
(09.891880) Error (criu/cr-restore.c:2596): Restoring FAILED.
Signed-off-by: Liu Chao <liuchao173@huawei.com>
|
|
This change is motivated by checkpointing and restoring container in
Pods.
When restoring a container into a new Pod the SELinux label of the
existing Pod needs to be used and not the SELinux label saved during
checkpointing.
The option --lsm-profile already enables changing of process SELinux
labels on restore. If there are, however, tmpfs checkpointed they
will be mounted during restore with the same context as during
checkpointing. This can look like the following example:
context="system_u:object_r:container_file_t:s0:c82,c137"
On restore we want to change this context to match the mount label of
the Pod this container is restored into. Changing of the mount label
is now possible with the new option --mount-context:
criu restore --mount-context "system_u:object_r:container_file_t:s0:c204,c495"
This will lead to mount options being changed to
context="system_u:object_r:container_file_t:s0:c204,c495"
Now the restored container can access all the files in the container
again.
This has been tested in combination with runc and CRI-O.
Signed-off-by: Adrian Reber <areber@redhat.com>
|
|
pidfd_store which will be used for reliable pidfd based pid reuse
detection for RPC clients requires two recent syscalls (pidfd_open
and pidfd_getfd).
We allow checking if pidfd_store is supported using:
1. CLI: criu check --feature pidfd_store
2. RPC: CRIU_REQ_TYPE__FEATURE_CHECK and set pidfd_store to
true in the "features" field of the request
Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
|
|
pidfd_store_sk option will be used later to store tasks pidfds
between predumps to detect pid reuse reliably.
pidfd_store_sk should be a fd of a connectionless unix socket.
init_pidfd_store_sk() steals the socket from the RPC client using
pidfd_getfd, checks that it is a connectionless unix socket and
checks if it is not initialized before (i.e. unnamed socket).
If not initialized the socket is first bound to an abstract name
(combination of the real pid/fd to avoid overlap), then it is
connected to itself hence allowing us to store the pidfds in the
receive queue of the socket (this is similar to how fdstore_init()
works).
v2:
- avoid close(pidfd) overriding errno of SYS_pidfd_open in
init_pidfd_store_sk()
- close pidfd_store_sk because we might have leftover from
previous iterations
Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
|
|
This changes the license of all files in the images/ directory from
GPLv2 to the Expat license (so-called MIT).
According to git the files have been authored by:
Abhishek Dubey
Adrian Reber
Alexander Mikhalitsyn
Alice Frosi
Andrei Vagin (Andrew Vagin, Andrey Vagin)
Cyrill Gorcunov
Dengguangxing
Dmitry Safonov
Guoyun Sun
Kirill Tkhai
Kir Kolyshkin
Laurent Dufour
Michael Holzheu
Michał Cłapiński
Mike Rapoport
Nicolas Viennot
Nikita Spiridonov
Pavel Emelianov (Pavel Emelyanov)
Pavel Tikhomirov
Radostin Stoyanov
rbruno@gsd.inesc-id.pt
Sebastian Pipping
Stanislav Kinsburskiy
Tycho Andersen
Valeriy Vdovin
The Expat license (so-called MIT) can be found here:
https://opensource.org/licenses/MIT
According to that link the correct SPDX short identifier is 'MIT'.
https://spdx.org/licenses/MIT.html
Signed-off-by: Adrian Reber <areber@redhat.com>
|
|
This commit adds a BPF map's name and ifindex to its protobuf image.
ifindex is the index of the network interface to which the BPF map is
attached and can be specified via a parameter while creating the BPF
map (BPF_MAP_CREATE). This commit also provides a default value of
false to the field 'frozen'.
Source files modified:
* images/bpfmap-file.proto
Signed-off-by: Abhishek Vijeev <abhishek.vijeev@gmail.com>
|
|
In this case, states of established tcp connections will not be dumped
and they will not be blocked. This will be useful in case of snapshots,
when we don't need to restore tcp connections.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
The SO_LINGER option allows to control how a TCP connection is closed.
The default behavior is to return immediately when close() is called,
and any unsent data is not guaranteed to be delivered. When SO_LINGER
is enabled, the close() call would block until all final data is
delivered to the remote end, for a specified time interval. When the
time interval is set to zero, the connection is aborted and any pending
data is immediately discarded upon close().
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
|
|
This patch enables checkpoint/restore of the SO_OOBINLINE socket option.
When the SO_OOBINLINE option is used, out-of-band data is placed in the
normal input queue as it is received. This permits it to be read using
read or recv without specifying the MSG_OOB flag.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
|
|
The field id 18 is used in Virtuozzo criu in multiple releases, so that
we can't change the id easily. So we can at least kindly ask not to use
this field in mainstream criu to decrease the pain of Virtuozzo criu
rebases.
Reference to related patch in Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/58e61a20c22c#images/sk-unix.proto
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
|
|
This commit adds protobuf definitions needed to checkpoint and
restore BPF map files along with the data they contain
Source files added:
* bpfmap-file.proto - Stores the meta-data about BPF maps
* bpfmap-data.proto - Stores the data (key-value pairs) contained
in BPF maps
Source files modified:
* fdinfo.proto - Added BPF map as a new kind of file descriptor.
'message file_entry' can now hold information about BPF map file
descriptors
* Makefile - Now generates build artifacts for bpfmap-file.proto
and bpfmap-data.proto
Signed-off-by: Abhishek Vijeev <abhishek.vijeev@gmail.com>
|
|
This adds build-id, checksum, checksum-config and checksum-parameter fields
to RegFileEntry to store metadata used for file verification.
build_id: Holds the build-id if it could be obtained
checksum: Holds the checksum if it could be obtained
checksum_config: Holds the configuration of bytes for which checksum has
been calculated (The entire file, first N bytes or every Nth byte)
checksum_parameter: Specifies the value of 'N', if required, for the
configuration of bytes
Signed-off-by: Ajay Bharadwaj <ajayrbharadwaj@gmail.com>
|
|
TODO: create correct magic
Signed-off-by: Adrian Reber <areber@redhat.com>
|
|
Signed-off-by: Guoyun Sun <sunguoyun@loongson.cn>
|
|
This adds the ability to stream images with criu-image-streamer
The workflow is the following:
1) criu-image-streamer is started, and starts listening on a UNIX
socket.
2) CRIU is started. img_streamer_init() is invoked, which connects to the
socket. During dump/restore operations, instead of using local disk to
open an image file, img_streamer_open() is called to provide a UNIX pipe
that is sent over the UNIX socket.
3) Once the operation is done, img_streamer_finish() is called, and the
UNIX socket is disconnected.
criu-image-streamer can be found at:
https://github.com/checkpoint-restore/criu-image-streamer
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
|
|
The time namespace allows for per-namespace offsets to the system
monotonic and boot-time clocks.
C/R of time namespaces are very straightforward. On dump, criu enters a
target time namespace and dumps currents clocks values, then on restore,
criu creates a new namespace and restores clocks values.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
Per-object image is acceptable if we expect to have 1-3 objects
per-container. If we expect to have more objects, it is better to save
them all into one image. There are a number of reasons for this:
* We need fewer system calls to read all objects from one image.
* It is faster to save or move one image.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
To really open symlink file and not the regular file below it, one needs
to do open with O_PATH|O_NOFOLLOW flags. Looks like systemd started to
open /etc/localtime symlink this way sometimes, and before that nobody
actually used this and thus we never supported this in CRIU.
Error (criu/files-ext.c:96): Can't dump file 11 of that type [120777]
(unknown /etc/localtime)
Looks like it is quiet easy to support, as c/r of symlink file is almost
the same as c/r of regular one. We need to only make fstatat not
following links in check_path_remap.
Also we need to take into account support of ghost symlinks.
Signed-off-by: Alexander Mikhalitsyn (Virtuozzo) <alexander@mihalicyn.com>
Co-developed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
|
|
The runc test cases are (sometimes) mounting a cgroup inside of the
container. For these tests to succeed, let CRIU know that cgroup2 exists
and how to restore such a mount.
This does not fix any specific cgroup2 settings, it just enables CRIU to
mount cgroup2 in the restored container.
Signed-off-by: Adrian Reber <areber@redhat.com>
|
|
See "man fcntl" for more information about seals.
memfd are the only files that can be sealed, currently. For this
reason, we dump the seal values in the MEMFD_INODE image.
Restoring seals must be done carefully as the seal F_SEAL_FUTURE_WRITE
prevents future write access. This means that any memory mapping with
write access must be restored before restoring the seals.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
|
|
See "man memfd_create" for more information of what memfd is.
This adds support for memfd open files, that are not not memory mapped.
* We add a new kind of file: MEMFD.
* We add two image types MEMFD_FILE, and MEMFD_INODE.
MEMFD_FILE contains usual file information (e.g., position).
MEMFD_INODE contains the memfd name, and a shmid identifier
referring to the content.
* We reuse the shmem facilities for dumping memfd content as it
would be easier to support incremental checkpoints in the future.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
|
|
To ensure consistency of runtime environment processes within a
container need to see same start time values over suspend/resume
cycles. We introduce new field to the core image structure to
store start time of a dumped process. Later same value would be
restored to a newly created task. In future the feature is likely
to be pulled here, so we reserve field id in protobuf descriptor.
Signed-off-by: Valeriy Vdovin <valeriy.vdovin@virtuozzo.com>
|
|
TCP keepalive packets can be used to determine if a connection
is still valid. When the SO_KEEPALIVE option is set, TCP packets
are periodically sent to keep the connection alive.
This patch implements checkpoint/restore support for SO_KEEPALIVE,
TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT options.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
|
|
The /proc/sys/net/unix/max_dgram_qlen is a per-net variable and
we already noticed that systemd inside a container may change its value
(for example it sets it to 512 by now instead of kernel's default
value 10), thus we need keep it inside image and restore then.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
|
|
Conflict register for file "sk-opts.proto": READ is already defined in
file "rpc.proto". Please fix the conflict by adding package name on the
proto file, or use different name for the duplication. Note: enum
values appear as siblings of the enum type instead of children of it.
https://github.com/checkpoint-restore/criu/issues/815
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
Skip iov-generation for regions not having
PROT_READ, since process_vm_readv syscall
can't process them during "read" pre-dump.
Handle random order of "read" & "splice"
pre-dumps.
Signed-off-by: Abhishek Dubey <dubeyabhishek777@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
Two modes of pre-dump algorithm:
1) splicing memory by parasite
--pre-dump-mode=splice (default)
2) using process_vm_readv syscall
--pre-dump-mode=read
Signed-off-by: Abhishek Dubey <dubeyabhishek777@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
Instead of creating cgroup yard in CRIU, now we can create it externally
and pass it to CRIU. Useful if somebody doesn't want to grant
CAP_SYS_ADMIN to CRIU.
Signed-off-by: Michał Cłapiński <mclapinski@google.com>
|
|
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
1. Checkpoint it via parasite.
2. Restore it after forking.
Signed-off-by: Michał Cłapiński <mclapinski@google.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
|
|
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
|
|
Shmem pages are written in the same set of images as regular
pages are, but stats for those are not collected. Fix this, but
keep the counts separate to have more info.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|