From d98bcb9ca9d0368ffb1633a1487b3af66e43212c Mon Sep 17 00:00:00 2001 From: Jacob Vosmaer Date: Fri, 5 Jun 2015 14:23:35 +0200 Subject: Try to explain Unicorn and unicorn-worker-killer --- doc/operations/unicorn.md | 86 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 doc/operations/unicorn.md (limited to 'doc/operations') diff --git a/doc/operations/unicorn.md b/doc/operations/unicorn.md new file mode 100644 index 00000000000..31b432cd411 --- /dev/null +++ b/doc/operations/unicorn.md @@ -0,0 +1,86 @@ +# Understanding Unicorn and unicorn-worker-killer + +## Unicorn + +GitLab uses [Unicorn](http://unicorn.bogomips.org/), a pre-forking Ruby web +server, to handle web requests (web browsers and Git HTTP clients). Unicorn is +a daemon written in Ruby and C that can load and run a Ruby on Rails +application; in our case the Rails application is GitLab Community Edition or +GitLab Enterprise Edition. + +Unicorn has a multi-process architecture to make better use of available CPU +cores (processes can run on different cores) and to have stronger fault +tolerance (most failures stay isolated in only one process and cannot take down +GitLab entirely). On startup, the Unicorn 'master' process loads a clean Ruby +environment with the GitLab application code, and then spawns 'workers' which +inherit this clean initial environment. The 'master' never handles any +requests, that is left to the workers. The operating system network stack +queues incoming requests and distributes them among the workers. + +In a perfect world, the master would spawn its pool of workers once, and then +the workers handle incoming web requests one after another until the end of +time. In reality, worker processes can crash or time out: if the master notices +that a worker takes too long to handle a request it will terminate the worker +process with SIGKILL ('kill -9'). No matter how the worker process ended, the +master process will replace it with a new 'clean' process again. Unicorn is +designed to be able to replace 'crashed' workers without dropping user +requests. + +This is what a Unicorn worker timeout looks like in `unicorn_stderr.log`. The +master process has PID 56227 below. + +``` +[2015-06-05T10:58:08.660325 #56227] ERROR -- : worker=10 PID:53009 timeout (61s > 60s), killing +[2015-06-05T10:58:08.699360 #56227] ERROR -- : reaped # worker=10 +[2015-06-05T10:58:08.708141 #62538] INFO -- : worker=10 spawned pid=62538 +[2015-06-05T10:58:08.708824 #62538] INFO -- : worker=10 ready +``` + +### Tunables + +The main tunables for Unicorn are the number of worker processes and the +request timeout after which the Unicorn master terminates a worker process. +See the [omnibus-gitlab Unicorn settings +documentation](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/doc/settings/unicorn.md) +if you want to adjust these settings. + +## unicorn-worker-killer + +GitLab has memory leaks. These memory leaks manifest themselves in long-running +processes, such as Unicorn workers. (The Unicorn master process is not known to +leak memory, probably because it does not handle user requests.) + +To make these memory leaks manageable, GitLab comes with the +[unicorn-worker-killer gem](https://github.com/kzk/unicorn-worker-killer). This +gem [monkey-patches](http://en.wikipedia.org/wiki/Monkey_patch) the Unicorn +workers to do a memory self-check after every 16 requests. If the memory of the +Unicorn worker exceeds a pre-set limit then the worker process exits. The +Unicorn master then automatically replaces the worker process. + +This is a robust way to handle memory leaks: Unicorn is designed to handle +workers that 'crash' so no user requests will be dropped. The +unicorn-worker-killer gem is designed to only terminate a worker process _in +between requests_, so no user requests are affected. + +This is what a Unicorn worker memory restart looks like in unicorn_stderr.log. +You see that worker 4 (PID 125918) is inspecting itself and decides to exit. +The threshold memory value was 254802235 bytes, about 250MB. With GitLab this +threshold is a random value between 200 and 250 MB. The master process (PID +117565) then reaps the worker process and spawns a new 'worker 4' with PID +127549. + +``` +[2015-06-05T12:07:41.828374 #125918] WARN -- : #: worker (pid: 125918) exceeds memory limit (256413696 bytes > 254802235 bytes) +[2015-06-05T12:07:41.828472 #125918] WARN -- : Unicorn::WorkerKiller send SIGQUIT (pid: 125918) alive: 23 sec (trial 1) +[2015-06-05T12:07:42.025916 #117565] INFO -- : reaped # worker=4 +[2015-06-05T12:07:42.034527 #127549] INFO -- : worker=4 spawned pid=127549 +[2015-06-05T12:07:42.035217 #127549] INFO -- : worker=4 ready +``` + +One other thing that stands out in the log snippet above, taken from +Gitlab.com, is that 'worker 4' was serving requests for only 23 seconds. This +is a normal value for our current GitLab.com setup and traffic. + +The high frequency of Unicorn memory restarts on some GitLab sites can be a +source of confusion for administrators. Usually they are a [red +herring](http://en.wikipedia.org/wiki/Red_herring). -- cgit v1.2.3