What's in the box?

Docker run

Let’s say we have some basic Dockerfile describing an image running Ubuntu 20.04.

FROM ubuntu:20.04

We can build the image and give it a name.

docker build -t my_image .

Finally we can execute commands in a container based on our image.

docker run my_image echo "hello, world!"

Which prints “hello, world!” to the screen. Nice 🙂

Inside the container

But what does it mean to “run in a container”? Let’s open a shell inside of one and have a look.
Note that the docker run command simply spawns a new process on my MacBook.

❯ ps
  ...
    6486 ttys001    0:00.05 docker run -it docker_image /bin/sh
  ...

But if we attach to a shell running inside the container process, we get a very different view of the world.

❯ docker run -it docker_image /bin/sh
# ls
bin   dev  home  media	opt   root  sbin  sys  usr
boot  etc  lib	 mnt	proc  run   srv   tmp  var
# ps
  PID TTY          TIME CMD
    1 pts/0    00:00:00 sh
    8 pts/0    00:00:00 ps
# hostname
d5437e71fda7
# whoami
root
# exit
❯

We have our own hostname, process list, filesystem, etc. In other words the sub-process is isolated from the host system that created it.
This idea of isolated, lightweight processes running on one or many host machines is what allowed us to move into the cloud. I rely on containers eery time I ship my code to production.
It’s time to understand what allows containers to exist and be isolated from the host system.

LonelyContainers

Let’s try to re-create the behavior of docker containers ourselves! This C program uses clone to spawn a new sub-process. Clone is similar to fork, meaning it also creates a new process. But unlike clone it gives us more control over which resources we want to share with the child process.

#define _GNU_SOURCE 
#include <sched.h>

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>

#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int container(void *args)
{
    printf("PID seen from container: %d\n", getpid());
    system("/bin/bash");

}

int main()
{
    pid_t p = clone(container, malloc(4096) + 4096, SIGCHLD, NULL);
    if (p == -1) {
        perror("clone");
        exit(1);
    }
    printf("PID seen from host system: %d\n", p);
    waitpid(p, NULL, 0);
    return 0;
}

We simply clone the current process and in the newly spawned sub-process we execute /bin/bash. Additionally, we write the PID of the main process as well as the sub-process. We can compile the program using gcc lonely_container.c and run it:

❯ ./lonely_container
PID seen from host system: 11775
PID seen from container: 11775
$ echo Hello, world!
Hello, world!
$ exit
exit
❯

PID independence

Note how both processes still share the same id:

PID seen from host system: 11775
PID seen from container: 11775

Let’s change that. When we have a look at the manpage for clone we see that the signature is

int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...);

and there is one flag, called CLONE_NEWPID which states

If CLONE_NEWPID is set, then create the process in a new PID namespace.

So, let’s call clone like this:

int flags = CLONE_NEWPID;
pid_t p = clone(container, malloc(4096) + 4096, SIGCHLD|flags, NULL);

And we get:

PID seen from host system: 13342
PID seen from container: 1

Nice, we’re in our own namespace of processes. From the containers perspective, there’s only one process: itself and it has PID 1. But when we run ps to look at other processes we see many more processes.

    PID TTY          TIME CMD
   1038 tty2     00:03:22 Xorg
   1046 tty2     00:00:00 gnome-session-b
   2600 pts/0    00:00:00 zsh
   2610 pts/0    00:00:00 zsh
   2611 pts/0    00:00:00 zsh
   2613 pts/0    00:00:00 gitstatusd
   2972 pts/1    00:00:00 zsh
   3007 pts/1    00:00:00 zsh
   3008 pts/1    00:00:00 zsh
   3010 pts/1    00:00:00 gitstatusd
   7060 pts/2    00:00:00 zsh
   7068 pts/2    00:00:00 zsh
   7069 pts/2    00:00:00 zsh
   7071 pts/2    00:00:00 gitstatusd
  13114 pts/0    00:00:00 man
  13122 pts/0    00:00:00 less
  13617 pts/1    00:00:01 node
  13636 pts/1    00:00:00 npm exec hugo s
  13647 pts/1    00:00:00 hugo
  14022 pts/2    00:00:00 sudo
  14023 pts/2    00:00:00 lonely_containe

File system independence

The problem withg ps above is because ps reads the process list from /proc. Since our sub-process still shares the file system with the host process, we can see all the host-system processes inside the container.
Let’s change that. First, we create a directory container_root next to lonely_container.c. This will ben the root to the filesystem of our container /. We now add a new flag CLONE_NEWNS

    int flags = CLONE_NEWPID|CLONE_NEWIPC|CLONE_NEWNS;
    pid_t p = clone(container, malloc(4096) + 4096, SIGCHLD|flags, NULL);

which states

If CLONE_NEWNS is set, the cloned child is started in a new mount namespace

We change the container function to chroot into the new container_root before executing the shell and mount /proc in our new root filesystem

int container(void *args)
{
    printf("PID seen from container: %d\n", getpid());
    chroot("./container_root");
    chdir("/");
    mount("proc", "/proc", "proc", 0, 0);
    system("/bin/bash");
    return 0;
}

When we compile and run it, we can see that we have our own filesystem and no longer see the host system processes:

❯ ./lonely_container
PID seen from host system: 11775
PID seen from container: 1
$ ps -a
    PID TTY          TIME CMD
      1 pts/2    00:00:00 bash
      7 pts/2    00:00:00 ps
$ exit
exit
❯

Further independence

To try this yourself I suggest looking into the manpage for clone and checking out what other flags you can add and what other reources you might isolate. One example could be the hostname via CLONE_UTS.
The next step could be to isolate CPU and memory for the sub-process, and limiting it to a certain degree. To achieve this, we will have to look into a linux feature called control groups aka cgroups. Maybe in the next post!

so long

Containers - what's in the box?