major.io words of wisdom from a systems engineer

Build containers in GitLab CI with buildah

cranes and skyscrapers

My team at Red Hat depends heavily on GitLab CI and we build containers often to run all kinds of tests. Fortunately, GitLab offers up CI to build containers and a container registry in every repository to hold the containers we build.

This is really handy because it keeps everything together in one place: your container build scripts, your container build infrastructure, and the registry that holds your containers. Better yet, you can put multiple types of containers underneath a single git repository if you need to build containers based on different Linux distributions.

Building with Docker in GitLab CI

By default, GitLab offers up a Docker builder that works just fine. The CI system clones your repository, builds your containers and pushes them wherever you want. There’s even a simple CI YAML file that does everything end-to-end for you.

However, I have two issues with the Docker builder:

  • Larger images: The Docker image layering is handy, but the images end up being a bit larger, especially if you don’t do a little cleanup in each stage.

  • Additional service: It requires an additional service inside the CI runner for the dind (“Docker in Docker”) builder. This has caused some CI delays for me several times.

Building with buildah in GitLab CI

On my local workstation, I use podman and buildah all the time to build, run, and test containers. These tools are handy because I don’t need to remember to start the Docker daemon each time I want to mess with a container. I also don’t need sudo.

All of my containers are stored beneath my home directory. That’s good for keeping disk space in check, but it’s especially helpful on shared servers since each user has their own unique storage. My container pulls and builds won’t disrupt anyone else’s work on the server and their work won’t disrupt mine.

Finally, buildah offers some nice options out of the box. First, when you build a container with buildah bud, you end up with only three layers by default:

  1. Original OS layer (example: fedora:30)
  2. Everything you added on top of the OS layer
  3. Tiny bit of metadata

This is incredibly helpful if you use package managers like dnf, apt, and yum that download a bunch of metadata before installing packages. You would normally have to clear the metadata carefully for the package manager so that your container wouldn’t grow in size. Buildah takes care of that by squashing all the stuff you add into one layer.

Of course, if you want to be more aggressive, buildah offers the --squash option which squashes the whole image down into one layer. This can be helpful if disk space is at a premium and you change the layers often.

Getting started

I have a repository called os-containers in GitLab that maintains fully updated containers for Fedora 29 and 30. The .gitlab-ci.yml file calls build.sh for two containers: fedora29 and fedora30. Open the build.sh file and follow along here:

# Use vfs with buildah. Docker offers overlayfs as a default, but buildah
# cannot stack overlayfs on top of another overlayfs filesystem.
export STORAGE_DRIVER=vfs

First off, we need to tell buildah to use the vfs storage driver. Docker uses overlayfs by default and stacking overlay filesystems will definitely lead to problems. Buildah won’t let you try it.

# Write all image metadata in the docker format, not the standard OCI format.
# Newer versions of docker can handle the OCI format, but older versions, like
# the one shipped with Fedora 30, cannot handle the format.
export BUILDAH_FORMAT=docker

By default, buildah uses the oci container format. This sometimes causes issues with older versions of Docker that don’t understand how to parse that type of metadata. By setting the format to docker, we’re using a format that almost all container runtimes can understand.

# Log into GitLab's container repository.
export REGISTRY_AUTH_FILE=${HOME}/auth.json
echo "$CI_REGISTRY_PASSWORD" | buildah login -u "$CI_REGISTRY_USER" --password-stdin $CI_REGISTRY

Here we set a path for the auth.json that contains the credentials for talking to the container repository. We also use buildah to authenticate to GitLab’s built-in container repository. GitLab automatically exports these variables for us (and hides them in the job output), so we can use them here.

buildah bud -f builds/${IMAGE_NAME} -t ${IMAGE_NAME} .

We’re now building the container and storing it temporarily as the bare image name, such as fedora30. This is roughly equivalent to docker build.

CONTAINER_ID=$(buildah from ${IMAGE_NAME})
buildah commit --squash $CONTAINER_ID $FQ_IMAGE_NAME

Now we are making a reference to our container with buildah from and using that reference to squash that container down into a single layer. This keeps the container as small as possible.

The commit step also tags the resulting image using our fully qualified image name (in this case, it’s registry.gitlab.com/majorhayden/os-containers/fedora30:latest)

buildah push ${FQ_IMAGE_NAME}

This is the same as docker push. There’s not much special to see here.

Maintaining containers

GitLab allows you to take things to the next level with CI schedules. In my repository, there is a schedule to build my containers once a day to catch the latest updates. I use these containers a lot and they need to be up to date before I can run tests.

If the container build fails for some reason, GitLab will send me an email to let me know.

Photo Source

Inspecting OpenShift cgroups from inside the pod

walking_through_rock_valley

My team at Red Hat builds a lot of kernels in OpenShift pods as part of our work with the Continuous Kernel Integration (CKI) project. We have lots of different pod sizes depending on the type of work we are doing and our GitLab runners spawn these pods based on the tags in our GitLab CI pipeline.

Compiling with make

When you compile a large software project, such as the Linux kernel, you can use multiple CPU cores to speed up the build. GNU’s make does this with the -j argument. Running make with -j10 means that you want to run 10 jobs while compiling. This would keep 10 CPU cores busy.

Setting the number too high causes more contention from the CPU and can reduce performance. Setting the number too low means that you are spending more time compiling than you would if you used all of your CPU cores.

Every once in a while, we adjusted our runners to use a different amount of CPUs or memory and then we had to adjust our pipeline to reflect the new CPU count. This was time consuming and error prone.

Many people just use nproc to determine the CPU core count. It works well with make:

make -j$(nproc)

Problems with containers

The handy nproc doesn’t work well for OpenShift. If you start a pod on OpenShift and limit it to a single CPU core, nproc tells you something very wrong:

$ nproc
32

We applied the single CPU limit with OpenShift, so what’s the problem? The issue is how nproc looks for CPUs. Here’s a snippet of strace output:

sched_getaffinity(0, 128, [0, 1, 2, 3, 4, 5]) = 8
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x6), ...}) = 0
write(1, "6\n", 26
)                      = 2

The sched_getaffinity syscall looks to see which CPUs are allowed to run the process and returns a count of those. OpenShift doesn’t prevent us from seeing the CPUs of the underlying system (the VM or bare metal host underneath our containers), but it uses cgroups to limit how much CPU time we can use.

Reading cgroups

Getting cgroup data is easy! Just change into the /sys/fs/cgroup/ directory and look around:

$ cd /sys/fs/cgroup/
$ ls -al cpu/
ls: cannot open directory 'cpu/': Permission denied

Ouch. OpenShift makes this a little more challenging. We’re not allowed to wander around in the land of cgroups without a map to exactly what we want.

My Fedora workstation shows a bunch of CPU cgroup settings:

$ ls -al /sys/fs/cgroup/cpu/
total 0
dr-xr-xr-x.  2 root root   0 Apr  5 01:40 .
drwxr-xr-x. 14 root root 360 Apr  5 01:40 ..
-rw-r--r--.  1 root root   0 Apr  5 13:08 cgroup.clone_children
-rw-r--r--.  1 root root   0 Apr  5 01:40 cgroup.procs
-r--r--r--.  1 root root   0 Apr  5 13:08 cgroup.sane_behavior
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.stat
-rw-r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_all
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu_sys
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu_user
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_sys
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_user
-rw-r--r--.  1 root root   0 Apr  5 09:10 cpu.cfs_period_us
-rw-r--r--.  1 root root   0 Apr  5 13:08 cpu.cfs_quota_us
-rw-r--r--.  1 root root   0 Apr  5 09:10 cpu.shares
-r--r--r--.  1 root root   0 Apr  5 13:08 cpu.stat
-rw-r--r--.  1 root root   0 Apr  5 13:08 notify_on_release
-rw-r--r--.  1 root root   0 Apr  5 13:08 release_agent
-rw-r--r--.  1 root root   0 Apr  5 13:08 tasks

OpenShift uses the Completely Fair Scheduler (CFS) to limit CPU time. Here’s a quick excerpt from the kernel documentation:

Quota and period are managed within the cpu subsystem via cgroupfs.

cpu.cfs_quota_us: the total available run-time within a period (in microseconds) cpu.cfs_period_us: the length of a period (in microseconds) cpu.stat: exports throttling statistics [explained further below]

The default values are: cpu.cfs_period_us=100ms cpu.cfs_quota=-1

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS.

Writing any (valid) positive value(s) will enact the specified bandwidth limit. The minimum quota allowed for the quota or period is 1ms. There is also an upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below.

Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit and return the group to an unconstrained state once more.

Any updates to a group’s bandwidth specification will result in it becoming unthrottled if it is in a constrained state.

Let’s see if inspecting cpu.cfs_quota_us can help us:

$ cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
10000

Now we’re getting somewhere. But what does 10000 mean here? OpenShift operates on the concept of millicores of CPU time, or 11000 of a CPU. 500 millicores is half a CPU and 1000 millicores is a whole CPU.

The pod in this example is assigned 100 millicores. Now we know that we can take the output of /sys/fs/cgroup/cpu/cpu.cfs_quota_us, divide by 100, and get our millicores.

We can make a script like this:

CFS_QUOTA=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
if [ $CFS_QUOTA -lt 100000 ]; then
  CPUS_AVAILABLE=1
else
  CPUS_AVAILABLE=$(expr ${CFS_QUOTA} / 100 / 1000)
fi
echo "Found ${CPUS_AVAILABLE} CPUS"
make -j${CPUS_AVAILABLE} ...

The script checks for the value of the quota and divides by 100,000 to get the number of cores. If the share is set to something less than 100,000, then a core count of 1 is assigned. (Pro tip: make does not like being told to compile with zero jobs.)

Reading memory limits

There are other limits you can read and inspect in a pod, including the available RAM. As we found with nproc, free is not very helpful:

# An OpenShift pod with 200MB RAM
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          32008       12322         880          31       18805       19246
Swap:             0           0           0

But the cgroups tell the truth:

$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
209715200

If you run Java applications in a container, like Jenkins (or Jenkins slaves), be sure to use the -XX:+UseCGroupMemoryLimitForHeap option. That will cause Java to look at the cgroups to determine its heap size.

Photo credit: Wikipedia

Running Ansible in OpenShift with arbitrary UIDs

blacksmith_anvil_hammer

My work at Red Hat involves testing lots and lots of kernels from various sources and we use GitLab CE to manage many of our repositories and run our CI jobs. Those jobs run in thousands of OpenShift containers that we spawn every day.

OpenShift has some handy security features that we like. First, each container is mounted read-only with some writable temporary space (and any volumes that you mount). Also, OpenShift uses arbitrarily assigned user IDs for each container.

Constantly changing UIDs provide some good protection against container engine vulnerabilities, but they can be a pain if you have a script or application that depends on being able to resolve a UID or GID back to a real user or group account.

Ansible and UIDs

If you run an Ansible playbook within OpenShift, you will likely run into a problem during the fact gathering process:

$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: KeyError: 'getpwuid(): uid not found: 1000220000'
fatal: [localhost]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}
	to retry, use: --limit @/major-ansible-messaround/playbook.retry

PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1

This exception is telling us that getpwuid() was not able to find an entry in /etc/passwd for our UID (1000220000 in this container).

One option would be to adjust the playbook so that we skip the fact gathering process:

- hosts: all
  gather_facts: no
  tasks:

    - name: Run tests
      command: ./run_tests.sh

However, this might not be helpful if you need facts to be gathered for your playbook to run. In that case, you need to make some adjustments to your container image first.

Updating the container

Nothing in the container image is writable within OpenShift, but we can change certain files to be group writable for the root user since every OpenShift user has an effective GID of 0.

When you build your container, add a line to your Dockerfile to allow the container user to have group write access to /etc/passwd and /etc/group:

# Make Ansible happy with arbitrary UID/GID in OpenShift.
RUN chmod g=u /etc/passwd /etc/group

Once your container has finished building, the permissions on both files should look like this:

$ ls -al /etc/passwd /etc/group
-rw-rw-r--. 1 root root 514 Mar 20 18:12 /etc/group
-rw-rw-r--. 1 root root 993 Mar 20 18:12 /etc/passwd

Make a user account

Now that we’ve made these files writable for our user in OpenShift, it’s time to change how we run our GitLab CI job. My job YAML currently looks like this:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - ansible-playbook -i hosts playbook.yml

We can add two lines that allow us to make a temporary user and group account for our OpenShift user:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
    - echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
    - id
    - ansible-playbook -i hosts playbook.yml

Note that we want the second GID returned by id since the first one is 0. The id command helps us check our work when the container starts. When the CI job starts, we should see some better output:

$ echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
$ echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
$ id
uid=1000220000(tempuser) gid=0(root) groups=0(root),1000220000(tempuser)
$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Download kernel source] **************************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Success!

Get a /56 from Spectrum using wide-dhcpv6

After writing my last post on my IPv6 woes with my Pixel 3, some readers asked how I’m handling IPv6 on my router lately. I wrote about this previously when Spectrum was Time Warner Cable and I was using Mikrotik network devices.

There is a good post from 2015 on the blog and it still works today:

I am still using that same setup today, but some readers found it difficult to find the post since Time Warner Cable has renamed to Spectrum. Don’t worry – the old post still works. :)

Pixel 3 Wi-Fi drops constantly

pixel_phones

We have two Google Pixel phones in our house: a Pixel 2 and a Pixel 3. Both of them drop off our home wireless network regularly. It causes lots of problems with various applications on the phones, especially casting video via Chromecast.

At the time when I first noticed the drops, I was using a pair of wireless access points (APs) from Engenius:

Also, here’s what I knew at the time:

  • Mac and Linux computers had no Wi-Fi issues at all
  • The signal level from both APs was strong
  • Disabling one AP made no improvement
  • Disabling one band (2.4 or 5GHz) on the APs made no improvement
  • Clearing the bluetooth/Wi-Fi data on the Pixel had no effect
  • Assigning a static IP address on the Pixel made no improvement
  • Using unencrypted SSIDs made no improvement

At this point, I felt strongly that the APs had nothing to do with it. I ordered a new NetGear Orbi mesh router and satellite anyway. The Pixels still dropped off the wireless network even with the new Orbi APs.

Reading logs

I started reading logs from every source I could find:

  • dhcpd logs from my router
  • syslogs from my APs (which forwarded into the router)
  • output from tcpdump on my router

Several things became apparent after reading the logs:

  • The Wi-Fi drop occurred usually every 30-60 seconds
  • The DHCP server received requests for a new IP address after every drop
  • None of the network traffic from the phones was being blocked at the router
  • The logs from the APs showed the phone disconnecting itself from the network; the APs were not forcing the phones off the network

All of the wireless and routing systems in my house seemed to point to a problem in the phones themselves. They were voluntarily dropping from the network without being bumped off by APs or the router.

Getting logs from the phone

It was time to get some logs from the phone itself. That would require connecting the phone via USB to a computer and enabling USB debugging on the phone.

First, I downloaded the Android SDK. The full studio release isn’t needed – scroll down and find the Command line tools only section. Unzip the download and find the tools/bin/sdkmanager executable. Run it like this:

# Fedora 29 systems may need to choose the older Java version for sdkmanager
# to run properly.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.201.b09-2.fc29.x86_64/jre
# Install the android-28 platform tools
./sdkmanager "platform-tools" "platforms;android-28"

Now we need to enable USB debugging on the phone itself. Be sure to disable USB debugging when you are done! Follow these steps:

  1. Go into the phone’s settings and choose About Phone from the bottom of the list.
  2. Scroll to the bottom and tap the Build number section repeatedly until a message appears saying that you are now a developer.
  3. Go back one screen and tap System.
  4. Click Advanced to show the additional options and tap Developer Options.
  5. In the Debugging section, tap USB Debugging to enable USB debugging.

Connect the phone to your computer via USB and run:

sudo platform-tools/adb logcat

Your screen will fill with logs from your phone.

Nuggets in the log

I watched the logs and waited for the Wi-Fi to drop. As soon as it dropped, I saw some interesting log messages:

I wpa_supplicant: wlan0: CTRL-EVENT-AVOID-FREQ ranges=5785-5825
I chatty  : uid=1000(system) IpClient.wlan0 expire 3 lines
I chatty  : uid=1000 system_server expire 1 line
D CommandListener: Setting iface cfg
E cnss-daemon: wlan_service_update_sys_param: unable to open /proc/sys/net/ipv4/tcp_use_userconfig
I chatty  : uid=1000(system) android.fg expire 1 line
I wpa_supplicant: wlan0: CTRL-EVENT-DISCONNECTED bssid=88:dc:96:4a:b6:75 reason=3 locally_generated=1
I chatty  : uid=10025 com.google.android.gms.persistent expire 7 lines
V NativeCrypto: Read error: ssl=0x7b349e2d08: I/O error during system call, Software caused connection abort
V NativeCrypto: Write error: ssl=0x7b349e2d08: I/O error during system call, Broken pipe
V NativeCrypto: Write error: ssl=0x7b349e2d08: I/O error during system call, Broken pipe
V NativeCrypto: SSL shutdown failed: ssl=0x7b349e2d08: I/O error during system call, Success
D ConnectivityService: reportNetworkConnectivity(158, false) by 10025

The line with CTRL-EVENT-AVOID-FREQ isn’t relevant because it’s simply a hint to the wireless drivers to avoid certain frequencies not used in the USA. The CTRL-EVENT-DISCONNECTED shows where wpa_supplicant received the disconnection message. The last line with ConnectivityService was very interesting. Something in the phone believes there is a network connectivity issue. That could be why the Pixel is hopping off the wireless network.

From there, I decided to examine only the ConnectivityService logs:

sudo platform-tools/adb logcat 'ConnectivityService:* *:S'

This logcat line tells adb that I want all logs from all log levels about the ConnectivityService, but all of the other logs should be silenced. I started seeing some interesting details:

D ConnectivityService: NetworkAgentInfo [WIFI () - 148] validation failed
D ConnectivityService: Switching to new default network: NetworkAgentInfo{ ni{[type: MOBILE[LTE]...
D ConnectivityService: Sending DISCONNECTED broadcast for type 1 NetworkAgentInfo [WIFI () - 148] isDefaultNetwork=true
D ConnectivityService: Sending CONNECTED broadcast for type 0 NetworkAgentInfo [MOBILE (LTE) - 100] isDefaultNetwork=true
D ConnectivityService: handleNetworkUnvalidated NetworkAgentInfo [WIFI () - 148] ...

Wait, what is this “validation failed” message? The Pixel was making network connections successfully the entire time as shown by tcpdump. This is part of Android’s [network connecivity checks] for various networks.

The last few connections just before the disconnect were to connectivitycheck.gstatic.com (based on tcpdump logs) and that’s Google’s way of verifying that the wireless network is usable and that there are no captive portals. I connected to it from my desktop on IPv4 and IPv6 to verify:

$ curl -4 -i https://connectivitycheck.gstatic.com/generate_204
HTTP/2 204
date: Sun, 17 Mar 2019 15:00:30 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
$ curl -6 -i https://connectivitycheck.gstatic.com/generate_204
HTTP/2 204
date: Sun, 17 Mar 2019 15:00:30 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"

Everything looked fine.

Heading to Google

After a bunch of searching on Google, I kept finding posts talking about disabling IPv6 to fix the Wi-Fi drop issues. I shrugged it off and kept searching. Finally, I decided to disable IPv6 and see if that helped.

I stopped radvd on the router, disabled Wi-Fi on the phone, and then re-enabled it. As I watched, the phone stayed on the wireless network for two minutes. Three minutes. Ten minutes. There were no drops.

At this point, this is still an unsolved mystery for me. Disabling IPv6 is a terrible idea, but it keeps my phones online. I plan to put the phones on their own VLAN without IPv6 so I can still keep IPv6 addresses for my other computers, but this is not a good long term fix. If anyone has any input on why this helps and how I can get IPv6 re-enabled, please let me know!

Update 2019-03-18

Several readers wanted to see what was happening just before the Wi-Fi drop, so here’s a small snippet from tcpdump:

07:26:06.736900 IP6 2607:f8b0:4000:80d::2003.443 > phone.41310: Flags [F.], seq 3863, ack 511, win 114, options [nop,nop,TS val 1288800272 ecr 66501414], length 0
07:26:06.743101 IP6 2607:f8b0:4000:80d::2003.443 > phone.41312: Flags [F.], seq 3864, ack 511, win 114, options [nop,nop,TS val 1778536228 ecr 66501414], length 0
07:26:06.765444 IP6 phone.41312 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 4183481455, win 0, length 0
07:26:06.765454 IP6 phone.41310 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 3279990707, win 0, length 0
07:26:07.487180 IP6 2607:f8b0:4000:80d::2003.443 > phone.41316: Flags [F.], seq 3863, ack 511, win 114, options [nop,nop,TS val 4145292968 ecr 66501639], length 0
07:26:07.537080 IP6 phone.41316 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 4188442452, win 0, length 0

That IPv6 address is at a Google PoP in Dallas, TX:

$ host 2607:f8b0:4000:80d::2003
3.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.d.0.8.0.0.0.0.4.0.b.8.f.7.0.6.2.ip6.arpa domain name pointer dfw06s49-in-x03.1e100.net.

I haven’t been able to intercept the traffic via man-in-the-middle since Google’s certificate checks are very strict. However, checks from my own computer work without an issue:

$ curl -ki "https://[2607:f8b0:4000:80d::2003]/generate_204"
HTTP/2 204
date: Mon, 18 Mar 2019 12:35:18 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"