major.io words of wisdom from a systems engineer

Texas Linux Fest 2019 Recap

Las Colinas in Irving

Another Texas Linux Fest has come and gone! The 2019 Texas Linux Fest was held in Irving at the Irving Convention Center. It was a great venue surrounded by lots of shops and restaurants.

If you haven’t attended one of these events before, you really should! Attendees have varying levels of experience with Linux and the conference organizers (volunteers) work really hard to ensure everyone feels included.

The event usually falls on a Friday and Saturday. Fridays consist of longer, deeper dive talks on various topics – technical and non-technical. Saturdays are more of a typical conference format with a keynote in the morning and 45-minute talks through the day. Saturday nights have lightning talks as well as “Birds of a Feather” events for people with similar interests.

Highlights

Steve Ovens took us on a three hour journey on Friday to learn more about our self-worth. His talk, “You’re Worth More Than You Know, Matching your Skills to Employers”, covered a myriad of concepts such as discovering what really motivates you, understanding how to value yourself (and your skills), and how to work well with different personality types.

I’ve attended these types of talks before and they sometimes end up a bit fluffy without items that you can begin using quickly. Steve’s talk was the opposite. He gave us concrete ways to change how we think about ourselves and use that knowledge to advance ourselves at work. I learned a lot about negotiation strategies for salary when getting hired or when pushing for a raise. Steve stopped lots of times to answer questions and it was clear that he was really interested in this topic.

Thomas Cameron kicked off Saturday with his “Linux State of the Union” talk. He talked a lot about his personal journey and how he has changed along the way. He noted quite a few changes to Linux (not the code, but the people around it) that many of us had not noticed. We learned more about how we can make the Linux community more diverse, inclusive, and welcoming. We also groaned through some problems from the good old days with jumpers on SATA cards and the joys of winmodems.

Adam Miller threw us into a seat of a roller coaster and gave a whirlwind talk about all the ways you can automate (nearly) everything with Ansible.

Adam Miller Ansible talk

He covered everything from simple configuration management tasks to scaling up software deployments over thousands of nodes. Adam also explained the OCI image format as being “sweet sweet tarballs with a little bit of metadata” and the audience was rolling with laughter. Adam’s talks are always good and you’ll be energized all the way through.

José Miguel Parrella led a great lightning talk in the evening about how Microsoft uses Linux in plenty of places:

Debian at Microsoft slide

The audience was shocked by how much Debian was used at Microsoft and it made it more clear that Microsoft is really making a big shift towards open source well. Many of us knew that already but we didn’t know the extent of the work being done.

My talks

My first talk was about my team at Red Hat, the Continuous Kernel Integration team. I shared some of the challenges involved with doing CI for the kernel at scale and how difficult it is to increase test coverage of subsystems within the kernel. There were two kernel developers in the audience and they had some really good questions.

The discussion at the end was quite productive. The audience had plenty of questions about how different pieces of the system worked, and how well GitLab was working for us. We also talked a bit about how the kernel is developed and if there is room for improvement. One attendee hoped that some of the work we’re doing will change the kernel development process for the better. I hope so, too.

My second talk covered the topic of burnout. I have delivered plenty of talks about impostor syndrome in the past and I was eager to share more ideas around “soft” skills that become more important to technical career development over time.

The best part of these types of talks for me is the honesty that people bring when they share their thoughts after the talk. A few people from the audience shared their own personal experiences (some were very personal) and you could see people in the audience begin to understand how difficult burnout recovery can be. Small conferences like these create environments where people can talk honestly about difficult topics.

If you’re looking for the slides from these talks, you can view them in Google Slides (for the sake of the GIFs!):

Google Slides also allows you to download the slides as PDFs. Just choose File > Download as > PDF.

BoF: Ham Radio and OSS

The BoFs were fairly late in the day and everyone was looking tired. However, we had a great group assemble for the Ham Radio and OSS BoF. We had about 15-20 licensed hams and 5-6 people who were curious about the hobby.

We talked about radios, antennas, procedures, how to study, and the exams. The ham-curious folks who joined us looked a bit overwhelmed by the help they were getting, but they left the room with plenty of ideas on how to get started.

I also agreed to write a blog post about everything I’ve learned so far that has made the hobby easier for me and I hope to write that soon. There is so much information out there for studying and finding equipment that it can become really confusing for people new to the hobby.

Final thoughts

If you get the opportunity to attend a local Linux fest in your state, do it! The Texas one is always good and people joined us from Arkansas, Oklahoma, Louisiana, and Arizona. Some people came as far as Connecticut and the United Kingdom! These smaller events have a much higher signal to noise ratio and there is more real discussion rather than marketing from industry giants.

Thanks to everyone who put the Texas Linux Fest together this year!

Build containers in GitLab CI with buildah

cranes and skyscrapers

My team at Red Hat depends heavily on GitLab CI and we build containers often to run all kinds of tests. Fortunately, GitLab offers up CI to build containers and a container registry in every repository to hold the containers we build.

This is really handy because it keeps everything together in one place: your container build scripts, your container build infrastructure, and the registry that holds your containers. Better yet, you can put multiple types of containers underneath a single git repository if you need to build containers based on different Linux distributions.

Building with Docker in GitLab CI

By default, GitLab offers up a Docker builder that works just fine. The CI system clones your repository, builds your containers and pushes them wherever you want. There’s even a simple CI YAML file that does everything end-to-end for you.

However, I have two issues with the Docker builder:

  • Larger images: The Docker image layering is handy, but the images end up being a bit larger, especially if you don’t do a little cleanup in each stage.

  • Additional service: It requires an additional service inside the CI runner for the dind (“Docker in Docker”) builder. This has caused some CI delays for me several times.

Building with buildah in GitLab CI

On my local workstation, I use podman and buildah all the time to build, run, and test containers. These tools are handy because I don’t need to remember to start the Docker daemon each time I want to mess with a container. I also don’t need sudo.

All of my containers are stored beneath my home directory. That’s good for keeping disk space in check, but it’s especially helpful on shared servers since each user has their own unique storage. My container pulls and builds won’t disrupt anyone else’s work on the server and their work won’t disrupt mine.

Finally, buildah offers some nice options out of the box. First, when you build a container with buildah bud, you end up with only three layers by default:

  1. Original OS layer (example: fedora:30)
  2. Everything you added on top of the OS layer
  3. Tiny bit of metadata

This is incredibly helpful if you use package managers like dnf, apt, and yum that download a bunch of metadata before installing packages. You would normally have to clear the metadata carefully for the package manager so that your container wouldn’t grow in size. Buildah takes care of that by squashing all the stuff you add into one layer.

Of course, if you want to be more aggressive, buildah offers the --squash option which squashes the whole image down into one layer. This can be helpful if disk space is at a premium and you change the layers often.

Getting started

I have a repository called os-containers in GitLab that maintains fully updated containers for Fedora 29 and 30. The .gitlab-ci.yml file calls build.sh for two containers: fedora29 and fedora30. Open the build.sh file and follow along here:

# Use vfs with buildah. Docker offers overlayfs as a default, but buildah
# cannot stack overlayfs on top of another overlayfs filesystem.
export STORAGE_DRIVER=vfs

First off, we need to tell buildah to use the vfs storage driver. Docker uses overlayfs by default and stacking overlay filesystems will definitely lead to problems. Buildah won’t let you try it.

# Write all image metadata in the docker format, not the standard OCI format.
# Newer versions of docker can handle the OCI format, but older versions, like
# the one shipped with Fedora 30, cannot handle the format.
export BUILDAH_FORMAT=docker

By default, buildah uses the oci container format. This sometimes causes issues with older versions of Docker that don’t understand how to parse that type of metadata. By setting the format to docker, we’re using a format that almost all container runtimes can understand.

# Log into GitLab's container repository.
export REGISTRY_AUTH_FILE=${HOME}/auth.json
echo "$CI_REGISTRY_PASSWORD" | buildah login -u "$CI_REGISTRY_USER" --password-stdin $CI_REGISTRY

Here we set a path for the auth.json that contains the credentials for talking to the container repository. We also use buildah to authenticate to GitLab’s built-in container repository. GitLab automatically exports these variables for us (and hides them in the job output), so we can use them here.

buildah bud -f builds/${IMAGE_NAME} -t ${IMAGE_NAME} .

We’re now building the container and storing it temporarily as the bare image name, such as fedora30. This is roughly equivalent to docker build.

CONTAINER_ID=$(buildah from ${IMAGE_NAME})
buildah commit --squash $CONTAINER_ID $FQ_IMAGE_NAME

Now we are making a reference to our container with buildah from and using that reference to squash that container down into a single layer. This keeps the container as small as possible.

The commit step also tags the resulting image using our fully qualified image name (in this case, it’s registry.gitlab.com/majorhayden/os-containers/fedora30:latest)

buildah push ${FQ_IMAGE_NAME}

This is the same as docker push. There’s not much special to see here.

Maintaining containers

GitLab allows you to take things to the next level with CI schedules. In my repository, there is a schedule to build my containers once a day to catch the latest updates. I use these containers a lot and they need to be up to date before I can run tests.

If the container build fails for some reason, GitLab will send me an email to let me know.

Photo Source

Inspecting OpenShift cgroups from inside the pod

walking_through_rock_valley

My team at Red Hat builds a lot of kernels in OpenShift pods as part of our work with the Continuous Kernel Integration (CKI) project. We have lots of different pod sizes depending on the type of work we are doing and our GitLab runners spawn these pods based on the tags in our GitLab CI pipeline.

Compiling with make

When you compile a large software project, such as the Linux kernel, you can use multiple CPU cores to speed up the build. GNU’s make does this with the -j argument. Running make with -j10 means that you want to run 10 jobs while compiling. This would keep 10 CPU cores busy.

Setting the number too high causes more contention from the CPU and can reduce performance. Setting the number too low means that you are spending more time compiling than you would if you used all of your CPU cores.

Every once in a while, we adjusted our runners to use a different amount of CPUs or memory and then we had to adjust our pipeline to reflect the new CPU count. This was time consuming and error prone.

Many people just use nproc to determine the CPU core count. It works well with make:

make -j$(nproc)

Problems with containers

The handy nproc doesn’t work well for OpenShift. If you start a pod on OpenShift and limit it to a single CPU core, nproc tells you something very wrong:

$ nproc
32

We applied the single CPU limit with OpenShift, so what’s the problem? The issue is how nproc looks for CPUs. Here’s a snippet of strace output:

sched_getaffinity(0, 128, [0, 1, 2, 3, 4, 5]) = 8
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x6), ...}) = 0
write(1, "6\n", 26
)                      = 2

The sched_getaffinity syscall looks to see which CPUs are allowed to run the process and returns a count of those. OpenShift doesn’t prevent us from seeing the CPUs of the underlying system (the VM or bare metal host underneath our containers), but it uses cgroups to limit how much CPU time we can use.

Reading cgroups

Getting cgroup data is easy! Just change into the /sys/fs/cgroup/ directory and look around:

$ cd /sys/fs/cgroup/
$ ls -al cpu/
ls: cannot open directory 'cpu/': Permission denied

Ouch. OpenShift makes this a little more challenging. We’re not allowed to wander around in the land of cgroups without a map to exactly what we want.

My Fedora workstation shows a bunch of CPU cgroup settings:

$ ls -al /sys/fs/cgroup/cpu/
total 0
dr-xr-xr-x.  2 root root   0 Apr  5 01:40 .
drwxr-xr-x. 14 root root 360 Apr  5 01:40 ..
-rw-r--r--.  1 root root   0 Apr  5 13:08 cgroup.clone_children
-rw-r--r--.  1 root root   0 Apr  5 01:40 cgroup.procs
-r--r--r--.  1 root root   0 Apr  5 13:08 cgroup.sane_behavior
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.stat
-rw-r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_all
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu_sys
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu_user
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_sys
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_user
-rw-r--r--.  1 root root   0 Apr  5 09:10 cpu.cfs_period_us
-rw-r--r--.  1 root root   0 Apr  5 13:08 cpu.cfs_quota_us
-rw-r--r--.  1 root root   0 Apr  5 09:10 cpu.shares
-r--r--r--.  1 root root   0 Apr  5 13:08 cpu.stat
-rw-r--r--.  1 root root   0 Apr  5 13:08 notify_on_release
-rw-r--r--.  1 root root   0 Apr  5 13:08 release_agent
-rw-r--r--.  1 root root   0 Apr  5 13:08 tasks

OpenShift uses the Completely Fair Scheduler (CFS) to limit CPU time. Here’s a quick excerpt from the kernel documentation:

Quota and period are managed within the cpu subsystem via cgroupfs.

cpu.cfs_quota_us: the total available run-time within a period (in microseconds) cpu.cfs_period_us: the length of a period (in microseconds) cpu.stat: exports throttling statistics [explained further below]

The default values are: cpu.cfs_period_us=100ms cpu.cfs_quota=-1

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS.

Writing any (valid) positive value(s) will enact the specified bandwidth limit. The minimum quota allowed for the quota or period is 1ms. There is also an upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below.

Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit and return the group to an unconstrained state once more.

Any updates to a group’s bandwidth specification will result in it becoming unthrottled if it is in a constrained state.

Let’s see if inspecting cpu.cfs_quota_us can help us:

$ cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
10000

Now we’re getting somewhere. But what does 10000 mean here? OpenShift operates on the concept of millicores of CPU time, or 11000 of a CPU. 500 millicores is half a CPU and 1000 millicores is a whole CPU.

The pod in this example is assigned 100 millicores. Now we know that we can take the output of /sys/fs/cgroup/cpu/cpu.cfs_quota_us, divide by 100, and get our millicores.

We can make a script like this:

CFS_QUOTA=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
if [ $CFS_QUOTA -lt 100000 ]; then
  CPUS_AVAILABLE=1
else
  CPUS_AVAILABLE=$(expr ${CFS_QUOTA} / 100 / 1000)
fi
echo "Found ${CPUS_AVAILABLE} CPUS"
make -j${CPUS_AVAILABLE} ...

The script checks for the value of the quota and divides by 100,000 to get the number of cores. If the share is set to something less than 100,000, then a core count of 1 is assigned. (Pro tip: make does not like being told to compile with zero jobs.)

Reading memory limits

There are other limits you can read and inspect in a pod, including the available RAM. As we found with nproc, free is not very helpful:

# An OpenShift pod with 200MB RAM
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          32008       12322         880          31       18805       19246
Swap:             0           0           0

But the cgroups tell the truth:

$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
209715200

If you run Java applications in a container, like Jenkins (or Jenkins slaves), be sure to use the -XX:+UseCGroupMemoryLimitForHeap option. That will cause Java to look at the cgroups to determine its heap size.

Photo credit: Wikipedia

Running Ansible in OpenShift with arbitrary UIDs

blacksmith_anvil_hammer

My work at Red Hat involves testing lots and lots of kernels from various sources and we use GitLab CE to manage many of our repositories and run our CI jobs. Those jobs run in thousands of OpenShift containers that we spawn every day.

OpenShift has some handy security features that we like. First, each container is mounted read-only with some writable temporary space (and any volumes that you mount). Also, OpenShift uses arbitrarily assigned user IDs for each container.

Constantly changing UIDs provide some good protection against container engine vulnerabilities, but they can be a pain if you have a script or application that depends on being able to resolve a UID or GID back to a real user or group account.

Ansible and UIDs

If you run an Ansible playbook within OpenShift, you will likely run into a problem during the fact gathering process:

$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: KeyError: 'getpwuid(): uid not found: 1000220000'
fatal: [localhost]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}
	to retry, use: --limit @/major-ansible-messaround/playbook.retry

PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1

This exception is telling us that getpwuid() was not able to find an entry in /etc/passwd for our UID (1000220000 in this container).

One option would be to adjust the playbook so that we skip the fact gathering process:

- hosts: all
  gather_facts: no
  tasks:

    - name: Run tests
      command: ./run_tests.sh

However, this might not be helpful if you need facts to be gathered for your playbook to run. In that case, you need to make some adjustments to your container image first.

Updating the container

Nothing in the container image is writable within OpenShift, but we can change certain files to be group writable for the root user since every OpenShift user has an effective GID of 0.

When you build your container, add a line to your Dockerfile to allow the container user to have group write access to /etc/passwd and /etc/group:

# Make Ansible happy with arbitrary UID/GID in OpenShift.
RUN chmod g=u /etc/passwd /etc/group

Once your container has finished building, the permissions on both files should look like this:

$ ls -al /etc/passwd /etc/group
-rw-rw-r--. 1 root root 514 Mar 20 18:12 /etc/group
-rw-rw-r--. 1 root root 993 Mar 20 18:12 /etc/passwd

Make a user account

Now that we’ve made these files writable for our user in OpenShift, it’s time to change how we run our GitLab CI job. My job YAML currently looks like this:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - ansible-playbook -i hosts playbook.yml

We can add two lines that allow us to make a temporary user and group account for our OpenShift user:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
    - echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
    - id
    - ansible-playbook -i hosts playbook.yml

Note that we want the second GID returned by id since the first one is 0. The id command helps us check our work when the container starts. When the CI job starts, we should see some better output:

$ echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
$ echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
$ id
uid=1000220000(tempuser) gid=0(root) groups=0(root),1000220000(tempuser)
$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Download kernel source] **************************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Success!

Get a /56 from Spectrum using wide-dhcpv6

After writing my last post on my IPv6 woes with my Pixel 3, some readers asked how I’m handling IPv6 on my router lately. I wrote about this previously when Spectrum was Time Warner Cable and I was using Mikrotik network devices.

There is a good post from 2015 on the blog and it still works today:

I am still using that same setup today, but some readers found it difficult to find the post since Time Warner Cable has renamed to Spectrum. Don’t worry – the old post still works. :)