major.io words of wisdom from a systems engineer

Inspecting OpenShift cgroups from inside the pod

walking_through_rock_valley

My team at Red Hat builds a lot of kernels in OpenShift pods as part of our work with the Continuous Kernel Integration (CKI) project. We have lots of different pod sizes depending on the type of work we are doing and our GitLab runners spawn these pods based on the tags in our GitLab CI pipeline.

Compiling with make

When you compile a large software project, such as the Linux kernel, you can use multiple CPU cores to speed up the build. GNU’s make does this with the -j argument. Running make with -j10 means that you want to run 10 jobs while compiling. This would keep 10 CPU cores busy.

Setting the number too high causes more contention from the CPU and can reduce performance. Setting the number too low means that you are spending more time compiling than you would if you used all of your CPU cores.

Every once in a while, we adjusted our runners to use a different amount of CPUs or memory and then we had to adjust our pipeline to reflect the new CPU count. This was time consuming and error prone.

Many people just use nproc to determine the CPU core count. It works well with make:

make -j$(nproc)

Problems with containers

The handy nproc doesn’t work well for OpenShift. If you start a pod on OpenShift and limit it to a single CPU core, nproc tells you something very wrong:

$ nproc
32

We applied the single CPU limit with OpenShift, so what’s the problem? The issue is how nproc looks for CPUs. Here’s a snippet of strace output:

sched_getaffinity(0, 128, [0, 1, 2, 3, 4, 5]) = 8
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x6), ...}) = 0
write(1, "6\n", 26
)                      = 2

The sched_getaffinity syscall looks to see which CPUs are allowed to run the process and returns a count of those. OpenShift doesn’t prevent us from seeing the CPUs of the underlying system (the VM or bare metal host underneath our containers), but it uses cgroups to limit how much CPU time we can use.

Reading cgroups

Getting cgroup data is easy! Just change into the /sys/fs/cgroup/ directory and look around:

$ cd /sys/fs/cgroup/
$ ls -al cpu/
ls: cannot open directory 'cpu/': Permission denied

Ouch. OpenShift makes this a little more challenging. We’re not allowed to wander around in the land of cgroups without a map to exactly what we want.

My Fedora workstation shows a bunch of CPU cgroup settings:

$ ls -al /sys/fs/cgroup/cpu/
total 0
dr-xr-xr-x.  2 root root   0 Apr  5 01:40 .
drwxr-xr-x. 14 root root 360 Apr  5 01:40 ..
-rw-r--r--.  1 root root   0 Apr  5 13:08 cgroup.clone_children
-rw-r--r--.  1 root root   0 Apr  5 01:40 cgroup.procs
-r--r--r--.  1 root root   0 Apr  5 13:08 cgroup.sane_behavior
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.stat
-rw-r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_all
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu_sys
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_percpu_user
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_sys
-r--r--r--.  1 root root   0 Apr  5 13:08 cpuacct.usage_user
-rw-r--r--.  1 root root   0 Apr  5 09:10 cpu.cfs_period_us
-rw-r--r--.  1 root root   0 Apr  5 13:08 cpu.cfs_quota_us
-rw-r--r--.  1 root root   0 Apr  5 09:10 cpu.shares
-r--r--r--.  1 root root   0 Apr  5 13:08 cpu.stat
-rw-r--r--.  1 root root   0 Apr  5 13:08 notify_on_release
-rw-r--r--.  1 root root   0 Apr  5 13:08 release_agent
-rw-r--r--.  1 root root   0 Apr  5 13:08 tasks

OpenShift uses the Completely Fair Scheduler (CFS) to limit CPU time. Here’s a quick excerpt from the kernel documentation:

Quota and period are managed within the cpu subsystem via cgroupfs.

cpu.cfs_quota_us: the total available run-time within a period (in microseconds) cpu.cfs_period_us: the length of a period (in microseconds) cpu.stat: exports throttling statistics [explained further below]

The default values are: cpu.cfs_period_us=100ms cpu.cfs_quota=-1

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS.

Writing any (valid) positive value(s) will enact the specified bandwidth limit. The minimum quota allowed for the quota or period is 1ms. There is also an upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below.

Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit and return the group to an unconstrained state once more.

Any updates to a group’s bandwidth specification will result in it becoming unthrottled if it is in a constrained state.

Let’s see if inspecting cpu.cfs_quota_us can help us:

$ cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
10000

Now we’re getting somewhere. But what does 10000 mean here? OpenShift operates on the concept of millicores of CPU time, or 11000 of a CPU. 500 millicores is half a CPU and 1000 millicores is a whole CPU.

The pod in this example is assigned 100 millicores. Now we know that we can take the output of /sys/fs/cgroup/cpu/cpu.cfs_quota_us, divide by 100, and get our millicores.

We can make a script like this:

CFS_QUOTA=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
if [ $CFS_QUOTA -lt 100000 ]; then
  CPUS_AVAILABLE=1
else
  CPUS_AVAILABLE=$(expr ${CFS_QUOTA} / 100 / 1000)
fi
echo "Found ${CPUS_AVAILABLE} CPUS"
make -j${CPUS_AVAILABLE} ...

The script checks for the value of the quota and divides by 100,000 to get the number of cores. If the share is set to something less than 100,000, then a core count of 1 is assigned. (Pro tip: make does not like being told to compile with zero jobs.)

Reading memory limits

There are other limits you can read and inspect in a pod, including the available RAM. As we found with nproc, free is not very helpful:

# An OpenShift pod with 200MB RAM
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          32008       12322         880          31       18805       19246
Swap:             0           0           0

But the cgroups tell the truth:

$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
209715200

If you run Java applications in a container, like Jenkins (or Jenkins slaves), be sure to use the -XX:+UseCGroupMemoryLimitForHeap option. That will cause Java to look at the cgroups to determine its heap size.

Photo credit: Wikipedia

Running Ansible in OpenShift with arbitrary UIDs

blacksmith_anvil_hammer

My work at Red Hat involves testing lots and lots of kernels from various sources and we use GitLab CE to manage many of our repositories and run our CI jobs. Those jobs run in thousands of OpenShift containers that we spawn every day.

OpenShift has some handy security features that we like. First, each container is mounted read-only with some writable temporary space (and any volumes that you mount). Also, OpenShift uses arbitrarily assigned user IDs for each container.

Constantly changing UIDs provide some good protection against container engine vulnerabilities, but they can be a pain if you have a script or application that depends on being able to resolve a UID or GID back to a real user or group account.

Ansible and UIDs

If you run an Ansible playbook within OpenShift, you will likely run into a problem during the fact gathering process:

$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: KeyError: 'getpwuid(): uid not found: 1000220000'
fatal: [localhost]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}
	to retry, use: --limit @/major-ansible-messaround/playbook.retry

PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1

This exception is telling us that getpwuid() was not able to find an entry in /etc/passwd for our UID (1000220000 in this container).

One option would be to adjust the playbook so that we skip the fact gathering process:

- hosts: all
  gather_facts: no
  tasks:

    - name: Run tests
      command: ./run_tests.sh

However, this might not be helpful if you need facts to be gathered for your playbook to run. In that case, you need to make some adjustments to your container image first.

Updating the container

Nothing in the container image is writable within OpenShift, but we can change certain files to be group writable for the root user since every OpenShift user has an effective GID of 0.

When you build your container, add a line to your Dockerfile to allow the container user to have group write access to /etc/passwd and /etc/group:

# Make Ansible happy with arbitrary UID/GID in OpenShift.
RUN chmod g=u /etc/passwd /etc/group

Once your container has finished building, the permissions on both files should look like this:

$ ls -al /etc/passwd /etc/group
-rw-rw-r--. 1 root root 514 Mar 20 18:12 /etc/group
-rw-rw-r--. 1 root root 993 Mar 20 18:12 /etc/passwd

Make a user account

Now that we’ve made these files writable for our user in OpenShift, it’s time to change how we run our GitLab CI job. My job YAML currently looks like this:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - ansible-playbook -i hosts playbook.yml

We can add two lines that allow us to make a temporary user and group account for our OpenShift user:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
    - echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
    - id
    - ansible-playbook -i hosts playbook.yml

Note that we want the second GID returned by id since the first one is 0. The id command helps us check our work when the container starts. When the CI job starts, we should see some better output:

$ echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
$ echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
$ id
uid=1000220000(tempuser) gid=0(root) groups=0(root),1000220000(tempuser)
$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Download kernel source] **************************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Success!

Get a /56 from Spectrum using wide-dhcpv6

After writing my last post on my IPv6 woes with my Pixel 3, some readers asked how I’m handling IPv6 on my router lately. I wrote about this previously when Spectrum was Time Warner Cable and I was using Mikrotik network devices.

There is a good post from 2015 on the blog and it still works today:

I am still using that same setup today, but some readers found it difficult to find the post since Time Warner Cable has renamed to Spectrum. Don’t worry – the old post still works. :)

Pixel 3 Wi-Fi drops constantly

pixel_phones

We have two Google Pixel phones in our house: a Pixel 2 and a Pixel 3. Both of them drop off our home wireless network regularly. It causes lots of problems with various applications on the phones, especially casting video via Chromecast.

At the time when I first noticed the drops, I was using a pair of wireless access points (APs) from Engenius:

Also, here’s what I knew at the time:

  • Mac and Linux computers had no Wi-Fi issues at all
  • The signal level from both APs was strong
  • Disabling one AP made no improvement
  • Disabling one band (2.4 or 5GHz) on the APs made no improvement
  • Clearing the bluetooth/Wi-Fi data on the Pixel had no effect
  • Assigning a static IP address on the Pixel made no improvement
  • Using unencrypted SSIDs made no improvement

At this point, I felt strongly that the APs had nothing to do with it. I ordered a new NetGear Orbi mesh router and satellite anyway. The Pixels still dropped off the wireless network even with the new Orbi APs.

Reading logs

I started reading logs from every source I could find:

  • dhcpd logs from my router
  • syslogs from my APs (which forwarded into the router)
  • output from tcpdump on my router

Several things became apparent after reading the logs:

  • The Wi-Fi drop occurred usually every 30-60 seconds
  • The DHCP server received requests for a new IP address after every drop
  • None of the network traffic from the phones was being blocked at the router
  • The logs from the APs showed the phone disconnecting itself from the network; the APs were not forcing the phones off the network

All of the wireless and routing systems in my house seemed to point to a problem in the phones themselves. They were voluntarily dropping from the network without being bumped off by APs or the router.

Getting logs from the phone

It was time to get some logs from the phone itself. That would require connecting the phone via USB to a computer and enabling USB debugging on the phone.

First, I downloaded the Android SDK. The full studio release isn’t needed – scroll down and find the Command line tools only section. Unzip the download and find the tools/bin/sdkmanager executable. Run it like this:

# Fedora 29 systems may need to choose the older Java version for sdkmanager
# to run properly.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.201.b09-2.fc29.x86_64/jre
# Install the android-28 platform tools
./sdkmanager "platform-tools" "platforms;android-28"

Now we need to enable USB debugging on the phone itself. Be sure to disable USB debugging when you are done! Follow these steps:

  1. Go into the phone’s settings and choose About Phone from the bottom of the list.
  2. Scroll to the bottom and tap the Build number section repeatedly until a message appears saying that you are now a developer.
  3. Go back one screen and tap System.
  4. Click Advanced to show the additional options and tap Developer Options.
  5. In the Debugging section, tap USB Debugging to enable USB debugging.

Connect the phone to your computer via USB and run:

sudo platform-tools/adb logcat

Your screen will fill with logs from your phone.

Nuggets in the log

I watched the logs and waited for the Wi-Fi to drop. As soon as it dropped, I saw some interesting log messages:

I wpa_supplicant: wlan0: CTRL-EVENT-AVOID-FREQ ranges=5785-5825
I chatty  : uid=1000(system) IpClient.wlan0 expire 3 lines
I chatty  : uid=1000 system_server expire 1 line
D CommandListener: Setting iface cfg
E cnss-daemon: wlan_service_update_sys_param: unable to open /proc/sys/net/ipv4/tcp_use_userconfig
I chatty  : uid=1000(system) android.fg expire 1 line
I wpa_supplicant: wlan0: CTRL-EVENT-DISCONNECTED bssid=88:dc:96:4a:b6:75 reason=3 locally_generated=1
I chatty  : uid=10025 com.google.android.gms.persistent expire 7 lines
V NativeCrypto: Read error: ssl=0x7b349e2d08: I/O error during system call, Software caused connection abort
V NativeCrypto: Write error: ssl=0x7b349e2d08: I/O error during system call, Broken pipe
V NativeCrypto: Write error: ssl=0x7b349e2d08: I/O error during system call, Broken pipe
V NativeCrypto: SSL shutdown failed: ssl=0x7b349e2d08: I/O error during system call, Success
D ConnectivityService: reportNetworkConnectivity(158, false) by 10025

The line with CTRL-EVENT-AVOID-FREQ isn’t relevant because it’s simply a hint to the wireless drivers to avoid certain frequencies not used in the USA. The CTRL-EVENT-DISCONNECTED shows where wpa_supplicant received the disconnection message. The last line with ConnectivityService was very interesting. Something in the phone believes there is a network connectivity issue. That could be why the Pixel is hopping off the wireless network.

From there, I decided to examine only the ConnectivityService logs:

sudo platform-tools/adb logcat 'ConnectivityService:* *:S'

This logcat line tells adb that I want all logs from all log levels about the ConnectivityService, but all of the other logs should be silenced. I started seeing some interesting details:

D ConnectivityService: NetworkAgentInfo [WIFI () - 148] validation failed
D ConnectivityService: Switching to new default network: NetworkAgentInfo{ ni{[type: MOBILE[LTE]...
D ConnectivityService: Sending DISCONNECTED broadcast for type 1 NetworkAgentInfo [WIFI () - 148] isDefaultNetwork=true
D ConnectivityService: Sending CONNECTED broadcast for type 0 NetworkAgentInfo [MOBILE (LTE) - 100] isDefaultNetwork=true
D ConnectivityService: handleNetworkUnvalidated NetworkAgentInfo [WIFI () - 148] ...

Wait, what is this “validation failed” message? The Pixel was making network connections successfully the entire time as shown by tcpdump. This is part of Android’s [network connecivity checks] for various networks.

The last few connections just before the disconnect were to connectivitycheck.gstatic.com (based on tcpdump logs) and that’s Google’s way of verifying that the wireless network is usable and that there are no captive portals. I connected to it from my desktop on IPv4 and IPv6 to verify:

$ curl -4 -i https://connectivitycheck.gstatic.com/generate_204
HTTP/2 204
date: Sun, 17 Mar 2019 15:00:30 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
$ curl -6 -i https://connectivitycheck.gstatic.com/generate_204
HTTP/2 204
date: Sun, 17 Mar 2019 15:00:30 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"

Everything looked fine.

Heading to Google

After a bunch of searching on Google, I kept finding posts talking about disabling IPv6 to fix the Wi-Fi drop issues. I shrugged it off and kept searching. Finally, I decided to disable IPv6 and see if that helped.

I stopped radvd on the router, disabled Wi-Fi on the phone, and then re-enabled it. As I watched, the phone stayed on the wireless network for two minutes. Three minutes. Ten minutes. There were no drops.

At this point, this is still an unsolved mystery for me. Disabling IPv6 is a terrible idea, but it keeps my phones online. I plan to put the phones on their own VLAN without IPv6 so I can still keep IPv6 addresses for my other computers, but this is not a good long term fix. If anyone has any input on why this helps and how I can get IPv6 re-enabled, please let me know!

Update 2019-03-18

Several readers wanted to see what was happening just before the Wi-Fi drop, so here’s a small snippet from tcpdump:

07:26:06.736900 IP6 2607:f8b0:4000:80d::2003.443 > phone.41310: Flags [F.], seq 3863, ack 511, win 114, options [nop,nop,TS val 1288800272 ecr 66501414], length 0
07:26:06.743101 IP6 2607:f8b0:4000:80d::2003.443 > phone.41312: Flags [F.], seq 3864, ack 511, win 114, options [nop,nop,TS val 1778536228 ecr 66501414], length 0
07:26:06.765444 IP6 phone.41312 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 4183481455, win 0, length 0
07:26:06.765454 IP6 phone.41310 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 3279990707, win 0, length 0
07:26:07.487180 IP6 2607:f8b0:4000:80d::2003.443 > phone.41316: Flags [F.], seq 3863, ack 511, win 114, options [nop,nop,TS val 4145292968 ecr 66501639], length 0
07:26:07.537080 IP6 phone.41316 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 4188442452, win 0, length 0

That IPv6 address is at a Google PoP in Dallas, TX:

$ host 2607:f8b0:4000:80d::2003
3.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.d.0.8.0.0.0.0.4.0.b.8.f.7.0.6.2.ip6.arpa domain name pointer dfw06s49-in-x03.1e100.net.

I haven’t been able to intercept the traffic via man-in-the-middle since Google’s certificate checks are very strict. However, checks from my own computer work without an issue:

$ curl -ki "https://[2607:f8b0:4000:80d::2003]/generate_204"
HTTP/2 204
date: Mon, 18 Mar 2019 12:35:18 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"

Stop audio pops on Intel HD Audio

headphones

I recently picked up a Dell Optiplex 7060 and I’m using it as my main workstation now. The Fedora installation was easy, but I noticed a variety of “pop” or clicking sounds when audio played, especially terminal bells.

If everything was quiet and I triggered a terminal bell, I would hear a loud pop just before the terminal bell sound. However, if I played music and then triggered a terminal bell, the pop was gone.

A quick Google search told me that the likely culprit was power saving settings on my Intel HD Audio chipset:

$ lspci | grep Audio
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)

Fixing it

There’s a handy power saving tunable available at /sys/module/snd_hda_intel/parameters/power_save that can be usd to adjust the timeout or disable power savings entirely. In my case, the delay was set to one second.

$ cat /sys/module/snd_hda_intel/parameters/power_save
1

That would be good for a laptop use case, but my workstation is always plugged in. I disabled it by setting it to zero:

# echo 0 > /sys/module/snd_hda_intel/parameters/power_save
$ cat /sys/module/snd_hda_intel/parameters/power_save
0

And the pops are gone! My Klipsch speakers have a built in amplifier and it was likely the abrupt changes in current that was causing the popping noises.

This setting will last until you reboot. You can make it permanent by adding this text to /etc/modprobe.d/audio_disable_powersave.conf:

options snd_hda_intel power_save=0

If you’re a laptop user and you want power savings but fewer pops, you could increase the delay to a more acceptable number. For example, setting it to 60 would mean that the card will power down after 60 seconds of silence. Just remember that you’ll get a nice pop when the 60 seconds has passed and a new sound is played.

Learning more

Diving into the kernel code reveals the tunable in /sound/pci/hda/hda_intel.c:

static int power_save = CONFIG_SND_HDA_POWER_SAVE_DEFAULT;
module_param(power_save, xint, 0644);
MODULE_PARM_DESC(power_save, "Automatic power-saving timeout "
		 "(in second, 0 = disable).");

The default comes from a kernel config option: CONFIG_SND_HDA_POWER_SAVE_DEFAULT. Most kernel packages on most distributions provide access to the kernel config file that was used to build the kernel originally. It’s often found in /boot (named the same as the kernel version) or it might be available at /proc/config.gz.

For Fedora, the kernel config is provided in /boot whenever a new kernel is is installed. I inspected mine and found:

$ grep HDA_POWER_SAVE_DEFAULT /boot/config-4.20.13-200.fc29.x86_64
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=1

The power_save setting is applied in /sound/pci/hda/hda_codec.c:

/**
 * snd_hda_set_power_save - reprogram autosuspend for the given delay
 * @bus: HD-audio bus
 * @delay: autosuspend delay in msec, 0 = off
 *
 * Synchronize the runtime PM autosuspend state from the power_save option.
 */
void snd_hda_set_power_save(struct hda_bus *bus, int delay)
{
	struct hda_codec *c;

	list_for_each_codec(c, bus)
		codec_set_power_save(c, delay);
}
EXPORT_SYMBOL_GPL(snd_hda_set_power_save);

We can look where codec_set_power_save is defined in the same file to learn more:

#ifdef CONFIG_PM
static void codec_set_power_save(struct hda_codec *codec, int delay)
{
	struct device *dev = hda_codec_dev(codec);

	if (delay == 0 && codec->auto_runtime_pm)
		delay = 3000;

	if (delay > 0) {
		pm_runtime_set_autosuspend_delay(dev, delay);
		pm_runtime_use_autosuspend(dev);
		pm_runtime_allow(dev);
		if (!pm_runtime_suspended(dev))
			pm_runtime_mark_last_busy(dev);
	} else {
		pm_runtime_dont_use_autosuspend(dev);
		pm_runtime_forbid(dev);
	}
}

This logic looks to see if CONFIG_PM is set to know if power management is desired at all. From there, it checks if we disabled power saving but there’s a discrete graphics card involved (codec->auto_runtime_pm). This check is important because the discrete graphics card cannot power down unless the HDA card suspends at the same time.

Next, there’s a check to see if the delay is greater than 0. This would be the case if CONFIG_SND_HDA_POWER_SAVE_DEFAULT was set to 1 (Fedora’s default). If so, the proper auto suspend delays are set.

If the delay is 0, then autosuspend is disabled and removed from power management entirely. This is the option I chose and it’s working great.

Photo source: Max Pixel