major.io words of wisdom from a systems engineer

Running Ansible in OpenShift with arbitrary UIDs

blacksmith_anvil_hammer

My work at Red Hat involves testing lots and lots of kernels from various sources and we use GitLab CE to manage many of our repositories and run our CI jobs. Those jobs run in thousands of OpenShift containers that we spawn every day.

OpenShift has some handy security features that we like. First, each container is mounted read-only with some writable temporary space (and any volumes that you mount). Also, OpenShift uses arbitrarily assigned user IDs for each container.

Constantly changing UIDs provide some good protection against container engine vulnerabilities, but they can be a pain if you have a script or application that depends on being able to resolve a UID or GID back to a real user or group account.

Ansible and UIDs

If you run an Ansible playbook within OpenShift, you will likely run into a problem during the fact gathering process:

$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: KeyError: 'getpwuid(): uid not found: 1000220000'
fatal: [localhost]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}
	to retry, use: --limit @/major-ansible-messaround/playbook.retry

PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1

This exception is telling us that getpwuid() was not able to find an entry in /etc/passwd for our UID (1000220000 in this container).

One option would be to adjust the playbook so that we skip the fact gathering process:

- hosts: all
  gather_facts: no
  tasks:

    - name: Run tests
      command: ./run_tests.sh

However, this might not be helpful if you need facts to be gathered for your playbook to run. In that case, you need to make some adjustments to your container image first.

Updating the container

Nothing in the container image is writable within OpenShift, but we can change certain files to be group writable for the root user since every OpenShift user has an effective GID of 0.

When you build your container, add a line to your Dockerfile to allow the container user to have group write access to /etc/passwd and /etc/group:

# Make Ansible happy with arbitrary UID/GID in OpenShift.
RUN chmod g=u /etc/passwd /etc/group

Once your container has finished building, the permissions on both files should look like this:

$ ls -al /etc/passwd /etc/group
-rw-rw-r--. 1 root root 514 Mar 20 18:12 /etc/group
-rw-rw-r--. 1 root root 993 Mar 20 18:12 /etc/passwd

Make a user account

Now that we’ve made these files writable for our user in OpenShift, it’s time to change how we run our GitLab CI job. My job YAML currently looks like this:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - ansible-playbook -i hosts playbook.yml

We can add two lines that allow us to make a temporary user and group account for our OpenShift user:

ansible_test:
  image: docker.io/major/ansible:fedora29
  script:
    - echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
    - echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
    - id
    - ansible-playbook -i hosts playbook.yml

Note that we want the second GID returned by id since the first one is 0. The id command helps us check our work when the container starts. When the CI job starts, we should see some better output:

$ echo "tempuser:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd
$ echo "tempuser:x:$(id -G | cut -d' ' -f 2)" >> /etc/group
$ id
uid=1000220000(tempuser) gid=0(root) groups=0(root),1000220000(tempuser)
$ ansible-playbook -i hosts playbook.yml

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Download kernel source] **************************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Success!

Get a /56 from Spectrum using wide-dhcpv6

After writing my last post on my IPv6 woes with my Pixel 3, some readers asked how I’m handling IPv6 on my router lately. I wrote about this previously when Spectrum was Time Warner Cable and I was using Mikrotik network devices.

There is a good post from 2015 on the blog and it still works today:

I am still using that same setup today, but some readers found it difficult to find the post since Time Warner Cable has renamed to Spectrum. Don’t worry – the old post still works. :)

Pixel 3 Wi-Fi drops constantly

pixel_phones

We have two Google Pixel phones in our house: a Pixel 2 and a Pixel 3. Both of them drop off our home wireless network regularly. It causes lots of problems with various applications on the phones, especially casting video via Chromecast.

At the time when I first noticed the drops, I was using a pair of wireless access points (APs) from Engenius:

Also, here’s what I knew at the time:

  • Mac and Linux computers had no Wi-Fi issues at all
  • The signal level from both APs was strong
  • Disabling one AP made no improvement
  • Disabling one band (2.4 or 5GHz) on the APs made no improvement
  • Clearing the bluetooth/Wi-Fi data on the Pixel had no effect
  • Assigning a static IP address on the Pixel made no improvement
  • Using unencrypted SSIDs made no improvement

At this point, I felt strongly that the APs had nothing to do with it. I ordered a new NetGear Orbi mesh router and satellite anyway. The Pixels still dropped off the wireless network even with the new Orbi APs.

Reading logs

I started reading logs from every source I could find:

  • dhcpd logs from my router
  • syslogs from my APs (which forwarded into the router)
  • output from tcpdump on my router

Several things became apparent after reading the logs:

  • The Wi-Fi drop occurred usually every 30-60 seconds
  • The DHCP server received requests for a new IP address after every drop
  • None of the network traffic from the phones was being blocked at the router
  • The logs from the APs showed the phone disconnecting itself from the network; the APs were not forcing the phones off the network

All of the wireless and routing systems in my house seemed to point to a problem in the phones themselves. They were voluntarily dropping from the network without being bumped off by APs or the router.

Getting logs from the phone

It was time to get some logs from the phone itself. That would require connecting the phone via USB to a computer and enabling USB debugging on the phone.

First, I downloaded the Android SDK. The full studio release isn’t needed – scroll down and find the Command line tools only section. Unzip the download and find the tools/bin/sdkmanager executable. Run it like this:

# Fedora 29 systems may need to choose the older Java version for sdkmanager
# to run properly.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.201.b09-2.fc29.x86_64/jre
# Install the android-28 platform tools
./sdkmanager "platform-tools" "platforms;android-28"

Now we need to enable USB debugging on the phone itself. Be sure to disable USB debugging when you are done! Follow these steps:

  1. Go into the phone’s settings and choose About Phone from the bottom of the list.
  2. Scroll to the bottom and tap the Build number section repeatedly until a message appears saying that you are now a developer.
  3. Go back one screen and tap System.
  4. Click Advanced to show the additional options and tap Developer Options.
  5. In the Debugging section, tap USB Debugging to enable USB debugging.

Connect the phone to your computer via USB and run:

sudo platform-tools/adb logcat

Your screen will fill with logs from your phone.

Nuggets in the log

I watched the logs and waited for the Wi-Fi to drop. As soon as it dropped, I saw some interesting log messages:

I wpa_supplicant: wlan0: CTRL-EVENT-AVOID-FREQ ranges=5785-5825
I chatty  : uid=1000(system) IpClient.wlan0 expire 3 lines
I chatty  : uid=1000 system_server expire 1 line
D CommandListener: Setting iface cfg
E cnss-daemon: wlan_service_update_sys_param: unable to open /proc/sys/net/ipv4/tcp_use_userconfig
I chatty  : uid=1000(system) android.fg expire 1 line
I wpa_supplicant: wlan0: CTRL-EVENT-DISCONNECTED bssid=88:dc:96:4a:b6:75 reason=3 locally_generated=1
I chatty  : uid=10025 com.google.android.gms.persistent expire 7 lines
V NativeCrypto: Read error: ssl=0x7b349e2d08: I/O error during system call, Software caused connection abort
V NativeCrypto: Write error: ssl=0x7b349e2d08: I/O error during system call, Broken pipe
V NativeCrypto: Write error: ssl=0x7b349e2d08: I/O error during system call, Broken pipe
V NativeCrypto: SSL shutdown failed: ssl=0x7b349e2d08: I/O error during system call, Success
D ConnectivityService: reportNetworkConnectivity(158, false) by 10025

The line with CTRL-EVENT-AVOID-FREQ isn’t relevant because it’s simply a hint to the wireless drivers to avoid certain frequencies not used in the USA. The CTRL-EVENT-DISCONNECTED shows where wpa_supplicant received the disconnection message. The last line with ConnectivityService was very interesting. Something in the phone believes there is a network connectivity issue. That could be why the Pixel is hopping off the wireless network.

From there, I decided to examine only the ConnectivityService logs:

sudo platform-tools/adb logcat 'ConnectivityService:* *:S'

This logcat line tells adb that I want all logs from all log levels about the ConnectivityService, but all of the other logs should be silenced. I started seeing some interesting details:

D ConnectivityService: NetworkAgentInfo [WIFI () - 148] validation failed
D ConnectivityService: Switching to new default network: NetworkAgentInfo{ ni{[type: MOBILE[LTE]...
D ConnectivityService: Sending DISCONNECTED broadcast for type 1 NetworkAgentInfo [WIFI () - 148] isDefaultNetwork=true
D ConnectivityService: Sending CONNECTED broadcast for type 0 NetworkAgentInfo [MOBILE (LTE) - 100] isDefaultNetwork=true
D ConnectivityService: handleNetworkUnvalidated NetworkAgentInfo [WIFI () - 148] ...

Wait, what is this “validation failed” message? The Pixel was making network connections successfully the entire time as shown by tcpdump. This is part of Android’s [network connecivity checks] for various networks.

The last few connections just before the disconnect were to connectivitycheck.gstatic.com (based on tcpdump logs) and that’s Google’s way of verifying that the wireless network is usable and that there are no captive portals. I connected to it from my desktop on IPv4 and IPv6 to verify:

$ curl -4 -i https://connectivitycheck.gstatic.com/generate_204
HTTP/2 204
date: Sun, 17 Mar 2019 15:00:30 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
$ curl -6 -i https://connectivitycheck.gstatic.com/generate_204
HTTP/2 204
date: Sun, 17 Mar 2019 15:00:30 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"

Everything looked fine.

Heading to Google

After a bunch of searching on Google, I kept finding posts talking about disabling IPv6 to fix the Wi-Fi drop issues. I shrugged it off and kept searching. Finally, I decided to disable IPv6 and see if that helped.

I stopped radvd on the router, disabled Wi-Fi on the phone, and then re-enabled it. As I watched, the phone stayed on the wireless network for two minutes. Three minutes. Ten minutes. There were no drops.

At this point, this is still an unsolved mystery for me. Disabling IPv6 is a terrible idea, but it keeps my phones online. I plan to put the phones on their own VLAN without IPv6 so I can still keep IPv6 addresses for my other computers, but this is not a good long term fix. If anyone has any input on why this helps and how I can get IPv6 re-enabled, please let me know!

Update 2019-03-18

Several readers wanted to see what was happening just before the Wi-Fi drop, so here’s a small snippet from tcpdump:

07:26:06.736900 IP6 2607:f8b0:4000:80d::2003.443 > phone.41310: Flags [F.], seq 3863, ack 511, win 114, options [nop,nop,TS val 1288800272 ecr 66501414], length 0
07:26:06.743101 IP6 2607:f8b0:4000:80d::2003.443 > phone.41312: Flags [F.], seq 3864, ack 511, win 114, options [nop,nop,TS val 1778536228 ecr 66501414], length 0
07:26:06.765444 IP6 phone.41312 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 4183481455, win 0, length 0
07:26:06.765454 IP6 phone.41310 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 3279990707, win 0, length 0
07:26:07.487180 IP6 2607:f8b0:4000:80d::2003.443 > phone.41316: Flags [F.], seq 3863, ack 511, win 114, options [nop,nop,TS val 4145292968 ecr 66501639], length 0
07:26:07.537080 IP6 phone.41316 > 2607:f8b0:4000:80d::2003.443: Flags [R], seq 4188442452, win 0, length 0

That IPv6 address is at a Google PoP in Dallas, TX:

$ host 2607:f8b0:4000:80d::2003
3.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.d.0.8.0.0.0.0.4.0.b.8.f.7.0.6.2.ip6.arpa domain name pointer dfw06s49-in-x03.1e100.net.

I haven’t been able to intercept the traffic via man-in-the-middle since Google’s certificate checks are very strict. However, checks from my own computer work without an issue:

$ curl -ki "https://[2607:f8b0:4000:80d::2003]/generate_204"
HTTP/2 204
date: Mon, 18 Mar 2019 12:35:18 GMT
alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"

Stop audio pops on Intel HD Audio

headphones

I recently picked up a Dell Optiplex 7060 and I’m using it as my main workstation now. The Fedora installation was easy, but I noticed a variety of “pop” or clicking sounds when audio played, especially terminal bells.

If everything was quiet and I triggered a terminal bell, I would hear a loud pop just before the terminal bell sound. However, if I played music and then triggered a terminal bell, the pop was gone.

A quick Google search told me that the likely culprit was power saving settings on my Intel HD Audio chipset:

$ lspci | grep Audio
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)

Fixing it

There’s a handy power saving tunable available at /sys/module/snd_hda_intel/parameters/power_save that can be usd to adjust the timeout or disable power savings entirely. In my case, the delay was set to one second.

$ cat /sys/module/snd_hda_intel/parameters/power_save
1

That would be good for a laptop use case, but my workstation is always plugged in. I disabled it by setting it to zero:

# echo 0 > /sys/module/snd_hda_intel/parameters/power_save
$ cat /sys/module/snd_hda_intel/parameters/power_save
0

And the pops are gone! My Klipsch speakers have a built in amplifier and it was likely the abrupt changes in current that was causing the popping noises.

This setting will last until you reboot. You can make it permanent by adding this text to /etc/modprobe.d/audio_disable_powersave.conf:

options snd_hda_intel power_save=0

If you’re a laptop user and you want power savings but fewer pops, you could increase the delay to a more acceptable number. For example, setting it to 60 would mean that the card will power down after 60 seconds of silence. Just remember that you’ll get a nice pop when the 60 seconds has passed and a new sound is played.

Learning more

Diving into the kernel code reveals the tunable in /sound/pci/hda/hda_intel.c:

static int power_save = CONFIG_SND_HDA_POWER_SAVE_DEFAULT;
module_param(power_save, xint, 0644);
MODULE_PARM_DESC(power_save, "Automatic power-saving timeout "
		 "(in second, 0 = disable).");

The default comes from a kernel config option: CONFIG_SND_HDA_POWER_SAVE_DEFAULT. Most kernel packages on most distributions provide access to the kernel config file that was used to build the kernel originally. It’s often found in /boot (named the same as the kernel version) or it might be available at /proc/config.gz.

For Fedora, the kernel config is provided in /boot whenever a new kernel is is installed. I inspected mine and found:

$ grep HDA_POWER_SAVE_DEFAULT /boot/config-4.20.13-200.fc29.x86_64
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=1

The power_save setting is applied in /sound/pci/hda/hda_codec.c:

/**
 * snd_hda_set_power_save - reprogram autosuspend for the given delay
 * @bus: HD-audio bus
 * @delay: autosuspend delay in msec, 0 = off
 *
 * Synchronize the runtime PM autosuspend state from the power_save option.
 */
void snd_hda_set_power_save(struct hda_bus *bus, int delay)
{
	struct hda_codec *c;

	list_for_each_codec(c, bus)
		codec_set_power_save(c, delay);
}
EXPORT_SYMBOL_GPL(snd_hda_set_power_save);

We can look where codec_set_power_save is defined in the same file to learn more:

#ifdef CONFIG_PM
static void codec_set_power_save(struct hda_codec *codec, int delay)
{
	struct device *dev = hda_codec_dev(codec);

	if (delay == 0 && codec->auto_runtime_pm)
		delay = 3000;

	if (delay > 0) {
		pm_runtime_set_autosuspend_delay(dev, delay);
		pm_runtime_use_autosuspend(dev);
		pm_runtime_allow(dev);
		if (!pm_runtime_suspended(dev))
			pm_runtime_mark_last_busy(dev);
	} else {
		pm_runtime_dont_use_autosuspend(dev);
		pm_runtime_forbid(dev);
	}
}

This logic looks to see if CONFIG_PM is set to know if power management is desired at all. From there, it checks if we disabled power saving but there’s a discrete graphics card involved (codec->auto_runtime_pm). This check is important because the discrete graphics card cannot power down unless the HDA card suspends at the same time.

Next, there’s a check to see if the delay is greater than 0. This would be the case if CONFIG_SND_HDA_POWER_SAVE_DEFAULT was set to 1 (Fedora’s default). If so, the proper auto suspend delays are set.

If the delay is 0, then autosuspend is disabled and removed from power management entirely. This is the option I chose and it’s working great.

Photo source: Max Pixel

Automatic floating windows in i3

floating leaf

The i3 window manager is a fast window manager that helps you keep all of your applications in the right place. It automatically tiles windows and can manage those tiles across multiple virtual desktops.

However, there are certain applications that I really prefer in a floating window. Floating windows do not get tiled and they can easily be dragged around with your mouse. They’re the type of windows you expect to see on other non-tiling desktops such as GNOME or KDE.

Convert a window to floating temporarily

If you have an existing window that you prefer to float, select that window and press Mod + Shift + Space bar. The window will pop up in front of the tiled windows and you can easily move it with your mouse.

Depending on your configuration, you may be able to resize it by grabbing a corner of the window with your mouse. You can also assign a key combination for resizing in your i3 configuration file (usually ~/.config/i3/config):

# resize window (you can also use the mouse for that)
mode "resize" {
        bindsym Left resize shrink width 10 px or 10 ppt
        bindsym Down resize grow height 10 px or 10 ppt
        bindsym Up resize shrink height 10 px or 10 ppt
        bindsym Right resize grow width 10 px or 10 ppt
        bindsym Return mode "default"
        bindsym Escape mode "default"
        bindsym $mod+r mode "default"
}
bindsym $mod+r mode "resize"

With this configuration, simply press Mod + r and use the arrow keys to grow or shrink the window’s borders.

Always float certain windows

For those windows that you always want to be floating no matter what, i3 has a solution for that, too. Just tell i3 how to identify your windows and ensure floating enable appears in the i3 config:

for_window [window_role="About"] floating enable
for_window [class="vlc"] floating enable
for_window [title="Authy"] floating enable

In the example above, I have a few windows always set to be floating:

  • [window_role="About"] - Any of the “About” windows in various applications that are normally opened by Help -> About.
  • [class="vlc"] - The VLC media player can be a good one to float if you need to stuff it away in a corner.
  • [title="Authy"] - Authy’s chrome extension looks downright silly as a tiled window.

Any time these windows are spawned, they will automatically appear as floating windows. You can always switch them back to tiled manually by pressing Mod + Shift + Space bar.

Identifying windows

Identifying windows in the way that i3 cares about can be challenging. Knowing when to use window_role or class for a window isn’t very intuitive. Fortunately, there’s a great script from an archived i3 faq thread that makes this easy:

Download this script to your system, make it executable (chmod +x i3-get-window-criteria), and run it. As soon as you do that, a plus (+) icon will replace your normal mouse cursor. Click on the window you care about and look for the output in your terminal where you ran the i3-get-window-criteria script.

On my system, clicking on a terminator terminal window gives me:

[class="Terminator" id=37748743 instance="terminator" title="major@indium:~"]

If I wanted to float all terminator windows, I could add this to my i3 configuration file:

for_window [class="Terminator"] floating enable

Float in a specific workspace

Do you need a window to always float on a specific workspace? i3 can do that, too!

Let’s go back to the example with VLC. Let’s consider that we have a really nice 4K display where we always want to watch movies and that’s where workspace 2 lives. We can tell i3 to always float the VLC window on workspace 2 with this configuration:

set $ws1 "1: main"
set $ws2 "2: 4kdisplay"
for_window [class="vlc"] floating enable
for_window [class="vlc"] move to workspace $ws2

Restart i3 to pick up the new changes (usually Mod + Shift + R) and start VLC. It should appear on workspace 2 as a floating window!

Photo source