Introducing Rustica: SSH Certs via Yubikeys

Introduction

I had a problem with SSH authentication. I have my own infrastructure that I need to access remotely using SSH and when it was just a couple servers in my closet, a password was fine.

Then I added a third, and fourth server.

Now I really should be using keys, but then I need to update all the hosts, and where am I going to put the key to keep it secure? The best would be to encrypt the key and keep it in ~/.ssh. Then I still need to enter a password though so the benefit didn’t feel there (even though it is).

Add a NAS. Then an offsite NAS.

This is really getting out of hand.

Cloud infrastructure has joined the fight.

Status: Untenable.

Oh The Problems You’ll Face

So do I just bite the bullet and use SSH keys as above? I could use OpenPGP and a Yubikey but that feels to me gross and add dependence on OpenPGP.

However it turns out SSH now supports smart cards for authentication. You can load a key on it, set some flags and it just works! I experimented with this but quickly ran into show stoppers.

  1. I’m already using the authentication slot on my Yubikey for code signing, so I’d need a separate Yubikey. Even though the Yubikey has 20 (!) extra slots for keys, SSH doesn’t see them.

  2. Even if I weren’t, I might want to use them for mTLS certificates in the future meaning I’d have to erase my SSH keys, or just kick the key security can down the road to when I get there.

I wasn’t too broken up about this though because this solution didn’t solve my other biggest problem: centralized logging.

Maybe I’m just bad at log parsing but looking at the sshd logs gives me two problems:

  1. The log format is complex, and seems to change significantly for slightly different connections. I.e bad password looks very different from bad key. It also appears the full context is spread across multiple lines.

  2. Every host will need to report these logs, meaning every host needs telegraf. Updating anything about logging needs to be updated everywhere or logging breaks. I know this is basically textbook “Use chef” but I’m not there, yet.

A New Challenger: SSH Certificates

SSH certificates are great but they aren’t generally used in home deployments because certificate features like expiration, serial numbers, principals, allowed extensions, aren’t generally important to smaller deployments and just add complexity.

Also while, certificates don’t solve the key security issue, it does solve the logging issue, if I use ultra short lived certificates. Issuing a certificate for 2-5 seconds makes the window for abuse is very small. Even smaller if I can issue the certificate to be valid only for a certain server.

Looking more into SSH certificates I now had the following issues:

  1. I need to write my own SSH certificate authority. There is decent tooling for Go and Python, but not Rust.

  2. It solves the logging problem, ish. A certificate has a list of authorized principals (users you can log in as) but not hosts. So I need a way to enforce that.

  3. Now I need to secure a CA key *and* the local user key. If the CA key is compromised, it’s even worse than stealing a user’s key because it can issue certificates for *any* user.

Even though it sounds like this is even worse than just using chef, these are programming issues, not systems issues, so I was willing to continue on.

First Up: SSH Certificates in Rust

In Rust but there isn’t much SSH certificate tooling. The closest was `sshkeys`, a library which supports parsing SSH certificates but doesn’t validate them. It also doesn’t create new certificates.

Since I was going to be making such core changes to the library, I forked and rewrote most of it. I kept a lot of the parsing code, but added a new private key module, certificate verification and signing, dozens of tests, an example that emulates `ssh-keygen -Lf`, among other things. The result was a new library I called `sshcerts`.

Now we can generate certificates.

Next: Key Security

Key security is one of the whole reasons I’m here so I won’t settle for anything less than keys being protected by some sort of secure enclave.

Now, the Secure Enclave, sounds like a great idea. It’s such a great idea in fact that people have done it. SEKey (also written in Rust) does this and it definitely worked…at some point. It seems the required entitlements to access the Secure Enclave changed and broke the build for Catalina and above so it didn’t work for me. If I’d spent more time on it, I probably could have made it work, but I stopped because it doesn’t have a feature I wanted even more: Attestation.

It’s 10pm. Do You Know Where Your Keys Are?

If you’re managing SSH at an enterprise, when someone registers a new key it’s useful know where that key resides as it helps determine what access it should have. A key generated in a `/tmp` folder, it should probably be less trusted than one in an HSM.

A Yubikey solves this with the Attestation (0xf9) slot. Using a certificate pair in this slot, you can generate an attestation proving a key was generated inside the Yubikey (it will not generate for imported keys). Every Yubikey also comes with a certificate signed by a Yubico Attestation Root CA meaning if you already deploy Yubikeys, this system is already available to you. As an added bonus, this also contains information about the Yubikey itself like the firmware version, form factor, and serial as well as information about the key’s usage policy: touch and pin requirements.

Now I can tie SSH keys to a Yubikey serial number, useful for cross referencing Yubikey serials (and now SSH keys) with deployment data.

Admittedly, this is not an issue I face with my personal infra as I’m the only one who uses it. But it was so fun to write I created a simple web service around showing this data and being able to toggle SSH permissions through it.

copyeditor.png

I’m An Agent of Security

I don’t know about you, but I can be lazy. If I have to remember to jump through security hoops every time I want to SSH into a server, I’m going to get frustrated/not do it. Plus those servers may report my failed login, setting off alarms.

I want this system to be as easy as possible. Really I want it to look like this:

Note: The animation above is a lot slower than in real life. Really it’s pretty instantaneous with the longest part being that you need to physically tap the Yubikey.

Note: The animation above is a lot slower than in real life. Really it’s pretty instantaneous with the longest part being that you need to physically tap the Yubikey.

Then what I want to see in the backend is this:

grafana.png

The simplest way I think to do this is using an SSH Agent. An SSH agent is software that holds keys on your behalf and provides signatures using those keys via a unix socket. You can see this if you run `ssh-agent` at your command line:


SSH_AUTH_SOCK=/tmp/ssh-FcNcQs3GYGP8/agent.1883623; export SSH_AUTH_SOCK;
SSH_AGENT_PID=1883624; export SSH_AGENT_PID;
echo Agent pid 1883624;

This is starting a new instance of `ssh-agent` and then gives you some bash exports telling SSH where to find the socket. This is why it’s generally run through `eval` so these get added to your bash context automatically.

SSH Agents support a number of different calls but we just need two to make this work: listing identities and signing data.

The process is pretty simple, when SSH opens a connection we receive the `list-identities` request via our unix socket. We take this opportunity to grab a new certificate from the server (which will only be valid for a couple seconds) and return it to SSH. If that certificate is accepted, we will receive a `sign-data` request which we sign with the Yubikey and away we go.

This gives the client the ease of use (the first part) while also logging the usage of the key on the centralized server (the second part).

Then Finally: Securing The CA Keys

Securing keys is the most important because a compromise of them results in catastrophic system failure. Thus it was non-negotiable that these had to be hardware backed keys. The only tool I had were Yubikeys (I really didn’t want to pay for a cloud HSM).

Now Yubico’s 500$ YubiHSM2 supports SSH certificates and they even provide a tutorial for it. But I don’t have one and since an SSH certificate is just a signature over the hash of some data, a standard Yubikey will work fine. A bit of tooling later using `sshcerts` and I had a daemon that generated new SSH certificates on demand using a Yubikey slot and could complete an SSH challenge.

How Is It All Put Together?

There are a lot of parts here so here is the final diagram showing how all the parts interact, along with an explanation at each step.

Untitled 2.png

Now this might be more complicated than you were expecting and there is a lot going on here. Here’s the breakdown step by step:

  1. The user initiates a connection to a remote SSH server

  2. SSH contacts the SSH Agent (in this case Rustica Agent) for what identities it should provide to the remote host. SSH does not provide us any information (as far as I can tell) about what host it’s connecting to, just that it wants a list of all our keys.

  3. Rustica Agent checks that it does not a valid certificate already, assuming it does not, it contacts the Rustica server to request one, along with the public key we want a certificate for.

  4. Rustica server returns a challenge to ask the agent to prove it has the private portion of the key.

  5. The challenge is signed using the Yubikey.

  6. The challenge, challenge signature, and requested certificate parameters are send to the server. This will contain the users we would like to receive a certificate for, how long we want the certificate to be valid for, among other things.

  7. The server will make a decision based on its own permission model whether or not grant the request, the parameters from the user are only requests and do not have to be granted (a user might request a principal the server is not willing to give to that user but may return others). It might also check the mTLS identity provided during the connection matches the owner of the public key provided. If request is allowed to proceed, the server will generate, sign, and return to the client, a brand new SSH certificate to be presented to the remote host.

  8. Rustica Agent provides this as the only identity to SSH

  9. SSH forwards this certificate to the remote host

  10. If the remote host accepts that CA, and the principal is allowed, and the certificate is not expired, and the source_ip critical option is obeyed or not present, the remote server will then challenge to the user to prove they possess the private key as well.

  11. This challenge is forwarded through SSH to Rustica Agent

  12. Rustica Agent signs this new challenge using the Yubikey a second time

  13. The challenge response is given back to SSH

  14. SSH forwards the challenge to the remote server and the connection is established.

While that may sound like a lot it only adds two network calls. Most of these steps happen with every SSH connection anyway so adding another couple isn’t a big deal.

Conclusion

I’ve rolled this out across my infrastructure and it’s working great. Having centralized SSH usage is great and since my infrastructure is relatively small setting up Slack alerts for logins was a breeze.

One thing I wish Yubikeys had was a signature counter that could be checked via the attestation certificate. Then I could build alarms for that number incrementing unexpectedly (both for the user and the server).

I will continue to improve, refactor, and add features to Rustica. Below are a list of topics that I just haven’t had time to cover but are already implemented. If you’re interested in hearing more about them, leave a comment, or better still message me on Keybase @obelisk

But Wait There’s More!

I think this post is long enough but here are things that this system implements that I didn’t talk about:

  • Key registration: how users can remotely add new keys with attestation metadata

  • Host certificates: Rustica also supports host certificates and my servers rotate them every minute. Useful for alarms.

  • External Authorization: Rustica probably isn’t the central point of authentication for your deployment so you can have it defer to another service and have that external service tell Rustica what it should do.

  • Extensions: Allow or disallow certain features of SSH

  • Critical Options: How to have an SSH key that can only run a single command or only come from a specified IP address.

  • Host restrictions: How Rustica can allow a principal of root to login to one server but not another using a bash script baked into the certificate.

  • PassportControl: My macOS GUI tool for managing Rustica Agent.

  • Different levels of key security: Rustica Agent currently support automatic signing keys or keys that require tap. This data is propagated to the backend via the attestation certificate.

  • Grafana Integration: Rustica logs to InfluxDB allowing you to easily build graphs and alerts on SSH usage.

Bypassing Disabled Location Services in macOS Mojave and Catalina

Introduction

When I was a core maintainer of osquery (back when it was under the control of Facebook), I was the only one fully focused on macOS things. We each had different areas we tended to operate in and macOS was mine. So when Mojave came out and we started getting reports that the wifi_survey table had started crashing it was up to me to investigate. This is the story of what I found.

User Intent and macOS

To understand why this is even a problem to start with, you need to understand Apple’s philosophy on security, privacy, and user intent (or my interpretation of Apple anyway). Apple products aspire to be simple and do what the user expects, even if that might be counter intuitive to developers. You see this with the photo picker on iOS where you’re expected to request a single user selected picture for use (like Snapchat), not the photo-roll unless you depend on that functionality (like Lightroom). Requesting a single picture requires no entitlements while photo roll access does along with a purpose string. This brings user data exposure down from all pictures to a single user selected picture, a significant privacy win. This is Apple solving a problem many users didn’t even know they had. The Location Services issue is similar in that way.

Divergence of Understanding

macOS High Sierra had a privacy settings panel in System Preferences that allowed and denied apps permission for several things (ironically one of which was Facebook integrations) and I think was a good step forward for user privacy.
Location Services being toggled off entirely is where we start to see what a developer thinks diverge from a typical user.

Developer: Disabling Location Services makes the CoreLocation Framework unavailable or restricted.

End User: Disabling Location Services makes apps unable to physically locate me.

Apple seems quite passionate at addressing these divergences and want to remove as many as possible. What we’ll be talking about here is using a wireless scan to get pin point location accuracy as an unprivileged user, even when location services is disabled.

Getting accurate location from wireless scans is extremely easy and Apple was one of the first companies (that I know of) to bring this to the public’s attention when they launched location abilities on the iPod Touch. These devices lacked GPS and users were concerned when accurate pins started showing up in the Google Maps app. End users didn’t (and still don’t) understand the connection between wifi and the impact it has on their privacy thus Apple took a step towards fixing this in macOS Mojave.

The Mojave Fix

The osquery crashes in Mojave were caused by the wifi_survey table when it tried to generate data for a running query. Tracing the program with LLDB made problem clear: the BSSID field was gone. The data returned from CoreWLAN is an NSDictionary that excepts/crashes upon accessing the non-existent “bssid” key which osquery had always used before. Adding an existence check before this access fixed osquery and it was patched that day.

But the BSSID field being gone is strange so we should try to discover why. It could be a mistake (this was Mojave Beta 1), it could be that it was moved to another API (splitting non essential functionality out for performance or space), or it could be a privacy change (BSSIDs can locate a user). There is no documentation at this point so we’ll need to look into Apple’s code to determine what happened.

The scanForNetworksWithName function in CoreWLAN seems like a reasonable place to start.

This falls through to a larger function, scanForNetworksWithChannels where a lot more happens and a clearer picture starts to form. First notice the XPC references indicating scanning likely happens in another process which returns the data to us.

Further down blockBSSIDAccess which takes no parameters and returns an integer seems promising. When it returns anything other than 0x0 it will remove the BSSID keys from the returned NSDictionary. Whatever it checks it seems likely this was the cause of osquery’s issue. Disassembling the function shows it has a critical mistake as it operates on data it treats as trusted when it isn’t.

What that long line amounts to is if any condition is true, return false, meaning don’t block BSSIDs. So when does this return false?

  • When the bundle identifier is prefixed with com.apple. OR

  • When the process name is coreautomationd OR

  • When the process name is WirelessStress OR

  • When the process name is wifivelocityd OR

  • When Location Services are enabled.

Why Mojave Didn’t Fix The Problem

There are two separate problems here, one is a coding problem and one is a structural problem.

The Structural Problem

Stripping BSSIDs in this function will never work. This is our own process (the XPC call has already returned) and the data is already in our memory space. We could even make the XPC call explicitly and have BSSIDs returned to us directly, without using CoreWLAN. This data needs to be filtered during the XPC call giving us no chance to see the data.

The Coding Problem

We control all of these fields. If we name our process any of those three or have a bundle identifier prefixed with com.apple. we would pass that check and have BSSIDs unstripped when the function returns.

The Proof of Concept

Objective-C has a feature called method swizzling allowing you to easily swap out function implementations at run time with another function of the same prototype. To bypass this check entirely, all anyone has to do is swizzle the mainBundle function to return com.apple.something instead of com.company.something and BSSIDs would be populated even if Location Services was disabled. There is a patched version of the wifi_survey.mm that does this you can compile into osquery to try for yourself on a Mojave system.

Reporting to Apple

I reported this to Apple the second day the Mojave beta was available, June 6th 2018. To date, it has not been fixed in Mojave. Perhaps they are busy, perhaps this is not high priority for them but I do think a year is a bit excessive.
However, in June 2019, Apple released 10.15 Catalina, and guess what. They fixed it!

Except they didn’t.

Why Catalina Didn’t Fix The Problem

Remember, there were two problems and either one would allow you to bypass this protection. They fixed the structural problem, leaving the coding problem mostly intact. If you look at CoreWLAN in Catalina you will not see any calls to blockBSSIDAccess in scanForNetworksWithChannels, yet BSSIDs are still blocked. This functionality now happens in locationServicesBlockBSSIDAccessForXPCConnection.

Except this doesn’t get called from anywhere in CoreWLAN so we need to go looking elsewhere. nming through many Apple libraries and binaries I found that function is called by /usr/libexec/airportd. It’s unclear why they kept this function in CoreWLAN however. It’s possible something I didn’t find also calls it, or will call it (in a future version of macOS Catalina) because I would expect this to have been moved to airportd.

But now we understand the flow and this call happens during the XPC call, never giving us the BSSIDs to steal (fixing the structural problem). Except here’s the relevant part of the disassembly:

So when does this return false? When:

  • You have the entitlement: com.apple.wifi.bypass-location-services OR

  • You have the entitlement: com.apple.wifi.priority.internal OR

  • Your last path component is WirelessStress OR

  • Location Services are enabled AND

    • Your last component path is SystemUIServer OR

  • locationServicesAuthorizedForPID returns true for your PID

So renaming your binary to WirelessStress bypasses this new check, allowing you to retrieve BSSIDs like you could in High Sierra, or Mojave with the previous exploit.

Conclusion

Apple fixed the harder of the two bugs in Catalina, the code is now structured correctly and only a single check is bad. I’m not sure how they knew to remove the package check but the last path component one was left in. Perhaps they thought that binaries cannot be named in collision with Apple binaries? More likely maybe they didn’t have time to give WirelessStress the correct entitlements before launch. If that’s the case, we should see this fixed before the end of the Catalina beta period so time will tell. Most likely I think they only fixed the exact thing I reported (without looking for any similar issues) and I never mentioned WirelessStress in my bug report.

The good news is that this is a very easy fix for Apple now, removing that one check should fully patch this vulnerability and then I’ll have to look for something else.

The Final Update

Astute readers can see above that there is still a bug present in the code, renaming your binary SystemUIServer will allow you to collect BSSIDs as long as location services are enabled. This is now fixed in Catalina 10.15.1 so I believe this bug is finally closed.