17 October 2019
DevSecOps — the art of embedding security into the software development lifecycle — is a common and largely underestimated threat vector for many organisations. Software developers can accidentally leak sensitive information, particularly secret keys for third party services, across code hosting platforms such as GitHub, GitLab and BitBucket. These secrets — including the data they were protecting — end up in the hands of bad actors which ultimately leads to significant data breaches. Much like we saw with the Capital One data breach earlier this year, the Canadian banking giant Scotiabank screw-up, and the Uber 2016 data breach.
And finding these secrets across GitHub is nothing new. There are many open-source tools available to help with this depending on which side of the fence you sit. On the adversary side, popular tools such as gitrob and truggleHog focus on digging in to commit history to find secret tokens from specific repositories, users or organisations. Recent research from North Carolina State University found that many of the secrets accidentally committed to GitHub are cleaned up within 24 hours, rendering said tools rather ineffective in practice.
On the defensive side, Amazon AWS labs have a tool called git-secrets that helps prevent committing secrets in the first place. And GitHub themselves are actively scanning for secrets through their token scanning project. Their objective is to identify secret tokens within committed code in real-time and notify the provider who will automatically revoke the token to prevent any abuse. As of 15 October 2019, GitHub have on-boarded 15 providers:
|Alibaba Cloud||Amazon Web Services (AWS)||Atlassian|
So in theory if you accidentally commit AWS secret keys to GitHub, Amazon will be notified and automatically revoke them. But how robust is this process? What if you could do the same but in an adversarial manner? Imagine being able to monitor the entirety of GitHub, GitLab and BitBucket to find any secrets accidentally committed in real time. Well, we're in luck. All three platforms provide a public 'real time firehose' events API (albeit with a few minutes delay in practice), that details various activity streams on the site, including code commits.
Pre-flight checks can be anything from ensuring the repository is below a particular size, has a certain number of stars, or isn't a fork. We then match the filename, path, extension, and the files contents against 120 signatures. How does the tool fair in practice, you ask? Over a period of 48 hours I was able to identify the following secrets:
|Secret Type||Count||Verified||Valid (%)|
|Username and Password in URI||1,351||440||32.5%|
|Google OAuth keys||231||174||75.3%|
|MailGun API keys||194||87||44.8%|
|Slack Webhook URLs||139||62||44.6%|
|-- redacted for brevity|
* indicates no verification took place
And a fuck-ton of sensitive client data. On average, I was finding and verifying secrets within 7 minutes of them being committed. And as you can see from the verified column, around 50% of them were valid meaning I could access the respective service using the captured credentials/keys. This suggests that either GitHub's token scanning isn't quick enough or their patterns aren't matching everything. I suspect it's a bit of both.
And note: I am purely validating secrets in an immutable way. No data was accessed.
To bring this threat to life I wrapped up shhgit in a web front-end. Watching the screen through secret after secret is quite mesmerising!
An unexpected finding...
What I wasn't expecting to find was valid package manager API keys, i.e., npm for Node.js; PyPi for Python; and NuGet for C#. The total number of downloads for these packages is in the millions. And the majority of these keys had publishing permissions. Meaning a bad actor could theoretically embed malicious code into the packages, reupload them without detection, and potentially infect millions of devices. These are just some redacted keys for NuGet packages:
Na na na na Na na na na GITMAN!
If you scroll back up to our table, you'll note that "Username and Password in URI" is by-far the most common found type of secret, i.e.: scheme://username:password@hostname:port. And the majority of these are databases, e.g., PostgreSQL, MongoDB, MySQL, etc. Because of the standardised URI format we can easily and automatically verify the credentials for popular schemes:
And if the connection is successful — and because we're a Good Samaritan — we can automatically raise an issue with the code maintainers on the platform the secret was found.
As I mentioned at the beginning of my post, leaking secrets across public code repositories is not a new threat. It's existed since the launch of GitHub and other services over 10 years ago. And from the recent data breaches the implications are clear: reputational damage and huge fines. But we — software developers, team managers, organisations — should be doing more:
- Ensure secrets don't end up in your code base in the first place. They should be a part of your environment. At a minimum, config files should be encrypted with a environment-based key. The Travis CI docs have a great guide on this.
- Use automated tools such as git-secrets to prevent secrets being committed.
- Provide training - and equally take the initiative to seek out training — on best practices and secure coding standards and guidelines.
- Make sure you are across your vendors who are developing code or apps for you. Ignorance isn't good enough. Ask the right questions.
And hopefully you won't be exclaiming ahh shhgit!.