How to find and delete idle GCP Projects
A constant source of pain in Google Cloud Platform (GCP) and everywhere else is the amount of unmaintained resources: idle virtual machines, old buckets, IAM policies, DNS records and so on. They contribute to the attack surface and the chance of a vulnerability increase with time.
Shutting off resources is a such a low hanging fruit from a risk perspective that as a security engineer you should make it a daily habit.
After all the most secure computer is the one that’s been turned off!
How to find the cruft
The bigger and complex a cloud infrastructure becomes, the harder it gets to find unmaintained stuff.
Having an inventory system in place, as early as possible, would prevent so many headaches but even the most enlightened leadership will have a hard time justifying the investment.
Eventually the problem will outgrow security and spill into other areas such as cloud spending (gasp!): that’s when everyone will start talking about inventories, accountability and resources lifecycle.
Until then how to find things to kill?
Start with the Projects. The GCP model encourages the segmentation of the infrastructure logical areas into Projects, and a lot of audit facilities are aggregated on that level. (Obviously Projects will also silently introduces cost multipliers such as VPCs but we will leave this for another rant.)
There are three sources one can query:
- Activity Logs
- Billing Reports
- IAM Policies
Use these three and you can build your own personal heuristic that will answer the question: can I kill this?
Events that change state and configuration of cloud services are collected in the Admin Activity and System Event audit logs.
While they both track configuration changes only the Admin Activity is the one that tracks manual changes driven by direct user action: creation of resources, change of IAM policies, etc.
The retention is ~400 days and I would check the frequency of these log entries to understand if services were being configured recently.
Usually an active project implies an active administrator.
We can query the billing account(s) to get a per project cost/usage report.
If the cost graph is flat that could be an indicator that the project is idling. In contrast plotting an active project’s cost will results in a bumpy curve as buckets will fill up, logs will be generated and resources will be add and removed over time.
It’s worth keeping in mind that we can also get usage reports for Compute Engine services. It’s mostly a data point about the lifecycle of resources rather than their usage - but can still contribute to our killer algorithm.
Nobody knows a project better than its owner, so we can’t go wrong if we ask politely. The problem is finding that person.
The solution is to scrape the IAM policy.
I’d start by searching role bindings for
Viewer as they are the basic roles in GCP.
If we are lucky we will get a Group or a User’s email.
If we get a Service Account (SA) we can investigate it. When a SA is bind with a basic role, 99% of the time it’s been created in another project. So we can recursively scrape that project’s IAM and keep going until we find a human.
Look ma, no influencing skills
There are two things I learned the hard way as a security artisan:
- Do not kill stuff without asking first
- Do not flood people with alerts
As such my cruft hunting algorithm is called
can_I_MAYBE_kill_this() and works like this:
- I combine billing reports and admin activity logs to figure out if the project is idle since a while. I want to rule out projects that are obviously active.
- I scrape the IAM policy and find potential owners.
- I send them an email asking who is the technical contact for the project because I need to talk to them about a security situation. The combination of asking for someone accountable and mentioning security usually trigger a game of hot potato that ends with the project killed.
- If I get no answer, I nudge that I will delete the project in X days. This is the part where is self reflect on how I ended up threatening good people for a living and I manually check the project again to find more evidence.
- I delete the project.
Note that deleting a project will trigger a soft deletion, this means you have 30 days to change your mind before resources are actually decommissioned (although Cloud Storage services get decommissioned faster, usually ~ a week).
The take away is that finding and shutting down idle cloud resources is not straightforward and can’t be solved with a cron job.
Shut down other people’s resources at your risk and peril: be nice, ask, nudge and implore them to take care of their things.
Keep track of every time you have to track down owners. Make accountability and resource lifecycle a chapter of your threat model and build a case to lobby for an inventory system.
If you have someone in charge of keeping track of spending go and talk to them: change is never introduced in isolation, and there isn’t anything better than mixing cost savings and security to get
a budget attention.