Fast and Secure Cloud Delivery in Regulated Industries

I've been working on AWS infrastructure platform teams for more than five years now. A general problem I keep coming back to is how to balance security, freedom, and productivity while enabling teams to efficiently build and deploy in the cloud without having to know everything. This is an even bigger problem in regulated industries, where it isn't enough to be secure and reliable; we also have to prove it to someone else. So what do we do about it?

Before the word "guardrail" shows up, let me say what all of this is actually for, because it changes how you should read the rest. Everything that follows is in service of speed. The guardrails, the pipelines, the policy as code, none of it is the point. The point is to let teams move fast on top of a floor they can't fall through, so they don't have to look down. If you're a developer expecting an explanation of why you should be allowed to do less, that's not where this goes. You get to keep clicking around in dev, you stop having to know everything to ship safely, and you get a real path to change the rules you don't like. It's mostly about letting you do more, safely.

The two ways to fail

The promise of public cloud was that we can provision infrastructure and managed resources within minutes. Yet many organizations' first instinct was and is to hide every resource creation behind the ticket system of a highly specialized team, which results in weeks-long turnarounds. The reasoning is that if every resource creation follows a process that is reviewed and approved, the chances of a security misconfiguration are much lower.

This puts auditors and security professionals at ease, but it succeeds mostly at making the central cloud team the bottleneck for innovation. It gets worse because a Cloud Centre of Excellence (CoE) is typically seen as a cost center, since it isn't directly connected to revenue-producing products. So it doesn't automatically get the budget to grow. Instead, it lags behind the complaints of other teams, perpetually trying to prove to management that it needs more headcount. In the end, it becomes a scaling problem. And when the official path is too hard, shadow IT becomes the quick solution, and in non-regulated industries, it might even get management's approval.

When the paved route takes people the long way around, they wear their own trail across the grass. That is what shadow IT is, a desire path. You can keep building fences and watch people climb them, or you can notice where the trail forms and pave it.

So what's the opposite of the ticket queue? We allow teams the freedom to make their own resources. But how do we make sure it's done in a secure and reliable way that satisfies auditors and security officers? Locking everything down and leaving everything open are just two different ways to fail.

This is where the platform and guardrails come in. Most of what teams want to build is fine. Creating a database on its own isn't necessarily insecure or problematic. However, if it's not encrypted or it is publicly accessible, then that's a big problem. So, within the platform, we need guardrails that disallow anything that's obviously unsafe and stop its creation in the first place. You don't inspect for the bad configuration after the fact; you make it impossible to build.

Pave the path

The compromise that AWS offered, and that many platform teams in regulated industries arrived at independently, was to provide packaged, opinionated products that provision a particular resource for a particular problem. AWS Service Catalog, which layers a set of permissions around packaged CloudFormation, solved the problem partially, but the CoE team would still be the bottleneck in producing those products.

In other companies, where teams had full autonomy over what to deploy, the development teams were instead handed infrastructure as code: templates, modules, building blocks, a golden path, or a paved road of least resistance, so the secure and reliable way comes easily. By making it easier for teams to create secure resources, we get fewer insecure ones.

The catch is that this only works while the paved road stays genuinely the easiest one. Most people end up with whatever they get by default, so the win comes from making the safe building block the default, not from mandating it.

There are a few issues in practice. The platform or enablement team needs to keep up with the other teams and keep providing them with building blocks. Some teams will refuse them and build their own. And naturally, it's never greenfield, so there are always existing teams that aren't using the secure blocks. Getting them across is its own program of work, the kind that only clears at scale when you run it as a deliberate migration rather than a memo: you prove the value on one willing team, then let that success pull the others along.

Then comes the question the CISO and auditors will ask: What stops a team from overriding the building blocks' secure defaults with something insecure as a quick fix? This is where the blocks have to be put in their proper place. They are the convenience layer, not the security boundary. The blocks make the secure path easy, but they are not what make the insecure path impossible. That is the job of the next two pieces.

IAM is the door, policy as code is the cargo

Going back, the proposal was to disallow unsafe configurations. How do we achieve this in reality within AWS, so that any team can create resources but only with a secure configuration, and insecure ones are blocked?

AWS IAM conditions can help us there for some configurations, but not for most. The coverage is small, it's hard to test and roll out across a whole organization, and you run into the character limits on policies. More fundamentally, IAM conditions only allow or deny one action at a time, and they don't see the overall context or the desired state. The clearest case is when multiple actions are each safe on their own, but together they result in insecure infrastructure. Picture one call that creates an S3 bucket and a second call that attaches a bucket policy. Each action is perfectly normal, and IAM, looking at them one at a time, finds nothing wrong. It's only the resulting state, a bucket made world-readable by that policy, that is the problem, and that resulting state is exactly what IAM's request-context evaluation can't see. AWS IAM is great, but just not the right tool here.

This is where policy as code can help us greatly. The split I like is this: IAM controls the door, and policy as code controls the cargo. IAM is channel control; it decides who can mutate production, through which path, and under which role. Policy as code is content control; it decides what's actually inside the CloudFormation template, the CDK synth, or the Terraform plan before any of it becomes real. Most of the confusion around enforcement comes from asking the door to inspect the cargo.

Together with the enablement team's golden path building blocks, policy as code shifts security left and lets teams who deviate from the golden path still prove that they are secure. The fair question, whether you're a developer or an auditor, is what stops a team from simply not running the checks. The answer has two halves that have to click together.

The first half is where CloudFormation hooks shine. They run after a template is provided and before CloudFormation does any non-read-only action, and you deploy your policy-as-code evaluations as part of them, for example, with CloudFormation Guard. If teams then use CloudFormation, or anything that builds on top of it, like CDK, we suddenly have achieved the ideal: teams deploy any allowed resource, but only with allowed configurations.

Channel control: making the checks impossible to skip

The second half is the channel, because hooks only matter if teams can't deploy around them, and that is IAM's job again. Of course, CloudFormation isn't all there is. Terraform is also very popular, and since Terraform usually runs as a CLI rather than as a service, there is no hook equivalent baked into the engine. Tools like OPA, Conftest, or Sentinel can evaluate a Terraform plan, and a managed offering like HCP Terraform can enforce policy server-side, but for the teams that self-host the CLI, the only thing that binds the check is control of the pipeline that runs it.

So how do we enforce that teams deploy Terraform with a policy-as-code scanning step they can't disable? The only way forward is secure deployment pipelines built by the platform team. They can either come as a default in each team's environment, or be provided as packaged products, through something like Service Catalog, that teams provision automatically with their own configuration. It can be as simple as a CodeBuild job that accepts a Terraform plan or codebase, runs policy as code, and then deploys it. Conveniently, the CodeBuild StartBuild action has good support for conditions on many of the attributes you'd want to pass around. Something similar can be done in other CI/CD environments.

Basically, the idea is this: the build steps are defined by the development teams, which they can start from platform-provided templates and defaults, but the deploy and scan steps are defined by the platform team, and the platform doesn't allow deployments any other way. The deploy step can't be modified by the development team. Without channel control, policy as code is just advice. With it, policy as code becomes the checkpoint that every production change has to clear.

Platform delivery model: IAM controls the deployment channel while policy as code checks production changes before deployment. Development stays flexible, with controlled escape hatches and detective controls around production

Escape hatches: off-road on purpose

This approach might suddenly bring back some PTSD for developers who've worked in environments that only allow IaC deployments through CI, where the feedback loop while building is too long. So let me be clear about the scope, because this is exactly the fear: these restrictions should be only for production environments, and by extension, production-like or pre-prod environments such as staging and acceptance, not for development environments.

The point is to let the two move at different speeds on purpose. Developers should be able to click around in the AWS console to modify, build, deploy, and run their IaC from their laptops for development and sandbox environments, so debugging and development have short feedback times and stay quick and easy. When the workload is ready to go to the next environment, that promotion can be done fully automated.

Of course, not everything can be fully automated, nor is it always reasonable. Let's remember our goal was security and reliability, so actions that can't cause an issue with either of those should always be allowed:

Read-only actions that don't allow reading sensitive data are a good start.
Non-read-only actions that don't affect the workload, for example, updating CloudWatch dashboards or running Athena queries, are also safe.
Some operational actions are fine too. An autoscaling group instance refresh doesn't necessarily mean downtime or a security issue if the underlying processes and AMI are done right. An SQS dead-letter-queue redrive is another.

Beyond that, the standard operating procedure should still have escape hatches, both for operations and for when things go wrong. The off-road path you design is safer than the one your engineers will improvise at 3 am when you pretend the road covers everything. In regulated industries, this can be a just-in-time privilege escalation with a four-eyes principle: someone fills in a form, gets a second person's approval, and is automatically granted an IAM role for a short time, while that access and what was done with it are logged. There are many solutions that implement this, and it's easy enough to build on your own.

Repeated by hand means ready to automate

The next question is how we prevent misuse of this just-in-time access in production. Detective controls help here, but the thing I see teams and organizations miss is that any repeated manual action on production, whether through just-in-time escalation or not, reveals an operation that can be automated into a secure playbook. That's just the SRE habit of eliminating toil: automate it the second time you do it by hand.

Take upscaling an ECS service. ECS UpdateService can be used to increase the count of tasks, but it also allows decreasing them, which security officers or the organization may deem unsafe without a secondary approval. So we wrap it in an automated playbook that the CISO or the teams can review ahead of time, with logging, prechecks, and validation built in. The development teams should be able to deploy their own playbooks and automations, and the platform or enablement team can provide them too.

Every layer has holes

At this point, it's worth saying why all of this defensive layering is even necessary. Before I came to platform teams, I worked as a software engineer in a small company. I understood OWASP, I knew my application, and I constantly thought about how hackers could misuse the API. Back then, I thought the idea of something like a WAF for SQL injection was ridiculous and only for the lazy, because I knew my app couldn't be vulnerable to SQL injection, and I knew a WAF couldn't prevent a complicated attack anyway. What I learned later is that in many industries, a developer making a promise that they "think" they know security and that their app is secure isn't good enough for the government, the regulators, and the consumers whose data is in their hands. We need to not only be secure, but also to show it.

On the other side, we need to think about the vulnerabilities of a system the same way Site Reliability Engineers think about the reliability of a system. We start by accepting that components will fail, so we make sure that one failure doesn't bring down the whole system. In security, we accept that there will be vulnerabilities we don't know about, so we put layers of defensive measures around what needs to be protected. The picture I keep in my head is the Swiss cheese model: every layer has holes, but stack enough of them and the holes don't line up.

Everything in this post has been one of those slices: preventive controls at deploy time. It's a good slice, but it's only one. Not everything goes through the pipeline; drift happens, and the request-context blind spots we saw with IAM have runtime equivalents. There are other slices, runtime, and detective controls among them, but this post stops at deploy-time prevention. That is the deliberate edge of this one.

The point was never control

So far, we've proposed infrastructure as code with policy-as-code enforcement, only a limited set of actions in production under normal circumstances, just-in-time audited privilege escalation, and automated playbooks. That sounds like a lot of work for a platform and enablement team, and it is. But notice that none of it was really about control.

That is why the rules cannot be treated as secret stone tablets. The policies and restrictions came from decisions, and those decisions had their own context. Naturally, they can cause friction and block innovation when that context changes. The organization needs a visible way to reevaluate them for new use cases, and depending on the risk, to make changing the restriction as easy as possible.

One policy I personally believe in is keeping code repositories inside the company readable by all relevant stakeholders. Platform code and policies should also be visible to development teams. That lets teams see not only the rule, but the reason the rule exists. It also lets them make pull requests to amend those policies in ways that don't violate their spirit, with reviews built into the process and automated deployments behind them. The process for introducing new AWS services, allowing new actions, or changing existing restrictions should be visible too, and development teams should take part in it.

That is the payoff: the rules are legible and amendable. Restriction without a path to change becomes just another ticket queue. But restriction with a clear path to change becomes a platform interface. It gives teams a way to move faster without pretending the risks aren't real.

The guardrails, the paved road, the locked-down pipelines, the escape hatches, and the visible policy process all buy the same thing: speed, on top of a floor that teams can't fall through, with security and reliability they can prove on demand.

Fast and Secure Cloud Delivery in Regulated Industries

The two ways to fail

Pave the path

IAM is the door, policy as code is the cargo

Channel control: making the checks impossible to skip

Escape hatches: off-road on purpose

Repeated by hand means ready to automate

Every layer has holes

The point was never control

Further reading

Comments

More from this blog

Mob Programming an Enigma Machine: A Coding Dojo Experiment

Unraveling Aurora DSQL Pricing

Can You Use Amazon S3 as a Database (storage backend)?

Command Palette

The two ways to fail

Pave the path

IAM is the door, policy as code is the cargo

Channel control: making the checks impossible to skip

Escape hatches: off-road on purpose

Repeated by hand means ready to automate

Every layer has holes

The point was never control

Further reading

Comments

More from this blog