AWS and GCP Horror Stories. Bad Reasons Good Companies Died.

Oct 19, 2023

Nobody plans to leave AWS, but as they say, “sh*** happens.”

As engineers, when we write software, we’re taught to keep it nimble by never depending directly on external systems. When it comes to the cloud… *crickets*. Companies have died because they needed to move off AWS but couldn’t do it in a reasonable and cost-effective timeline.

It’s not rocket science why we do this. It’s simple: there are no alternatives. But maybe there is. I’ll explain why you should take cloud-nimble architecture seriously and then show you what I do to keep my projects cloud-nimble.

Cloud Service Optimization

The first reason you should value the ability to switch clouds and cloud services is the ability to select and use the cloud service that’s price and performance-optimized for your use case.

When I first got into serverless, I thought GCP’s Cloud Function service was better than AWS Lambda. I don’t recall why I thought that, but I built a consumer-facing API on GCP Cloud Run.

Can you guess what happened?

It was a horrible mistake. My API had an insane latency problem. Cold start requests added — at a minimum — 2 seconds per request. The AWS team has worked hard to build a service that can do things that GCP’s Cloud Functions simply can’t, specifically around cold starts and latency.

I had to move my infrastructure to a different service.

This happened again at Teamflow, a virtual office startup (think Slack meets Zoom). We built our API on GCP’s App Engine. The use case of a virtual office is naturally hyper-spikey workloads. At 7:59 AM, zero people are using your product. Three minutes later, at 8:02 AM PT, over 90% of your customers are online. Because App Engine couldn’t scale fast enough, those customers were banging the refresh button in frustration as the app failed to load.

I spent three weeks, with two other engineers, modifying our code and infrastructure to move it to GCP’s Cloud Run service — which was much easier to configure for our hyper-spikey scaling requirements.

Avoiding The Killswitch

AWS, GCP, Azure, and every other cloud provider reserves the right to ‘unalive’ your account and destroy your infrastructure anytime they want. If you’re thinking, “I’m not pot, porn, or gambling, so that would never happen to me,” you are wrong.

I recently spoke with a founder who told me a nightmare story. His team was using GCP’s Cloud Run, a container service, to host their API. They had a unique use case that required them to call back to their own API to kick off more work. It turns GCP monitors for this type of behavior and flags it as crypto mining. One sunny day, their infrastructure was gone, and their account was locked. They spent the next week working ~18 hours a day to move to AWS.

Maybe you are building a pot, porn, gambling, crypto mining, political, or firearm-related application. I don’t judge. I’m just telling you to think ahead and build like your cloud provider might kick you off their platform tomorrow.

Unlocking Free Credits

If you’re a pre-seed or seed-stage startup, free cloud credits can be the hand of mercy holding fire back from consuming your runway. Cloud credits are easy enough to get, but the way most people build their infrastructure — as though they’ve made a lifelong blood pact with their cloud provider — they’re limited to only the credits that cloud provided them.

Could you move from AWS to GCP when your credits are up?

If you’re a YCombinator startup, you can get $150K in free credits on AWS and about $200K on GCP. If you’re building AI, Azure will give you $300K in free credits to run your models on their cloud.

The ability to move your infrastructure from one cloud to another could save you an immediate $200K. Why would you ever build in a way that locked you into a single cloud?

Maximizing Redundancy

If Silicon Valley Bank can fail, so can Amazon. It may just be an outage, but the only thing certain about life and companies is that everything dies. Your job is to keep yours alive as long as possible.

How much money would your company lose during a 12-hour outage on AWS?

True disaster recovery is the ability to move from AWS to GCP. True redundancy is to run segments of your platform on AWS and GCP in parallel at all times.

Cloud Cost Negotiation

The ultimate move in any negotiation is to walk away. If the other side knows you can’t, it means their hands are wrapped firmly around a very specific organ of yours — whatever that is for you. Cost negotiations with cloud providers are no different.

Suppose you can’t leave AWS because you’ve built $100M worth of infrastructure on their platform. Suppose you’ve tightly coupled that infrastructure to their APIs (like S3, Cognito, and SQS). You can’t walk away. You’ll have to eat whatever number they tell you to regarding cost.

The ability to negotiate costs might not seem like a very big deal. If you’re a small company, it’s not. If you really did have $100M of infrastructure, then getting AWS to give you a 3% discount is substantial.

A Very Bad No Good Day

We don’t know what we don’t know. There are millions of permutations left unseen by the human eye. You never know what will happen or why. Here’s a crazy, impossible scenario: Google wants to acquire your company, but your infrastructure runs on AWS.

Meet Matt, this is his story. He did the founder’s hustle and the grind to bring a company to life from nothing. GCP approached and started talks to acquire. It was a dream come true until the conversation turned to the topic of their AWS infrastructure. Google, as you might expect, can’t acquire companies that run on AWS, and Matt’s team couldn’t move their platform to GCP in a time and cost-effective way. In the end, Google walked away, and Matt lost the exit.

You never know what can happen. Having the ability to quickly and easily switch between cloud providers might save your company or just save your company a few million dollars.

© Exobase 2023