Startup Infrastructure: No Matter the Choice, You're Still Wrong.

Nov 7, 2023

I recently had someone respond to an article I wrote and say, among other things, that if your team is in a situation where you realize that the cloud provider and service you choose isn’t the best fit and you need to switch… it’s your own fault.

…the “oh I’m not getting the performance I need from this provider” is a failure of your engineers to vet the services they are using for your use cases.

To be fair, I like to hold engineers to a high bar, so I don’t completely disagree. However, it’s not reasonable, especially in the context of an early-stage startup, to expect engineers to look into the future and predict through the next two years what will stay the same, what will change, and what those changes will be.

I wrote an article about senior engineers and what it means to be one. The point I make is that the most senior engineers understand one thing: it’s not possible to be absolutely right. Even if you look at all the factors, collect all the data, and understand every facet of the problem, things change. You can’t predict change. It’s all the more true with infrastructure.

Let’s assume for a moment that I’m wrong. Here are the factors, based on my personal experience of what has led to infrastructure changes in my career, that you would have to predict to be right about your infrastructure choices for more than one to two years in an early-stage startup environment:

  1. The Market

  2. The Product

  3. The Team

  4. The Users

  5. The Budget

The Market

Sometimes, as a startup, you have to react to the market. It’s possible that interest rates go up, conflict breaks out in the Middle East, or a regime collapses in South America and — somehow — due to that, you have to modify your product to support a new feature. It sounds crazy, but startups are crazy.

Something as simple as an industry leader acquiring an up-and-comer and then immediately sunsetting their service can drive your users to start screaming, “take my money” if you’ll only support the use case the sunsetting product solved for.

More likely is, a competitor modifies their product to do what you’re doing with twice the performance and half the wait time for end-users. Your customers begin leaving your product for the competitors because the value of that time saved is too high to ignore.

All of this can lead to infrastructure changes. You might need to use a different cloud service, possibly on another cloud provider, that has lower concurrency restrictions, higher memory limits, lower cost, or better performance for your specific new use case.

The Product

In the search for that holy grail that we call product-market-fit, it’s not unusual for your product to turn out completely different from what it started as initially. If it did change, that’s a good thing. If you have the humility to listen to the market and change your product to fit the needs that exist, you’re ahead of the game.

You might start a dating site and end up with a video blogging platform, start an MMORPG game and end up with an online instant messaging platform, or start a social-good network and end up with a daily deals app. All this can happen in a year or two.

In any one of those changes, your infrastructure will have to change as well. As a founder, if you believe your initial problem/solution idea will last forever and commit fully — from a technical perspective — to a specific infrastructure provider and service, then you might be just a little bit screwed when you need to make large changes. In this context, screwed means you have to spend exceeding amounts of time and money. Some companies, like YouTube, Slack, and Groupon made it work (the examples above). Do you have the time, budget, and energy to work hard enough to push their luck?

The Team

Sometimes, at an early-stage startup, you have one person, out of the massive three-person company, who is the sole infrastructure expert. Many teams refer to this as the bus factor, and good teams will seek to spread that expertise around when the bus factor is less than three. But at a startup, there’s little you can do. Now, what happens when the bus comes and takes your one infrastructure expert off to a seven-figure job at Google?

Hiring isn’t easy or cheap. You might find someone with the same infrastructure expertise as the member who was ‘hit’ by the bus. If you can’t, you might need to make changes when your new expert, who has different values and tastes than the previous one, insists things will have to change to continue.

On the other hand, what if a bus pulls up and drops off a new team member? The new team member has ten years of experience more than the team member who initially designed your infrastructure. The latest expert says you can get 2x the performance at half the cost if you make significant changes.

Would you spend a week of dev time to cut your monthly cloud cost in half? Would you spend a day?

The Users

Users: those beautiful, sweet, wonderful, annoying, painful, awful little things that keep us going; they can surprise you, and their behaviors can change. At an early-stage startup where every $1 of revenue is worth a gallon of blood, sweat, and tears when customers begin complaining about platform stability, you will do whatever you need to, no matter how drastic, to satisfy them.

I’ve done a lot of platform performance and site reliability work, and when it comes to users, what we’re concerned about is their behavior. As your product grows, their behavior can change. Many times, as you get more users, you start to see distinct behaviors that weren’t clear when you only had one — or none.

Teamflow is an example. It’s a Slack meets Zoom company that provides a virtual office product. Users can design a virtual office, sit in it, move through virtual space to their coworker’s office, knock on the door, and get an all-around real-world experience while working remotely.

When the product was just getting started, we chose to use GCP App Engine. At the time, it was the best choice, the most cost-effective and performant service available to us, given our specific expertise. It also allowed us to build rapidly because of the simplicity. As we grew and got more users, we noticed some incredibly spikey user behavior. Every morning, at 7:59 AM, there would be zero users on the platform. Two minutes later, at 8:01 AM, nearly 80–90% of our daily active users were online. App Engine couldn’t keep up with that kind of spikey scaling. For the first hour of every weekday, our platform was nearly unusable. We moved to Cloud Run. It wasn’t easy, but we didn’t have a choice.

The budget

When the runway starts to run out, people start to ask very deep questions about the infrastructure. It’s typically the largest cost after salaries, so it gets the side eye first.

Even if the runway isn’t running out, nobody likes to leave money on the table. I talked to a founder yesterday who has $1M in free credits available across the three major cloud providers (AWS, GCP, and Azure), but his infrastructure is stuck on AWS where he can’t make use of the other credits. As a seed-stage startup, he can’t afford to take a week off to move his infrastructure.

Sometimes, in the beginning, we make sub-par choices because of budget constraints. When the budget increases because your startup is succeeding and growing, you’ll want to take that corner-cutting infrastructure choice you made and sharpen it up. Depending on how it was built, you might be screwed.

Again, screwed means it will cost you an exorbanate amount of time and money to make changes.

Conclusion

If you still think you can make infrastructure choices that are right in the present and will be right in the future… I can’t help you, good luck. If you see what I see, you’re realizing there’s no hope; you can’t predict varying factors moving unpredictably through time. It’s time to consider your ability and capacity to make infrastructure changes.

© Exobase 2023