What happened to OpenStack?

I get asked this question fairly often during interviews, like people think that OpenStack is dead. I make a squishy face before answering. Here’s a hint: OpenStack is not dead, but something happened to it.

OpenStack logo – 2019

The project is still quite healthy and is following the usual hype curve: OpenStack is finally mature enough that it’s less interesting to talk about, so it’s understandable that people think something happened.

The folks asking me this question are generally curios and good intentioned, and I think my answer says a lot about my professional background, so even during an interview I indulge in a short ramble.

I know, I know … I should have written about it at the time, but I had other things going on and wasn’t making space in my life to write. So I’ve decided to revisit the topic today. Next time someone asks me, I’ll just point them here 😉

We all knew that, for the long-term success of the OpenStack ecosystem, we needed a hyper-scale OpenStack public cloud to succeed, and a healthy set of SaaS applications to be built upon the IaaS/PaaS layers that we had. We knew there needed to be at more than one hyperscale implementation to avoid a gravity problem (in the early days, that was RackSpace).

Publicly, lots of folks were talking about this; common headlines compared OpenStack to AWS, and every VC and analyst wanted to know if it was ready for the Enterprise yet. I even sat on a panel to discuss this very question! (There’s a recording of this, but I won’t link it here because I don’t like being reminded of how I used to look.) But over time, this key component of our success never materialized. Why was that?

Back in 2014 I was working at HP Cloud. Money was pouring into OpenStack and we were all riding high on that. This was the year that OpenStack hype peaked, and I will never forget the Carnival at the Paris Summit … tre magnifique! From outside, however, it may not have been obvious that this was coming from Telcos and Hardware Vendors seeking to influence the direction of OpenStack. Specifically, these companies wanted to create the technological basis for integrated private cloud solutions they could take to market quickly.

Privately, we could see that these market forces were opposed to actually achieving hyperscale. The Big Tent was a bold move to try and address some of these issues. We (that is, the Technical Committee) did what we could, but in the end we couldn’t fix the problem because we didn’t control the purse strings.

You see, almost every company contributing substantially to OpenStack at that time relied on Enterprise customers. That’s a nice way of saying well-known brands who spend a lot of money on each other every year. Unlike the hyper-scalers, Enterprise customers generally do not have an interest in building their own servers or writing their own switch firmware. They buy these things (and the support contracts for them) from other Enterprise companies that make them, and those companies want to keep making money from these deals.

Meanwhile, distros and operators were struggling to make OpenStack installable so that they could roll out, and maintain, deployments to satisfy the market demand. And the Enterprise demand for new private cloud regions was incredibly high. A lot of internal pressure was being put on the OpenStack developer teams (at the Distros, Telco’s, HW vendors) to create software that worked well at a scale these Enterprise Customers needed, roughly in the 1,000’s to 10,000’s of cores with a small number of regions for geographic resiliency.

However, building a distributed system for a scale of 10,000 cores is fundamentally different than building for 10,000,000 cores. Besides just changing the code, scaling up by another three orders of magnitude requires fundamental changes in how any business operate. The margins would need to become much, much smaller. Gone would be the 3rd party support contracts, the B2B deals, the value-added software – the grease in the sales’ teams wheels. The enterprise hardware & software giants were not willing to restructure their businesses that much, and the organic growth of companies like OVH was too slow (even though I think they are still on the right track).

Creating a viable, open source, hyperscale cloud software solution was against the best interest of the companies most heavily investing in OpenStack’s development.

In my usual fashion, I tried to draw this on a napkin one day. The original is long since lost, so I’ve redrawn it from memory below

Influence was flowing from the wrong direction. Product management within the Telco, Hardware, and Distro space — that is, companies like ATT, Cisco, Dell, and RedHat — were pouring in money to pay developers to have influence and write code, funding lavish parties and inflating salaries (in full disclosure, I certainly enjoyed the parties, and while I got paid well, I didn’t make out nearly as well as a lot of folks did). Meanwhile, Operators were getting further and further behind the release-train, had little voice (they were busy operating the clouds, after all) … and the end-users of the cloud (app developers) had almost no voice at all.

Eventually, revenues would need to come from the growth of supported businesses on top of the cloud (ie, app developers’ success), but we couldn’t get there because operators weren’t able to scale up enough, or maintain cross-cloud compatibility well enough, for a healthy market to flourish on top of the cloud. Everyone on the left-hand side of that diagram was too busy trying to differentiate to win deals against each other.

With this going on, the upstream technical team leads — who could see the problem — were lobbying inside our respective companies to change the priorities. We convinced some managers and executives to get onboard with tactical investments in code and infrastructure, but we couldn’t get enough backing for the larger changes that were needed.

As long as the financial influencers were focused on the business models of mid-sized customers (MSP’s and Enterprise markets), they were not willing to invest in the massive strategic efforts needed to make OpenStack competitive in the hyperscale market. The vast majority of developers, whose project goals were dictated by PMs and executives within their employers, were thus busy driving features (designed to take advantage of “value-added integrations”, the grease of the Enterprise sales cycle) into the codebase, and no one could fix the core issues that had become apparent to all the deployers who were busily trying to scale up their clouds.

You see, the successful hyperscalers had already removed middleware marginal cost. Google, AWS, even Facebook, all make their own servers, switches, storage, etc. They fork open source projects then pay developers to maintain and improve private versions so that they can stay ahead (some projects are now trying to prevent this). These companies pioneered scale by shaving the costs off their infrastructure, but those efficiencies have not made it back to the Enterprise market. They became the behemoths they are, in no small part, by building processes, teams, and hardware to break any reliance on 3rd parties for software, hardware, or support.

So, you see, creating a viable, open source, hyperscale cloud software solution was against the best interest of the companies most heavily investing in OpenStack’s development.

When you’re looking at other cloud products, think about similar conflicts of interest that might be affecting your favorite spokespersons today… (I’m looking at you, kubernetes)

Those of us who joined OpenStack early and dreamed big, well, we had no chance of building those dreams within OpenStack. It became something less hype-worthy but perhaps even more useful: an extensible tool for mid-scale infrastructure automation, powering thousands of businesses around the world, including many non-profit and public-good institutions.

And – just to be clear – I’m very glad to have been able to help in my small way to build a tool that has been used so widely for good.

I am publishing this now in the hope that it can serve as a warning to everyone out there who is investing in Kubernetes. You’ll never be able to run it as effectively, at the same scale, as GKE does – unless you also invest in a holistic change to your organization.

Even then, if you’re using Kubernetes, you probably won’t succeed, because it isn’t in Google’s best interest to let anyone else actually compete with GKE.

And that’s probably OK. In fact, OpenStack + Kubernetes is probably just fine for what ever you’re building, and neither project is going anywhere any time soon.

If another company wants to stand a chance in the hyperscale space, it will need to look holistically at its entire business apparatus, and either repeat the Google/Facebook/Amazon model of privatizing open source and investing in your own server fabrication or band together with other vertically-focused, open source businesses in order to compete.

And if you’re doing that, please drop me a line. I’d love to help.

Edge of the Clouds

xkcd comic #307: "that's no moon"
“That’s no Edge…”

Lately, I see a lot of folks on Twitter talking about the #Edge of #CloudComputing and arguing “That’s the Edge” and “That’s not the Edge!”…

My first thought was, “Wow, we sure love reusing words and then debating their meaning!”

And then I remembered our discussions at the first #OpenDev conference. The team behind the OSF put this mini-conference together to collectively answer the big question: “What is the Edge?”

The napkin drawings we did in the back of the room on the first day turned into an impromptu talk on the second day (it might have been recorded, but I’m not going to go look for it right now). Fueled by some strong Portland coffee this morning, I decided to write this post and rehash the discussion. It is still the best definition I’ve heard in any forum since then.

First, a few other definitions, just to make sure we’re on the same page:

  • Dynamic Workload: an application, service, or functional program (whether run on physical, virtual, or containerized infrastructure) that is managed by an orchestration system which dynamically responds to external (user or device) driven demands.
  • Cloud: an API-driven consumption model for abstract compute, storage, and networking resources.
  • Connected Device: an internet-connected device whose function is to interact directly with a human, eg.: cell phone, smart lightbulb, connected speaker, learning thermostat, self-driving car.

I think we can all agree that application workloads have been moving away from traditional colo’s and managed hosting providers — and into the cloud. I’m not here today to debate whether “cloud” means centralized (AWS, GCE, etc) or on-prem / in a CoLo, the point is that application management has become more automated, workload-driven, and centralized.

However, there is now a growing pressure to move “towards the edge” — but what exactly is that?

In a connected world, the Edge (of the Cloud) is that Compute resource which is closest to the data producer and data consumer, where a Dynamic Workload can be run in order to meet user, device, or application demand.

restated from discussions at OpenDev 2017

Latency and bandwidth are key to understanding the move towards the Edge.

The increased bandwidth consumption comes, in part, from sensor networks and IoT devices which inherently generate more data. The latency requirement comes from the situational use of “smart devices” which need to respond more quickly to their environment than a webpage ever did. In short, the Edge is also the result of the increasing prevalence of Augmented Intelligence (AI) & Machine Learning (ML) applications.

Today, companies need faster processing of data streams and applications that are more tolerant of network hiccups; we will soon need the ability to deliver AI-driven responses to environmental changes while being completely disconnected from a traditional data center. Incidentally, this same situation is driving the creation of, and race towards, 5G. (Hint: it’s all connected!)

Allow me to offer a few practical examples.

Imagine a self-driving car, with all its video cameras and sensor networks that require massive AI-driven processing to “see” and “react” to changing traffic conditions. Now imagine the uplink glitches for 100ms while transferring between cell towers. Whereas your facebook-browsing or video-streaming service (assuming you were a passenger in the car) wouldn’t be affected by a quarter-second latency spike, the autonomous vehicle could be unable to respond to a sudden change in road conditions (eg, another vehicle swerving) and cause a crash! That’s obviously terrible and should be prevented! To address this, a lot of powerful computing resources need to be put into the car — or the car’s uplink needs to be both blazingly fast and come with a guaranteed 100% uptime. That car needs to become the edge.

Here’s another example: imagine your excitement at having just installed 20 Hue lightbulbs. They’re all busy streaming sensor data into “the cloud”, and your house is well beautified. You turn on Netflix, but it starts having trouble delivering any HD content because of the increased network traffic from all those smart bulbs. Clearly, you don’t want that, and Phillips knew this, so they designed Hue in such a way that you need to buy a Hub which your Hue bulbs connect to. The hub aggregates traffic between your devices and the Cloud. That hub is the edge.

These scenario aren’t oddities; this is the direction that so many tech companies are going. If you hear the buzzwords IoT, Edge, ML, AI … it’s all related to the drive to deploy applications closer to sensors and consumers. To address this, we need abstract compute workloads that can run as close to the data producer & consumer as possible, thus reducing latency and increasing available bandwidth.

(diagram showing images of a data center, small colo, cell tower, and wifi router, along an axis labelled "distance in milliseconds" which decreases from 100ms to 1ms, with a collection of icons on the right edge representing a user and consumer devices)

Today, the Edge is often a rack of hardware installed in a manufacturing plant, connected back to the company’s central CoLo or up to their cloud services through a VPC. In some cases, it’s a device in your own home.

Cisco is already touting its AI-driven commercial routers, and it won’t be long before this technology reaches consumer devices (if it hasn’t already). This might look like a smart set-top box installed in your house which enables your Cable provider to dynamically run microservices to optimize your viewing experience on their content, thus potentially circumventing NetNeutrality rules.

That doesn’t sound great, does it? Let’s get even darker… what if your ISP could run a microservice on your WiFi router to monitor your in-home device usage and better target advertising to you? … I think that’s a gross invasion of my privacy, but it’s not far off. We already have companies in our house (Alexa, stop spying on me!), and it stands to reason that telco’s want in on that revenue.

But at least you can still run Open Source wifi firmware (eg, OpenWRT) 🙂

While the promise of smarter, faster, more ubiquitous, always-connected computing brings a lot of value for companies and users who aren’t concerned about privacy, I’m going to keep working on Open Source projects that keep the playing field level and empower independent users, too.