No Shame, Just Pain: How We Migrated Away From Kubernetes 1.16

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:1f813b32-f2fc-4ceb-8ffc-b77b40832c19/as/large-260102_MSFT-AI-Tour-NYC_WebpageImages_Hero_1080x720px.jpg?preferwebp=true

This post is based on the talk "No Shame, Just Pain: How We Migrated Away From Kubernetes 1.16" presented by Jannis Relakis and Michael Seiwald-McCarty at KubeCon + CloudNativeCon EU 2026, Amsterdam.

While the cloud-native community was buzzing about WASM and AI-driven operations, we at Celonis were still running Kubernetes 1.16 — on over fifty clusters, across six flavors, and three cloud providers. This is the story of how we dug ourselves out of that hole: a multi-year, zero-downtime migration that forced us to confront years of technical debt, align dozens of teams, and rethink how we manage clusters at scale.

This is not a "look how shiny our platform is" story. It's a confession from engineers who sat through KubeCon talks for years, drooling over features we couldn't use. There's no shame here. Just lessons, empathy, and a lot of pain turned into progress.

Recently, we came across a striking statistic: 70% of all large-scale migrations fail. And as the famous XKCD "Standards" comic illustrates — when you have 14 competing standards and decide to consolidate, you tend to end up with 15. We were in a remarkably similar situation. We had six Kubernetes flavors and wanted to consolidate them down to three — one per cloud provider. The fear was that we'd end up with nine. Spoiler: that didn't happen. This post explains how we avoided that outcome, shares the lessons we learned, and documents the mistakes we made so you can avoid them.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:43fc9506-6c2e-42c3-92a7-335818866577/original/as/Image_No_Shame_Just_Pain_Kubernetes_Flavor_Proliferation.png

How we got here

At Celonis, we organize our infrastructure around what we call environments — full installations of the Celonis platform, each consisting of multiple Kubernetes clusters. Some environments are shared across hundreds of customers; others are dedicated, private installations for a single enterprise. We run on AWS, Azure, and GCP depending on customer preference, across 22 cloud regions worldwide. As of February 2026, all of this added up to approximately 160 clusters.

Our Kubernetes flavor accumulation happened gradually over six years, and each choice made sense at the time. We started with kops in 2017 for our first staging and production environments on AWS. A year later, we adopted Gardener when we needed multi-cloud support for Azure — its innovative "Kubernetes-in-Kubernetes" model with seed and shoot clusters was a real step forward. In 2021, we moved toward OpenShift for its enterprise security focus and, crucially, our first professionally supported Kubernetes clusters. There was nothing wrong with any of these choices individually. The problem was that we never went back to consolidate. As the company grew rapidly, the platform team was consumed by market expansion and simply couldn't keep up with maintenance. Some clusters were never upgraded.

By the time we took stock, the landscape was painful. The cognitive load on the team was enormous. Making changes was complicated and slow, often requiring implementation across multiple flavors. The newest Kubernetes features were unavailable to us. There were significant security concerns. And most of our clusters were completely out of support. In 2023, we decided we had to start from scratch.

The target we set was clear: one flavor per cloud provider — EKS, AKS, and GKE — with high-level concepts kept uniform across all of them. Everything would be modern, secure, and easy to maintain. We chose managed Kubernetes control planes to stop running them ourselves, provisioned everything through fine-grained Terraform modules orchestrated with Terragrunt, and used ArgoCD ApplicationSets for cluster add-on management. Cluster upgrades became a first-class concern from day one — we've already completed seven upgrade cycles since inception, keeping all clusters on the same version. Later, we added Karpenter for improved node management and cost optimization.

Each environment runs what we internally call a cluster fleet, consisting of three clusters: a main cluster where the majority of platform workloads run, an ML/AI cluster dedicated to machine learning workloads, and a query engine cluster for our central query engine component. All three are connected through a shared networking layer within the same cloud provider region.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:a86fe11f-2321-464f-b9de-70123644c7b5/original/as/Image_No_Shame_Just_Pain_Cluster_Fleet_Architecture.png

The Wormhole

The most critical requirement for our migration was zero downtime. We wanted to migrate workloads independently of each other — no big-bang switchover, no maintenance windows. Any workload could live on either the old or new cluster for an arbitrary period, and it needed to communicate seamlessly with workloads on both sides.

Off-the-shelf service mesh solutions like Cilium were not feasible. Our clusters were running ancient versions across too many flavors. We needed something minimalist, customizable, and compatible with our heterogeneous landscape. So we built our own cross-cluster service mesh: the Wormhole.

The Wormhole is a set of statically configured Envoy proxies that enable transparent, bidirectional HTTP traffic proxying between clusters. It leverages Envoy's dynamic forward proxying capability, resolving targets dynamically via DNS based on the Host header rather than requiring explicit upstream configuration. On each cluster, an egress proxy forwards outbound HTTP traffic to the paired cluster, while an ingress proxy receives inbound traffic from it. The key property is that this is entirely transparent to applications — when service A calls service X, it doesn't know or care whether X is running on the old or new cluster.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:b130c5f6-79a3-4fc5-95a1-3fb0e16bd228/original/as/Image_No_Shame_Just_Pain_Wormhole_Overview.png

To see how this works in practice, consider a stateless application migration. Application A starts on the old cluster, serving traffic normally. We redeploy it on the new cluster, where it runs but doesn't yet receive traffic. Then we patch the Kubernetes Service selector on the old cluster to point at the Wormhole — a zero-downtime operation that reroutes traffic through the Wormhole to the pods on the new cluster. The old pods are now orphaned and can be safely scaled down. The reverse direction works identically: if Application A on the new cluster needs to talk to a service that hasn't been migrated yet, the Wormhole routes traffic back to the old cluster transparently.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:217185ab-644f-4b67-afc5-4b6a303cd423/original/as/Image_No_Shame_Just_Pain_Wormhole_Traffic_Flow.png

The Wormhole wasn't without risk. The most notable incident was a configuration mistake where both sides pointed at the Wormhole simultaneously, causing requests to ping-pong endlessly between clusters. This overwhelmed the proxies and interrupted connectivity mid-migration — precisely when some services were on the old cluster and some on the new. We learned to guard against this with aggressive validation checks.

From individual apps to entire environments

Not all applications could be migrated the same way. We classified every workload into one of three groups. About 90% were stateless and could follow the standard Wormhole procedure: scale up on the target, cut over traffic, scale down on the source. Roughly 1% were singletons — legacy applications that could only have one replica running at a time — which required the reverse order: scale down first, cut over, then scale up. This introduced a brief downtime window, but it was negligible when steps were executed in quick succession. The remaining 9% were stateful applications — including our query engine, ML workloads, and automation engine — each of which required bespoke, one-off migration procedures designed for their specific data and operational requirements.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:c1c23f26-111e-4427-9938-4d4660dbcaae/original/as/Image_No_Shame_Just_Pain_Workload_Classification.png

Knowing how to migrate individual apps wasn't enough. We needed to orchestrate the migration of an entire environment with all its interdependencies. For this, we created what we internally called the Master Plan: a dependency diagram that mapped every step from provisioning to teardown. At the bottom was provisioning the new clusters. Above that came infrastructure prerequisites — setting up the Wormhole, establishing RabbitMQ federation, migrating egress IPs. Then the workload migrations themselves, executed in parallel where possible. The ingress cutover could only happen after all stateless and singleton migrations were complete. And at the very top: tearing down the old infrastructure. This dependency diagram gave us better scheduling, clear opportunities for parallelization, and visibility into bottlenecks and unnecessary cross-team handovers.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:9716b9e9-6212-463a-811a-2e65a71f098b/original/as/Image_No_Shame_Just_Pain_Migration_Master_Plan.png

The math that forced automation

When we zoomed into the stateless service migration, the actual steps per application turned out to be more involved than the simplified "scale up, cut over, scale down" summary suggests. First, we had to duplicate the Kustomize overlay — we used Kustomize to describe app configuration, but needed to control old and new cluster configs independently. Then we adjusted the overlay to remove OpenShift-specific constructs like shared volumes and SecurityContextConstraints, and updated manifests for API compatibility across a jump from 1.16 to 1.32+. Only then came the actual scale up, traffic cut-over, and scale down. Each step produced one or more pull requests.

The math was sobering: roughly 10 PRs per app, times 100 apps per environment, times 40 environments — 40,000 pull requests in total. If we had done all of this manually, we would probably still be migrating. We would have caused a hundred-plus incidents. The results would have been wildly inconsistent, and we would have accumulated even more technical debt on top of the debt we were trying to pay down.

The solution was building dedicated migration tooling — a CLI that could automate migration operations for batches of apps, not one by one but 10 or even 100 at a time. It accessed clusters, fetched information from cloud provider APIs and ArgoCD, and performed all the migration actions: scaling up and down, traffic cut-over, overlay adjustments, decommissioning, and more. Critically, the output was always pull requests reviewed by a human — human-in-the-loop at every stage.

The investment paid off enormously. Every migration action became traceable through a chain of PRs. The same commands could be rerun with identical results. Deterministic output eliminated human error. Execution was orders of magnitude faster. And perhaps most importantly, instead of relying on a few specialists — which was itself a bottleneck — the CLI enabled any team member to execute migrations safely by following the tool's guidance rather than navigating complex runbooks. This democratization of the migration process was a huge relief.

Timeline, results, and the snowflakiness index

The project kicked off in early 2023 with the design of the target platform and migration strategies. By late 2023, we had our first EKS cluster provisioned and were focused on EKS migrations — everything manual, everything slow. In early 2024, we started with AKS, but velocity remained low. This was when we realized that manual approaches simply would not scale, and we invested heavily in automation. By 2025, the automation had matured and migration velocity increased dramatically. As of 2026, the final environments are approaching completion.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:804c2344-41f4-44d3-8f6f-53658af87a8a/original/as/Image_No_Shame_Just_Pain_Migration_Timeline.png

One factor that significantly slowed us down was what we internally called the snowflakiness index. Even within a single provider flavor — say, Gardener on AWS — environments differed from each other in meaningful ways. Some used Linkerd while others used Istio. Some were HIPAA-compliant with additional hardening. Some ran on GovCloud with distinct configurations. The list went on. Each new variation required adjustments to our tooling and migration procedures. As more environments were migrated, the snowflakiness index declined — reinforcing that this project was primarily a consolidation effort, not just a modernization one. After migration, we actively prevent drift by centrally managing the fleet through infrastructure-as-code with feature flags: same templates everywhere, only specific features toggled per environment.

What we learned

After years of migrating 160+ clusters across three cloud providers, a few lessons stand out above the rest.

Don't change too many things at once. We consciously kept this as a lift-and-shift migration, not an application refactoring project. If an application used EFS and that wasn't ideal, we addressed it separately. Mixing migration with refactoring multiplies risk and makes rollback harder.

Beware of scope creep. It happened constantly: "You're already migrating — can't we also add this change?" We had to say no firmly and protect the project timeline. Every additional change multiplied risk and effort across all 40 environments.

Snowflakes will slow you down. The key to scaling a migration is repeatability, and snowflakes are the enemy of repeatability. Snowflake environments break your proven procedures and send you back to the drawing board. After migration, we actively prevent drift with centralized, templated infrastructure-as-code and feature flags.

Keep up with the changing environment. Even during a multi-year migration, the world doesn't stop and the rest of the company keeps moving. We introduced Karpenter and Cilium, some services were decommissioned (saving us migration work), and new ones were added. We had to continuously adapt our tooling and plans.

Iterate and improve. Every migration should make the next one better, feeding learnings back into our runbooks, automation, and procedures. Over time, velocity increased, procedures became smoother, and incidents decreased.

Minimize handovers. Many teams were involved, and domain experts designed specific migration steps. But we made those steps self-service so that other teams could execute them without waiting on potentially busy specialists. This eliminated a major throughput bottleneck.

Don't cause the same incident twice. When you touch every application in every cluster, things will go wrong. We focused relentlessly on improving reliability over time, and service level objectives were invaluable in tracking our progress and maintaining accountability.

Migrating away from Kubernetes 1.16 was one of the most challenging infrastructure projects we've undertaken at Celonis. It required confronting years of technical debt, aligning dozens of teams, and building custom tooling for a problem no off-the-shelf solution could address. But the results speak for themselves: a uniform, modern, and maintainable Kubernetes platform across three cloud providers — with the confidence and processes in place to ensure we never fall that far behind again.

If you're facing a similar migration, we hope our journey — mistakes and all — helps you navigate yours.

This post is based on the talk "No Shame, Just Pain: How We Migrated Away From Kubernetes 1.16" presented by Jannis Relakis and Michael Seiwald-McCarty at KubeCon + CloudNativeCon EU 2026, Amsterdam.