Handling Breaking CRD Changes

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:43b64e3e-c865-46a7-9893-442ff5de1393/original/as/Blog-Image-04.svg

How We Upgraded External Secrets Operator Safely

When a third‑party Kubernetes operator changes its CustomResourceDefinitions (CRDs), the result can range from a harmless warning to a full outage. This post describes a concrete case — upgrading External Secrets Operator (ESO) from 0.9.1 to 0.16.2 — why it broke Argo CD syncs in our dev environments, how we implemented a minimal, safe fix to keep clusters running while we migrated manifests, and a checklist for when you could apply this approach in your infrastructure.

TL;DR

Problem: Upgrading ESO to 0.16.2 caused Argo CD apps to fail to sync.
Root cause: The new ESO no longer served externalsecrets/v1alpha1; most of our manifests were still using v1alpha1.
Why rollback failed: Some objects were converted to the new storage version (v1) during upgrade; downgrading couldn't convert them back.
Fix: Added a compatibility CRD version — a slightly modified copy of v1alpha1 with defaults aligned to v1beta1 — so ESO could accept the old manifests.
Outcome: Dev clusters recovered and we rolled the compatibility change to other dev environments. We're now migrating manifests to v1 and preparing for ESO 1.0.0.

Some useful knowledge about Kubernetes CRDs

In Kubernetes, CRDs (Custom Resource Definitions) are objects inside the cluster that specify how user defined, or third-party defined, resources should look like: all the spec and status fields, the required properties, the default values, and all of this is done through OpenAPI specification.

The same CRD can have multiple versions, a version of a CRD represents one of the possible ways to interact with the same resource.

crd-update-1.png

When one CRD has multiple versions, there are two important flags to be set in all of them: served and storage.

A `served: true` CRD version is one that can be used through kubeapi (i.e.: kubectl, argocd, etc...), any api call using specifically this version will be replied using this version syntax (e.g.: `kubectl get externalsecrets.v1beta1.external-secrets.io`).

crd-update-2.png

When not specified, the served version will be chosen based on naming through Kubernetes version priority standard.

A `storage: true` CRD version is the version that will be used to save the resource to etcd in the Kubernetes control plane, because of this, there can only be exactly 1 CRD version with this field set to `true`.

All other versions will be converted to the storage version through the conversion webhook.

crd-update-3.png

When a new version is introduced, it is usually set as the new storage: true version, while the pre-existing one is set to storage: false and patched with the conversion webhook configuration, which handles conversion from the old to the new version during API interaction.

crd-update-4.png

Resources that are still stored as the previous version will keep being stored as such, no mass conversion is performed after CRD update, but if any kubeapi interaction occurs that tries to update the resource, this will be converted to the new version. Mass conversion can be intentionally triggered by patching all resources involved with minimal changes.

Unfortunately the stored api version for each object is not exposed through any API and requires access to etcd to be verified (which is not available in managed Kubernetes offerings such as AWS EKS, Azure AKS, and GCP GKE). The only available information through kubeapi is the list of all the versions currently used for storage of resources for the CRD, e.g.: `status.storedVersions: ["v1beta1", "v1"]` tells us that both v1beta1 and v1 are currently used for storage of at least one resource each, but doesn't tell us which, how many, or in which namespaces.

When an older CRD version is desired to be deleted, the usual flow is as follows:

Update the CRD introducing a new version and set it to storage: true while setting the old one to storage: false
Have the old version handle conversion to the new one with a conversion webhook
Make sure the CRD status.storedVersions only reports the new version
Update all the manifests used for interacting with kubeapi to the new version
Update the CRD again toggling to served: false in the older version, disabling the older version while still having its specs saved to the cluster, in case a rollback is needed
Verify that nothing is disrupted
Update the CRD for the 3rd time by fully removing the older version from the CRD spec
Repeat step 6

What follows is what happened to us when we missed step 4 in due time and couldn't update all of our manifests quickly to remediate.

Context

We needed to bring ESO forward by many minor versions (0.9.1 → 0.16.2) for bug fixes and features. We assumed we were using v1beta1 everywhere because it was available in 0.9.1; a small sample check seemed to confirm that. After upgrading to 0.16.2 in one dev environment and validating that secrets still synced, we declared the upgrade safe — until we rolled out the upgrade to all other dev environments and Argo CD began failing to apply many manifests. The failure: in the apps sync status, Argo CD reported there was no externalsecrets/v1alpha1 API in the target clusters.

Why rollback was not an option

During the upgrade some ExternalSecret objects were already converted to v1 (the storage version used by the newer ESO), but webhooks are designed for upgrade paths; neither the old nor the new webhooks could handle conversion from v1 to v1beta1 (the older storage version) on downgrade. We considered manually recreating objects after rollback but that would be error prone and time consuming given the amount of clusters and resorces involved. So we started crafting a fix‑forward strategy.

Investigating incompatibilities

We compared the v1alpha1 schema from 0.9.1 with v1beta1 in 0.16.2 and cataloged changes. Most of the changed fields and syntaxes were not used in our manifests — we found only one service still using the old dataFrom syntax and converted it. That gave us an opportunity: if kubeapi could be made to accept v1alpha1 manifests (even if internally ESO used v1beta1/v1 logic), Argo CD would be able to apply resources, allowing us to postpone manifests update (which is still prefered in order to make them compatible with the native capabilities of the new ESO version, but impractical to do all at once given the amount of applications involved).

The workaround we implemented

We extracted the v1alpha1 CRD from 0.9.1 and compared it line-by-line with 0.16.2's CRD.
Then we created a new served-but-not-storage CRD version: a slightly modified copy of the old v1alpha1 with default values aligned to the newer v1beta1 for fields that overlap.
We tested that CRD bundle in a single dev cluster.
Verification that Argo CD apps could once again create and sync ExternalSecret resources without converting them to the new schema immediately was successful.
And at last we fixed the one manifest still using the old dataFrom syntax and rolled the CRD compatibility change to the other dev environments.

This approach kept the cluster API surface compatible with our current manifests without forcing an immediate rewrite of hundreds of application manifests.

Results

Dev environments: recovered and stable, all ArgoCD apps are now back to healthy.
Migration plan: regularly update resource manifests to the latest available api version which is compatible with all environments, which in this case means we planned to update our external secrets to v1beta1 in the short term, and to v1 after production rollout of the ESO update.
Production: postpone ESO update to after all the external secrets manifests will be updated to v1beta1, to avoid deploying the compatibility patch.

Key takeaways

Validate the entire upgrade path and affected services before rolling out broadly.
Don't rely solely on a small sample of manifests — search org-wide for API versions your apps actually reference, ArgoCD can help with that.

              #!/bin/bash

APP_NAMES=$(argocd app list -o name)

echo "Version Kind Name"
echo "-----------------"

for APP in $APP_NAMES; do
  echo "--- Application: $APP ---"
  argocd app get "$APP" -o json 2>/dev/null | \
    jq -r '.status.resources[] | select(.kind != "Application") | "\(.version) \(.kind) \(.name)"' | \
    sort -u
done

            

Rollbacks can be harder than forward fixes when CRD storage versions change. Anticipate conversion pitfalls.
A minimal compatibility CRD can be an effective bridge while you migrate manifests, but it should be temporary. Plan to migrate consumers to the newer API.
Regularly update app manifests. APIs do not generally "move backwards"; keeping manifests current reduces upgrade risk.

Alternatives we considered (and why we rejected them)

Reintroducing a v1alpha1 -> v1beta1 conversion webhook inside the new ESO: technically possible, but it would complicate future upgrades and give a false sense of safety. We preferred a short-term compatibility shim plus explicit manifest update.
Bulk recreation of resources on rollback: too risky and error prone.

Recommended checklist for similar cases

Search all repos and apps for API versions referenced in manifests.
Confirm what CRD versions the target operator will serve and which is the storage version.
Test upgrades in an isolated dev environment and validate Argo CD / GitOps syncs.
If objects will be converted to a new storage version, assume rollback will be hard or impossible.
If necessary, provide a temporary compatibility CRD version and schedule a migration plan to the new API.

When this is useful

The new CRD version contains breaking changes, but affected features are either unused or their manifests can be safely upgraded in advance.
Environments are managed by GitOps (for example, ArgoCD) with many manifests spread across repositories, so a global, immediate edit is impractical.
A fast, low‑risk way to restore CI/CD or GitOps syncs is needed without touching every application.
Multiple clusters or teams are upgrading at different speeds; a compatibility version limits the blast radius while teams work on it.

Damiano Fisicaro

Damiano has been working as a Platform Engineer at Celonis since 2024. He started his career as a Cloud Engineer back in 2021. His passion is to make things work, even better if in automation. He really likes understanding new technologies, figuring out how to use them in his day-to-day work, and is always happy to share what he learns with the team.
His hobbies are playing league of legends, action-adventure videogames, chess, dancing Lindy Hop, and playing the guitar.

author title

Author

author

Damiano Fisicaro