Celonis Platform Infrastructure Engineering

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:1f813b32-f2fc-4ceb-8ffc-b77b40832c19/as/large-260102_MSFT-AI-Tour-NYC_WebpageImages_Hero_1080x720px.jpg?preferwebp=true

The Celonis Platform Engineering team orchestrates a global multi-cloud ecosystem by treating infrastructure as a product. This strategy abstracts complexity to ensure reliability and security, empowering developers to ship faster.

Since its inception in 2011, Celonis has grown exponentially with 1350+ customers around the world. As the process intelligence software evolved over time, so did its underlying systems, their architecture. Building, scaling and managing this ever-growing, dynamic SaaS environment is the main responsibility of a Platform Infrastructure (PI) Engineer at Celonis.

The role of a PI engineer is to provide infrastructure solutions that are secure, stable, cost effective and are crucial for enabling application teams to become self-sufficient for delivering their products, thereby significantly reducing the lead time required to bring features to customers.

With 150+ active Kubernetes clusters, hosted on multi-cloud environments (AWS, Azure, GCP) across 19+ geo-locations, the PI team streamlines the configuration of underlying infrastructure stack. Each component constituting the system is treated as an individual product. The platform team handles all the underlying infrastructure complexity, ensuring developers are fully abstracted, allowing them to plug in their code and deploy features efficiently.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:4dfacf1b-713c-4599-b3b7-9167a25f727a/original/as/Image_PI_Engineering_Blog_k8s_growth_external.jpg

Reliability

Reliability on our platform is a core feature, achieved by abstracting away instability, rigorously optimizing resource management, and enabling non-interruptive operations.

Our technical strategy focuses on continuous, impactful improvements that drive scalability. For instance, change from non-zone-aware cluster autoscalers to a modern Karpenter solution allowed us to unify node group provisioning and scaling. Thus eliminating the additional Autoscaling group hop and improving resource elasticity and failure domain awareness, making the cluster significantly more expandable under peak demand.

Stability is an end-to-end commitment, reaching from the instant demands of real-time traffic all the way through to flawless deployment orchestration.

At the runtime layer, we actively work to prevent performance degradation by focusing on low-level metrics. For instance, to ensure the Envoy proxy's dns_queries attempt-to-success ratio remains stable even when under significant upstream load.

Our commitment to GitOps and Infrastructure as Code (IaC) has resulted in a vast ecosystem: millions of Terraform-managed resources spanning all three major cloud providers and over 15,000 ArgoCD Applications driving our Kubernetes deployments.

But scale is only half the story. The other half is operational agility. As our platform evolves, we might face complex operational requirements, such as migrating applications across namespaces or cluster boundaries. We ensure that even significant structural changes - like moving the home of an application - never interrupt the user or corrupt the GitOps source of truth. We build for a world where the platform can change constantly, but reliability remains permanent.

Security

As a native feature, security is enforced at every level, from the initial infrastructure pre-provisioning to services running in live production.

To operate our Terraform configuration, we employ a security model centered around scoped short-lived tokens bound to specific identities. This approach is a fundamental shift away from traditional, less secure methods: neither our Platform Engineers nor our automated tooling ever rely on static long-lived credentials or store sensitive information as plain-text secrets to provision infrastructure.

Instead, a system like AWS Secrets Manager is responsible to store the secrets, while AWS IAM Roles or Azure/GCP Federated Identity grants temporary, minimal-privilege credentials only for the duration of the required operation. This drastically reduces the attack surface, limits the potential blast radius of a compromised token, and ensures that all infrastructure changes are tied back to an auditable identity.

Within the Kubernetes environment, Policy as Code (PaC) is paramount. We leverage Kyverno to validate, mutate, and generate resources. This allows us to enforce crucial guardrails, such as mandating necessary resource requests while forbidding CPU limits on all deployments. Beyond validation, we also leverage Kyverno's generation capabilities to reduce operational toil. For example, by automatically provisioning image pull secrets for private AWS ECR authentication in every newly created Namespace.

Granular network control is achieved via Network Policies, which strictly govern communication, such as only allowing ingress traffic to gateway via the DNS proxies and VPNs, or restricting specific workload egress to predefined targets only.

We perform routine evaluations to maintain a proactive security posture. For example, while examining the traffic to the external dependencies, we review what part should be routed via dedicated private endpoints and what should be egressed via NAT gateways to minimize our attack surface.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:f152022a-db00-474a-8881-c6c0f325ecfa/original/as/Image_PI_Engineering_Blog_security_external.jpg

Governance and control

Ensuring the platform operates with predictable consistency is achieved by leveraging automation and templating across all layers.

We maintain control over boundaries by providing versioned Terraform code modules, maintaining IAM policy templates among others.

For our most complex infrastructure automation and orchestration needs, we leverage the power of the Kubernetes Control Plane. This involves building custom Kubernetes Operators to codify sophisticated operational knowledge and manage non-Kubernetes resources declaratively. For example, implementing a controller that automates boilerplate code generation based on a given template. The controller automates Git workflows such as committing changes to target branches and creating pull requests, bridging Kubernetes resource management with Git-based configuration as code practices.

While we actively build our custom solutions, we also strategically evaluate and utilize community-driven projects, such as Crossplane or Kubernetes Resource Operator (KRO). This hybrid approach ensures we aren't constantly reinventing the wheel. Furthermore, if our Platform Engineers identify missing functionality in an off-the-shelf OSS solution, we are strongly encouraged to contribute the necessary features and fixes to the upstream open-source projects, reinforcing our commitment to the wider community.

For network edge operations like domain-level certificate issuance via Cloudflare, we enforce a standardized, codified provisioning process. This ensures every domain utilizes the same secure configuration and renewal logic, guaranteeing uniformity and integrity across all production assets.

https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:dc024cfb-e2e2-405c-8b5a-f1f69cb30880/original/as/Image_PI_Engineering_Blog_governance_external.jpg

Monitoring and Traceability

Having a clear visibility of the entire eco-system is another integral part of the PI team. We achieve this by centralizing observability within Datadog, where we maintain a rich ecosystem of SLOs, customized dashboards, and monitors.

Our observability extends from standard Kubernetes metrics down to low-level network forensics. For instance, when latency spikes are detected, or when diagnosing abnormal network packet drops, or connectivity issues, we leverage Cilium's eBPF maps to identify the root cause.

We ensure full traceability with comprehensive logging using custom workload labels and APM, giving us a real-time map of component interactions that pinpoints consumption time, delays, response codes among others.

Our well-defined monitors utilize proactive, optimized thresholds that are specifically set to identify pre-failure indicators, allowing us to intervene and prevent issues before they impact service quality and stability in production.

Future-Proofing the Platform

Our PI team not just maintains systems, but drives continuous value by taking critical architectural decisions, designing platform solutions, and weighing different trade-offs.

Our work involves tackling high-stakes challenges, such as zero-downtime migration of entire systems to a new platform or seamlessly changing core network interfaces without impacting production.

To ensure sustained excellence, we routinely challenge the present status quo.

Our team thrives on open discussion and shared ownership, backed by the deep technical experience required to architect a truly modern, efficient, and reliable platform.