Throughout my career, I've always gravitated toward two things: technology infrastructure, and cybersecurity. Being awarded with CVE-2026-10533 was a pleasure and a privilege, as I got to contribute to the security community not through the normal red-team lenses of memory corruption, obfuscated payloads, or social engineering. Instead, this CVE was discovered by asking a much simpler question:
What happens when a tenant consumes a shared cluster resource that nobody is measuring?
Let's break this question down a bit, to understand the exploit methodology. If you're already an expert in OpenShift administration, feel free to skip the next section.
Defining the Question
In this case, what is a tenant? If you're unfamiliar with OpenShift, Kubernetes, or cloud computing in general - a tenant is simply one "user" of the (usually shared) distributed system. Therefore, when you use the cloud (e.g., you stand up an application inside AWS, GCP, etc.), you are a single tenant of the cloud. Almost always - except where you pay a premium for an exception! - you *share* the underlying compute and resources of the cloud with other users that you simply don't have access to see or interact with. You could have two people from two completely different countries and companies provision some software on the cloud and the cloud provider chooses to place their workloads right next to each other on the same physical server in the same physical datacenter for a plethora of different reasons. The biggest point to take from this is that by design, tenants are isolated from one another.
So what do we mean by shared? While tenants are supposed to be completely isolated from one another, they could end up using shared resources, like the physical server mentioned above. If that one server fails, BOTH tenants are impacted. Maintaining shared cluster and cloud resources with priority and fairness to all tenants is pretty much the entire job description of cloud administrators and operators. There is a lot to manage as you can probably imagine, but think of it akin to maintaining the roads that many people drive on, the sewage systems that many houses use, the electricity grid tied to many endpoints, etc. If a cluster admin inadvertently impacts a shared resource, that is a bad day at work for that guy. If a *tenant* can intentionally impact shared resources, and step on other tenants, that is when the activity crosses the line from accidental incident to malicious attack.
Why do I say nobody? Well, cluster admins can and should be watching their shared cluster resources that all tenants use. However, there are many things that tenants technically "share", including but not limited to:
- Underlying host CPU
- Underlying host memory
- Underlying host disk space
- Underlying host network
- Network bandwidth
- Cluster control plane API calls
- DNS nameservers
- L3 network
- L1 data cables
- Server rack PDUs
- Datacenter aisle
- Datacenter HVAC
- Datacenter electricity allocation
- Regional power grid
Enter Events
Events in Kubernetes are simply metadata objects that get created and logged whenever something interesting happens in a cluster, such as:- A Pod is created
- A container image is pulled
- A container starts
- A scheduling decision occurs
- A readiness probe fails
- A Pod completes successfully
Event object that logs the timestamp, description of what happened, and references to the tenant and workload that generated the Event. Events are extremely useful for troubleshooting because they provide a timeline of what happened inside the cluster. Events are not free.Event is stored in etcd, the distributed key-value database of Kubernetes that serves as the cluster's source of truth. etcd is arguably the most critical component in the cluster. It stores information about Pods, Deployments, Nodes, Services, Operators, Images, API objects, and countless other pieces of cluster state. If etcd becomes unhealthy, the rest of the cluster quickly follows. A Kubernetes cluster is ultimately a distributed system whose state is coordinated through etcd. Anything capable of overwhelming etcd therefore has the potential to impact the entire cluster.Normal Guardrails
ResourceQuotas. Most enterprise Kubernetes and OpenShift use ResourceQuotas to prevent a single tenant from consuming excessive resources. A typical tenant's provisioning space (called a Namespace) might be limited to:- 10 Pods
- 1 CPU
- 10 GiB of memory
ResourceQuotas are only capable of measuring the resources they are configured to measure. Events are not one of these resources.Events as a shared cluster resource, from a single tenant? Glad you asked, because that is what CVE-2026-10533 is all about.Enter the Exploit
ServiceAccounts. But the simple, repeatable, no-login deployment pipeline trigger is just a click away. ServiceAccount that comes standard with every new OpenShift project/namespace (account name: "deployer") to create short-lived Pods within allowed CPU/memory ResourceQuotas, it is possible to generate ~130,000 Events in under three hours from a single namespace - enough to severely degrade API server performance on new OpenShift clusters, and completely topple clusters older than a couple of in-place major version upgrades.Events and API server requests. ResourceQuota found in enterprises of:- Number of Pods: Each child Pod sleeps for a very short duration (< 150ms) on start. Additionally, with the
restartPolicy: Neverflag set, the Pod gets "forgotten" about by the API server. "Completed" Pods do not count toward the overall project Pod count, unless a very specific configuration setting is enabled (but most organizations would not want this enabled anyway as it would block new CronJob Pods from starting). - CPU request: each child Pod requests 100m of CPU, and is forgotten about upon Completion.
- Memory request: each child Pod requests 128Mi of memory, and is forgotten about upon Completion.
Exploit Code
Exploit Impact
Consideration for Long-Running/Aged Clusters
etcd for any kind of information. Nodes, 20,000 Pods, maybe 25 operators, and 600 different Namespaces/tenants. That is a lot of objects. Old MachineConfigs and certs from cluster versions long past may still be lingering around. User and Group objects to handle all the developer logins, RBAC configs flying around, Node states switching from Healthy to NodeDiskPressure and back again, images being pulled, Pods completing runs, the list goes on. etcd is tracking all of this, and it only gets more bogged down the more you add to the cluster. etcd capacity by 86% through exploiting CVE-2026-10533. Your cluster will crash.Weathering the Storm
- Turn off "
DeploymentConfig" capability at cluster build time. This will NOT create the default "deployer"ServiceAccountand would require an attacker to need the ability to both deploy a Pod, and run a Pod that can deploy Pods (needs to run a DoS control Pod). Thedeployeraccount takes care of creating Pods for us, but if that went away the attacker would need more permissions. However, this additional privilege may not be too difficult to acquire if they already have an account that can deploy a Pod. Simply run the Pod as their initial deploying account and it may work immediately. - Setting
ResourceQuota/v1/spec.scopestoNotTerminatingand applying to all namespaces in a cluster would prevent this exploit, and allow ResourceQuotas to track Completed Pods. There is a very large tradeoff with this however, in that a normalCronJobobject spawns "Job" Pods, and they of course do their job then Complete. If you only allow 10 Completed Pods per namespace, a tenant could only run 10 Jobs before having to clean up (or else the next Job would fail). This more than likely is not something a Platform Engineering team would want to enable, as it would impose a lot of restrictions on developers and users of OpenShift.
No comments:
Post a Comment