Home
9 min read read

Wazuh on K8s: 7 Frameworks, Auto-Remediation, One Chart

Cover: Wazuh on K8s: 7 Frameworks, Auto-Remediation, One Chart

Most organizations running Wazuh on Kubernetes are stitching together five separate tools to get compliance coverage that still has gaps. One tool scans CIS benchmarks. Another handles admission policies. A third runs vulnerability checks. Remediation is manual. Reports are spreadsheets. None of them share context, and when an auditor asks whether a single kubelet misconfiguration also fails your NIST, PCI, and HIPAA controls — nobody can answer without hours of cross-referencing.

I built a single Helm chart that replaces that entire stack. 167 checks across 7 compliance frameworks. Admission webhook enforcement. Automated remediation. Runtime threat detection mapped to MITRE ATT&CK. Prometheus metrics. Grafana dashboards. Audit-ready compliance reports. One helm install, full lifecycle coverage.

This post is a technical deep-dive into the architecture, the cross-framework mapping that makes it work, the runtime threat detection engine, and every enterprise feature under the hood.


The Problem With Current Approaches

Here’s what most teams are running today:

  1. A Wazuh or Falco DaemonSet for detection
  2. OPA/Gatekeeper or Kyverno for admission policies
  3. A separate CIS scanner (kube-bench) as a CronJob
  4. Manual remediation or Ansible playbooks triggered by humans
  5. Compliance reports generated in spreadsheets by hand

These tools don’t share context. A CIS check that fails on the kubelet doesn’t automatically map to the NIST 800-53 control it satisfies. The admission webhook doesn’t know what the SCA scanner found. The remediation is always manual.

The result: compliance drift, audit fatigue, and a false sense of security.


Architecture: One Chart, Full Lifecycle

The chart deploys as a DaemonSet across every node, with four enforcement layers:

                     ┌─────────────────────────┐
   Deploy ──────►│   Admission Webhook      │ ◄── PREVENT
                  │   Block before it runs   │
                  └───────────┬─────────────┘

                  ┌───────────▼─────────────┐
   Runtime ─────►│   Wazuh Agent DaemonSet  │ ◄── DETECT
                  │   SCA + FIM + Vuln + RT  │
                  └───────────┬─────────────┘

                  ┌───────────▼─────────────┐
   CronJob ─────►│   Auto-Remediation       │ ◄── FIX
                  │   File perms, kernel,    │
                  │   SSH, auditd, modules   │
                  └───────────┬─────────────┘

                  ┌───────────▼─────────────┐
   Scheduled ───►│   Compliance Reports     │ ◄── PROVE
                  │   JSON / HTML / CSV      │
                  │   S3 upload + email      │
                  └─────────────────────────┘

Prevent — A ValidatingWebhookConfiguration intercepts every pod, deployment, statefulset, daemonset, job, and cronjob at admission time. It blocks privileged containers, host namespace access, privilege escalation, :latest tags, missing required labels, and unauthorized registries. It does this before the workload ever touches a node.

Detect — Wazuh agents run SCA (Security Configuration Assessment) scans against seven policy files simultaneously. Each policy file is a set of checks written in Wazuh’s SCA YAML format, with compliance cross-references baked into every check.

Fix — A CronJob runs every 6 hours (configurable) and remediates findings automatically: file permissions, kernel sysctl parameters, SSH hardening, unused kernel modules, and auditd rules. It starts in dry-run mode by default — it logs what it would fix without touching anything.

Prove — A weekly CronJob generates compliance reports in JSON, HTML, and CSV. It can upload directly to S3 or email stakeholders. The HTML report is audit-ready with framework breakdowns and pass/fail summaries.


Seven Frameworks, One Scan

Here’s what a single SCA scan evaluates:

FrameworkControlsWhat It Checks
CIS Kubernetes v1.8.031 (L1 + L2)API server config, kubelet hardening, etcd security, RBAC, network policies
CIS Linux v2.0.036 (L1 + L2)Filesystem, network params, SSH, logging, file permissions, password policy
NIST 800-53 Rev524AC, AU, CM, IA, SC, SI control families mapped to K8s and OS checks
PCI-DSS v4.020Network segmentation, encryption at rest/transit, access control, FIM, audit trails
HIPAA §164.31216Access control, audit controls, integrity, authentication, transmission security
SOC2 Type II18CC6-CC8 trust criteria, availability, change management
Runtime Threats22MITRE ATT&CK mapped: cryptomining, container escape, reverse shells, persistence

Total: 167 checks per scan cycle.

The critical insight is cross-framework mapping. Take kubelet anonymous authentication as an example:

   - id: 50400
  title: "IA-2: kubelet anonymous auth disabled"
  compliance:
    - nist_800_53: ["IA-2"]
    - cis: ["4.2.1"]

This single check satisfies:

  • CIS Kubernetes 4.2.1 (Worker Node — Kubelet)
  • NIST 800-53 IA-2 (Identification and Authentication)
  • PCI-DSS 2.2.1 (Secure default configurations)
  • HIPAA 164.312(a)(2)(i) (Unique user identification)
  • SOC2 CC6.1 (Logical access security)

One finding, five frameworks addressed. That’s the kind of efficiency that auditors and compliance teams actually need, and that no single open-source tool provides out of the box.


Runtime Threat Detection: MITRE ATT&CK Mapped

This is the part that goes beyond compliance into active threat hunting. The runtime policy file checks for indicators of compromise that map directly to MITRE ATT&CK techniques:

Cryptomining (T1496)

   - id: 90100
  title: "Cryptominer process detection"
  condition: none
  rules:
    - "p:xmrig"
    - "p:minerd"
    - "p:cpuminer"
    - "p:ethminer"
    - "p:cgminer"
    - "p:nbminer"
    - "p:t-rex"
    - "p:gminer"

This doesn’t just check for xmrig. It checks for 11 known miners, stratum protocol connections on mining pool ports (3333, 4444, 8333, 14444, 45700), and processes consuming >90% CPU as a behavioral indicator.

Container Escape (T1611)

   - id: 90201
  title: "Container escape — Host mount abuse"
  condition: none
  rules:
    - "c:mount -> r:docker.sock"
    - "c:mount -> r:containerd.sock"

Detects containers mounting the Docker socket or containerd socket (the most common container escape vector), nsenter usage, cgroup release_agent abuse (CVE-2022-0492 style), and running privileged containers.

Reverse Shells (T1059)

Detects shell processes with network socket redirections, ncat/nc with -e flags, and socat TCP connections. These are the exact patterns you’d see in a real post-exploitation scenario.

Credential Harvesting (T1552)

Checks for processes reading Kubernetes ServiceAccount tokens from /proc, connections to the cloud metadata endpoint (169.254.169.254), and SSH private key scanning across /home.

Persistence (T1053, T1543, T1554)

Detects recently created cron jobs, modified system binaries, and new systemd service files — all created within the last 60 minutes, which is a strong indicator of an active intrusion.


The Admission Webhook: Shift-Left Enforcement

Detection is reactive. The admission webhook is proactive — it prevents non-compliant workloads from ever running.

It’s deployed as a separate HA deployment (default 2 replicas with topology spread constraints) with its own ServiceAccount, RBAC, NetworkPolicy, PDB, and cert-manager TLS certificate.

The policy engine evaluates 13 rules:

   {
  "blockPrivileged": true,
  "blockHostNetwork": true,
  "blockHostPID": true,
  "blockHostIPC": true,
  "requireRunAsNonRoot": true,
  "blockPrivilegeEscalation": true,
  "blockLatestTag": true,
  "requireImageDigest": false,
  "requiredLabels": ["app.kubernetes.io/name", "app.kubernetes.io/version"],
  "blockedImageRegistries": [],
  "allowedImageRegistries": []
}

The webhook itself is self-hardened:

  • Non-root (runs as UID 65534)
  • Read-only root filesystem
  • All capabilities dropped
  • Seccomp RuntimeDefault
  • NetworkPolicy restricting traffic to only the API server
  • Failure policy defaults to Ignore (fail-open) so a webhook outage doesn’t block deployments, switchable to Fail for strict environments

The exemption system is critical for production. The chart’s own namespace and service account are automatically exempted, along with kube-system, kube-public, and kube-node-lease. You can’t accidentally lock yourself out.


Auto-Remediation: From Detection to Action

The remediation engine runs as a privileged CronJob with host filesystem access. Here’s what it fixes:

File Permissions — Sets /etc/passwd to 644, /etc/shadow to 640, /etc/group to 644, /etc/gshadow to 640. For Kubernetes nodes, it also enforces 600 permissions and root:root ownership on kube-apiserver.yaml, kube-controller-manager.yaml, kube-scheduler.yaml, and etcd.yaml.

Kernel Parameters — Applies sysctl hardening:

   net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.send_redirects=0
net.ipv4.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_source_route=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.tcp_syncookies=1
net.ipv6.conf.all.accept_ra=0
net.ipv6.conf.default.accept_ra=0

It persists changes to /etc/sysctl.conf so they survive reboots.

SSH Hardening — Ensures PermitRootLogin no, PermitEmptyPasswords no, MaxAuthTries 4, ClientAliveInterval 300, ClientAliveCountMax 3, LoginGraceTime 60.

Kernel Modules — Disables cramfs, squashfs, and udf by writing to /etc/modprobe.d/cis-hardening.conf.

Auditd Rules — Adds watch rules for /etc/passwd, /etc/shadow, /etc/group, and /etc/gshadow.

The dry-run mode is essential. On first deployment, it logs every change it would make without touching anything:

   [DRY-RUN] Would execute: chmod 640 /host/etc/shadow
[DRY-RUN] Would execute: sysctl -w net.ipv4.conf.all.send_redirects=0
[DRY-RUN] Would execute: echo 'PermitRootLogin no' >> /host/etc/ssh/sshd_config

When you’re ready to go live, flip autoRemediation.dryRun: false. It sends a Slack notification with a count of changes made per node.


Observability: Prometheus + Grafana

Every agent pod runs a metrics sidecar exporting six metrics:

MetricTypeDescription
wazuh_agent_upgaugeIs the agent process running (0/1)
wazuh_sca_checks_passedgaugeNumber of SCA checks currently passing
wazuh_sca_checks_failedgaugeNumber of SCA checks currently failing
wazuh_fim_events_totalcounterTotal file integrity change events
wazuh_vulnerabilities_detectedgaugeCurrent vulnerability count
wazuh_alerts_totalcounterTotal alerts generated

The PrometheusRule defines six alerts:

  • WazuhAgentDown — Agent offline for 5+ minutes (critical)
  • WazuhHighSCAFailureRate — >30% of checks failing (warning)
  • WazuhCriticalSCAFailures — >50% of checks failing (critical)
  • WazuhVulnerabilitiesDetected — >50 vulnerabilities on a node (warning)
  • WazuhFIMSpikeDetected — Unusual rate of file changes (warning)
  • WazuhAlertStorm — >50 alerts/sec indicating an active incident (critical)

The Grafana dashboard is auto-discovered via sidecar label and shows: agent status, compliance score gauge, SCA pass/fail per node, vulnerability trends, FIM event rate, and alert rate with threshold highlighting.


Self-Hardening: The Chart Secures Itself

A security chart that isn’t itself hardened is a joke. This chart practices what it preaches:

  • NetworkPolicy — Agent pods can only reach the Wazuh manager, DNS, and the Kubernetes API. Webhook pods only accept traffic from the API server.
  • PodDisruptionBudget — Maintains 50% agent availability during rolling updates and node drains.
  • Seccomp — RuntimeDefault profile on all pods.
  • Secret management — Registration passwords are stored in Kubernetes Secrets with helm.sh/resource-policy: keep. Supports external secret references.
  • Config checksums — DaemonSet pods auto-restart when ConfigMaps change. No manual rollout needed.
  • cert-manager integration — Webhook TLS and optional agent-to-manager mTLS via cert-manager Certificates with ECDSA P-256 keys.
  • Manager HA — Supports multiple manager endpoints with automatic failover.
  • Values schema validation — JSON Schema catches misconfiguration before helm install runs.
  • Priority class — Agents run as system-node-critical so they’re the last thing evicted under resource pressure.

Deploying It

Minimal deployment with CIS + NIST (enabled by default):

   helm install wazuh-hardening ./wazuh-k8s-hardening \
  --namespace wazuh-system --create-namespace \
  --set manager.host=wazuh-manager.wazuh.svc.cluster.local \
  --set manager.registrationPassword=YOUR_PASSWORD

Full enterprise deployment:

   global:
  clusterName: "prod-us-east-1"
  environment: "production"
  organization: "Your Org"

manager:
  host: "wazuh-manager.wazuh.svc.cluster.local"
  existingSecret: "wazuh-auth"
  failover:
    enabled: true
    hosts:
      - host: "wazuh-manager-2.wazuh.svc.cluster.local"

compliance:
  cisKubernetes:
    profile: "L2"
  cisLinux:
    profile: "L2"
  nist80053:
    enabled: true
  pciDss:
    enabled: true
  hipaa:
    enabled: true
  soc2:
    enabled: true

admissionWebhook:
  enabled: true
  failurePolicy: "Fail"

autoRemediation:
  enabled: true
  dryRun: false
  notifications:
    enabled: true
    slackWebhookUrl: "https://hooks.slack.com/services/..."

Why Wazuh

I chose Wazuh as the engine for this because it’s the only open-source platform that can run SCA, FIM, vulnerability detection, log collection, rootcheck, and active response from a single agent binary. Falco does runtime detection well but doesn’t do compliance scanning. kube-bench does CIS but nothing else. OPA does admission but doesn’t touch host-level hardening.

Wazuh’s SCA engine accepts custom YAML policy files with compliance cross-references baked into every check. That’s the capability that makes multi-framework mapping possible without building a custom engine from scratch. The agent is lightweight enough to run as a DaemonSet without starving your workloads, and the manager aggregates findings across every node into a single pane of glass.

This chart extends Wazuh’s capabilities into areas it doesn’t cover natively: admission-time enforcement, automated remediation, Prometheus-native observability, and scheduled compliance reporting. It’s what Wazuh should ship as a reference Kubernetes deployment.


Resources