CNF lifecycle, Helm charts, K8s Operators, CPU Manager, geo-redundancy patterns, zero-downtime upgrades, production hardening
1. What Is Cloud-Native 5GC — The Simple Version
Cloud-native 5GC means running NFs as containers on Kubernetes — not just ported to containers, but designed from the ground up to leverage K8s primitives: horizontal pod scaling, rolling upgrades, health probes, ConfigMap-driven configuration, and Operator-managed lifecycle. The difference from just “running in containers” is that cloud-native NFs are stateless where possible, externalise state to persistent volumes or databases, and fail gracefully without cascading.
In practice, cloud-native 5GC on Kubernetes is harder than running web applications on Kubernetes. The latency requirements are tighter, the NIC requirements are specialised, and “stateless” is a relative concept — an SMF pod in the middle of programming 50,000 PFCP sessions cannot just be killed and replaced without consequences.
| 3GPP Reference |
| GSMA NG.126 — Cloud Infrastructure Reference Model for 5GC |
| ETSI GS NFV-IFA 036 — Kubernetes-based virtualisation requirements |
| 3GPP TS 29.500 — Technical Realization of Service-Based Architecture (SBI on K8s) |
2. Architecture — Kubernetes 5GC Cluster Design
| Node Type | Role | Resource Profile | NF Workloads |
| Master nodes (3×) | K8s control plane: etcd, API server, scheduler, controller manager | 4–8 vCPU, 16–32 GB RAM | No NF workloads — control plane only. 3 nodes for etcd quorum. |
| Signalling workers | AMF, SMF, PCF, UDM, AUSF, NRF, CHF | 32–64 vCPU, 256–512 GB RAM, 25 GbE NIC | CPU Manager policy=static for AMF/SMF. Guaranteed QoS for all. |
| UPF workers | UPF only — dedicated, isolated | 64 vCPU, 256 GB RAM + hugepages, 2× 100 GbE SR-IOV | Taints/Tolerations to prevent non-UPF scheduling. Hugepages pre-allocated at node boot. |
| Storage workers (optional) | Ceph OSD or external NVMe for persistent volumes | High I/O NVMe, dedicated NICs | UDR/CHF database persistent volumes. Avoid co-locating with NF workers. |
| OAM workers | Prometheus, Grafana, logging (Loki/ELK), CI/CD agents | Standard compute | Monitoring must not compete with NF resources. Separate node pool. |
Table 1 — K8s cluster node role design for 5GC. Separating UPF workers from signalling workers is the single most important cluster topology decision.
3. How It Works — NF Lifecycle on Kubernetes
Helm Chart Deployment
5GC NFs are deployed via Helm charts provided by the NF vendor. The operator creates a values.yaml override file specifying: PLMN IDs (MCC/MNC), TAI list, DNN configuration, NRF endpoint, N2/N3 interface IPs, resource limits (CPU, memory, hugepages), replica counts, storage class, and TLS certificate references. The chart deploys all K8s objects: Deployment or StatefulSet, Services, ConfigMaps, Secrets, NetworkAttachmentDefinitions (Multus), and HorizontalPodAutoscaler.
Pro tip: Version-control every values.yaml in Git with PR review before any production apply. Helm upgrade without version control is the fastest path to an unrecoverable configuration drift. Every change should have a commit message and a rollback plan.
K8s Operators for 5GC NF Lifecycle
Kubernetes Operators extend the K8s API with Custom Resource Definitions (CRDs) that represent 5GC-specific concepts. The NF vendor ships an Operator that watches these CRDs and manages the NF lifecycle with telecom-aware logic:
| Operator Function | What Vanilla K8s Cannot Do | What the Operator Does |
| Zero-downtime SMF upgrade | K8s rolling update kills old pods before sessions drain | Operator: drain sessions from old pod, wait for zero active sessions, then terminate. New pod starts accepting before old terminates. |
| UPF graceful drain | K8s terminationGracePeriodSeconds is a blunt timer | Operator: send PFCP Association Release to UPF, wait for SMF to migrate sessions to standby UPF, then terminate pod. |
| AMF context preservation | K8s Deployment has no concept of UE context | Operator: write AMF context (UE registrations) to PersistentVolume before pod terminates. New pod reads context on start. |
| NRF registration health | K8s readiness probe cannot check NRF registration state | Operator: verifies NF is registered in NRF before marking pod Ready. Dead pod deregisters from NRF on termination. |
| Certificate rotation | K8s cert-manager rotates certs but cannot flush token caches | Operator: coordinates cert rotation across all pods in the NF cluster, flushes OAuth2 token caches. |
Table 2 — K8s Operator functions for 5GC. Without Operators, vanilla K8s lifecycle management causes session drops during every upgrade.
Zero-Downtime Upgrade Procedure
Step 1 — Scale up: add new pods running the new NF version alongside old pods. New version pods registered in NRF with reduced weight — they accept small percentage of traffic for verification.
Step 2 — Drain old pods: AMF/SMF Operator sets old pods to weight=0 in NRF (stops new UEs registering to old pods). Existing sessions continue on old pods until natural termination.
Step 3 — UPF drain (for UPF upgrades): SMF Operator migrates active PDU sessions from old UPF to standby UPF via PFCP Session Establishment on standby + PFCP Session Deletion on old. Brief per-session interruption (~100ms) during migration.
Step 4 — Terminate old pods after drain timeout (configurable, typically 5–15 minutes). Verify KPIs stable on new version pods.
Step 5 — Scale back to normal replica count. Remove temporary extra pods.
4. Key Parameters and Technical Terms
| Term | Definition | 5GC Significance |
| CPU Manager policy=static | K8s kubelet policy that enables exclusive CPU allocation for Guaranteed QoS pods. | Required for AMF/SMF/UPF. Without static policy: OS scheduler migrates threads between CPUs, causing cache misses. |
| topologyManagerPolicy=single-numa-node | K8s kubelet policy that ensures CPU and memory allocations are from the same NUMA node. | Prevents cross-NUMA memory access. Must be set before deploying any 5GC NF pods. |
| NetworkAttachmentDefinition | Multus CRD that defines a secondary network interface for a pod (e.g., SR-IOV VF for UPF N3). | UPF pod spec references NAD for each secondary interface. Without NAD: UPF only has the primary CNI interface. |
| PodDisruptionBudget (PDB) | K8s policy limiting how many pods can be unavailable during voluntary disruptions. | Set for each NF: minAvailable=N-1. Prevents simultaneous eviction of all AMF/SMF pods during node maintenance. |
| StatefulSet vs Deployment | StatefulSet: stable pod names, stable PersistentVolumeClaims. Deployment: ephemeral. | SMF and UDM: StatefulSet — stable pod names for clustering, stable volumes for session/subscriber state. |
| HPA (Horizontal Pod Autoscaler) | Automatically scales pod replicas based on CPU, memory, or custom metrics. | Useful for AMF/PCF/NRF which can scale stateless. SMF autoscaling is complex due to session state affinity. |
| liveness / readiness probes | K8s health checks. Liveness: restart if failed. Readiness: stop traffic if failed. | Configure readiness probe to check NRF registration, not just HTTP endpoint. Unhealthy NF deregistered from NRF. |
| Helm upgrade –atomic | Helm flag: if upgrade fails, automatically rollback to previous release. | Use for all production NF upgrades. Prevents partial upgrades leaving cluster in inconsistent state. |
Table 3 — K8s parameters for 5GC. CPU Manager + NUMA topology manager + SR-IOV are the platform trio that determines NF performance.
5. Common Issues in the Field
| Field Note: SMF Pod Rolling Update Drops Sessions — No Operator Installed |
| Operator ran helm upgrade on SMF without vendor K8s Operator deployed. |
| K8s default rolling update: terminates old pod immediately, starts new pod. |
| Old pod terminated mid-session: 15,000 active PDU sessions lost their PFCP state. |
| UEs experienced session drops lasting 3-5 minutes during re-establishment storm. |
| Fix: deploy vendor SMF Operator before any upgrade. Operator manages graceful session drain. |
| Lesson: never run Helm upgrade on stateful 5GC NFs without the vendor Operator managing the lifecycle. |
| Field Note: All AMF Pods Evicted Simultaneously — Missing PodDisruptionBudget |
| Node maintenance required draining 2 K8s worker nodes. AMF had 4 pods, all on the 2 nodes being drained. |
| K8s drained both nodes simultaneously (default behaviour). All 4 AMF pods evicted at once. |
| Network: total AMF unavailability for 90 seconds while pods rescheduled on other nodes. |
| All active UE registrations: NAS signalling timeout. Mass re-registration storm on recovery. |
| Fix: set PodDisruptionBudget minAvailable=2 for AMF. Spread AMF pods across nodes with podAntiAffinity. |
| Always test node drain procedure in staging before production maintenance. |
6. Troubleshooting
| Symptom | Root Cause | Check | Fix |
| Session drops during NF upgrade | No graceful drain — vanilla K8s rolling update | K8s events during upgrade; NF Operator deployment status | Deploy vendor NF Operator; use Operator-managed upgrade procedure |
| Mass session drop during node maintenance | PodDisruptionBudget not set — all pods evicted simultaneously | PDB config: kubectl get pdb -n 5gc; AMF pod distribution across nodes | Set PDB minAvailable; add podAntiAffinity to spread pods across nodes |
| NRF shows NF as registered but it is down | Liveness probe not checking NRF registration — pod Running but NRF unaware | NF readiness probe config; NRF NF profile list | Configure readiness probe to check NRF registration endpoint |
| Certificate expiry causes SBI failures | cert-manager rotated cert but NF pods not reloaded | K8s Secret last-updated timestamp; NF pod TLS cert expiry | Use vendor Operator for cert rotation — it coordinates pod reload and token cache flush |
| K8s upgrade breaks NF pods | K8s API version change deprecates NF CRD or manifest | kubectl get events; helm status shows degraded | Validate NF vendor K8s compatibility matrix before cluster upgrade. Test in staging. |
Table 4 — Cloud-native 5GC troubleshooting. Most issues are lifecycle management failures, not NF software bugs.
7. Summary — Key Takeaways
| Topic | Key Takeaway |
| Vendor K8s Operator | Mandatory for stateful NFs (SMF, UPF). Vanilla K8s rolling update = session drops. Deploy Operator before first production upgrade. |
| PodDisruptionBudget | Set for every NF. minAvailable=N-1. Without PDB: node maintenance or cluster upgrade evicts all pods of one NF simultaneously. |
| CPU Manager + NUMA | policy=static + topologyManagerPolicy=single-numa-node. Configure at node level before deploying NF pods. Changing after deployment requires pod restarts. |
| PodAntiAffinity | Spread NF replicas across physical hosts and availability zones. Single host failure must not take down all AMF or NRF replicas. |
| Zero-downtime upgrade | Scale up new version → drain old → verify KPIs → terminate old. This is a 15-30 minute procedure per NF, not a 30-second kubectl apply. |
| GitOps for config | values.yaml in Git with PR review. Every production change has a commit, a review, and a rollback point. Helm upgrade –atomic always. |
| Readiness probe | Check NRF registration in readiness probe, not just HTTP /healthz. An NF not in NRF is invisible to the rest of the core. |
Table 5 — Post 08 summary. Cloud-native 5GC on K8s is production-viable with the right Operator, PDB, and topology configuration.
Next: Post 09 — SBA & NF APIs Deep Dive
