Cloud-Native 5GC on Kubernetes

CNF lifecycle, Helm charts, K8s Operators, CPU Manager, geo-redundancy patterns, zero-downtime upgrades, production hardening

1. What Is Cloud-Native 5GC — The Simple Version

Cloud-native 5GC means running NFs as containers on Kubernetes — not just ported to containers, but designed from the ground up to leverage K8s primitives: horizontal pod scaling, rolling upgrades, health probes, ConfigMap-driven configuration, and Operator-managed lifecycle. The difference from just “running in containers” is that cloud-native NFs are stateless where possible, externalise state to persistent volumes or databases, and fail gracefully without cascading.

In practice, cloud-native 5GC on Kubernetes is harder than running web applications on Kubernetes. The latency requirements are tighter, the NIC requirements are specialised, and “stateless” is a relative concept — an SMF pod in the middle of programming 50,000 PFCP sessions cannot just be killed and replaced without consequences.

3GPP Reference
GSMA NG.126 — Cloud Infrastructure Reference Model for 5GC
ETSI GS NFV-IFA 036 — Kubernetes-based virtualisation requirements
3GPP TS 29.500 — Technical Realization of Service-Based Architecture (SBI on K8s)

2. Architecture — Kubernetes 5GC Cluster Design

Node TypeRoleResource ProfileNF Workloads
Master nodes (3×)K8s control plane: etcd, API server, scheduler, controller manager4–8 vCPU, 16–32 GB RAMNo NF workloads — control plane only. 3 nodes for etcd quorum.
Signalling workersAMF, SMF, PCF, UDM, AUSF, NRF, CHF32–64 vCPU, 256–512 GB RAM, 25 GbE NICCPU Manager policy=static for AMF/SMF. Guaranteed QoS for all.
UPF workersUPF only — dedicated, isolated64 vCPU, 256 GB RAM + hugepages, 2× 100 GbE SR-IOVTaints/Tolerations to prevent non-UPF scheduling. Hugepages pre-allocated at node boot.
Storage workers (optional)Ceph OSD or external NVMe for persistent volumesHigh I/O NVMe, dedicated NICsUDR/CHF database persistent volumes. Avoid co-locating with NF workers.
OAM workersPrometheus, Grafana, logging (Loki/ELK), CI/CD agentsStandard computeMonitoring must not compete with NF resources. Separate node pool.

Table 1 — K8s cluster node role design for 5GC. Separating UPF workers from signalling workers is the single most important cluster topology decision.

3. How It Works — NF Lifecycle on Kubernetes

Helm Chart Deployment

5GC NFs are deployed via Helm charts provided by the NF vendor. The operator creates a values.yaml override file specifying: PLMN IDs (MCC/MNC), TAI list, DNN configuration, NRF endpoint, N2/N3 interface IPs, resource limits (CPU, memory, hugepages), replica counts, storage class, and TLS certificate references. The chart deploys all K8s objects: Deployment or StatefulSet, Services, ConfigMaps, Secrets, NetworkAttachmentDefinitions (Multus), and HorizontalPodAutoscaler.

Pro tip: Version-control every values.yaml in Git with PR review before any production apply. Helm upgrade without version control is the fastest path to an unrecoverable configuration drift. Every change should have a commit message and a rollback plan.

K8s Operators for 5GC NF Lifecycle

Kubernetes Operators extend the K8s API with Custom Resource Definitions (CRDs) that represent 5GC-specific concepts. The NF vendor ships an Operator that watches these CRDs and manages the NF lifecycle with telecom-aware logic:

Operator FunctionWhat Vanilla K8s Cannot DoWhat the Operator Does
Zero-downtime SMF upgradeK8s rolling update kills old pods before sessions drainOperator: drain sessions from old pod, wait for zero active sessions, then terminate. New pod starts accepting before old terminates.
UPF graceful drainK8s terminationGracePeriodSeconds is a blunt timerOperator: send PFCP Association Release to UPF, wait for SMF to migrate sessions to standby UPF, then terminate pod.
AMF context preservationK8s Deployment has no concept of UE contextOperator: write AMF context (UE registrations) to PersistentVolume before pod terminates. New pod reads context on start.
NRF registration healthK8s readiness probe cannot check NRF registration stateOperator: verifies NF is registered in NRF before marking pod Ready. Dead pod deregisters from NRF on termination.
Certificate rotationK8s cert-manager rotates certs but cannot flush token cachesOperator: coordinates cert rotation across all pods in the NF cluster, flushes OAuth2 token caches.

Table 2 — K8s Operator functions for 5GC. Without Operators, vanilla K8s lifecycle management causes session drops during every upgrade.

Zero-Downtime Upgrade Procedure

Step 1 — Scale up: add new pods running the new NF version alongside old pods. New version pods registered in NRF with reduced weight — they accept small percentage of traffic for verification.

Step 2 — Drain old pods: AMF/SMF Operator sets old pods to weight=0 in NRF (stops new UEs registering to old pods). Existing sessions continue on old pods until natural termination.

Step 3 — UPF drain (for UPF upgrades): SMF Operator migrates active PDU sessions from old UPF to standby UPF via PFCP Session Establishment on standby + PFCP Session Deletion on old. Brief per-session interruption (~100ms) during migration.

Step 4 — Terminate old pods after drain timeout (configurable, typically 5–15 minutes). Verify KPIs stable on new version pods.

Step 5 — Scale back to normal replica count. Remove temporary extra pods.

4. Key Parameters and Technical Terms

TermDefinition5GC Significance
CPU Manager policy=staticK8s kubelet policy that enables exclusive CPU allocation for Guaranteed QoS pods.Required for AMF/SMF/UPF. Without static policy: OS scheduler migrates threads between CPUs, causing cache misses.
topologyManagerPolicy=single-numa-nodeK8s kubelet policy that ensures CPU and memory allocations are from the same NUMA node.Prevents cross-NUMA memory access. Must be set before deploying any 5GC NF pods.
NetworkAttachmentDefinitionMultus CRD that defines a secondary network interface for a pod (e.g., SR-IOV VF for UPF N3).UPF pod spec references NAD for each secondary interface. Without NAD: UPF only has the primary CNI interface.
PodDisruptionBudget (PDB)K8s policy limiting how many pods can be unavailable during voluntary disruptions.Set for each NF: minAvailable=N-1. Prevents simultaneous eviction of all AMF/SMF pods during node maintenance.
StatefulSet vs DeploymentStatefulSet: stable pod names, stable PersistentVolumeClaims. Deployment: ephemeral.SMF and UDM: StatefulSet — stable pod names for clustering, stable volumes for session/subscriber state.
HPA (Horizontal Pod Autoscaler)Automatically scales pod replicas based on CPU, memory, or custom metrics.Useful for AMF/PCF/NRF which can scale stateless. SMF autoscaling is complex due to session state affinity.
liveness / readiness probesK8s health checks. Liveness: restart if failed. Readiness: stop traffic if failed.Configure readiness probe to check NRF registration, not just HTTP endpoint. Unhealthy NF deregistered from NRF.
Helm upgrade –atomicHelm flag: if upgrade fails, automatically rollback to previous release.Use for all production NF upgrades. Prevents partial upgrades leaving cluster in inconsistent state.

Table 3 — K8s parameters for 5GC. CPU Manager + NUMA topology manager + SR-IOV are the platform trio that determines NF performance.

5. Common Issues in the Field

Field Note: SMF Pod Rolling Update Drops Sessions — No Operator Installed
Operator ran helm upgrade on SMF without vendor K8s Operator deployed.
K8s default rolling update: terminates old pod immediately, starts new pod.
Old pod terminated mid-session: 15,000 active PDU sessions lost their PFCP state.
UEs experienced session drops lasting 3-5 minutes during re-establishment storm.
Fix: deploy vendor SMF Operator before any upgrade. Operator manages graceful session drain.
Lesson: never run Helm upgrade on stateful 5GC NFs without the vendor Operator managing the lifecycle.
Field Note: All AMF Pods Evicted Simultaneously — Missing PodDisruptionBudget
Node maintenance required draining 2 K8s worker nodes. AMF had 4 pods, all on the 2 nodes being drained.
K8s drained both nodes simultaneously (default behaviour). All 4 AMF pods evicted at once.
Network: total AMF unavailability for 90 seconds while pods rescheduled on other nodes.
All active UE registrations: NAS signalling timeout. Mass re-registration storm on recovery.
Fix: set PodDisruptionBudget minAvailable=2 for AMF. Spread AMF pods across nodes with podAntiAffinity.
Always test node drain procedure in staging before production maintenance.

6. Troubleshooting

SymptomRoot CauseCheckFix
Session drops during NF upgradeNo graceful drain — vanilla K8s rolling updateK8s events during upgrade; NF Operator deployment statusDeploy vendor NF Operator; use Operator-managed upgrade procedure
Mass session drop during node maintenancePodDisruptionBudget not set — all pods evicted simultaneouslyPDB config: kubectl get pdb -n 5gc; AMF pod distribution across nodesSet PDB minAvailable; add podAntiAffinity to spread pods across nodes
NRF shows NF as registered but it is downLiveness probe not checking NRF registration — pod Running but NRF unawareNF readiness probe config; NRF NF profile listConfigure readiness probe to check NRF registration endpoint
Certificate expiry causes SBI failurescert-manager rotated cert but NF pods not reloadedK8s Secret last-updated timestamp; NF pod TLS cert expiryUse vendor Operator for cert rotation — it coordinates pod reload and token cache flush
K8s upgrade breaks NF podsK8s API version change deprecates NF CRD or manifestkubectl get events; helm status shows degradedValidate NF vendor K8s compatibility matrix before cluster upgrade. Test in staging.

Table 4 — Cloud-native 5GC troubleshooting. Most issues are lifecycle management failures, not NF software bugs.

7. Summary — Key Takeaways

TopicKey Takeaway
Vendor K8s OperatorMandatory for stateful NFs (SMF, UPF). Vanilla K8s rolling update = session drops. Deploy Operator before first production upgrade.
PodDisruptionBudgetSet for every NF. minAvailable=N-1. Without PDB: node maintenance or cluster upgrade evicts all pods of one NF simultaneously.
CPU Manager + NUMApolicy=static + topologyManagerPolicy=single-numa-node. Configure at node level before deploying NF pods. Changing after deployment requires pod restarts.
PodAntiAffinitySpread NF replicas across physical hosts and availability zones. Single host failure must not take down all AMF or NRF replicas.
Zero-downtime upgradeScale up new version → drain old → verify KPIs → terminate old. This is a 15-30 minute procedure per NF, not a 30-second kubectl apply.
GitOps for configvalues.yaml in Git with PR review. Every production change has a commit, a review, and a rollback point. Helm upgrade –atomic always.
Readiness probeCheck NRF registration in readiness probe, not just HTTP /healthz. An NF not in NRF is invisible to the rest of the core.

Table 5 — Post 08 summary. Cloud-native 5GC on K8s is production-viable with the right Operator, PDB, and topology configuration.

Next: Post 09 — SBA & NF APIs Deep Dive

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top