CNF lifecycle, Helm charts, K8s Operators, CPU Manager, geo-redundancy patterns, zero-downtime upgrades, production hardening

1. What Is Cloud-Native 5GC — The Simple Version

Cloud-native 5GC means running NFs as containers on Kubernetes — not just ported to containers, but designed from the ground up to leverage K8s primitives: horizontal pod scaling, rolling upgrades, health probes, ConfigMap-driven configuration, and Operator-managed lifecycle. The difference from just “running in containers” is that cloud-native NFs are stateless where possible, externalise state to persistent volumes or databases, and fail gracefully without cascading.

In practice, cloud-native 5GC on Kubernetes is harder than running web applications on Kubernetes. The latency requirements are tighter, the NIC requirements are specialised, and “stateless” is a relative concept — an SMF pod in the middle of programming 50,000 PFCP sessions cannot just be killed and replaced without consequences.

3GPP Reference

GSMA NG.126 — Cloud Infrastructure Reference Model for 5GC

ETSI GS NFV-IFA 036 — Kubernetes-based virtualisation requirements

3GPP TS 29.500 — Technical Realization of Service-Based Architecture (SBI on K8s)

2. Architecture — Kubernetes 5GC Cluster Design

Node Type	Role	Resource Profile	NF Workloads
Master nodes (3×)	K8s control plane: etcd, API server, scheduler, controller manager	4–8 vCPU, 16–32 GB RAM	No NF workloads — control plane only. 3 nodes for etcd quorum.
Signalling workers	AMF, SMF, PCF, UDM, AUSF, NRF, CHF	32–64 vCPU, 256–512 GB RAM, 25 GbE NIC	CPU Manager policy=static for AMF/SMF. Guaranteed QoS for all.
UPF workers	UPF only — dedicated, isolated	64 vCPU, 256 GB RAM + hugepages, 2× 100 GbE SR-IOV	Taints/Tolerations to prevent non-UPF scheduling. Hugepages pre-allocated at node boot.
Storage workers (optional)	Ceph OSD or external NVMe for persistent volumes	High I/O NVMe, dedicated NICs	UDR/CHF database persistent volumes. Avoid co-locating with NF workers.
OAM workers	Prometheus, Grafana, logging (Loki/ELK), CI/CD agents	Standard compute	Monitoring must not compete with NF resources. Separate node pool.

Table 1 — K8s cluster node role design for 5GC. Separating UPF workers from signalling workers is the single most important cluster topology decision.

3. How It Works — NF Lifecycle on Kubernetes

Helm Chart Deployment

5GC NFs are deployed via Helm charts provided by the NF vendor. The operator creates a values.yaml override file specifying: PLMN IDs (MCC/MNC), TAI list, DNN configuration, NRF endpoint, N2/N3 interface IPs, resource limits (CPU, memory, hugepages), replica counts, storage class, and TLS certificate references. The chart deploys all K8s objects: Deployment or StatefulSet, Services, ConfigMaps, Secrets, NetworkAttachmentDefinitions (Multus), and HorizontalPodAutoscaler.

Pro tip: Version-control every values.yaml in Git with PR review before any production apply. Helm upgrade without version control is the fastest path to an unrecoverable configuration drift. Every change should have a commit message and a rollback plan.

K8s Operators for 5GC NF Lifecycle

Kubernetes Operators extend the K8s API with Custom Resource Definitions (CRDs) that represent 5GC-specific concepts. The NF vendor ships an Operator that watches these CRDs and manages the NF lifecycle with telecom-aware logic:

Operator Function	What Vanilla K8s Cannot Do	What the Operator Does
Zero-downtime SMF upgrade	K8s rolling update kills old pods before sessions drain	Operator: drain sessions from old pod, wait for zero active sessions, then terminate. New pod starts accepting before old terminates.
UPF graceful drain	K8s terminationGracePeriodSeconds is a blunt timer	Operator: send PFCP Association Release to UPF, wait for SMF to migrate sessions to standby UPF, then terminate pod.
AMF context preservation	K8s Deployment has no concept of UE context	Operator: write AMF context (UE registrations) to PersistentVolume before pod terminates. New pod reads context on start.
NRF registration health	K8s readiness probe cannot check NRF registration state	Operator: verifies NF is registered in NRF before marking pod Ready. Dead pod deregisters from NRF on termination.
Certificate rotation	K8s cert-manager rotates certs but cannot flush token caches	Operator: coordinates cert rotation across all pods in the NF cluster, flushes OAuth2 token caches.

Table 2 — K8s Operator functions for 5GC. Without Operators, vanilla K8s lifecycle management causes session drops during every upgrade.

Zero-Downtime Upgrade Procedure

Step 1 — Scale up: add new pods running the new NF version alongside old pods. New version pods registered in NRF with reduced weight — they accept small percentage of traffic for verification.

Step 2 — Drain old pods: AMF/SMF Operator sets old pods to weight=0 in NRF (stops new UEs registering to old pods). Existing sessions continue on old pods until natural termination.

Step 3 — UPF drain (for UPF upgrades): SMF Operator migrates active PDU sessions from old UPF to standby UPF via PFCP Session Establishment on standby + PFCP Session Deletion on old. Brief per-session interruption (~100ms) during migration.

Step 4 — Terminate old pods after drain timeout (configurable, typically 5–15 minutes). Verify KPIs stable on new version pods.

Step 5 — Scale back to normal replica count. Remove temporary extra pods.

4. Key Parameters and Technical Terms

Term	Definition	5GC Significance
CPU Manager policy=static	K8s kubelet policy that enables exclusive CPU allocation for Guaranteed QoS pods.	Required for AMF/SMF/UPF. Without static policy: OS scheduler migrates threads between CPUs, causing cache misses.
topologyManagerPolicy=single-numa-node	K8s kubelet policy that ensures CPU and memory allocations are from the same NUMA node.	Prevents cross-NUMA memory access. Must be set before deploying any 5GC NF pods.
NetworkAttachmentDefinition	Multus CRD that defines a secondary network interface for a pod (e.g., SR-IOV VF for UPF N3).	UPF pod spec references NAD for each secondary interface. Without NAD: UPF only has the primary CNI interface.
PodDisruptionBudget (PDB)	K8s policy limiting how many pods can be unavailable during voluntary disruptions.	Set for each NF: minAvailable=N-1. Prevents simultaneous eviction of all AMF/SMF pods during node maintenance.
StatefulSet vs Deployment	StatefulSet: stable pod names, stable PersistentVolumeClaims. Deployment: ephemeral.	SMF and UDM: StatefulSet — stable pod names for clustering, stable volumes for session/subscriber state.
HPA (Horizontal Pod Autoscaler)	Automatically scales pod replicas based on CPU, memory, or custom metrics.	Useful for AMF/PCF/NRF which can scale stateless. SMF autoscaling is complex due to session state affinity.
liveness / readiness probes	K8s health checks. Liveness: restart if failed. Readiness: stop traffic if failed.	Configure readiness probe to check NRF registration, not just HTTP endpoint. Unhealthy NF deregistered from NRF.
Helm upgrade –atomic	Helm flag: if upgrade fails, automatically rollback to previous release.	Use for all production NF upgrades. Prevents partial upgrades leaving cluster in inconsistent state.

Table 3 — K8s parameters for 5GC. CPU Manager + NUMA topology manager + SR-IOV are the platform trio that determines NF performance.

5. Common Issues in the Field

Field Note: SMF Pod Rolling Update Drops Sessions — No Operator Installed

Operator ran helm upgrade on SMF without vendor K8s Operator deployed.

K8s default rolling update: terminates old pod immediately, starts new pod.

Old pod terminated mid-session: 15,000 active PDU sessions lost their PFCP state.

UEs experienced session drops lasting 3-5 minutes during re-establishment storm.

Fix: deploy vendor SMF Operator before any upgrade. Operator manages graceful session drain.

Lesson: never run Helm upgrade on stateful 5GC NFs without the vendor Operator managing the lifecycle.

Field Note: All AMF Pods Evicted Simultaneously — Missing PodDisruptionBudget

Node maintenance required draining 2 K8s worker nodes. AMF had 4 pods, all on the 2 nodes being drained.

K8s drained both nodes simultaneously (default behaviour). All 4 AMF pods evicted at once.

Network: total AMF unavailability for 90 seconds while pods rescheduled on other nodes.

All active UE registrations: NAS signalling timeout. Mass re-registration storm on recovery.

Fix: set PodDisruptionBudget minAvailable=2 for AMF. Spread AMF pods across nodes with podAntiAffinity.

Always test node drain procedure in staging before production maintenance.

6. Troubleshooting

Symptom	Root Cause	Check	Fix
Session drops during NF upgrade	No graceful drain — vanilla K8s rolling update	K8s events during upgrade; NF Operator deployment status	Deploy vendor NF Operator; use Operator-managed upgrade procedure
Mass session drop during node maintenance	PodDisruptionBudget not set — all pods evicted simultaneously	PDB config: kubectl get pdb -n 5gc; AMF pod distribution across nodes	Set PDB minAvailable; add podAntiAffinity to spread pods across nodes
NRF shows NF as registered but it is down	Liveness probe not checking NRF registration — pod Running but NRF unaware	NF readiness probe config; NRF NF profile list	Configure readiness probe to check NRF registration endpoint
Certificate expiry causes SBI failures	cert-manager rotated cert but NF pods not reloaded	K8s Secret last-updated timestamp; NF pod TLS cert expiry	Use vendor Operator for cert rotation — it coordinates pod reload and token cache flush
K8s upgrade breaks NF pods	K8s API version change deprecates NF CRD or manifest	kubectl get events; helm status shows degraded	Validate NF vendor K8s compatibility matrix before cluster upgrade. Test in staging.

Table 4 — Cloud-native 5GC troubleshooting. Most issues are lifecycle management failures, not NF software bugs.

7. Summary — Key Takeaways

Topic	Key Takeaway
Vendor K8s Operator	Mandatory for stateful NFs (SMF, UPF). Vanilla K8s rolling update = session drops. Deploy Operator before first production upgrade.
PodDisruptionBudget	Set for every NF. minAvailable=N-1. Without PDB: node maintenance or cluster upgrade evicts all pods of one NF simultaneously.
CPU Manager + NUMA	policy=static + topologyManagerPolicy=single-numa-node. Configure at node level before deploying NF pods. Changing after deployment requires pod restarts.
PodAntiAffinity	Spread NF replicas across physical hosts and availability zones. Single host failure must not take down all AMF or NRF replicas.
Zero-downtime upgrade	Scale up new version → drain old → verify KPIs → terminate old. This is a 15-30 minute procedure per NF, not a 30-second kubectl apply.
GitOps for config	values.yaml in Git with PR review. Every production change has a commit, a review, and a rollback point. Helm upgrade –atomic always.
Readiness probe	Check NRF registration in readiness probe, not just HTTP /healthz. An NF not in NRF is invisible to the rest of the core.

Table 5 — Post 08 summary. Cloud-native 5GC on K8s is production-viable with the right Operator, PDB, and topology configuration.

Next: Post 09 — SBA & NF APIs Deep Dive

Cloud-Native 5GC on Kubernetes