NFV evolution, VNF vs CNF, OpenStack vs Kubernetes, MANO architecture, deployment models, vendor landscape — why the platform matters as much as the NF software

1. What Is COTS Virtualisation in 5GC — The Simple Version

COTS (Commercial Off-The-Shelf) virtualisation means running 5G Core NFs on standard x86 or ARM servers using software virtualisation — instead of purpose-built telecom hardware with proprietary ASICs. The motivation is straightforward: faster feature velocity, lower hardware cost, multi-vendor flexibility, and cloud-native operations. The reality is more nuanced: 5GC NFs running on COTS hardware have specific configuration requirements (CPU pinning, hugepages, NUMA topology) that are very different from enterprise application workloads.

Every major 5GC vendor — Ericsson, Nokia, Huawei, Samsung — now ships NFs as container-native software designed to run on Kubernetes. But the operators who have had the smoothest deployments are the ones who understood that “runs on Kubernetes” does not mean “configure it like a web app.”

3GPP Reference

ETSI GS NFV-INF 001 — NFV Infrastructure Requirements

ETSI GS NFV-IFA 014 — Network Functions Virtualisation Management and Orchestration

GSMA NG.126 — Cloud Infrastructure Reference Model for 5G Core

3GPP TS 23.501 Section 5.17 — Network Function Services

2. Architecture — VNF vs CNF and the Platform Stack

VNF vs CNF — The Shift to Cloud-Native

Dimension	VNF (VM-based)	CNF (Container-based)
Deployment unit	Full VM — OS + NF software	Container — NF software only, shared kernel
Boot time	Minutes (full OS boot)	Seconds (container start)
Memory overhead	2–4 GB per VM for OS	~50–200 MB per container
Scaling	Clone full VM — slow, heavyweight	Pod scale-out in seconds
Lifecycle management	VNFM (ETSI NFV IFA)	Kubernetes Operator + Helm charts
NIC access	SR-IOV via PCI passthrough	SR-IOV via CNI plugins (Multus + SRIOV-CNI)
State management	State on VM local disk	State in PersistentVolumes or external database
GCC operator trend	Legacy — still operating existing VNF deployments	CNF-first from 2022 onwards for all new 5GC buildouts

Table 1 — VNF vs CNF. Industry direction is CNF-first for all new 5GC. VNF continues for operators with existing OpenStack investment and active NF contracts.

The Full Platform Stack

Layer	VM-based (VNF)	Container-based (CNF)	Notes
NF Software	VNF on VM	CNF on Pod	Same NF logic, different packaging
Orchestration	VNFM (per-vendor)	Kubernetes Operator + Helm	K8s Operator handles NF-specific lifecycle
Infrastructure Mgmt	VIM (OpenStack)	K8s + CNI	OpenStack still used as IaaS under some K8s deployments
MANO	NFVO + VNFM + VIM (ETSI NFV)	K8s + Helm + ArgoCD/Flux	NFVO concept replaced by K8s service mesh + GitOps
Compute	COTS x86 servers	COTS x86 or ARM servers	Same hardware — virtualisation layer differs
Networking	SR-IOV + OVS-DPDK	Multus + SRIOV-CNI + OVS-DPDK	Both paths use SR-IOV for UPF data plane
Storage	Ceph RBD or NFS for VMs	Ceph RBD or local NVMe StorageClass	UDR and CHF: NVMe for low-latency DB I/O

Table 2 — Platform stack comparison. The hardware is the same. The management and lifecycle layer is fundamentally different.

3. How It Works — The Kubernetes 5GC Platform

In a CNF-based 5GC deployment on Kubernetes, here is how the pieces fit together:

The K8s cluster is structured with dedicated node types. Master nodes run the K8s control plane (etcd, API server, scheduler) — typically 3 nodes for HA. Worker nodes are specialised: signalling-plane workers host AMF, SMF, PCF, UDM, AUSF, NRF pods (standard compute, 25 GbE management NIC). UPF workers are separately timed out with 100 GbE SR-IOV NICs, hugepages pre-allocated, CPU pinned.

NF deployment uses Helm charts. The operator installs the vendor-provided Helm chart with a values.yaml override file specifying: PLMN IDs, TAI configurations, DNN definitions, NRF endpoint, N2/N3 interface IP addresses, resource limits, replica counts. The chart deploys all Kubernetes objects: Deployment or StatefulSet, Services (ClusterIP for SBI, LoadBalancer or NodePort for N2/N3), ConfigMaps, Secrets (TLS certificates), NetworkAttachmentDefinitions (Multus secondary NICs for UPF).

K8s Operators manage NF-specific lifecycle events that vanilla K8s cannot handle: graceful SMF pod termination (drain active PDU sessions before killing the pod), UPF rolling upgrade (redirect GTP-U sessions to standby UPF, upgrade primary, redirect back), AMF session context preservation across restarts (write context to PersistentVolume before pod terminates).

4. Key Parameters and Technical Terms

Term	Definition	Why It Matters for 5GC
CPU Pinning	Binding vCPUs to specific physical CPUs. Prevents OS scheduler from migrating threads.	UPF and AMF worker threads must be pinned. Without pinning: cache misses and latency jitter. Configure via K8s CPU Manager policy=static.
Hugepages	2 MB or 1 GB memory pages (vs default 4 KB). Pre-allocated at boot.	DPDK (used by UPF packet processing) requires hugepages. Without them: 40–60% UPF throughput loss at line rate.
NUMA Topology	Non-Uniform Memory Access. Multi-socket servers have memory banks per socket. Cross-socket access adds 30–80 ns latency.	Set topologyManagerPolicy=single-numa-node in kubelet. UPF and AMF pods must have all CPUs and memory from the same NUMA node.
SR-IOV	Single Root I/O Virtualisation. One physical NIC presents as multiple Virtual Functions.	UPF requires SR-IOV for N3 and N6 line-rate forwarding. Without SR-IOV: all traffic through kernel network stack — cannot sustain 100 Gbps.
DPDK	Data Plane Development Kit. User-space packet processing library that bypasses kernel network stack.	Used by UPF for line-rate packet forwarding. Requires hugepages and CPU pinning to function efficiently.
Multus	Kubernetes CNI meta-plugin. Allows pods to have multiple network interfaces.	UPF pod needs: primary CNI interface for management + N4, plus SR-IOV VFs for N3 and N6. Multus attaches the SR-IOV VFs.
Guaranteed QoS (K8s)	Pod QoS class where requests = limits for all containers. These pods are last to be evicted.	Set for ALL 5GC NF pods. SMF or UPF evicted under memory pressure = mass session drop.
PodDisruptionBudget	K8s policy specifying minimum available replicas during voluntary disruptions (upgrades, node drain).	Set PDB for each NF: minAvailable=N-1. Prevents all AMF pods from being drained simultaneously during cluster upgrade.
Helm Chart	Kubernetes application package: templates + default values. Operators override via values.yaml.	Vendor delivers NF as Helm chart. Operator customises via values.yaml. Version-controlled deployment.
StatefulSet vs Deployment	StatefulSet gives pods stable hostnames and persistent storage. Deployment does not.	SMF and UDM: StatefulSet (stable pod names needed for session state and DB clustering). AMF: Deployment acceptable if state in external store.

Table 3 — Platform key terms. CPU pinning, hugepages, and NUMA topology are the three configuration items that most commonly degrade UPF performance in initial deployments.

5. Common Issues in the Field

UPF Throughput 50% Below Expected — Hugepages Not Configured

Hugepages are a non-obvious requirement that is easy to miss in initial deployments. DPDK allocates large packet buffers and expects them to be in hugepage memory for TLB (Translation Lookaside Buffer) efficiency. Without hugepages, DPDK falls back to standard 4 KB pages and TLB misses under load dominate CPU time. The UPF CPU looks busy, but actual packet forwarding throughput is 40–60% below the server’s capability.

Field Note: UPF Capped at 45 Gbps — Hugepages Missing from K8s Node Config

New SA deployment. UPF server spec: dual 100 GbE NICs, 64 vCPUs. Expected throughput: ~90 Gbps.

Production load test: UPF capped at 45 Gbps. CPU utilisation appeared normal (70%).

Investigation: hugepages-1Gi not configured in K8s node spec. DPDK using 4 KB standard pages.

Fix: add hugepages-1Gi: “32Gi” to K8s node spec; restart UPF pods.

Throughput jumped to 91 Gbps on same hardware with no other change.

UPF Pod Evicted During Memory Pressure — Wrong QoS Class

If UPF pods are deployed as Burstable QoS (requests < limits), the K8s kubelet can evict them during node memory pressure events. An evicted UPF pod drops all active GTP-U sessions immediately. The replacement pod starts fresh with no session state. Every UE whose session was on that UPF loses connectivity until their device re-establishes the PDU session.

Field Note: 40,000 Sessions Dropped — UPF Pod Evicted During Memory Pressure

Operator ran UPF with memory requests=16Gi, limits=64Gi (Burstable QoS).

During a memory pressure event on the node, kubelet selected the UPF pod for eviction.

40,000 active PDU sessions dropped simultaneously. Session re-establishment took 3-5 minutes.

Fix: set UPF memory requests=limits=64Gi (Guaranteed QoS). Also: set PodDisruptionBudget=0 for UPF (never voluntarily evict). Set nodeSelector to UPF-dedicated worker nodes — no other pods compete.

6. Troubleshooting

Symptom	Root Cause	Check	Fix
UPF throughput well below spec	Hugepages not configured; DPDK using 4KB pages	UPF pod spec: hugepages-1Gi resource request; node: hugetlbfs mount	Configure hugepages-1Gi on K8s node spec; set resource request in UPF pod
UPF/SMF pod evicted during peak hours	Pod QoS class is Burstable — K8s evicts under memory pressure	K8s events: kubectl get events –field-selector reason=Evicted	Set requests=limits for all 5GC pods (Guaranteed QoS); dedicate worker nodes
NF latency spikes during busy hour	NUMA cross-socket memory access — vCPUs split across NUMA nodes	numactl –hardware on worker node; K8s topology manager policy	Set topologyManagerPolicy=single-numa-node; set CPU Manager policy=static
UPF N3 interface cannot sustain 10Gbps+	SR-IOV not configured — packets going through kernel network stack	UPF pod: check NetworkAttachmentDefinition for SR-IOV VF; ethtool on N3 interface	Configure Multus + SRIOV-CNI; verify SR-IOV VF is attached to UPF pod
NF pod restart loses all sessions	SMF/UPF state not persisted — ephemeral pod storage	K8s pod spec: check volume mounts for session state persistence	Use StatefulSet with PersistentVolume for SMF session state; or external session DB

Table 4 — Platform troubleshooting. Most performance issues are K8s configuration errors, not NF software bugs.

7. Design Recommendations

Separate UPF worker nodes from signalling-plane worker nodes. UPF requires: hugepages pre-allocated, DPDK-enabled NICs, CPU pinned, NUMA-local. Sharing a node with AMF/SMF pods introduces resource contention that is extremely difficult to debug. Dedicated UPF nodes with nodeSelector and Taints/Tolerations prevent accidental co-scheduling.

Set K8s resource requests = limits for every 5GC NF pod from day one. Guaranteed QoS class prevents eviction under memory pressure. Size limits based on vendor specifications plus 20% headroom. Accepting Burstable QoS to save memory today guarantees an outage under load tomorrow.

Version-control all Helm values.yaml files in Git. Every NF configuration change should go through a Git pull request review before applying to production. This is the single most effective change management practice for K8s-based 5GC deployments — it creates an audit trail and a rollback point for every configuration change.

8. Summary — Key Takeaways

Topic	Key Takeaway
VNF vs CNF	Industry has moved to CNF-first. Same NF software, container packaging. K8s Operator replaces VNFM for lifecycle management.
Hugepages	Mandatory for UPF DPDK packet processing. Configure hugepages-1Gi on K8s node spec AND in UPF pod resource request. Missing = 40-60% throughput loss.
CPU pinning + NUMA	Set CPU Manager policy=static and topologyManagerPolicy=single-numa-node. UPF and AMF worker threads must not cross NUMA nodes.
SR-IOV	Required for UPF N3/N6 line-rate. Configured via Multus + SRIOV-CNI. Without SR-IOV: packet throughput limited by kernel network stack.
Guaranteed QoS	requests=limits for ALL 5GC pods. Burstable UPF pod can be evicted = mass session drop. Non-negotiable.
Dedicated UPF nodes	Use nodeSelector + Taints to prevent non-UPF pods on UPF worker nodes. Resource contention between UPF and signalling NFs is hard to debug.
GitOps for config	Version-control all Helm values.yaml. Every change has a PR, review, and rollback point.

Table 5 — Post 06 summary. COTS virtualisation works reliably when the platform is configured correctly. Most production failures are platform config issues.

Next: Post 07 — 5GC Hardware & Infrastructure

COTS & Virtualisation