COTS & Virtualisation

NFV evolution, VNF vs CNF, OpenStack vs Kubernetes, MANO architecture, deployment models, vendor landscape — why the platform matters as much as the NF software

1. What Is COTS Virtualisation in 5GC — The Simple Version

COTS (Commercial Off-The-Shelf) virtualisation means running 5G Core NFs on standard x86 or ARM servers using software virtualisation — instead of purpose-built telecom hardware with proprietary ASICs. The motivation is straightforward: faster feature velocity, lower hardware cost, multi-vendor flexibility, and cloud-native operations. The reality is more nuanced: 5GC NFs running on COTS hardware have specific configuration requirements (CPU pinning, hugepages, NUMA topology) that are very different from enterprise application workloads.

Every major 5GC vendor — Ericsson, Nokia, Huawei, Samsung — now ships NFs as container-native software designed to run on Kubernetes. But the operators who have had the smoothest deployments are the ones who understood that “runs on Kubernetes” does not mean “configure it like a web app.”

3GPP Reference
ETSI GS NFV-INF 001 — NFV Infrastructure Requirements
ETSI GS NFV-IFA 014 — Network Functions Virtualisation Management and Orchestration
GSMA NG.126 — Cloud Infrastructure Reference Model for 5G Core
3GPP TS 23.501 Section 5.17 — Network Function Services

2. Architecture — VNF vs CNF and the Platform Stack

VNF vs CNF — The Shift to Cloud-Native

DimensionVNF (VM-based)CNF (Container-based)
Deployment unitFull VM — OS + NF softwareContainer — NF software only, shared kernel
Boot timeMinutes (full OS boot)Seconds (container start)
Memory overhead2–4 GB per VM for OS~50–200 MB per container
ScalingClone full VM — slow, heavyweightPod scale-out in seconds
Lifecycle managementVNFM (ETSI NFV IFA)Kubernetes Operator + Helm charts
NIC accessSR-IOV via PCI passthroughSR-IOV via CNI plugins (Multus + SRIOV-CNI)
State managementState on VM local diskState in PersistentVolumes or external database
GCC operator trendLegacy — still operating existing VNF deploymentsCNF-first from 2022 onwards for all new 5GC buildouts

Table 1 — VNF vs CNF. Industry direction is CNF-first for all new 5GC. VNF continues for operators with existing OpenStack investment and active NF contracts.

The Full Platform Stack

LayerVM-based (VNF)Container-based (CNF)Notes
NF SoftwareVNF on VMCNF on PodSame NF logic, different packaging
OrchestrationVNFM (per-vendor)Kubernetes Operator + HelmK8s Operator handles NF-specific lifecycle
Infrastructure MgmtVIM (OpenStack)K8s + CNIOpenStack still used as IaaS under some K8s deployments
MANONFVO + VNFM + VIM (ETSI NFV)K8s + Helm + ArgoCD/FluxNFVO concept replaced by K8s service mesh + GitOps
ComputeCOTS x86 serversCOTS x86 or ARM serversSame hardware — virtualisation layer differs
NetworkingSR-IOV + OVS-DPDKMultus + SRIOV-CNI + OVS-DPDKBoth paths use SR-IOV for UPF data plane
StorageCeph RBD or NFS for VMsCeph RBD or local NVMe StorageClassUDR and CHF: NVMe for low-latency DB I/O

Table 2 — Platform stack comparison. The hardware is the same. The management and lifecycle layer is fundamentally different.

3. How It Works — The Kubernetes 5GC Platform

In a CNF-based 5GC deployment on Kubernetes, here is how the pieces fit together:

The K8s cluster is structured with dedicated node types. Master nodes run the K8s control plane (etcd, API server, scheduler) — typically 3 nodes for HA. Worker nodes are specialised: signalling-plane workers host AMF, SMF, PCF, UDM, AUSF, NRF pods (standard compute, 25 GbE management NIC). UPF workers are separately timed out with 100 GbE SR-IOV NICs, hugepages pre-allocated, CPU pinned.

NF deployment uses Helm charts. The operator installs the vendor-provided Helm chart with a values.yaml override file specifying: PLMN IDs, TAI configurations, DNN definitions, NRF endpoint, N2/N3 interface IP addresses, resource limits, replica counts. The chart deploys all Kubernetes objects: Deployment or StatefulSet, Services (ClusterIP for SBI, LoadBalancer or NodePort for N2/N3), ConfigMaps, Secrets (TLS certificates), NetworkAttachmentDefinitions (Multus secondary NICs for UPF).

K8s Operators manage NF-specific lifecycle events that vanilla K8s cannot handle: graceful SMF pod termination (drain active PDU sessions before killing the pod), UPF rolling upgrade (redirect GTP-U sessions to standby UPF, upgrade primary, redirect back), AMF session context preservation across restarts (write context to PersistentVolume before pod terminates).

4. Key Parameters and Technical Terms

TermDefinitionWhy It Matters for 5GC
CPU PinningBinding vCPUs to specific physical CPUs. Prevents OS scheduler from migrating threads.UPF and AMF worker threads must be pinned. Without pinning: cache misses and latency jitter. Configure via K8s CPU Manager policy=static.
Hugepages2 MB or 1 GB memory pages (vs default 4 KB). Pre-allocated at boot.DPDK (used by UPF packet processing) requires hugepages. Without them: 40–60% UPF throughput loss at line rate.
NUMA TopologyNon-Uniform Memory Access. Multi-socket servers have memory banks per socket. Cross-socket access adds 30–80 ns latency.Set topologyManagerPolicy=single-numa-node in kubelet. UPF and AMF pods must have all CPUs and memory from the same NUMA node.
SR-IOVSingle Root I/O Virtualisation. One physical NIC presents as multiple Virtual Functions.UPF requires SR-IOV for N3 and N6 line-rate forwarding. Without SR-IOV: all traffic through kernel network stack — cannot sustain 100 Gbps.
DPDKData Plane Development Kit. User-space packet processing library that bypasses kernel network stack.Used by UPF for line-rate packet forwarding. Requires hugepages and CPU pinning to function efficiently.
MultusKubernetes CNI meta-plugin. Allows pods to have multiple network interfaces.UPF pod needs: primary CNI interface for management + N4, plus SR-IOV VFs for N3 and N6. Multus attaches the SR-IOV VFs.
Guaranteed QoS (K8s)Pod QoS class where requests = limits for all containers. These pods are last to be evicted.Set for ALL 5GC NF pods. SMF or UPF evicted under memory pressure = mass session drop.
PodDisruptionBudgetK8s policy specifying minimum available replicas during voluntary disruptions (upgrades, node drain).Set PDB for each NF: minAvailable=N-1. Prevents all AMF pods from being drained simultaneously during cluster upgrade.
Helm ChartKubernetes application package: templates + default values. Operators override via values.yaml.Vendor delivers NF as Helm chart. Operator customises via values.yaml. Version-controlled deployment.
StatefulSet vs DeploymentStatefulSet gives pods stable hostnames and persistent storage. Deployment does not.SMF and UDM: StatefulSet (stable pod names needed for session state and DB clustering). AMF: Deployment acceptable if state in external store.

Table 3 — Platform key terms. CPU pinning, hugepages, and NUMA topology are the three configuration items that most commonly degrade UPF performance in initial deployments.

5. Common Issues in the Field

UPF Throughput 50% Below Expected — Hugepages Not Configured

Hugepages are a non-obvious requirement that is easy to miss in initial deployments. DPDK allocates large packet buffers and expects them to be in hugepage memory for TLB (Translation Lookaside Buffer) efficiency. Without hugepages, DPDK falls back to standard 4 KB pages and TLB misses under load dominate CPU time. The UPF CPU looks busy, but actual packet forwarding throughput is 40–60% below the server’s capability.

Field Note: UPF Capped at 45 Gbps — Hugepages Missing from K8s Node Config
New SA deployment. UPF server spec: dual 100 GbE NICs, 64 vCPUs. Expected throughput: ~90 Gbps.
Production load test: UPF capped at 45 Gbps. CPU utilisation appeared normal (70%).
Investigation: hugepages-1Gi not configured in K8s node spec. DPDK using 4 KB standard pages.
Fix: add hugepages-1Gi: “32Gi” to K8s node spec; restart UPF pods.
Throughput jumped to 91 Gbps on same hardware with no other change.

UPF Pod Evicted During Memory Pressure — Wrong QoS Class

If UPF pods are deployed as Burstable QoS (requests < limits), the K8s kubelet can evict them during node memory pressure events. An evicted UPF pod drops all active GTP-U sessions immediately. The replacement pod starts fresh with no session state. Every UE whose session was on that UPF loses connectivity until their device re-establishes the PDU session.

Field Note: 40,000 Sessions Dropped — UPF Pod Evicted During Memory Pressure
Operator ran UPF with memory requests=16Gi, limits=64Gi (Burstable QoS).
During a memory pressure event on the node, kubelet selected the UPF pod for eviction.
40,000 active PDU sessions dropped simultaneously. Session re-establishment took 3-5 minutes.
Fix: set UPF memory requests=limits=64Gi (Guaranteed QoS). Also: set PodDisruptionBudget=0 for UPF (never voluntarily evict). Set nodeSelector to UPF-dedicated worker nodes — no other pods compete.

6. Troubleshooting

SymptomRoot CauseCheckFix
UPF throughput well below specHugepages not configured; DPDK using 4KB pagesUPF pod spec: hugepages-1Gi resource request; node: hugetlbfs mountConfigure hugepages-1Gi on K8s node spec; set resource request in UPF pod
UPF/SMF pod evicted during peak hoursPod QoS class is Burstable — K8s evicts under memory pressureK8s events: kubectl get events –field-selector reason=EvictedSet requests=limits for all 5GC pods (Guaranteed QoS); dedicate worker nodes
NF latency spikes during busy hourNUMA cross-socket memory access — vCPUs split across NUMA nodesnumactl –hardware on worker node; K8s topology manager policySet topologyManagerPolicy=single-numa-node; set CPU Manager policy=static
UPF N3 interface cannot sustain 10Gbps+SR-IOV not configured — packets going through kernel network stackUPF pod: check NetworkAttachmentDefinition for SR-IOV VF; ethtool on N3 interfaceConfigure Multus + SRIOV-CNI; verify SR-IOV VF is attached to UPF pod
NF pod restart loses all sessionsSMF/UPF state not persisted — ephemeral pod storageK8s pod spec: check volume mounts for session state persistenceUse StatefulSet with PersistentVolume for SMF session state; or external session DB

Table 4 — Platform troubleshooting. Most performance issues are K8s configuration errors, not NF software bugs.

7. Design Recommendations

Separate UPF worker nodes from signalling-plane worker nodes. UPF requires: hugepages pre-allocated, DPDK-enabled NICs, CPU pinned, NUMA-local. Sharing a node with AMF/SMF pods introduces resource contention that is extremely difficult to debug. Dedicated UPF nodes with nodeSelector and Taints/Tolerations prevent accidental co-scheduling.

Set K8s resource requests = limits for every 5GC NF pod from day one. Guaranteed QoS class prevents eviction under memory pressure. Size limits based on vendor specifications plus 20% headroom. Accepting Burstable QoS to save memory today guarantees an outage under load tomorrow.

Version-control all Helm values.yaml files in Git. Every NF configuration change should go through a Git pull request review before applying to production. This is the single most effective change management practice for K8s-based 5GC deployments — it creates an audit trail and a rollback point for every configuration change.

8. Summary — Key Takeaways

TopicKey Takeaway
VNF vs CNFIndustry has moved to CNF-first. Same NF software, container packaging. K8s Operator replaces VNFM for lifecycle management.
HugepagesMandatory for UPF DPDK packet processing. Configure hugepages-1Gi on K8s node spec AND in UPF pod resource request. Missing = 40-60% throughput loss.
CPU pinning + NUMASet CPU Manager policy=static and topologyManagerPolicy=single-numa-node. UPF and AMF worker threads must not cross NUMA nodes.
SR-IOVRequired for UPF N3/N6 line-rate. Configured via Multus + SRIOV-CNI. Without SR-IOV: packet throughput limited by kernel network stack.
Guaranteed QoSrequests=limits for ALL 5GC pods. Burstable UPF pod can be evicted = mass session drop. Non-negotiable.
Dedicated UPF nodesUse nodeSelector + Taints to prevent non-UPF pods on UPF worker nodes. Resource contention between UPF and signalling NFs is hard to debug.
GitOps for configVersion-control all Helm values.yaml. Every change has a PR, review, and rollback point.

Table 5 — Post 06 summary. COTS virtualisation works reliably when the platform is configured correctly. Most production failures are platform config issues.

Next: Post 07 — 5GC Hardware & Infrastructure

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top