5GC Operations & Lifecycle Management

Day 0/1/2, zero-downtime upgrades, DR drills, configuration management, Helm GitOps, observability stack, NOC integration

1. What Is 5GC Operations — The Simple Version

5GC operations is everything that happens after the network goes live: keeping NFs healthy, performing software upgrades without dropping sessions, responding to faults within SLA windows, managing configuration changes safely, and proving to auditors that the network can recover from a disaster. This is where most operators underinvest compared to the deployment phase. The deployment gets a 6-month project team. Operations gets whoever is left.

Getting operations right from the start means: GitOps for all configuration changes, Grafana dashboards built before go-live (not after the first outage), quarterly DR drills with real session traffic, and zero-downtime upgrade procedures tested in staging before they run on production.

3GPP Reference
3GPP TS 28.500 — Management concept, architecture and requirements for 5G
GSMA NG.126 — Cloud Infrastructure Reference Model — operations section
3GPP TS 28.552 — Performance measurements for operations reference

2. Day 0 / Day 1 / Day 2 — The Operations Model

PhaseDefinitionKey ActivitiesCommon Operator Mistake
Day 0Design and planning before deploymentArchitecture decisions, hardware sizing, DC design, IP plan, S-NSSAI allocation, TAI mapping design, roaming partner SEPP exchangeSkipping TAI mapping design → SMF UPF selection gaps discovered post-launch
Day 1Initial deployment and configurationNF software install, Helm chart deployment, NF registration in NRF, acceptance testing, KPI baselineNo acceptance test for PFCP Session Modification → asymmetric connectivity discovered from subscriber complaints
Day 2Ongoing operationsSoftware upgrades, configuration changes, SLA monitoring, incident response, capacity management, DR drillsNo GitOps → config drift between NFs discovered only during incident investigation

Table 1 — Day 0/1/2 operations model. Day 2 is where most operational gaps are — it gets the least design investment and causes the most production incidents.

3. Zero-Downtime NF Upgrade Procedure

Every 5GC software upgrade must be zero-downtime for subscribers. Here is the validated procedure for a stateful NF (SMF) upgrade on a Kubernetes cluster with vendor Operator installed:

Step 1 — Pre-upgrade validation: verify current NF health (all pods Running, NRF registration Active, KPIs at baseline). Verify staging upgrade has completed successfully with same version. Verify rollback procedure has been tested.

Step 2 — Scale out: deploy new version pods alongside old version pods. New pods register in NRF with weight=1 (low traffic), old pods continue at weight=100. Both versions running simultaneously.

Step 3 — Traffic drain: set old pods to weight=0 in NRF (no new UE registrations routed to old pods). Existing sessions continue on old pods. Wait for session count on old pods to drain toward zero (configurable timer: 10–30 minutes for SMF).

Step 4 — Verification: during drain, monitor KPIs on new version pods: Registration SR, PDU Session SR, N4 latency. If any KPI degrades below threshold: immediately execute rollback (Helm rollback to previous release).

Step 5 — Termination: after drain timer expires (or session count on old pods reaches zero), terminate old pods. Scale back to normal replica count. Remove extra capacity pods.

Step 6 — Post-upgrade validation: verify all pods on new version, all registered in NRF, KPIs at baseline. Close change window.

Pro Tip
UPF upgrade has an additional step between Steps 3 and 4: SMF Operator migrates active GTP-U sessions from old UPF to standby UPF via PFCP Session Establishment on standby + deletion on old.
Each session migration causes a ~100ms interruption per session.
Schedule UPF upgrades during lowest-traffic window (02:00–04:00 local time). Consider VoNR call state before migrating.

4. Configuration Management — GitOps for 5GC

Every NF configuration change in production must go through Git. This is not optional for 5GC operations — without version-controlled configuration, you cannot answer “what changed?” when investigating an incident, and you cannot execute a reliable rollback.

GitOps ComponentToolHow It Works for 5GC
Source of truthGit repository (GitLab/GitHub)All Helm values.yaml and NF ConfigMaps stored in Git. Production branch protected — requires PR review.
Change processPull Request + peer review + CI validationEvery config change = PR. CI pipeline validates: syntax check, NSSAI/TAI consistency, inter-NF config cross-check.
DeploymentArgoCD or Flux CDWatches Git repository. Applies changes to K8s cluster automatically on merge to production branch. No manual kubectl apply in production.
RollbackGit revert + ArgoCD syncIncident: revert commit in Git → ArgoCD automatically applies previous config to all affected NFs.
Secrets managementVault or K8s Secrets with git-cryptTLS certificates and NF credentials never in plain Git. Referenced via Vault paths in Helm values.

Table 2 — GitOps for 5GC configuration management. The single most effective operational practice for preventing configuration drift and enabling fast rollback.

5. Observability Stack

ComponentToolWhat It MonitorsAlert Examples
Metrics collectionPrometheus + vendor exportersNF KPIs (Reg SR, PDU SR, N4 latency), K8s pod health, NIC throughputReg SR < 99.5%: page NOC immediately
VisualisationGrafana dashboardsPer-NF service KPIs, per-slice KPIs, platform metrics, DC power/coolingPFCP Mod Timeout Rate > 0.01%: investigate
Log aggregationLoki or ELK (Elasticsearch/Kibana)NF structured logs, K8s events, NGAP/PFCP error logsPFCP_SESSION_MODIFICATION_TIMEOUT count > 0/hour
TracingJaeger or Zipkin (if vendor supports)HTTP/2 SBI request traces across NF chainsN11 call latency P95 > 200ms — trace shows where time is spent
AlertingAlertmanager + PagerDuty/OpsGenieRule-based alerts from Prometheus; log-based alerts from LokiNight-time critical: PDU SR < 97%; Day: Reg SR < 99.5%
Certificate monitoringcert-manager + Grafana cert expiry panelTLS certificate expiry dates for all NFs and SEPPAlert 30 days before any certificate expiry

Table 3 — 5GC observability stack. Build this before go-live. The first production incident response time is 10× faster with pre-built dashboards than without.

6. DR Drills — Quarterly Test Procedure

A disaster recovery plan that has never been tested is not a plan — it is a hypothesis. For 5GC, the minimum quarterly DR test should cover:

DR Test 1 — Primary DC failure: cut all power/network to primary DC (in lab environment or using Kubernetes node drain to simulate). Validate: secondary DC takes over within RTO (typically 30–60s for active-active, 2–5 minutes for active-standby). Validate: no active PDU sessions dropped (for active-active). Validate: new registrations resume within 30 seconds.

DR Test 2 — UPF pod failure: kill the primary UPF pod serving a test PLMN (kubectl delete pod). Validate: SMF declares UPF unhealthy within PFCP heartbeat window (180s default, configurable). Validate: SMF re-establishes sessions on standby UPF. Validate: no unrecoverable session drops for test UEs.

DR Test 3 — NRF failure: kill all NRF pods simultaneously. Validate: consumer NFs can still call each other using cached NRF discovery results for validityPeriod duration (300s minimum). Validate: NRF recovers and re-registers all NFs within 120s of pod restart.

DR Test 4 — SEPP failure: kill primary SEPP pod. Validate: N32 failover to secondary SEPP. Validate: roaming registrations continue without interruption. This is critical before any commercial roaming launch.

Field Note: First DR Drill — 8 Things That Did Not Work
GCC operator ran first quarterly DR drill 3 months after SA commercial launch:
(1) Primary DC failover took 4 minutes — etcd leader election was slow. Fix: tune etcd election timeout.
(2) UPF standby pod had stale PFCP session table — not all sessions migrated. Fix: SMF Operator migration procedure.
(3) NRF recovery: NFs could not re-register — NRF started before its etcd backend was ready. Fix: K8s initContainer dependency.
(4) AMF context not recovered — StatefulSet PersistentVolume not mounted. Fix: PV mount added.
(5) SEPP secondary had expired certificate — was never tested. Fix: cert-manager + certificate monitoring.
(6) Grafana dashboards lost all data — Prometheus retention was 1 day, DR drill reset it. Fix: remote_write to Thanos.
DR drills are how you find these gaps before they find you in production.

7. Summary — Key Takeaways

TopicKey Takeaway
Day 2 investmentOperations gets less design investment than deployment and causes more incidents. Build observability before go-live. Run DR drill within first month.
Zero-downtime upgradeScale up new version → NRF weight drain → KPI verification → terminate old. Must have vendor Operator. Test in staging first. Always have rollback plan.
GitOpsAll production config in Git. PR review for every change. ArgoCD for automated deployment. This is the single most effective change management practice.
ObservabilityPrometheus + Grafana + Loki minimum. Build per-slice dashboards. Alert thresholds: Reg SR < 99.5%, PFCP Mod Timeout > 0. Certificate expiry 30-day advance alert.
DR drillsQuarterly minimum. Test: DC failover, UPF pod failure, NRF recovery, SEPP failover. First drill always reveals multiple gaps. This is expected and valuable.
NOC integrationNOC must have 5GC Grafana access, alarm correlation playbooks, and escalation paths for each failure pattern. The war room playbook from Post 14 is the starting point.

Table 4 — Post 19 summary. Operations is an ongoing practice, not a one-time setup. Build it with the same rigor as the network itself.

Next: Post 20 — 5GC Evolution: 5G-Advanced & Beyond

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top