Day 0/1/2, zero-downtime upgrades, DR drills, configuration management, Helm GitOps, observability stack, NOC integration

1. What Is 5GC Operations — The Simple Version

5GC operations is everything that happens after the network goes live: keeping NFs healthy, performing software upgrades without dropping sessions, responding to faults within SLA windows, managing configuration changes safely, and proving to auditors that the network can recover from a disaster. This is where most operators underinvest compared to the deployment phase. The deployment gets a 6-month project team. Operations gets whoever is left.

Getting operations right from the start means: GitOps for all configuration changes, Grafana dashboards built before go-live (not after the first outage), quarterly DR drills with real session traffic, and zero-downtime upgrade procedures tested in staging before they run on production.

3GPP Reference

3GPP TS 28.500 — Management concept, architecture and requirements for 5G

GSMA NG.126 — Cloud Infrastructure Reference Model — operations section

3GPP TS 28.552 — Performance measurements for operations reference

2. Day 0 / Day 1 / Day 2 — The Operations Model

Phase	Definition	Key Activities	Common Operator Mistake
Day 0	Design and planning before deployment	Architecture decisions, hardware sizing, DC design, IP plan, S-NSSAI allocation, TAI mapping design, roaming partner SEPP exchange	Skipping TAI mapping design → SMF UPF selection gaps discovered post-launch
Day 1	Initial deployment and configuration	NF software install, Helm chart deployment, NF registration in NRF, acceptance testing, KPI baseline	No acceptance test for PFCP Session Modification → asymmetric connectivity discovered from subscriber complaints
Day 2	Ongoing operations	Software upgrades, configuration changes, SLA monitoring, incident response, capacity management, DR drills	No GitOps → config drift between NFs discovered only during incident investigation

Table 1 — Day 0/1/2 operations model. Day 2 is where most operational gaps are — it gets the least design investment and causes the most production incidents.

3. Zero-Downtime NF Upgrade Procedure

Every 5GC software upgrade must be zero-downtime for subscribers. Here is the validated procedure for a stateful NF (SMF) upgrade on a Kubernetes cluster with vendor Operator installed:

Step 1 — Pre-upgrade validation: verify current NF health (all pods Running, NRF registration Active, KPIs at baseline). Verify staging upgrade has completed successfully with same version. Verify rollback procedure has been tested.

Step 2 — Scale out: deploy new version pods alongside old version pods. New pods register in NRF with weight=1 (low traffic), old pods continue at weight=100. Both versions running simultaneously.

Step 3 — Traffic drain: set old pods to weight=0 in NRF (no new UE registrations routed to old pods). Existing sessions continue on old pods. Wait for session count on old pods to drain toward zero (configurable timer: 10–30 minutes for SMF).

Step 4 — Verification: during drain, monitor KPIs on new version pods: Registration SR, PDU Session SR, N4 latency. If any KPI degrades below threshold: immediately execute rollback (Helm rollback to previous release).

Step 5 — Termination: after drain timer expires (or session count on old pods reaches zero), terminate old pods. Scale back to normal replica count. Remove extra capacity pods.

Step 6 — Post-upgrade validation: verify all pods on new version, all registered in NRF, KPIs at baseline. Close change window.

Pro Tip

UPF upgrade has an additional step between Steps 3 and 4: SMF Operator migrates active GTP-U sessions from old UPF to standby UPF via PFCP Session Establishment on standby + deletion on old.

Each session migration causes a ~100ms interruption per session.

Schedule UPF upgrades during lowest-traffic window (02:00–04:00 local time). Consider VoNR call state before migrating.

4. Configuration Management — GitOps for 5GC

Every NF configuration change in production must go through Git. This is not optional for 5GC operations — without version-controlled configuration, you cannot answer “what changed?” when investigating an incident, and you cannot execute a reliable rollback.

GitOps Component	Tool	How It Works for 5GC
Source of truth	Git repository (GitLab/GitHub)	All Helm values.yaml and NF ConfigMaps stored in Git. Production branch protected — requires PR review.
Change process	Pull Request + peer review + CI validation	Every config change = PR. CI pipeline validates: syntax check, NSSAI/TAI consistency, inter-NF config cross-check.
Deployment	ArgoCD or Flux CD	Watches Git repository. Applies changes to K8s cluster automatically on merge to production branch. No manual kubectl apply in production.
Rollback	Git revert + ArgoCD sync	Incident: revert commit in Git → ArgoCD automatically applies previous config to all affected NFs.
Secrets management	Vault or K8s Secrets with git-crypt	TLS certificates and NF credentials never in plain Git. Referenced via Vault paths in Helm values.

Table 2 — GitOps for 5GC configuration management. The single most effective operational practice for preventing configuration drift and enabling fast rollback.

5. Observability Stack

Component	Tool	What It Monitors	Alert Examples
Metrics collection	Prometheus + vendor exporters	NF KPIs (Reg SR, PDU SR, N4 latency), K8s pod health, NIC throughput	Reg SR < 99.5%: page NOC immediately
Visualisation	Grafana dashboards	Per-NF service KPIs, per-slice KPIs, platform metrics, DC power/cooling	PFCP Mod Timeout Rate > 0.01%: investigate
Log aggregation	Loki or ELK (Elasticsearch/Kibana)	NF structured logs, K8s events, NGAP/PFCP error logs	PFCP_SESSION_MODIFICATION_TIMEOUT count > 0/hour
Tracing	Jaeger or Zipkin (if vendor supports)	HTTP/2 SBI request traces across NF chains	N11 call latency P95 > 200ms — trace shows where time is spent
Alerting	Alertmanager + PagerDuty/OpsGenie	Rule-based alerts from Prometheus; log-based alerts from Loki	Night-time critical: PDU SR < 97%; Day: Reg SR < 99.5%
Certificate monitoring	cert-manager + Grafana cert expiry panel	TLS certificate expiry dates for all NFs and SEPP	Alert 30 days before any certificate expiry

Table 3 — 5GC observability stack. Build this before go-live. The first production incident response time is 10× faster with pre-built dashboards than without.

6. DR Drills — Quarterly Test Procedure

A disaster recovery plan that has never been tested is not a plan — it is a hypothesis. For 5GC, the minimum quarterly DR test should cover:

DR Test 1 — Primary DC failure: cut all power/network to primary DC (in lab environment or using Kubernetes node drain to simulate). Validate: secondary DC takes over within RTO (typically 30–60s for active-active, 2–5 minutes for active-standby). Validate: no active PDU sessions dropped (for active-active). Validate: new registrations resume within 30 seconds.

DR Test 2 — UPF pod failure: kill the primary UPF pod serving a test PLMN (kubectl delete pod). Validate: SMF declares UPF unhealthy within PFCP heartbeat window (180s default, configurable). Validate: SMF re-establishes sessions on standby UPF. Validate: no unrecoverable session drops for test UEs.

DR Test 3 — NRF failure: kill all NRF pods simultaneously. Validate: consumer NFs can still call each other using cached NRF discovery results for validityPeriod duration (300s minimum). Validate: NRF recovers and re-registers all NFs within 120s of pod restart.

DR Test 4 — SEPP failure: kill primary SEPP pod. Validate: N32 failover to secondary SEPP. Validate: roaming registrations continue without interruption. This is critical before any commercial roaming launch.

Field Note: First DR Drill — 8 Things That Did Not Work

GCC operator ran first quarterly DR drill 3 months after SA commercial launch:

(1) Primary DC failover took 4 minutes — etcd leader election was slow. Fix: tune etcd election timeout.

(2) UPF standby pod had stale PFCP session table — not all sessions migrated. Fix: SMF Operator migration procedure.

(3) NRF recovery: NFs could not re-register — NRF started before its etcd backend was ready. Fix: K8s initContainer dependency.

(4) AMF context not recovered — StatefulSet PersistentVolume not mounted. Fix: PV mount added.

(5) SEPP secondary had expired certificate — was never tested. Fix: cert-manager + certificate monitoring.

(6) Grafana dashboards lost all data — Prometheus retention was 1 day, DR drill reset it. Fix: remote_write to Thanos.

DR drills are how you find these gaps before they find you in production.

7. Summary — Key Takeaways

Topic	Key Takeaway
Day 2 investment	Operations gets less design investment than deployment and causes more incidents. Build observability before go-live. Run DR drill within first month.
Zero-downtime upgrade	Scale up new version → NRF weight drain → KPI verification → terminate old. Must have vendor Operator. Test in staging first. Always have rollback plan.
GitOps	All production config in Git. PR review for every change. ArgoCD for automated deployment. This is the single most effective change management practice.
Observability	Prometheus + Grafana + Loki minimum. Build per-slice dashboards. Alert thresholds: Reg SR < 99.5%, PFCP Mod Timeout > 0. Certificate expiry 30-day advance alert.
DR drills	Quarterly minimum. Test: DC failover, UPF pod failure, NRF recovery, SEPP failover. First drill always reveals multiple gaps. This is expected and valuable.
NOC integration	NOC must have 5GC Grafana access, alarm correlation playbooks, and escalation paths for each failure pattern. The war room playbook from Post 14 is the starting point.

Table 4 — Post 19 summary. Operations is an ongoing practice, not a one-time setup. Build it with the same rigor as the network itself.

Next: Post 20 — 5GC Evolution: 5G-Advanced & Beyond

5GC Operations & Lifecycle Management