Day 0/1/2, zero-downtime upgrades, DR drills, configuration management, Helm GitOps, observability stack, NOC integration
1. What Is 5GC Operations — The Simple Version
5GC operations is everything that happens after the network goes live: keeping NFs healthy, performing software upgrades without dropping sessions, responding to faults within SLA windows, managing configuration changes safely, and proving to auditors that the network can recover from a disaster. This is where most operators underinvest compared to the deployment phase. The deployment gets a 6-month project team. Operations gets whoever is left.
Getting operations right from the start means: GitOps for all configuration changes, Grafana dashboards built before go-live (not after the first outage), quarterly DR drills with real session traffic, and zero-downtime upgrade procedures tested in staging before they run on production.
| 3GPP Reference |
| 3GPP TS 28.500 — Management concept, architecture and requirements for 5G |
| GSMA NG.126 — Cloud Infrastructure Reference Model — operations section |
| 3GPP TS 28.552 — Performance measurements for operations reference |
2. Day 0 / Day 1 / Day 2 — The Operations Model
| Phase | Definition | Key Activities | Common Operator Mistake |
| Day 0 | Design and planning before deployment | Architecture decisions, hardware sizing, DC design, IP plan, S-NSSAI allocation, TAI mapping design, roaming partner SEPP exchange | Skipping TAI mapping design → SMF UPF selection gaps discovered post-launch |
| Day 1 | Initial deployment and configuration | NF software install, Helm chart deployment, NF registration in NRF, acceptance testing, KPI baseline | No acceptance test for PFCP Session Modification → asymmetric connectivity discovered from subscriber complaints |
| Day 2 | Ongoing operations | Software upgrades, configuration changes, SLA monitoring, incident response, capacity management, DR drills | No GitOps → config drift between NFs discovered only during incident investigation |
Table 1 — Day 0/1/2 operations model. Day 2 is where most operational gaps are — it gets the least design investment and causes the most production incidents.
3. Zero-Downtime NF Upgrade Procedure
Every 5GC software upgrade must be zero-downtime for subscribers. Here is the validated procedure for a stateful NF (SMF) upgrade on a Kubernetes cluster with vendor Operator installed:
Step 1 — Pre-upgrade validation: verify current NF health (all pods Running, NRF registration Active, KPIs at baseline). Verify staging upgrade has completed successfully with same version. Verify rollback procedure has been tested.
Step 2 — Scale out: deploy new version pods alongside old version pods. New pods register in NRF with weight=1 (low traffic), old pods continue at weight=100. Both versions running simultaneously.
Step 3 — Traffic drain: set old pods to weight=0 in NRF (no new UE registrations routed to old pods). Existing sessions continue on old pods. Wait for session count on old pods to drain toward zero (configurable timer: 10–30 minutes for SMF).
Step 4 — Verification: during drain, monitor KPIs on new version pods: Registration SR, PDU Session SR, N4 latency. If any KPI degrades below threshold: immediately execute rollback (Helm rollback to previous release).
Step 5 — Termination: after drain timer expires (or session count on old pods reaches zero), terminate old pods. Scale back to normal replica count. Remove extra capacity pods.
Step 6 — Post-upgrade validation: verify all pods on new version, all registered in NRF, KPIs at baseline. Close change window.
| Pro Tip |
| UPF upgrade has an additional step between Steps 3 and 4: SMF Operator migrates active GTP-U sessions from old UPF to standby UPF via PFCP Session Establishment on standby + deletion on old. |
| Each session migration causes a ~100ms interruption per session. |
| Schedule UPF upgrades during lowest-traffic window (02:00–04:00 local time). Consider VoNR call state before migrating. |
4. Configuration Management — GitOps for 5GC
Every NF configuration change in production must go through Git. This is not optional for 5GC operations — without version-controlled configuration, you cannot answer “what changed?” when investigating an incident, and you cannot execute a reliable rollback.
| GitOps Component | Tool | How It Works for 5GC |
| Source of truth | Git repository (GitLab/GitHub) | All Helm values.yaml and NF ConfigMaps stored in Git. Production branch protected — requires PR review. |
| Change process | Pull Request + peer review + CI validation | Every config change = PR. CI pipeline validates: syntax check, NSSAI/TAI consistency, inter-NF config cross-check. |
| Deployment | ArgoCD or Flux CD | Watches Git repository. Applies changes to K8s cluster automatically on merge to production branch. No manual kubectl apply in production. |
| Rollback | Git revert + ArgoCD sync | Incident: revert commit in Git → ArgoCD automatically applies previous config to all affected NFs. |
| Secrets management | Vault or K8s Secrets with git-crypt | TLS certificates and NF credentials never in plain Git. Referenced via Vault paths in Helm values. |
Table 2 — GitOps for 5GC configuration management. The single most effective operational practice for preventing configuration drift and enabling fast rollback.
5. Observability Stack
| Component | Tool | What It Monitors | Alert Examples |
| Metrics collection | Prometheus + vendor exporters | NF KPIs (Reg SR, PDU SR, N4 latency), K8s pod health, NIC throughput | Reg SR < 99.5%: page NOC immediately |
| Visualisation | Grafana dashboards | Per-NF service KPIs, per-slice KPIs, platform metrics, DC power/cooling | PFCP Mod Timeout Rate > 0.01%: investigate |
| Log aggregation | Loki or ELK (Elasticsearch/Kibana) | NF structured logs, K8s events, NGAP/PFCP error logs | PFCP_SESSION_MODIFICATION_TIMEOUT count > 0/hour |
| Tracing | Jaeger or Zipkin (if vendor supports) | HTTP/2 SBI request traces across NF chains | N11 call latency P95 > 200ms — trace shows where time is spent |
| Alerting | Alertmanager + PagerDuty/OpsGenie | Rule-based alerts from Prometheus; log-based alerts from Loki | Night-time critical: PDU SR < 97%; Day: Reg SR < 99.5% |
| Certificate monitoring | cert-manager + Grafana cert expiry panel | TLS certificate expiry dates for all NFs and SEPP | Alert 30 days before any certificate expiry |
Table 3 — 5GC observability stack. Build this before go-live. The first production incident response time is 10× faster with pre-built dashboards than without.
6. DR Drills — Quarterly Test Procedure
A disaster recovery plan that has never been tested is not a plan — it is a hypothesis. For 5GC, the minimum quarterly DR test should cover:
DR Test 1 — Primary DC failure: cut all power/network to primary DC (in lab environment or using Kubernetes node drain to simulate). Validate: secondary DC takes over within RTO (typically 30–60s for active-active, 2–5 minutes for active-standby). Validate: no active PDU sessions dropped (for active-active). Validate: new registrations resume within 30 seconds.
DR Test 2 — UPF pod failure: kill the primary UPF pod serving a test PLMN (kubectl delete pod). Validate: SMF declares UPF unhealthy within PFCP heartbeat window (180s default, configurable). Validate: SMF re-establishes sessions on standby UPF. Validate: no unrecoverable session drops for test UEs.
DR Test 3 — NRF failure: kill all NRF pods simultaneously. Validate: consumer NFs can still call each other using cached NRF discovery results for validityPeriod duration (300s minimum). Validate: NRF recovers and re-registers all NFs within 120s of pod restart.
DR Test 4 — SEPP failure: kill primary SEPP pod. Validate: N32 failover to secondary SEPP. Validate: roaming registrations continue without interruption. This is critical before any commercial roaming launch.
| Field Note: First DR Drill — 8 Things That Did Not Work |
| GCC operator ran first quarterly DR drill 3 months after SA commercial launch: |
| (1) Primary DC failover took 4 minutes — etcd leader election was slow. Fix: tune etcd election timeout. |
| (2) UPF standby pod had stale PFCP session table — not all sessions migrated. Fix: SMF Operator migration procedure. |
| (3) NRF recovery: NFs could not re-register — NRF started before its etcd backend was ready. Fix: K8s initContainer dependency. |
| (4) AMF context not recovered — StatefulSet PersistentVolume not mounted. Fix: PV mount added. |
| (5) SEPP secondary had expired certificate — was never tested. Fix: cert-manager + certificate monitoring. |
| (6) Grafana dashboards lost all data — Prometheus retention was 1 day, DR drill reset it. Fix: remote_write to Thanos. |
| DR drills are how you find these gaps before they find you in production. |
7. Summary — Key Takeaways
| Topic | Key Takeaway |
| Day 2 investment | Operations gets less design investment than deployment and causes more incidents. Build observability before go-live. Run DR drill within first month. |
| Zero-downtime upgrade | Scale up new version → NRF weight drain → KPI verification → terminate old. Must have vendor Operator. Test in staging first. Always have rollback plan. |
| GitOps | All production config in Git. PR review for every change. ArgoCD for automated deployment. This is the single most effective change management practice. |
| Observability | Prometheus + Grafana + Loki minimum. Build per-slice dashboards. Alert thresholds: Reg SR < 99.5%, PFCP Mod Timeout > 0. Certificate expiry 30-day advance alert. |
| DR drills | Quarterly minimum. Test: DC failover, UPF pod failure, NRF recovery, SEPP failover. First drill always reveals multiple gaps. This is expected and valuable. |
| NOC integration | NOC must have 5GC Grafana access, alarm correlation playbooks, and escalation paths for each failure pattern. The war room playbook from Post 14 is the starting point. |
Table 4 — Post 19 summary. Operations is an ongoing practice, not a one-time setup. Build it with the same rigor as the network itself.
Next: Post 20 — 5GC Evolution: 5G-Advanced & Beyond
