Packet capture on N2/N4/N11/N3, Grafana dashboard design, Huawei MAE alarms, NRF 503 storms, UPF path failure, site-dark war room walkthrough
1. The Troubleshooting Mindset
When a 5GC fault occurs, the instinct is to look at the NF that is alarming. This is almost always wrong. 5GC failures are systemic — a failed NRF causes AMF symptoms, a failed N4 causes SMF symptoms, a DSCP remark at an aggregation router causes VoNR symptoms. The correct approach is to follow the session flow backward from the symptom to the root cause, using the right tools at each layer.
This article is the war room playbook. It covers Wireshark and tcpdump filters for every 5GC interface, a Grafana dashboard layout that spots failures before they become outages, Huawei MAE alarm correlation maps, the top five failure patterns with diagnosis paths, and a full site-dark walkthrough from first alarm to root cause.
2. Packet Capture — Wireshark and tcpdump for 5GC
tcpdump One-Liners for Each Interface
Capture N2 (NGAP over SCTP port 38412): tcpdump -i eth0 -w n2_capture.pcap sctp port 38412
Capture N4 (PFCP over UDP port 8805): tcpdump -i eth0 -w n4_capture.pcap udp port 8805
Capture N3 (GTP-U over UDP port 2152): tcpdump -i eth0 -w n3_capture.pcap udp port 2152
Capture N3 for one specific UE by TEID: tcpdump -i eth0 “udp port 2152 and (udp[12:4] == 0x12345678)” -w ue_teid.pcap
Capture SBI HTTP/2 (TCP 443): tcpdump -i eth0 -w sbi_capture.pcap tcp port 443 and host 10.1.2.3
Capture all 5GC interfaces simultaneously: tcpdump -i eth0 “sctp port 38412 or udp port 8805 or udp port 2152” -w all_5gc.pcap
Wireshark Display Filters for 5GC
| Interface | Protocol | Wireshark Display Filter | What You See |
| N2 (AMF–gNB) | NGAP over SCTP | ngap | All NGAP: InitialUEMessage, UEContextRelease, PDUSessionResourceSetup, Paging, Handover |
| N2 — Registration only | NGAP | ngap.procedureCode == 15 | InitialUEMessage carrying NAS Registration Request |
| N2 — Handover | NGAP | ngap.procedureCode == 0 or ngap.procedureCode == 1 | HandoverRequired / HandoverRequest |
| N4 (SMF–UPF) | PFCP | pfcp | All PFCP: Session Estab/Mod/Del, Usage Reports, Heartbeat, Association |
| N4 — Failures only | PFCP | pfcp.cause != 1 | PFCP responses where Cause != Request Accepted (0x01). Any non-zero = failure. |
| N4 — Specific session | PFCP | pfcp.session_id == 0x1234ABCD | All PFCP messages for one session — get SEID from SMF log |
| N11 (AMF–SMF) | HTTP/2 | http2 | All SBI HTTP/2 — enable TLS decryption with SSLKEYLOGFILE for plaintext |
| N11 — Errors only | HTTP/2 | http2.headers.status matches “^[45]” | 4xx and 5xx HTTP responses — NF errors. Any 5xx on N11 = session setup issue. |
| N3 (gNB–UPF) | GTP-U | gtp | All GTP-U: G-PDU data, Echo Request/Response (N3 path health check) |
| N3 — Specific TEID | GTP-U | gtp.teid == 0x12345678 | All traffic for one PDU session — TEID from SMF PFCP session log |
Table 1 — Wireshark display filters for 5GC interfaces. The N4 failures filter (pfcp.cause != 1) is the fastest way to spot PFCP problems without reading every packet.
3. Top 5 Failure Patterns and Diagnosis Paths
Pattern 1: Mass Registration Failure
Symptom: AMF Registration SR drops below 90% suddenly. Grafana shows auth failure rate spike. No gNB alarms.
Check 1 — Is AUSF reachable? kubectl get pods -n 5gc | grep ausf. If CrashLoopBackOff: AUSF is the root cause.
Check 2 — Is NRF overloaded? Grafana: NRF discovery request rate vs capacity. If rate > configured NRF max: NRF overload. Fix: enable AMF NRF caching, scale NRF.
Check 3 — Wireshark N2 capture: InitialUEMessage arriving at AMF? If yes but no Authentication Request sent back: AMF cannot reach AUSF on N12. Capture N12 HTTP/2 and look for connection refused or TLS failure.
Pattern 2: Asymmetric Connectivity — Uploads Work, Downloads Fail
Symptom: PDU session shows as active. UE uplink data flows normally. Downlink packets arrive at UPF N6 but UE receives nothing.
Check 1 — SMF logs: search for PFCP_SESSION_MODIFICATION_TIMEOUT for the affected session. If found: PFCP Session Modification (Step 8 of PDU session setup) failed. UPF is still in BUFFER mode for downlink.
Check 2 — N4 packet capture during session setup: does the PFCP Session Modification Request with Outer Header Creation (gNB TEID) appear? Does UPF respond? If no response: UPF N4 queue backed up.
Fix: increase UPF N4 processing threads. Increase SMF PFCP T1 timer from 3s to 8s with N1=3 retries.
Pattern 3: NRF 503 Storm — Cascading SBI Failures
Symptom: after maintenance window or NF pod restarts, SBI error rate climbs across all NF pairs. Auth failure rate rises. PDU session setup failures. Grafana shows NRF HTTP 503 rate spike.
What happened: all NFs restarted simultaneously, all attempt OAuth2 token refresh and NRF discovery simultaneously. NRF token endpoint overwhelmed.
Check: NRF pod CPU and HTTP connection pool utilisation. If NRF CPU > 90% and 503 rate > 10%: NRF overload confirmed.
Fix immediate: nothing — wait 2-3 minutes for token request backpressure to clear. Fix permanent: add startup jitter (0–60s random delay) to NF pod token refresh. Stagger NF pod restarts during maintenance (10-minute waves, not all at once).
Pattern 4: UPF GTP-U Path Failure — gNB Cannot Reach UPF N3
Symptom: all PDU sessions on a specific gNB lose data plane connectivity. Sessions still show as active in SMF. No PDU session release alarms. N3 Echo Request from UPF to gNB times out.
Check 1 — tcpdump on UPF N3 interface: are GTP-U Echo Request packets going out? Are Echo Response packets coming back? No response: N3 transport path failure.
Check 2 — Traceroute from UPF to gNB N3 IP. If path asymmetric or failing at aggregation router: transport issue.
Check 3 — If Echo Response is arriving but data packets are not: check gNB TEID in UPF FAR. If UPF FAR has stale gNB TEID from before last handover: PFCP modification failure after handover.
Pattern 5: Site Dark — Full Scenario Walkthrough
03:00. NOC ticket: site Muscat-AlKhuwair-02 showing zero active UEs. 400 users affected. No data, no voice.
Step 1 — Grafana: which NF alarms are active for this site? AMF alarm = core/signalling issue. No NF alarm + gNB alarm = transport or gNB issue. No alarm anywhere = monitoring failure.
Step 2 — Check gNB OAM: is the gNB operational? Does it show N2 connection status as Connected to AMF? If gNB shows N2 connected but users fail: the problem is in the core, not the gNB.
Step 3 — If N2 shows Connected: capture N2 on AMF interface for this gNB. Are NGAP InitialUEMessage packets arriving? If yes: core is receiving registrations. If no: N2 transport failure between gNB and AMF.
Step 4 — If NGAP arriving: is AMF sending Authentication Request back? If yes: auth failing at AUSF. Capture N12. If no: AMF cannot reach AUSF. Check AUSF pod health.
Step 5 — If auth succeeds but no data: check N4. tcpdump on UPF N4 interface: PFCP Session Establishment requests from SMF arriving? UPF responding with Request Accepted? If no response: UPF is the problem.
Step 6 — If PDU sessions are establishing but no data: N3 path check. GTP-U Echo Request from UPF to gNB N3 IP. Response arriving? If not: N3 transport failure. Traceroute to confirm.
Step 7 — Correlate timeline: when did first alarm fire vs when did users start dropping? If a transport maintenance event preceded the failure by 5-10 minutes: likely transport configuration change caused the issue.
4. MAE Alarm Correlation
| Alarm Name (iMaster NCE) | What It Means | Immediate Check | Likely Fix |
| 5GC_AMF_REG_SR_LOW | Registration success rate below threshold | Check AUSF pod health; NRF discovery latency; AMF N12 error counter | Restart AUSF if crashed; enable NRF caching; scale AUSF if overloaded |
| 5GC_SMF_N4_ASSOC_LOST | PFCP Association between SMF and UPF lost | Check UPF pod health; N4 IP connectivity (ping UPF N4 IP from SMF); IPsec status | Restore N4 transport; restart UPF; SMF will re-establish association and re-program sessions |
| 5GC_NRF_DISC_LATENCY_HIGH | NRF discovery response time P95 > threshold | Check NRF pod CPU; NRF discovery request rate vs configured capacity | Scale NRF horizontally; enable discovery caching on consumer NFs |
| 5GC_UPF_TPUT_THRESHOLD | UPF throughput approaching N6 link capacity | Check N6 interface utilisation; UPF DPDK CPU workers | Add UPF instances; upgrade N6 link capacity |
| 5GC_SEPP_N32_FAILURE | N32 inter-PLMN SEPP link failure | Check SEPP TLS certificate validity; N32 IPX connectivity; partner SEPP reachability | Renew SEPP certificate; engage IPX provider; verify partner SEPP IP |
| 5GC_SMF_PFCP_MOD_TIMEOUT | PFCP Session Modification timeout rate exceeding threshold | UPF N4 processing thread utilisation; N4 transport latency | Increase UPF N4 threads; increase SMF PFCP T1 timer; check N4 path QoS |
Table 2 — 5GC alarm correlation (indicative — verify alarm IDs with vendor alarm dictionary for your software version).
5. Summary — Key Takeaways
| Topic | Key Takeaway |
| Diagnosis approach | Follow the session flow. Symptom → which step fails → which interface/NF → packet capture confirms. Never jump to conclusions based on which NF is alarming. |
| Wireshark N4 filter | pfcp.cause != 1 is the fastest way to spot all PFCP failures in a capture. Any non-zero cause = investigate that session. |
| Asymmetric connectivity | Uploads work, downloads fail = PFCP Session Modification timeout. Check SMF logs for PFCP_SESSION_MODIFICATION_TIMEOUT. UPF N4 thread pool. |
| NRF 503 storm | After maintenance: stagger NF restarts + add startup jitter to token refresh. NRF overload self-heals in 2-3 minutes but prevention is better. |
| Site dark walkthrough | 7-step isolation: Grafana → gNB OAM → N2 capture → N12 check → N4 check → N3 check → timeline correlation. Do not skip steps. |
| Grafana before Wireshark | Grafana spots the failure category in 30 seconds. Wireshark confirms root cause in 5 minutes. Use both — in that order. |
Table 3 — Post 14 summary. Troubleshooting 5GC is systematic layer-by-layer isolation, not random NF restarts.
Next: Post 15 — 5GC Security Architecture
