Packet capture on N2/N4/N11/N3, Grafana dashboard design, Huawei MAE alarms, NRF 503 storms, UPF path failure, site-dark war room walkthrough

1. The Troubleshooting Mindset

When a 5GC fault occurs, the instinct is to look at the NF that is alarming. This is almost always wrong. 5GC failures are systemic — a failed NRF causes AMF symptoms, a failed N4 causes SMF symptoms, a DSCP remark at an aggregation router causes VoNR symptoms. The correct approach is to follow the session flow backward from the symptom to the root cause, using the right tools at each layer.

This article is the war room playbook. It covers Wireshark and tcpdump filters for every 5GC interface, a Grafana dashboard layout that spots failures before they become outages, Huawei MAE alarm correlation maps, the top five failure patterns with diagnosis paths, and a full site-dark walkthrough from first alarm to root cause.

2. Packet Capture — Wireshark and tcpdump for 5GC

tcpdump One-Liners for Each Interface

Capture N2 (NGAP over SCTP port 38412): tcpdump -i eth0 -w n2_capture.pcap sctp port 38412

Capture N4 (PFCP over UDP port 8805): tcpdump -i eth0 -w n4_capture.pcap udp port 8805

Capture N3 (GTP-U over UDP port 2152): tcpdump -i eth0 -w n3_capture.pcap udp port 2152

Capture N3 for one specific UE by TEID: tcpdump -i eth0 “udp port 2152 and (udp[12:4] == 0x12345678)” -w ue_teid.pcap

Capture SBI HTTP/2 (TCP 443): tcpdump -i eth0 -w sbi_capture.pcap tcp port 443 and host 10.1.2.3

Capture all 5GC interfaces simultaneously: tcpdump -i eth0 “sctp port 38412 or udp port 8805 or udp port 2152” -w all_5gc.pcap

Wireshark Display Filters for 5GC

Interface	Protocol	Wireshark Display Filter	What You See
N2 (AMF–gNB)	NGAP over SCTP	ngap	All NGAP: InitialUEMessage, UEContextRelease, PDUSessionResourceSetup, Paging, Handover
N2 — Registration only	NGAP	ngap.procedureCode == 15	InitialUEMessage carrying NAS Registration Request
N2 — Handover	NGAP	ngap.procedureCode == 0 or ngap.procedureCode == 1	HandoverRequired / HandoverRequest
N4 (SMF–UPF)	PFCP	pfcp	All PFCP: Session Estab/Mod/Del, Usage Reports, Heartbeat, Association
N4 — Failures only	PFCP	pfcp.cause != 1	PFCP responses where Cause != Request Accepted (0x01). Any non-zero = failure.
N4 — Specific session	PFCP	pfcp.session_id == 0x1234ABCD	All PFCP messages for one session — get SEID from SMF log
N11 (AMF–SMF)	HTTP/2	http2	All SBI HTTP/2 — enable TLS decryption with SSLKEYLOGFILE for plaintext
N11 — Errors only	HTTP/2	http2.headers.status matches “^[45]”	4xx and 5xx HTTP responses — NF errors. Any 5xx on N11 = session setup issue.
N3 (gNB–UPF)	GTP-U	gtp	All GTP-U: G-PDU data, Echo Request/Response (N3 path health check)
N3 — Specific TEID	GTP-U	gtp.teid == 0x12345678	All traffic for one PDU session — TEID from SMF PFCP session log

Table 1 — Wireshark display filters for 5GC interfaces. The N4 failures filter (pfcp.cause != 1) is the fastest way to spot PFCP problems without reading every packet.

3. Top 5 Failure Patterns and Diagnosis Paths

Pattern 1: Mass Registration Failure

Symptom: AMF Registration SR drops below 90% suddenly. Grafana shows auth failure rate spike. No gNB alarms.

Check 1 — Is AUSF reachable? kubectl get pods -n 5gc | grep ausf. If CrashLoopBackOff: AUSF is the root cause.

Check 2 — Is NRF overloaded? Grafana: NRF discovery request rate vs capacity. If rate > configured NRF max: NRF overload. Fix: enable AMF NRF caching, scale NRF.

Check 3 — Wireshark N2 capture: InitialUEMessage arriving at AMF? If yes but no Authentication Request sent back: AMF cannot reach AUSF on N12. Capture N12 HTTP/2 and look for connection refused or TLS failure.

Pattern 2: Asymmetric Connectivity — Uploads Work, Downloads Fail

Symptom: PDU session shows as active. UE uplink data flows normally. Downlink packets arrive at UPF N6 but UE receives nothing.

Check 1 — SMF logs: search for PFCP_SESSION_MODIFICATION_TIMEOUT for the affected session. If found: PFCP Session Modification (Step 8 of PDU session setup) failed. UPF is still in BUFFER mode for downlink.

Check 2 — N4 packet capture during session setup: does the PFCP Session Modification Request with Outer Header Creation (gNB TEID) appear? Does UPF respond? If no response: UPF N4 queue backed up.

Fix: increase UPF N4 processing threads. Increase SMF PFCP T1 timer from 3s to 8s with N1=3 retries.

Pattern 3: NRF 503 Storm — Cascading SBI Failures

Symptom: after maintenance window or NF pod restarts, SBI error rate climbs across all NF pairs. Auth failure rate rises. PDU session setup failures. Grafana shows NRF HTTP 503 rate spike.

What happened: all NFs restarted simultaneously, all attempt OAuth2 token refresh and NRF discovery simultaneously. NRF token endpoint overwhelmed.

Check: NRF pod CPU and HTTP connection pool utilisation. If NRF CPU > 90% and 503 rate > 10%: NRF overload confirmed.

Fix immediate: nothing — wait 2-3 minutes for token request backpressure to clear. Fix permanent: add startup jitter (0–60s random delay) to NF pod token refresh. Stagger NF pod restarts during maintenance (10-minute waves, not all at once).

Pattern 4: UPF GTP-U Path Failure — gNB Cannot Reach UPF N3

Symptom: all PDU sessions on a specific gNB lose data plane connectivity. Sessions still show as active in SMF. No PDU session release alarms. N3 Echo Request from UPF to gNB times out.

Check 1 — tcpdump on UPF N3 interface: are GTP-U Echo Request packets going out? Are Echo Response packets coming back? No response: N3 transport path failure.

Check 2 — Traceroute from UPF to gNB N3 IP. If path asymmetric or failing at aggregation router: transport issue.

Check 3 — If Echo Response is arriving but data packets are not: check gNB TEID in UPF FAR. If UPF FAR has stale gNB TEID from before last handover: PFCP modification failure after handover.

Pattern 5: Site Dark — Full Scenario Walkthrough

03:00. NOC ticket: site Muscat-AlKhuwair-02 showing zero active UEs. 400 users affected. No data, no voice.

Step 1 — Grafana: which NF alarms are active for this site? AMF alarm = core/signalling issue. No NF alarm + gNB alarm = transport or gNB issue. No alarm anywhere = monitoring failure.

Step 2 — Check gNB OAM: is the gNB operational? Does it show N2 connection status as Connected to AMF? If gNB shows N2 connected but users fail: the problem is in the core, not the gNB.

Step 3 — If N2 shows Connected: capture N2 on AMF interface for this gNB. Are NGAP InitialUEMessage packets arriving? If yes: core is receiving registrations. If no: N2 transport failure between gNB and AMF.

Step 4 — If NGAP arriving: is AMF sending Authentication Request back? If yes: auth failing at AUSF. Capture N12. If no: AMF cannot reach AUSF. Check AUSF pod health.

Step 5 — If auth succeeds but no data: check N4. tcpdump on UPF N4 interface: PFCP Session Establishment requests from SMF arriving? UPF responding with Request Accepted? If no response: UPF is the problem.

Step 6 — If PDU sessions are establishing but no data: N3 path check. GTP-U Echo Request from UPF to gNB N3 IP. Response arriving? If not: N3 transport failure. Traceroute to confirm.

Step 7 — Correlate timeline: when did first alarm fire vs when did users start dropping? If a transport maintenance event preceded the failure by 5-10 minutes: likely transport configuration change caused the issue.

4. MAE Alarm Correlation

Alarm Name (iMaster NCE)	What It Means	Immediate Check	Likely Fix
5GC_AMF_REG_SR_LOW	Registration success rate below threshold	Check AUSF pod health; NRF discovery latency; AMF N12 error counter	Restart AUSF if crashed; enable NRF caching; scale AUSF if overloaded
5GC_SMF_N4_ASSOC_LOST	PFCP Association between SMF and UPF lost	Check UPF pod health; N4 IP connectivity (ping UPF N4 IP from SMF); IPsec status	Restore N4 transport; restart UPF; SMF will re-establish association and re-program sessions
5GC_NRF_DISC_LATENCY_HIGH	NRF discovery response time P95 > threshold	Check NRF pod CPU; NRF discovery request rate vs configured capacity	Scale NRF horizontally; enable discovery caching on consumer NFs
5GC_UPF_TPUT_THRESHOLD	UPF throughput approaching N6 link capacity	Check N6 interface utilisation; UPF DPDK CPU workers	Add UPF instances; upgrade N6 link capacity
5GC_SEPP_N32_FAILURE	N32 inter-PLMN SEPP link failure	Check SEPP TLS certificate validity; N32 IPX connectivity; partner SEPP reachability	Renew SEPP certificate; engage IPX provider; verify partner SEPP IP
5GC_SMF_PFCP_MOD_TIMEOUT	PFCP Session Modification timeout rate exceeding threshold	UPF N4 processing thread utilisation; N4 transport latency	Increase UPF N4 threads; increase SMF PFCP T1 timer; check N4 path QoS

Table 2 — 5GC alarm correlation (indicative — verify alarm IDs with vendor alarm dictionary for your software version).

5. Summary — Key Takeaways

Topic	Key Takeaway
Diagnosis approach	Follow the session flow. Symptom → which step fails → which interface/NF → packet capture confirms. Never jump to conclusions based on which NF is alarming.
Wireshark N4 filter	pfcp.cause != 1 is the fastest way to spot all PFCP failures in a capture. Any non-zero cause = investigate that session.
Asymmetric connectivity	Uploads work, downloads fail = PFCP Session Modification timeout. Check SMF logs for PFCP_SESSION_MODIFICATION_TIMEOUT. UPF N4 thread pool.
NRF 503 storm	After maintenance: stagger NF restarts + add startup jitter to token refresh. NRF overload self-heals in 2-3 minutes but prevention is better.
Site dark walkthrough	7-step isolation: Grafana → gNB OAM → N2 capture → N12 check → N4 check → N3 check → timeline correlation. Do not skip steps.
Grafana before Wireshark	Grafana spots the failure category in 30 seconds. Wireshark confirms root cause in 5 minutes. Use both — in that order.

Table 3 — Post 14 summary. Troubleshooting 5GC is systematic layer-by-layer isolation, not random NF restarts.

Next: Post 15 — 5GC Security Architecture

5GC Troubleshooting — Real Failures, Wireshark & Grafana