5GC Troubleshooting — Real Failures, Wireshark & Grafana

Packet capture on N2/N4/N11/N3, Grafana dashboard design, Huawei MAE alarms, NRF 503 storms, UPF path failure, site-dark war room walkthrough

1. The Troubleshooting Mindset

When a 5GC fault occurs, the instinct is to look at the NF that is alarming. This is almost always wrong. 5GC failures are systemic — a failed NRF causes AMF symptoms, a failed N4 causes SMF symptoms, a DSCP remark at an aggregation router causes VoNR symptoms. The correct approach is to follow the session flow backward from the symptom to the root cause, using the right tools at each layer.

This article is the war room playbook. It covers Wireshark and tcpdump filters for every 5GC interface, a Grafana dashboard layout that spots failures before they become outages, Huawei MAE alarm correlation maps, the top five failure patterns with diagnosis paths, and a full site-dark walkthrough from first alarm to root cause.

2. Packet Capture — Wireshark and tcpdump for 5GC

tcpdump One-Liners for Each Interface

Capture N2 (NGAP over SCTP port 38412): tcpdump -i eth0 -w n2_capture.pcap sctp port 38412

Capture N4 (PFCP over UDP port 8805): tcpdump -i eth0 -w n4_capture.pcap udp port 8805

Capture N3 (GTP-U over UDP port 2152): tcpdump -i eth0 -w n3_capture.pcap udp port 2152

Capture N3 for one specific UE by TEID: tcpdump -i eth0 “udp port 2152 and (udp[12:4] == 0x12345678)” -w ue_teid.pcap

Capture SBI HTTP/2 (TCP 443): tcpdump -i eth0 -w sbi_capture.pcap tcp port 443 and host 10.1.2.3

Capture all 5GC interfaces simultaneously: tcpdump -i eth0 “sctp port 38412 or udp port 8805 or udp port 2152” -w all_5gc.pcap

Wireshark Display Filters for 5GC

InterfaceProtocolWireshark Display FilterWhat You See
N2 (AMF–gNB)NGAP over SCTPngapAll NGAP: InitialUEMessage, UEContextRelease, PDUSessionResourceSetup, Paging, Handover
N2 — Registration onlyNGAPngap.procedureCode == 15InitialUEMessage carrying NAS Registration Request
N2 — HandoverNGAPngap.procedureCode == 0 or ngap.procedureCode == 1HandoverRequired / HandoverRequest
N4 (SMF–UPF)PFCPpfcpAll PFCP: Session Estab/Mod/Del, Usage Reports, Heartbeat, Association
N4 — Failures onlyPFCPpfcp.cause != 1PFCP responses where Cause != Request Accepted (0x01). Any non-zero = failure.
N4 — Specific sessionPFCPpfcp.session_id == 0x1234ABCDAll PFCP messages for one session — get SEID from SMF log
N11 (AMF–SMF)HTTP/2http2All SBI HTTP/2 — enable TLS decryption with SSLKEYLOGFILE for plaintext
N11 — Errors onlyHTTP/2http2.headers.status matches “^[45]”4xx and 5xx HTTP responses — NF errors. Any 5xx on N11 = session setup issue.
N3 (gNB–UPF)GTP-UgtpAll GTP-U: G-PDU data, Echo Request/Response (N3 path health check)
N3 — Specific TEIDGTP-Ugtp.teid == 0x12345678All traffic for one PDU session — TEID from SMF PFCP session log

Table 1 — Wireshark display filters for 5GC interfaces. The N4 failures filter (pfcp.cause != 1) is the fastest way to spot PFCP problems without reading every packet.

3. Top 5 Failure Patterns and Diagnosis Paths

Pattern 1: Mass Registration Failure

Symptom: AMF Registration SR drops below 90% suddenly. Grafana shows auth failure rate spike. No gNB alarms.

Check 1 — Is AUSF reachable? kubectl get pods -n 5gc | grep ausf. If CrashLoopBackOff: AUSF is the root cause.

Check 2 — Is NRF overloaded? Grafana: NRF discovery request rate vs capacity. If rate > configured NRF max: NRF overload. Fix: enable AMF NRF caching, scale NRF.

Check 3 — Wireshark N2 capture: InitialUEMessage arriving at AMF? If yes but no Authentication Request sent back: AMF cannot reach AUSF on N12. Capture N12 HTTP/2 and look for connection refused or TLS failure.

Pattern 2: Asymmetric Connectivity — Uploads Work, Downloads Fail

Symptom: PDU session shows as active. UE uplink data flows normally. Downlink packets arrive at UPF N6 but UE receives nothing.

Check 1 — SMF logs: search for PFCP_SESSION_MODIFICATION_TIMEOUT for the affected session. If found: PFCP Session Modification (Step 8 of PDU session setup) failed. UPF is still in BUFFER mode for downlink.

Check 2 — N4 packet capture during session setup: does the PFCP Session Modification Request with Outer Header Creation (gNB TEID) appear? Does UPF respond? If no response: UPF N4 queue backed up.

Fix: increase UPF N4 processing threads. Increase SMF PFCP T1 timer from 3s to 8s with N1=3 retries.

Pattern 3: NRF 503 Storm — Cascading SBI Failures

Symptom: after maintenance window or NF pod restarts, SBI error rate climbs across all NF pairs. Auth failure rate rises. PDU session setup failures. Grafana shows NRF HTTP 503 rate spike.

What happened: all NFs restarted simultaneously, all attempt OAuth2 token refresh and NRF discovery simultaneously. NRF token endpoint overwhelmed.

Check: NRF pod CPU and HTTP connection pool utilisation. If NRF CPU > 90% and 503 rate > 10%: NRF overload confirmed.

Fix immediate: nothing — wait 2-3 minutes for token request backpressure to clear. Fix permanent: add startup jitter (0–60s random delay) to NF pod token refresh. Stagger NF pod restarts during maintenance (10-minute waves, not all at once).

Pattern 4: UPF GTP-U Path Failure — gNB Cannot Reach UPF N3

Symptom: all PDU sessions on a specific gNB lose data plane connectivity. Sessions still show as active in SMF. No PDU session release alarms. N3 Echo Request from UPF to gNB times out.

Check 1 — tcpdump on UPF N3 interface: are GTP-U Echo Request packets going out? Are Echo Response packets coming back? No response: N3 transport path failure.

Check 2 — Traceroute from UPF to gNB N3 IP. If path asymmetric or failing at aggregation router: transport issue.

Check 3 — If Echo Response is arriving but data packets are not: check gNB TEID in UPF FAR. If UPF FAR has stale gNB TEID from before last handover: PFCP modification failure after handover.

Pattern 5: Site Dark — Full Scenario Walkthrough

03:00. NOC ticket: site Muscat-AlKhuwair-02 showing zero active UEs. 400 users affected. No data, no voice.

Step 1 — Grafana: which NF alarms are active for this site? AMF alarm = core/signalling issue. No NF alarm + gNB alarm = transport or gNB issue. No alarm anywhere = monitoring failure.

Step 2 — Check gNB OAM: is the gNB operational? Does it show N2 connection status as Connected to AMF? If gNB shows N2 connected but users fail: the problem is in the core, not the gNB.

Step 3 — If N2 shows Connected: capture N2 on AMF interface for this gNB. Are NGAP InitialUEMessage packets arriving? If yes: core is receiving registrations. If no: N2 transport failure between gNB and AMF.

Step 4 — If NGAP arriving: is AMF sending Authentication Request back? If yes: auth failing at AUSF. Capture N12. If no: AMF cannot reach AUSF. Check AUSF pod health.

Step 5 — If auth succeeds but no data: check N4. tcpdump on UPF N4 interface: PFCP Session Establishment requests from SMF arriving? UPF responding with Request Accepted? If no response: UPF is the problem.

Step 6 — If PDU sessions are establishing but no data: N3 path check. GTP-U Echo Request from UPF to gNB N3 IP. Response arriving? If not: N3 transport failure. Traceroute to confirm.

Step 7 — Correlate timeline: when did first alarm fire vs when did users start dropping? If a transport maintenance event preceded the failure by 5-10 minutes: likely transport configuration change caused the issue.

4. MAE Alarm Correlation

Alarm Name (iMaster NCE)What It MeansImmediate CheckLikely Fix
5GC_AMF_REG_SR_LOWRegistration success rate below thresholdCheck AUSF pod health; NRF discovery latency; AMF N12 error counterRestart AUSF if crashed; enable NRF caching; scale AUSF if overloaded
5GC_SMF_N4_ASSOC_LOSTPFCP Association between SMF and UPF lostCheck UPF pod health; N4 IP connectivity (ping UPF N4 IP from SMF); IPsec statusRestore N4 transport; restart UPF; SMF will re-establish association and re-program sessions
5GC_NRF_DISC_LATENCY_HIGHNRF discovery response time P95 > thresholdCheck NRF pod CPU; NRF discovery request rate vs configured capacityScale NRF horizontally; enable discovery caching on consumer NFs
5GC_UPF_TPUT_THRESHOLDUPF throughput approaching N6 link capacityCheck N6 interface utilisation; UPF DPDK CPU workersAdd UPF instances; upgrade N6 link capacity
5GC_SEPP_N32_FAILUREN32 inter-PLMN SEPP link failureCheck SEPP TLS certificate validity; N32 IPX connectivity; partner SEPP reachabilityRenew SEPP certificate; engage IPX provider; verify partner SEPP IP
5GC_SMF_PFCP_MOD_TIMEOUTPFCP Session Modification timeout rate exceeding thresholdUPF N4 processing thread utilisation; N4 transport latencyIncrease UPF N4 threads; increase SMF PFCP T1 timer; check N4 path QoS

Table 2 — 5GC alarm correlation (indicative — verify alarm IDs with vendor alarm dictionary for your software version).

5. Summary — Key Takeaways

TopicKey Takeaway
Diagnosis approachFollow the session flow. Symptom → which step fails → which interface/NF → packet capture confirms. Never jump to conclusions based on which NF is alarming.
Wireshark N4 filterpfcp.cause != 1 is the fastest way to spot all PFCP failures in a capture. Any non-zero cause = investigate that session.
Asymmetric connectivityUploads work, downloads fail = PFCP Session Modification timeout. Check SMF logs for PFCP_SESSION_MODIFICATION_TIMEOUT. UPF N4 thread pool.
NRF 503 stormAfter maintenance: stagger NF restarts + add startup jitter to token refresh. NRF overload self-heals in 2-3 minutes but prevention is better.
Site dark walkthrough7-step isolation: Grafana → gNB OAM → N2 capture → N12 check → N4 check → N3 check → timeline correlation. Do not skip steps.
Grafana before WiresharkGrafana spots the failure category in 30 seconds. Wireshark confirms root cause in 5 minutes. Use both — in that order.

Table 3 — Post 14 summary. Troubleshooting 5GC is systematic layer-by-layer isolation, not random NF restarts.

Next: Post 15 — 5GC Security Architecture

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top