Network Troubleshooting for DevOps Engineers: A Practical Guide

A production-grade handbook for DevOps Engineers, SREs, Platform Engineers, and Kubernetes Engineers

1. Introduction

Network issues cause the majority of production outages. Whether it’s a microservice that can’t reach its database, a Kubernetes pod stuck in CrashLoopBackOff due to a DNS failure, or a load balancer silently dropping traffic — understanding how to navigate these problems quickly is what separates a junior engineer from a senior one.

What makes network troubleshooting uniquely challenging is the number of layers involved. A single failed HTTP request can originate from a misconfigured security group in AWS, a broken CoreDNS pod in Kubernetes, a firewall rule added by a teammate, a wrong route in an OS routing table, or a misconfigured application binding to the wrong interface. Without a systematic approach, you’re guessing.

Common real-world scenarios you’ll encounter:

Service unreachable — curl returns Connection refused or times out
Kubernetes pod communication failure — pods can’t reach each other or services
DNS failures — names don’t resolve, or resolve to wrong IPs
Load balancer misconfiguration — traffic reaches the LB but never hits backend pods
Firewall blocking traffic — connectivity works from some hosts but not others

The layered troubleshooting mindset is everything. Always start at the bottom (physical/IP connectivity) and work your way up (application). Jumping straight to application logs when the real problem is a firewall rule costs hours.

Application layer  → Is the app listening? Is it returning errors? 
Transport layer    → Is the port open? Are connections being established? 
Network layer      → Can packets route between hosts? 
Data link/physical → Is there connectivity at all?

2. How Networking Works: A DevOps Perspective

OSI vs TCP/IP — What Actually Matters

Forget memorizing 7 OSI layers for an exam. In production, you work with 4 practical layers:

Layer Protocol Your Tools Application HTTP, gRPC, DNS, TLS curl, dig, openssl Transport TCP, UDP ss, netstat, nc Network IP, ICMP ping, traceroute, ip route Link/Physical Ethernet, ARP ip link, arp, ethtool

What Happens When You Run `curl https://example.com`

Understanding this flow tells you exactly where to look when things break.

┌─────────────────────────────────────────────────────────────────┐ 
│  curl https://example.com                                        │ 
│                                                                  │ 
│  1. DNS Resolution                                               │ 
│     └─ Check /etc/hosts → /etc/nsswitch.conf → resolv.conf      │ 
│        → Query DNS server (e.g., 8.8.8.8)                       │ 
│        → Returns: 93.184.216.34                                  │ 
│                                                                  │ 
│  2. Routing Decision                                             │ 
│     └─ Check routing table: which interface to use?             │ 
│        → Selects eth0, gateway 192.168.1.1                      │ 
│                                                                  │ 
│  3. TCP Handshake (port 443)                                     │ 
│     └─ SYN   → server                                           │ 
│        SYN-ACK ← server                                         │ 
│        ACK   → server                                           │ 
│                                                                  │ 
│  4. TLS Handshake                                                │ 
│     └─ ClientHello → server                                     │ 
│        ServerHello + Certificate ← server                       │ 
│        Key exchange, session established                        │ 
│                                                                  │ 
│  5. HTTP Request                                                 │ 
│     └─ GET / HTTP/1.1                                           │ 
│        Host: example.com                                        │ 
│                                                                  │ 
│  6. Response                                                     │ 
│     └─ HTTP/1.1 200 OK                                          │ 
└─────────────────────────────────────────────────────────────────┘

Where failures occur at each step:

DNS → could not resolve host — check /etc/resolv.conf, test with dig
Routing → Network unreachable — check ip route, gateway, VPC routing tables
TCP handshake → Connection refused (port closed) or timeout (firewall dropping)
TLS → SSL certificate verify failed, expired cert, wrong hostname
HTTP → 4xx/5xx errors, application-level problems

3. Systematic Troubleshooting Methodology

Gut-feel debugging is slow and inconsistent. A structured approach cuts your mean time to resolution (MTTR) dramatically.

The Six-Step Framework

Step 1: Define the problem precisely

Before touching a single tool, answer:

What is the exact error message?
What is the source (client IP/pod/service)?
What is the destination (IP/hostname/port/protocol)?
When did it start? What changed?
Is it affecting all requests or some?

# Gather basic context 
hostname && ip addr show 
date && uptime 
last reboot

Step 2: Verify basic connectivity (Layer 3)

# Can you reach the host at all? 
ping -c 4 <target-ip>

# What's the path?
traceroute

If ping fails to an IP: routing or firewall issue. If ping succeeds but everything else fails: higher-layer problem.

Step 3: Verify DNS resolution

dig <hostname> 
dig @8.8.8.8 <hostname>      # Bypass local resolver 
nslookup <hostname>

If the hostname doesn’t resolve, or resolves to the wrong IP, you’ve found your problem.

Step 4: Check routing

ip route get <target-ip>     # Shows which route will be used 
ip route show                 # Full routing table

Step 5: Check firewall and port accessibility

nc -zv      # Is the port reachable? 
telnet   
ss -tulnp                     # Is the service listening locally? 
iptables -L -n -v             # Any rules blocking traffic?

Step 6: Check the application layer

curl -v http://:/health 
curl -vvv --resolve :: https:///path 
journalctl -u  --since "10 min ago" 
kubectl logs  --previous

Decision Tree

Is ping to target IP working? 
├── NO  → Routing/firewall issue 
│         ├── Check: ip route get  
│         ├── Check: iptables -L 
│         └── Check: Cloud security groups 
└── YES → Is DNS resolving correctly? 
          ├── NO  → DNS issue 
          │         ├── Check: /etc/resolv.conf 
          │         ├── Check: dig @  
          │         └── K8s: check CoreDNS pods 
          └── YES → Is the port open? 
                    ├── NO  → Service not running or firewall blocking 
                    │         ├── Check: ss -tulnp 
                    │         └── Check: nc -zv host port 
                    └── YES → Application layer issue 
                              ├── Check: curl -v 
                              ├── Check: app logs 
                              └── Check: TLS certificate

4. Essential Linux Network Troubleshooting Tools

ping — Baseline Connectivity

What it does: Sends ICMP echo requests to verify Layer 3 reachability and measure round-trip time.

When to use: First check in any troubleshooting session.

ping -c 4 8.8.8.8             # 4 packets to Google DNS 
ping -c 4 google.com          # Tests both DNS and connectivity 
ping -I eth0 10.0.0.5         # Force specific interface 
ping -s 1400 10.0.0.5         # Test with larger packet (MTU issues)

Interpreting output:

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=12.3 ms 
64 bytes from 8.8.8.8: icmp_seq=2 ttl=118 time=11.8 ms

--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 11.8/12.0/12.3/0.2 ms

ttl decreasing across attempts = possible routing asymmetry
High mdev (variance) = network instability
100% packet loss to an IP that should be reachable = firewall dropping ICMP or host down

traceroute / tracepath — Path Analysis

What it does: Shows every network hop between you and the destination. Identifies where packets stop.

traceroute 8.8.8.8 
traceroute -T -p 443 api.example.com    # TCP-based traceroute (bypasses ICMP blocks) 
tracepath 8.8.8.8                       # Similar, no root required 
mtr --report 8.8.8.8                    # Real-time, combines ping+traceroute

Production use: If traceroute shows packets reaching hop 7 but never hop 8, the problem is between those two nodes — which might be a cloud router, firewall, or misconfigured VPN gateway.

ip — Interface and Route Management

What it does: The modern replacement for ifconfig and route. Manages interfaces, addresses, routes, and more.

ip addr show                  # All interfaces and IPs 
ip addr show eth0             # Specific interface 
ip route show                 # Routing table 
ip route get 10.0.1.5         # Which route would be used? 
ip link show                  # Interface state (UP/DOWN) 
ip neigh show                 # ARP table

Real-world example:

$ ip route get 10.96.0.1 
10.96.0.1 via 192.168.1.1 dev eth0 src 192.168.1.50 uid 0 
    cache

This tells you: traffic to 10.96.0.1 (Kubernetes ClusterIP) goes via gateway 192.168.1.1 on eth0. If you expect it to go through a Kubernetes CNI interface instead, something is misconfigured.

ss — Socket Statistics

What it does: Shows open sockets, listening ports, and established connections. Faster and more powerful than netstat.

ss -tulnp                     # TCP+UDP, listening only, with process names 
ss -tnp state established     # All established TCP connections 
ss -s                         # Summary statistics 
ss -tulnp | grep :443         # Who is listening on 443? 
ss -tnp dst 10.0.0.5          # Connections to specific destination

Interpreting output:

Netid  State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port Process 
tcp    LISTEN  0       128     0.0.0.0:80           0.0.0.0:*         users:(("nginx",pid=1234,fd=6)) 
tcp    ESTAB   0       0       10.0.0.10:54312      10.0.0.5:443      users:(("curl",pid=5678,fd=5))

Recv-Q nonzero on LISTEN: app is not accepting connections fast enough (backlog full)
Send-Q nonzero on ESTAB: network congestion, destination not consuming data

curl — HTTP/HTTPS Testing

What it does: The Swiss Army knife for testing HTTP endpoints.

curl -v http://service:8080/health            # Verbose output 
curl -I https://example.com                   # Headers only 
curl -w "\nTime: %{time_total}s\n" https://example.com   # Timing 
curl --connect-timeout 5 --max-time 10 http://slow-service/ 
curl -k https://self-signed.example.com       # Skip TLS verification 
curl --resolve api.example.com:443:10.0.0.5 https://api.example.com   # Override DNS 
curl -H "Host: myapp.example.com" http://10.0.0.5/   # Test ingress with custom Host header

The timing breakdown is gold for diagnosing slow requests:

curl -w "DNS: %{time_namelookup}s | Connect: %{time_connect}s | TLS: %{time_appconnect}s | Total: %{time_total}s\n" \ 
  -o /dev/null -s https://example.com

dig — DNS Interrogation

What it does: Queries DNS servers directly. The primary tool for DNS troubleshooting.

dig google.com                              # A record 
dig google.com AAAA                         # IPv6 record 
dig google.com MX                           # Mail records 
dig @8.8.8.8 google.com                     # Query specific nameserver 
dig +short google.com                       # IP only 
dig +trace google.com                       # Full resolution chain 
dig -x 8.8.8.8                             # Reverse DNS 
dig @10.96.0.10 kubernetes.default.svc.cluster.local   # K8s CoreDNS

Reading dig output:

;; ANSWER SECTION: 
google.com.		299	IN	A	142.250.80.46 
;; Query time: 12 msec 
;; SERVER: 8.8.8.8#53

299 = TTL in seconds (low TTL = DNS changes propagate quickly)
SERVER = which resolver actually answered
No ANSWER section = DNS record doesn’t exist or resolution failed

tcpdump — Packet Capture

What it does: Captures and analyzes raw network packets. The ultimate source of truth.

tcpdump -i eth0                             # All traffic on eth0 
tcpdump -i any port 443                     # All TLS traffic, any interface 
tcpdump host 10.0.0.5                       # Traffic to/from specific host 
tcpdump src 10.0.0.5 and dst port 8080      # Filtered 
tcpdump -w /tmp/capture.pcap -i eth0        # Save to file (analyze in Wireshark) 
tcpdump -i eth0 -nn -v port 53              # DNS queries, no hostname resolution 
tcpdump 'tcp[tcpflags] & (tcp-syn) != 0'   # SYN packets only

Reading a TCP handshake:

14:23:01 IP 10.0.0.10.54312 > 10.0.0.5.443: Flags [S], seq 1234567890 
14:23:01 IP 10.0.0.5.443 > 10.0.0.10.54312: Flags [S.], seq 9876543210, ack 1234567891 
14:23:01 IP 10.0.0.10.54312 > 10.0.0.5.443: Flags [.], ack 9876543211

[S] = SYN (initiating connection)
[S.] = SYN-ACK (server acknowledging)
[.] = ACK (connection established)
[R] = RST (connection refused/reset — port closed or firewall rejecting)
[F] = FIN (graceful close)

If you see SYN packets leaving but no SYN-ACK arriving, a firewall is dropping packets.

nc (netcat) — Swiss Army Knife for TCP/UDP

What it does: Opens raw TCP/UDP connections, useful for port testing and basic service simulation.

nc -zv 10.0.0.5 443             # Test if port is open (verbose) 
nc -zv 10.0.0.5 8080-8090       # Scan port range 
nc -l 8080                       # Listen on port 8080 (simple server) 
echo "GET / HTTP/1.0" | nc 10.0.0.5 80   # Raw HTTP request 
nc -u 10.0.0.5 514               # UDP test (syslog port)

nmap — Network Scanner

What it does: Scans ports and services across one or many hosts.

nmap -p 443,8080,8443 10.0.0.5         # Scan specific ports 
nmap -p- 10.0.0.5                       # All 65535 ports 
nmap -sV 10.0.0.5                       # Version detection 
nmap -sn 10.0.0.0/24                    # Host discovery (no port scan) 
nmap --script ssl-cert 10.0.0.5 -p 443  # Check TLS certificate

Note: Use with permission. In production environments, coordinate scans to avoid triggering security alerts.

5. Troubleshooting DNS Problems {#dns}

DNS issues cause a disproportionate number of production incidents, and they’re often subtle — everything looks fine until a TTL expires or a pod restarts.

How DNS Resolution Works on Linux

Application → glibc resolver 
           → Check /etc/nsswitch.conf (order: files, dns) 
           → Check /etc/hosts (files) 
           → Query /etc/resolv.conf nameserver(s) 
              └── systemd-resolved (127.0.0.53) on modern systems 
                  └── Upstream DNS (8.8.8.8, or corporate DNS)

cat /etc/resolv.conf 
# nameserver 127.0.0.53        <- systemd-resolved stub 
# nameserver 10.96.0.10        <- K8s CoreDNS (inside pods) 
# search default.svc.cluster.local svc.cluster.local cluster.local 
# options ndots:5

Common DNS Issues and How to Debug Them

Issue: DNS timeout

# How long does resolution take? 
time dig google.com

# Is the resolver responding?
dig @127.0.0.53 google.com
dig @8.8.8.8 google.com # Bypass local resolver# Check systemd-resolved status
systemd-resolve --status
resolvectl status

Issue: Wrong DNS server configured

# See what resolver is being used 
cat /etc/resolv.conf

# On systemd-resolved systems
resolvectl dns# Override for a single query
dig @10.0.0.1 internal.service.corp

Issue: DNS works for public names but not internal

# Check search domains 
cat /etc/resolv.conf | grep search

# Manually test internal name with full FQDN
dig internal.service.corp.
dig internal.service.corp # Note trailing dot matters

Kubernetes CoreDNS

Inside Kubernetes pods, DNS is handled by CoreDNS at the kube-dns ClusterIP (typically 10.96.0.10).

# Verify CoreDNS pods are running 
kubectl -n kube-system get pods -l k8s-app=kube-dns

# Test DNS from inside a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Inside pod:
nslookup kubernetes.default
nslookup myservice.mynamespace.svc.cluster.local
cat /etc/resolv.conf# Check CoreDNS logs
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50# Check CoreDNS ConfigMap
kubectl -n kube-system get configmap coredns -o yaml

The ndots:5 setting explained: In Kubernetes, short names like myservice trigger up to 5 search domain attempts before falling back to the root. This means myservice expands to myservice.default.svc.cluster.local, then myservice.svc.cluster.local, etc. This can cause DNS timeouts when hitting external names — consider using FQDNs with a trailing dot for external endpoints.

6. Troubleshooting Connectivity Issues

Localhost Issues

# Is the service bound to the right interface? 
ss -tulnp | grep

# Service bound to 127.0.0.1 won't be reachable externally
# Service bound to 0.0.0.0 listens on all interfaces

If a service is bound to 127.0.0.1:8080 and you're trying to reach it from another host — that's your problem. Check the application configuration to bind to 0.0.0.0 or the specific external IP.

Server-to-Server Connectivity

# From source server, test destination 
ping <destination-ip> 
nc -zv <destination-ip> <port> 
curl -v http://<destination-ip>:<port>/health

# Check routing
ip route get # Example output
$ ip route get 10.0.1.50
10.0.1.50 via 10.0.0.1 dev eth0 src 10.0.0.10 uid 0

Reading the routing table:

$ ip route show 
default via 192.168.1.1 dev eth0 proto dhcp 
10.0.0.0/8 via 10.10.0.1 dev vpn0           # Internal traffic via VPN 
172.16.0.0/12 via 10.10.0.1 dev vpn0 
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.50

Traffic to 10.0.1.50 matches the /8 route and goes via the VPN. If that VPN tunnel is down, connection fails even though the host is physically reachable.

Container Networking Issues

In Docker/Kubernetes, containers have their own network namespace with separate interfaces and routes.

# Docker: inspect container network 
docker inspect  | grep -i network 
docker exec  ip addr 
docker exec  ip route 
docker exec  cat /etc/resolv.conf

# Check Docker bridge network
ip link show docker0
bridge link show

7. Troubleshooting Ports and Services {#ports}

Is the Service Listening?

ss -tulnp                               # All listening sockets with process 
ss -tulnp | grep :8080                  # Specific port 
ss -tulnp | grep nginx                  # Specific process

# If ss not available (old systems)
netstat -tulnp
netstat -tulnp | grep LISTEN

Is the Port Reachable Remotely?

nc -zv 10.0.0.5 8080                    # Quick port test 
nc -zv -w 3 10.0.0.5 8080              # 3 second timeout

# Test from within Kubernetes pod
kubectl exec -it -- nc -zv
kubectl exec -it -- wget -qO- http://:/health

Interpreting nc results:

Connection to 10.0.0.5 8080 port [tcp/http-alt] succeeded!   # Port open 
nc: connectx to 10.0.0.5 port 8080 (tcp) failed: Connection refused  # Port closed 
# (hangs/timeout) = firewall dropping packets silently

The difference between “Connection refused” and a timeout is critical:

Refused = host is reachable, but nothing is listening on that port (or iptables REJECT)
Timeout = packets are being dropped (firewall DROP rule, routing issue, host unreachable)

8. Firewall Troubleshooting

iptables

# View all rules with packet counts 
iptables -L -n -v

# View NAT table (important for K8s/Docker)
iptables -t nat -L -n -v# View filter table explicitly
iptables -t filter -L INPUT -n -v --line-numbers# Check if a specific port is blocked
iptables -L INPUT -n | grep DROP
iptables -L INPUT -n | grep REJECT

Understanding chains:

INPUT — traffic destined for this host
OUTPUT — traffic originating from this host
FORWARD — traffic passing through this host (relevant for routers, K8s nodes)
PREROUTING (nat table) — DNAT happens here (e.g., K8s service VIP → pod IP)
POSTROUTING (nat table) — SNAT/masquerade happens here

Real-world K8s iptables example:

When you access a Kubernetes ClusterIP service, iptables intercepts the traffic and rewrites the destination to a backend pod IP using DNAT rules created by kube-proxy:

# See K8s service rules 
iptables -t nat -L KUBE-SERVICES -n -v | grep 10.96.0.10 
iptables -t nat -L KUBE-SVC-<hash> -n -v

# Trace a packet through iptables (kernel module)
modprobe xt_LOG
iptables -t raw -I PREROUTING -p tcp --dport 8080 -j LOG --log-prefix "PKT: "
# Watch: dmesg | grep PKT
# Clean up: iptables -t raw -D PREROUTING -p tcp --dport 8080 -j LOG --log-prefix "PKT: "

nftables

nftables replaces iptables on modern distributions (RHEL 8+, Debian 10+):

nft list ruleset                         # All rules 
nft list table inet filter               # Filter table 
nft list chain inet filter input         # Input chain

ufw (Ubuntu Firewall)

ufw status verbose                       # Current rules and status 
ufw allow 8080/tcp                       # Allow port 
ufw deny from 10.0.0.5                   # Block source IP 
ufw logging on                           # Enable logging (/var/log/ufw.log)

9. Packet-Level Troubleshooting with tcpdump {#tcpdump}

tcpdump is your ground truth. When you can’t trust what the application says, packets don’t lie.

Confirming Traffic Reaches the Server

# On the server, capture incoming connections 
tcpdump -i any -nn port 8080

# On the server, filter for specific client
tcpdump -i any -nn src 10.0.0.10 and port 8080# Capture and save for later analysis
tcpdump -i eth0 -w /tmp/debug.pcap port 443
# Transfer to your laptop and open in Wireshark

Diagnosing Connection Failures

Scenario: Client sends SYN, gets no response

# On client 
tcpdump -i eth0 host 10.0.0.5 and port 8080 
# See: SYN packets going out, nothing coming back 
# Conclusion: Firewall is dropping packets (DROP rule, security group)

# On server
tcpdump -i eth0 port 8080
# If SYN packets don't appear here: firewall before server
# If SYN packets appear but no SYN-ACK: server-side issue (app not listening, server firewall)

Scenario: Connection established but no data

tcpdump -i any -nn -A host 10.0.0.5 and port 8080 
# -A prints ASCII content 
# Look for HTTP request/response or lack thereof

Advanced tcpdump filters:

# Only SYN packets (new connections) 
tcpdump 'tcp[tcpflags] & tcp-syn != 0'

# Only RST packets (connection resets)
tcpdump 'tcp[tcpflags] & tcp-rst != 0'# HTTP GET requests
tcpdump -A -s 0 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'# Large packets (MTU debugging)
tcpdump 'ip[2:2] > 1400'# DNS queries
tcpdump -i any -nn port 53

10. Kubernetes Network Troubleshooting

Kubernetes networking has multiple layers: pod networking, service networking, and ingress. Each can fail independently.

Pod Networking Fundamentals

Every pod gets its own IP (managed by CNI: Calico, Flannel, Cilium, etc.). Pods can communicate directly via IP across nodes — if CNI is working correctly.

# Get pod IPs 
kubectl get pods -o wide -n <namespace>

# Check which node a pod is on
kubectl get pod -o wide# Test connectivity from inside a pod
kubectl exec -it -- ping
kubectl exec -it -- nc -zv
kubectl exec -it -- wget -qO- http://:/health

Service Networking

Kubernetes Services create virtual IPs (ClusterIP) and route traffic to matching pods via iptables/IPVS rules set up by kube-proxy.

# Inspect a service 
kubectl get svc  -o wide 
kubectl describe svc

# Verify endpoints exist (if Endpoints is empty, no pods match the selector)
kubectl get endpoints
kubectl describe endpoints

Empty Endpoints is the #1 cause of “Service unreachable” in Kubernetes. This means no pods match the service selector. Check:

# What selector does the service use? 
kubectl get svc  -o jsonpath='{.spec.selector}'

# Do any pods match?
kubectl get pods -l app=myapp # Replace with your selector labels
kubectl get pods --show-labels | grep

Debugging with Ephemeral Containers

# Run a debug pod in same namespace 
kubectl run debug-pod --image=nicolaka/netshoot -it --rm --restart=Never -- bash

# Inside netshoot: dig, curl, tcpdump, iperf all available# Attach to running pod's network namespace (K8s 1.23+)
kubectl debug -it --image=nicolaka/netshoot --target=

Ingress Troubleshooting

# Check ingress configuration 
kubectl get ingress -A 
kubectl describe ingress <name>

# Check ingress controller pods
kubectl -n ingress-nginx get pods
kubectl -n ingress-nginx logs --tail=50# Test with explicit Host header
curl -H "Host: myapp.example.com" http:///path# Test TLS
openssl s_client -connect myapp.example.com:443 -servername myapp.example.com

CNI Issues

If pods on different nodes can’t communicate:

# Check CNI pods (Calico example) 
kubectl -n kube-system get pods -l k8s-app=calico-node

# Check node-level routes
ip route show | grep # Verify CNI interface exists
ip link show cali* # Calico
ip link show flannel* # Flannel
ip link show cilium* # Cilium# Check for CNI config
ls /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist

11. Cloud Network Troubleshooting

AWS

Security Groups are the most common source of connectivity problems in AWS. Unlike iptables, they are stateful — allowing inbound automatically allows return traffic.

# From the AWS CLI 
aws ec2 describe-security-groups --group-ids sg-xxxxxxxx 
aws ec2 describe-network-acls --filters Name=vpc-id,Values=vpc-xxxxxxxx

# Check effective security groups on an instance
aws ec2 describe-instances --instance-ids i-xxxxxxxx \
--query 'Reservations[].Instances[].SecurityGroups'

Common AWS network gotchas:

Security Group allows port 8080, but the application is binding to 127.0.0.1 — packets arrive but are rejected by OS
NACLs are stateless — you need both inbound AND outbound rules (unlike Security Groups)
VPC Peering is not transitive — A peers with B, B peers with C ≠ A can reach C
Route tables — subnets need explicit routes to reach peered VPCs, VPN gateways, etc.

# Test from EC2 instance 
curl http://169.254.169.254/latest/meta-data/   # Instance metadata (verify IMDSv2) 
curl http://169.254.169.254/latest/meta-data/local-ipv4

# VPC Flow Logs — enable on suspect subnets, then query CloudWatch Logs
# Look for REJECT action on expected traffic

GCP

# Check firewall rules 
gcloud compute firewall-rules list 
gcloud compute firewall-rules describe <rule-name>

# Check routes
gcloud compute routes list# VPC network details
gcloud compute networks describe

GCP-specific gotchas: Firewall rules apply to the entire VPC network, not subnets. Target tags or service accounts control which VMs the rule applies to.

Azure

In Azure, Network Security Groups (NSGs) can be attached at both the subnet level and the NIC level — both are evaluated. A common mistake is configuring the NIC NSG but forgetting the subnet NSG, or vice versa.

az network nsg show -g <resource-group> -n <nsg-name> 
az network nsg rule list -g <resource-group> --nsg-name <nsg-name> 
az network nic show -g <resource-group> -n <nic-name>

12. Real Production Incident Walkthrough

Scenario: “Payment Service Unreachable After Deployment”

Alert received: payment-service health check failing. 0% success rate for 5 minutes.

Step 1: Define the problem

kubectl get pods -n payments 
# NAME                              READY   STATUS    RESTARTS   AGE 
# payment-svc-7d9f8b6-xk2pq        0/1     Running   0          3m 
# payment-svc-7d9f8b6-mn8qt        0/1     Running   0          3m

Pods are running but not READY. Something is failing the readiness probe.

Step 2: Check events and logs

kubectl describe pod payment-svc-7d9f8b6-xk2pq -n payments 
# Events: 
# Warning  Unhealthy  2m  kubelet  Readiness probe failed: Get "http://10.0.1.45:8080/health": dial tcp 10.0.1.45:8080: connect: connection refused

kubectl logs payment-svc-7d9f8b6-xk2pq -n payments --tail=30
# Error: Cannot connect to database: dial tcp 10.96.45.12:5432: i/o timeout

App is running but can’t reach its database.

Step 3: DNS and service check

kubectl exec -it payment-svc-7d9f8b6-xk2pq -n payments -- sh 
# Inside pod: 
nslookup postgres-service.databases.svc.cluster.local 
# Server: 10.96.0.10 
# Non-authoritative answer: Name: postgres-service.databases.svc.cluster.local 
# Address: 10.96.45.12

DNS resolves correctly.

Step 4: Test connectivity

# Still inside pod 
nc -zv 10.96.45.12 5432 
# (hangs — timeout, not refused)

Port times out. Either the service has no endpoints, or a NetworkPolicy is blocking it.

Step 5: Check endpoints

kubectl get endpoints postgres-service -n databases 
# NAME               ENDPOINTS   AGE 
# postgres-service         45m

No endpoints! The service has no backing pods.

Step 6: Find the root cause

kubectl get pods -n databases 
# NAME                 READY   STATUS             RESTARTS   AGE 
# postgres-0           0/1     ImagePullBackOff   0          46m

The database pod failed to start due to ImagePullBackOff. During the deployment, someone updated the database image tag in the Helm values and pushed an image that doesn't exist in the registry.

Resolution:

# Fix the image tag 
helm upgrade postgres ./charts/postgres -n databases --set image.tag=15.3

# Verify pod comes up
kubectl get pods -n databases -w# Verify endpoints populate
kubectl get endpoints postgres-service -n databases
# NAME ENDPOINTS AGE
# postgres-service 10.0.1.82:5432 2m# Verify payment service recovers
kubectl get pods -n payments

Total resolution time: 11 minutes. The structured approach — checking events, logs, DNS, connectivity, endpoints in sequence — avoided hours of guessing.

13. Advanced Troubleshooting Techniques

conntrack — Connection Tracking

The Linux connection tracking table records all NAT’d connections. Useful for debugging K8s service routing and SNAT issues.

conntrack -L                             # List all tracked connections 
conntrack -L | grep 10.0.0.5            # Filter by IP 
conntrack -L | wc -l                    # Total tracked connections 
# If this is near nf_conntrack_max, you'll drop connections 
cat /proc/sys/net/netfilter/nf_conntrack_count 
cat /proc/sys/net/netfilter/nf_conntrack_max

# Watch new connections in real-time
conntrack -E -e NEW

High conntrack count is a real production issue. Under heavy load, it can exhaust the conntrack table, causing new connections to silently fail with no error.

Network Namespaces

Containers and pods have isolated network namespaces. To troubleshoot at the packet level inside a container without installing tools in the container:

# Find the container PID 
docker inspect  | grep Pid 
# Or for K8s 
crictl inspect  | grep pid

# Enter the network namespace
nsenter -t -n -- ip addr show
nsenter -t -n -- ss -tulnp
nsenter -t -n -- tcpdump -i any port 8080

Advanced ss Filters

# Show only connections in TIME_WAIT (can indicate connection storm) 
ss -tn state time-wait | wc -l

# Show sockets by memory usage (find memory hog)
ss -tm# Connections to a specific destination port
ss -tn dst :443# Filter by source address
ss -tn src 10.0.0.5

strace for Socket Debugging

When you need to know exactly what syscalls an application makes:

strace -e trace=network -p  
strace -e connect,bind,sendto,recvfrom curl http://example.com

This shows every connect() call, which IP:port the app is trying to reach, and what errors it receives — invaluable when the app logs are ambiguous.

14. Automation and Monitoring

The best network troubleshooting is the one you don’t have to do because your monitoring caught the issue first.

Key Metrics to Monitor with Prometheus

# Key network metrics to alert on:

# Blackbox exporter — probe availability
probe_success{job="blackbox", instance="https://api.example.com"} == 0# Node exporter — interface errors
rate(node_network_receive_errs_total[5m]) > 0
rate(node_network_transmit_errs_total[5m]) > 0# DNS resolution failures (CoreDNS)
rate(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}[5m]) > 0.01# Kubernetes endpoint availability
kube_endpoint_address_available{endpoint="my-service"} == 0# TCP retransmits (sign of network congestion)
rate(node_netstat_Tcp_RetransSegs[5m]) > 10

Grafana Dashboards

Key dashboards to maintain:

Node Exporter Full (dashboard ID 1860) — network interface metrics per node
Kubernetes Networking — pod/service network traffic
CoreDNS — DNS query rates, SERVFAIL rates, response times
Blackbox Exporter — endpoint availability and probe duration

Proactive Alerting

# AlertManager rule example 
groups: 
- name: network 
  rules: 
  - alert: ServiceEndpointDown 
    expr: kube_endpoint_address_available == 0 
    for: 1m 
    labels: 
      severity: critical 
    annotations: 
      summary: "Kubernetes service {{ $labels.endpoint }} has no available endpoints"

- alert: DNSHighLatency
expr: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS p99 latency > 500ms"

Continuous Connectivity Testing

Run synthetic monitoring probes from within your cluster:

# Deploy a simple network probe pod that tests connectivity continuously 
kubectl apply -f - < 
apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: network-probe 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: network-probe 
  template: 
    metadata: 
      labels: 
        app: network-probe 
    spec: 
      containers: 
      - name: probe 
        image: nicolaka/netshoot 
        command: ["/bin/sh", "-c"] 
        args: 
        - while true; do 
            nc -zv postgres-service.databases 5432 && echo "DB OK" || echo "DB FAIL"; 
            sleep 10; 
          done 
EOF 
kubectl logs -f deployment/network-probe

15. Best Practices Checklist

Investigation practices:

Always start with ping before anything else — establish whether basic connectivity exists
Always check DNS separately from connectivity — they fail independently
Always run diagnostic commands from both ends (source and destination) when possible
Save tcpdump captures (-w file.pcap) before the issue clears itself
Document your debugging steps — you’ll face this issue again
Check “what changed” in your deployment pipeline before spending time on tools

Infrastructure practices:

Implement health checks and readiness probes on all Kubernetes workloads
Always set resource limits — a pod consuming all CPU can cause DNS timeouts that look like network issues
Use NetworkPolicies in Kubernetes but test them in audit mode first
Keep firewall rules documented and in version control (Terraform, Pulumi)
Enable VPC Flow Logs in cloud environments — they’re invaluable after the fact
Set up Blackbox Exporter probes for all critical service endpoints
Monitor CoreDNS health metrics actively

Security practices:

Default-deny NetworkPolicies in Kubernetes namespaces, then explicitly allow
Use security group/NACL changes in change management — they’re silent and cause immediate outages
Regularly audit firewall rules for stale entries

Operational practices:

Maintain a network diagram — knowing expected topology cuts debug time in half
Keep netshoot or similar debug images available in your container registry
Create runbooks for known failure patterns (DNS failures, endpoint empty, etc.)
Add network-layer metrics to your SLOs — don’t just track application error rates

16. Conclusion

Network troubleshooting is a skill that compounds over time. The engineer who’s debugged a hundred incidents builds a mental model that shortcuts the diagnostic process — they know where to look first because they’ve seen the patterns.

The core mental model: Connectivity is a chain. Every link in that chain (DNS, routing, firewall, application) must work for the end result to work. Your job is to find the broken link, and the fastest way to do that is to test each link systematically rather than randomly.

Key principles to internalize:

Packets don’t lie. When in doubt, tcpdump at both ends.
DNS is nearly always involved. Test it early, test it explicitly.
“Connection timeout” and “Connection refused” mean different things — read the error carefully.
Empty Kubernetes endpoints cause more service outages than any other single issue.
The most recent change is usually the cause. Check your deployment history before spending 30 minutes with tools.

Master these tools first: ping, dig, ss, curl -v, nc, tcpdump. With these six, you can resolve 90% of production network issues. The rest — conntrack, nsenter, strace — are for the 10% of deep-dive investigations.

Network troubleshooting is not magic. It’s methodology, layered knowledge, and the right tools applied in the right order. Build that foundation, and production outages become problems to solve rather than fires to fight.

Written for DevOps Engineers, SREs, and Platform Engineers operating production Kubernetes and cloud infrastructure. All commands tested on Linux (Ubuntu 22.04, RHEL 8) and Kubernetes 1.28+.