Troubleshooting a network connectivity issue is quite challenging in a complex Data Center Network with live production traffic.
Below are few basic methods to identify a node in the traffic path that drops packets which would cause network connectivity issues. This procedure is in general and not related to any specific vendor as troubleshooting options varies with each vendor.
Let’s consider below simple topology for troubleshooting purpose:
R1(IP:18.104.22.168)— R2 — R3 — R4(IP:22.214.171.124)
Problem statement#1: 100% ping loss between R1 and R4. In this case, ping from R1 to R4’s interface IP fails 100%.
- If there is routing entry(advertised via any unicast routing protocol), then we need to check if R1 has ARP entry to reach the Next-hop of R4’s interface. In this case, NH for R4 would be R2’s interface IP address connected to R1. If the Arp is not available, then the problem is with ARP resolution and we need to troubleshoot in that direction. (‘debug arp’ command and Extended ACL count option discussed below).
- If both route and ARP table looks good, then we need to check if those entries seen in show commands (software copy) are written in hardware CAM(Content-addressable memory) of R1 and available to CPU(IP Kernel) for it to generate Ping ICMP traffic.
- If everything looks good on R1, we need to check if the ping ICMP packets are going out of R1’s interface to R2. One quick way is to check the output interface statistics counter of R1 or using Egress ACL count option on R1(discussed later)
- Use ‘debug ip icmp‘ command on R1 and R4 to see whether ICMP packets are sent and received on both end nodes.
- If packets are going out of R1, then we can mark R1 as clear and start looking at the nodes in the traffic path(R2 and R3) to reach the final destination (R4)
- Traceroute can be used on R1 towards R4, to see where the packets are getting dropped in the traffic path. Note: Traceroute only shows the unidirectional traffic path from R1 and R4. To check undirectional traffic patch from R4 to R1, use traceroute on R4 towards R1.
- Once we identify a node that drops packet using traceroute, use above procedures on that specific node to check the forwarding entries both in software and hardware. Ingress and Egress ACL with appropriate permit statements along with count options would be helpful to confirm whether a packets with specific SRC/DST IP are ingressing and egressing out of an interface.
- PBR(Policy Based Routing) rules precedes routing table entries. Check if any PBR rules are forwarding traffic to some other path.
Problem Statement#2: Not 100% ping loss. But few ping packets are getting dropped from R1 to R4. In this case, make sure the L3 network is converged ie: there is no route flap that would result in transient packet loss; and all intermediate routers have appropriate forwarding entries in show commands. To identify the problematic node, in addition to above methods;
- Make sure the forwarding entries are written in hardware CAM in source, destination and all intermediate nodes in the traffic path.
- Ping with higher packet size (within path MTU) to see if the interface counters of all routers in the path are counting those packets from R1 to R4. Dell switches have specific interface counter values that counts packet ‘over 511-byte pkts‘ and ‘over 1023-byte pkts’. So these higher MTU ICMP request packets would be counted by these counters. ICMP response would be same size, so we should see the similar statistics on return path from R4 to R1. This method may not be helpful with live production traffic as all interface counters would keep incrementing. So, cannot make conclusive decision based on interface counter values.
- Configure a test ACL with count option to count the ICMP packet that matches SRC and DST IP address of Ping packets. A sample ACL would look like;
Dell#show run acl ! ip access-list extended test-R1R4 seq 5 permit icmp host 126.96.36.199 host 188.8.131.52 count seq 10 permit icmp any any seq 15 permit ip any any <<< permit to allow all other traffic ! ip access-list extended test-R4R1 seq 5 permit icmp host 184.108.40.206 host 220.127.116.11 count seq 10 permit icmp any any seq 15 permit ip any any Dell#
Apply ‘test-R1R4’ ACL on interfaces in the traffic path from R1 to R4 to count ICMP request packets from R1 to R4. Similarly apply ‘test-R4R1’ ACL on interfaces in the traffic path from R4 to R1 to count ICMP reply packets from R4 to R1. Now, initiate ping from R1 to R4 with specific count value(‘Dell#ping 18.104.22.168 count 100’) and check if those ACL counts those ICMP packets. This should help us to identify the node that drops ping packets.
Note: In Cisco device, ACL count option is enabled by default. Need not to specifically mention “count” option in ACL as we do in Dell switches.
- If there are any port-channels configured between network nodes (routers), then try to eliminate each port-channel member by shutting down each member interface of Port-channel one-by-one. Intrusive in production network but helps to narrow down packet loss in few corner cases where all members of a port-channel works fine except one that is corrupting packet. All packets that are hashed to that single problematic port-channel member might get dropped resulting in latency or packet retransmissions in network.
If none of above methods are useful, we may need to do port-mirroring(SPAN) on all interfaces in the traffic path and deep dive into packet capture from all nodes to see which node drops packets. Quite time consuming process.
With above methods, I believe we can narrow down to a point where we can isolate a problematic node in the network.
Hope this helps.