6 min read

traceroute, ICMP, internetworking, the Wild Wild West?

Someone asked why there were asterisks during a traceroute and inquiring as to if something was wrong with their uplink. I figured I would drop a little learning, if you all are willing to join my magical adventure to observe/orient and ultimately if necessary - troubleshoot IPv4 internetworks.
traceroute, ICMP, internetworking, the Wild Wild West?

It's a very well-known fact to experts (however frequently misunderstood by laypersons) that many carrier hops don't respond to ICMP for various reasons; mostly because the carrier is aggregating/multiplexing (muxing) so many links that it would be very resource-intensive to respond to the number of folks trying to

traceroute through an egress of an entire metropolitan or geographical region if they all decided to romp on it at once. This blog post is adapted from a Facebook comment in an IT support group where someone asked why there were asterisks during a traceroute and inquiring as to if something was wrong with their uplink. I figured I would drop a little learning, if you all are willing to join me in the magical adventure that is using traceroute to observe/orient and ultimately if necessary - troubleshoot IPv4 internetworks.

ICMP is great! I enable it on all of my networks whether LAN or WAN, however, I do not operate at any scale like that of carriers which would necessitate silently dropping ICMP probes; I find that there really are no security advantages by 'obscurity' as a

whois or port scan will yield details about the owner/Point of Contact (POC) and whether or not a particular host is alive anyhow - unless it has a very restrictive Access Control List (ACL) which prevents it from doing so. I learned in my younger years with a more aggressive security stance that blocking all ICMP simply makes it way more difficult to figure what is going wrong when trouble rears it's ugly head. Any of my cloud or premise stuff all has allow icmp 0.0.0.0/0 specifically to aid in troubleshooting my own connectivity to the Provider Edge (PE) whether that is CenturyLink (Lumen), Comcast, AWS, T-Mobile, or any other carrier/provider.

The aforementioned specific phenomenon/reasoning for denying ICMP that I described earlier is called 'control plane policing' (CoPP) in carrier lingo to ensure that ingress/egress are hidden for scale/multiplexing/aggregation/if everyone traversing the link were to traceroute at the same time - it ultimately would tax the resources of or saturate the device:

Traceroute | Network | Lumen help
Traceroute is used by networking professionals to discover the path network traffic takes. Learn more about traceroute and how it can help you troubleshoot issues with your service.


What does traceroute actually tell us? It gives us forward path discovery and relies unabashedly upon the premise that the internet is a friendly and generous place (as it was when ICMP was envisioned) where everyone who says hello to you with a smile will also be returned a friendly greeting and smile. I'm not sure if you've ever been to Brooklyn or LA, but that doesn't fly there... the same circumstances surround a carrier egress link! Specifically, one might have hundreds of gigabit circuits all aggregated to one peering point and it would be silly to say hello to every person who gets on the subway in New York! Even if you had the capacity to do it, would it really serve a purpose? It is 'bad netiquette' to drop ICMP but you don't have to follow those rules at the scale of Ma Bell or Tier 1/2 providers.

Understanding the traceroute output | Lumen help
Learn more about traceroute and how to read the output to help you analyze traffic flow.

For our purposes of learning about traceroute as a troubleshooting tool, as long as the traffic finds the next hop, routing table decision, or 'gateway of last resort' - it doesn't really matter whether a device that is not owned or controlled by us tells us what its name is and where it is at... the important part is the latency which you will always be able to determine from the other hops even if you don't get that data from one of the layer 3 hops itself. Unless you get back an ICMP unreachable, everything is still going 'according to plan' and there is no reason to panic. Unless, it's at the Disco!

UNEXPECTED BRENDON URIE!!!

Another thing important to note, routes may be asymmetrical, dynamic, load-balanced, or simply chosen by a router's own reachability metrics, BGP peers, or other parameters that again are completely under the carrier's control. I can't tell you what to do in your organization any more than you can tell me what to do in mine - but we both can send email to each other without any specific interventions by either of us.

The internet is dynamic, robust, scalable, self-healing, and sometimes - just catawampus - but as long as our traffic gets where it needs to go, the route that it takes beyond our own AS (Autonomous System - if we even possess one, most of the time we lose all control beyond the CE [Customer Edge]) is under our sphere of influence about as much as the subway or metro being on time is... which is a rhetorical way of saying its not at all in any way lol:

IP routing symmetry/asymmetry | Lumen help
Learn more about traceroute and important function of IP routing; routing isn’t always symmetrical and, in the case of the Internet, it’s often asymmetrical.

Now that we understand these concepts in a little more detail, let's look at an actual mock-up of the sorts of things that happen in carrier networks where ICMP does not matter as long as the packets flow and we aren't getting ICMP unreachable or timeouts to our destination address or RSTs/windowing at layer 4:

Timeouts and unreachables in traceroute | Lumen help
Learn more about traceroutes and how it handles timeouts and unreachables.

Even when there is not an issue with connectivity to a node or endpoint and we have ICMP telling us who everybody is - there still can be consequences of latency or errors introduced by layer 1/2 issues or link saturation and when we are fortunate,
traceroute will allow for us to determine a root cause of at least which particular hop is causing these difficulties:

When traceroute is useful | Lumen help
Learn how traceroute can be helpful in diagnosing issues with your network traffic.

There also are times where this information is not useful at all because of obfuscation/asymmetry or 'carrier magic' like MPLS (which I always enjoy calling layer 2.5 - I'll save that for a later TED talk lol) or even gasp "network gremlins" which actually can prevent us from determining which hop even is introducing the latency:

When traceroute isn’t useful | Network | Lumen help
Learn when traceroute can mislead you in diagnosing traffic issues.

Now you can get some really fun stuff when multipath is a factor and probes/'TTL exceeded' messages get sent separate ways and both sides can't see the whole story. That doesn't mean it is unexpected behavior - just repeat after me like a mantra - 'after my Customer Edge (CE) - I have zero control over the traffic!' and we will all get through this like sensible adults:

How traceroute works | Lumen help
As a foundation to using traceroute to diagnose issues with your traffic flow, learn more about how traceroute works.

I hope me going into this kind of detail on a random internet post is helpful and informative for some folks and they read the content that Lumen with all of their billions of dollars probably conscripted some wonk like me in a lab for a couple days to crank out this stuff just so they can put it on the front page of their (as of today) 14bn valuation company lol.

Also kind of hoping they don't redesign their site and kill all this content, because after writing all of this I don't have the capacity to rip all of that.