In 1987, virtually every network admin could get an email address for the admin of pretty much any other network, worldwide, with a couple phone calls or a whois lookup. That meant it was practical to troubleshoot other peoples networks, which are often where these errors are seen. Nowadays? Forget about it. Hah. Wow. No way.
I literally did this less than 5 years ago using traceroute. I was having connectivity issues between my home network and my server cabinet in Hurricane Electric fmt2. mtr was showing packets being dropped on the AT&T side of the connection between AT&T and HE. I let HE know about this issue, they reached out to AT&T, and the issue was fixed within the hour.
One of my favorite fairly-obscure tools that just seem to not have made it to today’s world is Dan Kaminsky’s paratrace, sending tcp keepalives with adjusted TTLs, causing the routers most likely to be involved in your existing tcp connections to tell you they were involved. It’s a brilliant bit of punching through layers of abstraction.
This is pretty cool but doesn’t address the point of the blog post, does it? Regardless of whether the packet was within your existing TCP stream or not, nobody’s forcing a router to reply with an exceeded TTL, just like described in the blog post, right?
I think this article is over the top. The basic point is valid, that you need to know a network well to be able to be sure you understand what traceroute is saying. But even with an unfamiliar network, traceroute can be helpful, and with experience it’s usually fairly clear when traceroute is broken for one of the reasons outlined in the article. And it’s one debugging tool among many that you need to bring to bear to diagnose a problem.
Anyway, here are a few points that occurred to me. I didn’t read the whole thing in detail so there are probably other points I could make, but I don’t think such a wild lengthy rant is worth the effort.
There are a few RFCs that list network tools and describe how they work. RFC 1147 (1990) covers traceroute as do several of its successors.
The RFC 1393 traceroute option has an appalling amplification factor. I bet it would have been disabled along with source routing and broadcast ping in the 1990s when smurf attacks became a problem.
The author doubts that network engineers or network vendors care about traceroute, but in my experience they do care. Earlier in the article is an example of traceroute not working well with MPLS, which does indeed happen, but the network vendors have implemented hacks that reach inside MPLS in order to decrement the encapsulated IP TTL and send ICMP time exceeded when it reaches zero, partly to make traceroute work. (See for example this from Cisco which is just the first link I found on the topic.)
It’s been years since I’ve had a traceroute solve my problem. Some key insights, maybe - but even then it’s more of reading tea leaves. My network days are long gone and now I agree with the article more than ever, it’s only getting more complicated and specialized. I am not an administrator anymore and this is not my main subject.
I do agree with you that one could have framed it a little better.
I’m glad I read enough of this article to understand how traceroute was cleverly implemented - but the rest seems to be a game of semantics. Just because there is no RFC, and just because routers aren’t shipped with a “Traceroute(TM) Compatible” feature does not negate the many years and many cases this tool was able to somewhat reliably trace the route a packet took.
This does seem a bit over the top. MPLS does not negate the utility of a traceroute because it never even claimed to work below the IP level, and if you’ve actually worked in a service provider network (I just started at my third) it comes in pretty handy. You still have to use your brain and possibly relationships to navigate any network path. But seeing basic layer 3 pathing is a bread and butter skill in any multihomed network and you don’t need to be a genius to see and decipher some useful information about which providers you are using for a particular traceroute because these are guaranteed to be layer 3 adjacent.
By that measure it feels that half of internet, networking and computing isn’t real.
And to be fair in my career I noticed what a large portion of people don’t know how many people don’t know how even their core field of expertise works. Sometimes it’s covered up with wrong assumptions and that might be fine for their whole career.
Sometimes these people even wrote books on that topic.
That being said I think it’s really good to write about these things. At least for me it’s sometimes hard to even detect that something is common knowledge, especially when it’s something where one would think that the lack of particular knowledge would not allow for certain systems, projects, companies to run as well as they do
But on the other hand I am fairly often surprised about bugs and outages companies have. Post mortems can be a big eye opener to what is and isn’t common knowledge. Burn for learning what stuff the industry at large and oneself doesn’t have.
It probably has a lot to do with how and where people learned about a field.
A few weeks ago I stumbled onto an article about Traceroute entitled “Traceroute Isn’t Real” which was reasonably entertaining while also managing to be incorrect or misleading in many places.
Lack of response from a node means nothing. Even if all the nodes past a certain point aren’t responding, that also means nothing. If the nodes have high latency, that also means nothing. If they respond on some probes and not others, that also means nothing.
The above statements speaks to the core limitations of classic traceroute. This is something I’ve tried to find ways to improve upon in my own traceroute-like tool (Trippy); how can we help make the output from traceroute easier for users to interpret?
Some things which can help:
Show the user a per-hop status to indicate that the lack of response from any non-target hop is not an issue (i’ve also heard it argued that the packet loss % column should not be shown at all by default)
Show the user forward and backward loss heuristics to help them filter out loss that does not “carry forward”
Allow the user to manipulate (via Dublin/Paris for UDP) ECMP flows to remove a large class of false positives
The article does not discuss ECMP at all, nor does it touch on either the Paris or Dublin traceroute protocols, which should be table stakes for doing traceroute these days.
This “tool” works purely at layer 3, so it’s impossible for it to adapt to the sort of “layer 12-dimensional-chess” type shenanigan that MPLS does
Whilst not a perfect solution, it is common to use ICMP multi-part messages (extension objects) to workaround this and it is widely supported by tools.
A lot of people think traceroute sends pings - this is an option, but never the default behavior AFAIK.
Sending ICMP EchoRequest packets is the default behaviour for (at least) MTR, my own tool trippy and tracert on Windows. However I would typically recommend using UDP/Dublin instead.
It’s not a special kind of ICMP message, or a UDP or TCP packet that uses a defined port. You cannot “permit traceroute” in a firewall, because it has no standard characteristics.
Whilst this is true, by convention port 33434 is often open for UDP/TCP specifically to facilitate traceroute and is a useful default dest port to use.
Serious question: what kind of tools in the toolkit do people have for debugging “thing seems to not be going from point A to point B in a network”? For me I have lost so much time in the past with consumer network stuff where it’s not clear at what level something is disappearing at
Assuming unix network stack and each device being under your control: tcpdump to listen on interfaces and netcat/ping to send test data. With this it’s straightforward to inspect various points along the route to determine if traffic is flowing or not.
I wouldn’t contest the premise of the article, but I did find a nice “edge case” where traceroute actually helps during my bachelor thesis: observing how the HAMnet (amateur radio wireless IP network spanning central Europe) adapts to node outages. I implemented my own little TTL-based traceroute that periodically traced paths between the ~20 nodes I had distributed to helpful HAMs by mail. IIRC I even managed to match some routing changes to weather events!
Of course, this only worked because the vast majority of nodes are MikroTik routers (per my SNMP survey), and they implement TTL-Exceeded cleanly. So this is more of an exception that proves the rule, as the article mentions.
I literally did this less than 5 years ago using traceroute. I was having connectivity issues between my home network and my server cabinet in Hurricane Electric fmt2. mtr was showing packets being dropped on the AT&T side of the connection between AT&T and HE. I let HE know about this issue, they reached out to AT&T, and the issue was fixed within the hour.
One of my favorite fairly-obscure tools that just seem to not have made it to today’s world is Dan Kaminsky’s paratrace, sending tcp keepalives with adjusted TTLs, causing the routers most likely to be involved in your existing tcp connections to tell you they were involved. It’s a brilliant bit of punching through layers of abstraction.
The popular
mtrtool can do this with the-Tflag!That’s different:
mtr -Tuses TCP SYN (and doesn’t complete the handshake) but paratrace uses TCP keepalives within an established TCP connection.This is pretty cool but doesn’t address the point of the blog post, does it? Regardless of whether the packet was within your existing TCP stream or not, nobody’s forcing a router to reply with an exceeded TTL, just like described in the blog post, right?
Correct. It only forces the machines that would answer along the path to give a slightly more precise answer.
I think this article is over the top. The basic point is valid, that you need to know a network well to be able to be sure you understand what traceroute is saying. But even with an unfamiliar network, traceroute can be helpful, and with experience it’s usually fairly clear when traceroute is broken for one of the reasons outlined in the article. And it’s one debugging tool among many that you need to bring to bear to diagnose a problem.
Anyway, here are a few points that occurred to me. I didn’t read the whole thing in detail so there are probably other points I could make, but I don’t think such a wild lengthy rant is worth the effort.
There are a few RFCs that list network tools and describe how they work. RFC 1147 (1990) covers traceroute as do several of its successors.
The RFC 1393 traceroute option has an appalling amplification factor. I bet it would have been disabled along with source routing and broadcast ping in the 1990s when smurf attacks became a problem.
The author doubts that network engineers or network vendors care about traceroute, but in my experience they do care. Earlier in the article is an example of traceroute not working well with MPLS, which does indeed happen, but the network vendors have implemented hacks that reach inside MPLS in order to decrement the encapsulated IP TTL and send ICMP time exceeded when it reaches zero, partly to make traceroute work. (See for example this from Cisco which is just the first link I found on the topic.)
It’s been years since I’ve had a traceroute solve my problem. Some key insights, maybe - but even then it’s more of reading tea leaves. My network days are long gone and now I agree with the article more than ever, it’s only getting more complicated and specialized. I am not an administrator anymore and this is not my main subject.
I do agree with you that one could have framed it a little better.
Holy shit I had no idea how traceroute worked under the hood.
I’m afraid I have come away with more, not less, respect for the tool having read the article. Not sure that was the author’s intent :)
I’m glad I read enough of this article to understand how traceroute was cleverly implemented - but the rest seems to be a game of semantics. Just because there is no RFC, and just because routers aren’t shipped with a “Traceroute(TM) Compatible” feature does not negate the many years and many cases this tool was able to somewhat reliably trace the route a packet took.
This does seem a bit over the top. MPLS does not negate the utility of a traceroute because it never even claimed to work below the IP level, and if you’ve actually worked in a service provider network (I just started at my third) it comes in pretty handy. You still have to use your brain and possibly relationships to navigate any network path. But seeing basic layer 3 pathing is a bread and butter skill in any multihomed network and you don’t need to be a genius to see and decipher some useful information about which providers you are using for a particular traceroute because these are guaranteed to be layer 3 adjacent.
Great article!
By that measure it feels that half of internet, networking and computing isn’t real.
And to be fair in my career I noticed what a large portion of people don’t know how many people don’t know how even their core field of expertise works. Sometimes it’s covered up with wrong assumptions and that might be fine for their whole career.
Sometimes these people even wrote books on that topic.
That being said I think it’s really good to write about these things. At least for me it’s sometimes hard to even detect that something is common knowledge, especially when it’s something where one would think that the lack of particular knowledge would not allow for certain systems, projects, companies to run as well as they do
But on the other hand I am fairly often surprised about bugs and outages companies have. Post mortems can be a big eye opener to what is and isn’t common knowledge. Burn for learning what stuff the industry at large and oneself doesn’t have.
It probably has a lot to do with how and where people learned about a field.
Interesting follow-up to this article:
https://systemsapproach.org/2024/12/09/three-packets-walk-into-a-tunnel/
The above statements speaks to the core limitations of classic traceroute. This is something I’ve tried to find ways to improve upon in my own traceroute-like tool (Trippy); how can we help make the output from traceroute easier for users to interpret?
Some things which can help:
UDP) ECMP flows to remove a large class of false positivesThe article does not discuss
ECMPat all, nor does it touch on either the Paris or Dublin traceroute protocols, which should be table stakes for doing traceroute these days.Whilst not a perfect solution, it is common to use ICMP multi-part messages (extension objects) to workaround this and it is widely supported by tools.
Sending ICMP
EchoRequestpackets is the default behaviour for (at least) MTR, my own tooltrippyandtracerton Windows. However I would typically recommend using UDP/Dublin instead.Whilst this is true, by convention port 33434 is often open for UDP/TCP specifically to facilitate traceroute and is a useful default dest port to use.
Would MTR get around most of the stuff in this? Just curious
no, it works fundamentally the same as other traceroute implementations and has broadly the same limitations.
Serious question: what kind of tools in the toolkit do people have for debugging “thing seems to not be going from point A to point B in a network”? For me I have lost so much time in the past with consumer network stuff where it’s not clear at what level something is disappearing at
Assuming unix network stack and each device being under your control: tcpdump to listen on interfaces and netcat/ping to send test data. With this it’s straightforward to inspect various points along the route to determine if traffic is flowing or not.
Traceroute is totally fine for the first couple hops in a consumer network.
I wouldn’t contest the premise of the article, but I did find a nice “edge case” where traceroute actually helps during my bachelor thesis: observing how the HAMnet (amateur radio wireless IP network spanning central Europe) adapts to node outages. I implemented my own little TTL-based traceroute that periodically traced paths between the ~20 nodes I had distributed to helpful HAMs by mail. IIRC I even managed to match some routing changes to weather events!
Of course, this only worked because the vast majority of nodes are MikroTik routers (per my SNMP survey), and they implement TTL-Exceeded cleanly. So this is more of an exception that proves the rule, as the article mentions.