1. 8

  2. 2

    This person seems to have much more experience than I do about TB transfers over WAN, but is this really the case?: “Not to belabor the point, but if you send a serial stream of TCP packets (FTP, rsync or almost any of the protocols mentioned in this doc), the rate at which you can send them, receive verification, and send another decreases as the ping time increases.”

    Plain TCP has a maximum window size of 64KB, so if the bandwidth/delay product is large (fast connection over a WAN), you will hit per-tcp-connection throughput limits (which parallel TCP would help) but that should be addressed by: https://en.wikipedia.org/wiki/TCP_window_scale_option

    Is the problem that in today’s internet this option works poorly? (I’ve seen large window problems on connections to a remote data centre which were going via some oldish NetScaler systems about 3-4 years ago - is it still a problem in today’s internet in general?)

    1. 4

      At least from my experience, WAN transfers were rarely problematic due to RTT[1] and other problems covered by the article, instead my team spent our time tackling temporal changes in network behaviour. This came in one of two main forms:

      • anything from 100% packet loss to 0.1% and then after a few minutes/hours it returned back to 0%
      • your 10/100/1000Mbps or whatever connection drops to less than 100kbps throughput between endpoints…and then later returns to normal

      Calling ‘support’ is simply not reliable or timely as the problem mostly occurred with one of the multiple hops between your endpoints and with a transit provider you had no business relationship with.

      This was the hard problem for us transferring data over the WAN. My team got quite good at solving this, which typically boiled down to route diversity and tooling to quickly separate the good routes from the bad ones. With this information you could then effectively bypass the local routing policies and bounce off alternative systems you leased that you could get the network characteristics you needed to operate; creating an A->C->B route or even A->C->D->E->B.

      When I left were working more on the automating of this and last I heard in the end my team put together a IPIP (or GRE) mesh and wrapped the lot with babeld.

      [1] the problems solved by the tools in the article work around the unavailability of any system/network tuning either due to skill or not knowing who to contact…plus their willingness to soak time into your problem

      1. 2

        Yep, “internet weather” is a thing. It’s great that you were able to choose your own routes, most don’t have such options, nice solution!

        I’ll say that RTTs do matter if you’ve taken care of stability. Retransmits are more costly than longer latency, but latency makes every retransmit slower and generally nets with higher latency also suffer from poor reliability, leading to retransmits (vicious cycle).

        I’ve found that latency is usually a decent metric for the quality of a link spanning multiple hops, so it’s worth optimizing for.

        1. 4

          What situations are you thinking of where you cannot control the routing?

          Maybe you attribute to that we were doing something more fancy, but our solution was just a dumb bunch of supplier diverse ~$10/month VPS systems with 2TB/month bandwidth limits. :)

          On a technical level it only involved dropping an IPIP (or GRE) tunnel between your main nodes and your VPS ‘hops’ and as and when needed amend the local routing table on system A to instruct it to get to system B via the tunnel.

          Although it made the sysadmin in me quietly cringe, the whole thing was made easier through adhoc changes to the /etc/hosts file :)

          Cheap as chips and very effective.