May 31, 2007

It’s Still the Latency, Stupid…pt.1

Buy This Book!One concept that continues to elude many IT managers is the impact of latency on network design. 11 years ago, Stuart Cheshire wrote a detailed analysis on the difference between bandwidth and latency ISP links. Over a decade later, his writings are still relevant. Latency, not bandwidth, is often the key to network speed (or lack thereof).

I was reminded of Cheshire’s article and the underlying principles recently when working on an international WAN design. What Cheshire noted was that light signals pass through fibre optics at roughly 66% of the speed of light, or 200*10^6 m/s. Regardless of the equipment or protocols you use, your data cannot exceed that theoretical limit. This limit equals the delay between when a packet is sent, and when it is received, aka latency.

In the US, we tend to focus on bandwidth and carrier technology when ordering circuits, completely ignoring latency. For instance, when choosing between cable and DSL for your house do you ever ask the carrier for its latency SLA? Maybe you should. Using a cable connection a ping to www.google.com in Mountain View, CA from my house (137 KM) yields an average ping time (aka round-trip time or RTT) of 73ms. The theoretical latency for this distance (round trip) is 1.37ms meaning my cable connection is roughly 50 times worse than the theoretical limit. No surprise that Comcast focuses on bandwidth and not latency in its marketing.

Cable and DSL circuits in the US are generally not business class and do not carry any service level agreement (SLA) on latency or availability. Businesses who use these circuits for business critical services do so at their peril. Business circuits such as Frame Relay and MPLS do generally include latency SLAs, but understanding the difference between the SLA and your actual experience can have a massive impact the performance of your network. For instance, let’s say a carrier advertises a 55ms round trip SLA in the US. This SLA equals the maximum latency between any two points of presence (POP) on their network.

The coast-to-coast distance in the US is roughly 5,000KM for a theoretical latency of 50ms, so a 55ms RTT SLA is pretty good. But that doesn’t mean packets on your network will only take 55ms to cross the country. When designing your WAN you must also account for the latency added by your network equipment and your servers, and the distance between the carrier’s POP and your offices. As a result, a well designed US WAN will still experience 75-80ms ping times. A poorly designed WAN can experience much worse times.

Now consider creating an international WAN. In this case, you typically will receive multiple SLAs from the carrier for different parts of the network. For instance, when designing an MPLS connection between California and the UK, the SLAs would be approximately 55ms within the US plus 95ms to cross the Atlantic Ocean plus 21ms to connect within the UK. Add the latency of your network and you get ping times of 175ms to 200ms.

At this point you are probably asking yourself “so what?” Two tenths of a second is no big deal. The answer is the impact of latency on TCP windowing. Transmission Control Protocol (TCP) has a flow control mechanism that senses latency and bandwidth between two hosts and determines the rates which data will be transferred. The TCP window is the amount a unacknowledged data a sender can transmit before waiting for a TCP ACK. As the latency increases, the TCP window shrinks, meaning the sender sends less data before waiting for an ACK. This helps reduce the amount of data that will need to be retransmitted in case a packet gets lost. Smaller windows equals more packets, and more packets equals more data because each packet carries the overhead of a 40 byte TCP/IP header regardless of if the payload is 1 byte or 1500 bytes.

The result is what I call the “Sandbag Problem.” Let’s say the two of us are trying to fill sandbags. My job is to scoop sand into a container and hand the full container to you (data). Your job is to empty the container into a sandbag and hand the empty container (ACK) back to me. Occasionally you drop the container so I have to fill it again (Retransmit). If we were standing next to each other, the time it takes for me to hand the container to you, have you empty it, and hand it back to me (latency) would be very small. Now imagine there is a 6′ wall between us, and I need to hand the container over to you.

The wall changes several aspects of our filling operation. First, the size of the container must be smaller because I cannot lift the same weight over my head that I can lift at waist level. Second, the time to complete one cycle would increase because it takes longer to lift the container 6′ than it does 3′. Third, you would drop more containers so retransmissions would increase. As the wall gets taller, the problem gets worse. If the wall were 10′ tall, we would be throwing containers instead of lifting them, so they would need to be even smaller. The containers would be traveling 20′ round trip instead of 12′ so the delay would increase 75%. And we would need to send a lot more containers to move the same amount of sand.

TCP works just like the sandbag problem. As distance increases, the TCP window shrinks, the time between transmission and acknowledgement increases, and the number of packets required to move the data grows. One reason for this is the effect of TCP congestion avoidance algorithms on the window size. The result is that the effective “speed” of the link decreases exponentially as the distance increases, regardless of bandwidth. RFC 1323 TCP Extensions for High Performance provides for mechanisms to deal with part of this problem. One method is to tune the TCP window on your hosts based upon a calculation of Bandwidth Delay Product (BDP). BDP = bandwidth x delay. Example: A 2Mb/s E1 link between California and the UK would have a BDP of 2.048Mb/s x 200ms = 51,200 Bytes. This is the ideal TCP window to fill the pipe so that the sender is not sitting idle waiting for ACK packets. Most hosts have a TCP Window default size of 64KB so, in this scenario, no adjustments would be needed. But, if the connection were a 45Mb/s DS3, then the BDP would be almost 1,100KB. In this scenario, TCP windows would need to be adjusted to use the available bandwidth at peak efficiency.

For most network applications, anything over 100ms latency is noticeable to your end users. Time sensitive applications such as VOIP or video teleconferencing suffer the worst experience when delay is introduced. Added to this is the impact of jitter. Jitter is the delay caused when packets travel alternative paths to the destination, and either arrive out of order, or with varying intervals between them. Applications such as e-mail that are bursty and not time sensitive do not feel the impact of latency to the same degree. How much of a problem is this for you today? One way to measure latency on your network is to use your carrier’s looking glass tools. A list of major looking glasses may be found at: http://www.traceroute.org/#Looking%20Glass.

When designing for latency in a WAN it is important to first understand the applications on the network. Also important is carrier technology you will be using. Frame Relay behaves differently from Ethernet, and your choices should take distance and intended applications into consideration. After the applications have been profiled, steps can be taken to mitigate the impact of network delay. In part 2 of this article, we will discuss methods of designing for latency mitigation.

Thanks for stopping by.
If you found this article useful, please leave a tip.

50 Comments »

  1. Pete said,

    June 1, 2007 @ 3:26 pm

    Good article. Thanks.

  2. Pete said,

    June 1, 2007 @ 3:29 pm

    Good article thanks.

  3. Jenny said,

    June 1, 2007 @ 3:35 pm

    excellent article!

    jenny
    http://www.spaml.com

  4. Bob Dylan said,

    June 1, 2007 @ 4:08 pm

    but if you’re sending a bigger chunk (bandwidth) then my download finishes earlier, which is less time which is faster internet (bandwidth).

  5. It's Still the Latency, Stupid... « News Coctail said,

    June 1, 2007 @ 4:14 pm

    [...] Still the Latency, Stupid… Filed under: Uncategorized — recar @ 11:06 pm It’s Still the Latency, Stupid… If you think bandwidth is the only thing affecting your network speed, think again. As pipes get [...]

  6. Eric said,

    June 1, 2007 @ 5:00 pm

    You are in violation of your Google Ads agreement by encouraging people to click on your ad links. Don’t let Google catch you…

    “Thanks for Stopping By.
    If you found this article useful, please leave a tip by clicking on an ad. “

  7. anjesh said,

    June 1, 2007 @ 5:05 pm

    I never realized that way – high BW is not necessarily high speed. Sandbag problem explains it all. Thanks.

  8. Test said,

    June 1, 2007 @ 5:05 pm

    Not bad, you are displaying at least displaying an understanding of the issues.

  9. Eric said,

    June 1, 2007 @ 7:33 pm

    You said that an average ping to http://www.google.com (137 km from your house) took 73ms. I live on the East Coast of the US and when I ping http://www.google.com, I get an average of 19ms. Also, if I ping http://www.bbc.co.uk I get an average of 44ms. Here’s a picture: http://img478.imageshack.us/img478/7005/pingsup6.jpg . I, too, have Comcast as my provider. Am I missing something?

  10. bill said,

    June 1, 2007 @ 8:27 pm

    Eric,

    You are actually demonstrating one point from my article, and point from the upcoming part2. You and I are seeing different latencies because of distance and a lack of SLAs on the Comcast connection. When I ping the same Google IP from my Comcast connection, I get ping times of 95ms. So, point #1 is latency is a function of distance and it has an impact on customer experience. Point #2 from the next article is that one method of fighting latency is to create a distributed architecture to move your services closer to your end users. This is what Google has done. The Google server you are accessing is close to you on the East coast. That is why you ping it at 19ms and I ping it at 95ms.

    Thanks for taking the time to read and comment.

    -Bill

  11. bill said,

    June 1, 2007 @ 8:39 pm

    Eric (other Eric),

    It is absolutely NOT a violation of the Google Terms and Conditions to encourage my readers to click on an ad and visit my sponsors. The whole purpose of ads is to encourage this type of behavior. More specifically, the Google T&Cs read as follows:

    “You shall not, and shall not authorize or encourage any third party to: (i) directly or indirectly generate queries, Referral Events, or impressions of or clicks on any Ad, Link, Search Result, or Referral Button through any automated, deceptive, fraudulent or other invalid means, including but not limited to through repeated manual clicks, the use of robots or other automated query tools and/or computer generated search requests, and/or the unauthorized use of other search engine optimization services and/or software”

    I am not encouraging anyone to create “repeated manual clicks.” I merely ask that those who are kind enough to visit my site and read my articles support the sponsoring ads that make this site possible. I know most people are busy and often forget that ads are placed on a site not to annoy readers but to help offset the time and expense providing useful and original content entails. I make a concerted effort to visit the ads on sites I find useful and hope others will do the same for me.

    I greatly appreciate all my readers and hope they find the experience was beneficial. I also appreciate Google and its ability to both drive traffic to my site and help me offset some of the costs. Ultimately it is up to the sponsors to make the results from you visiting their site worth the $.05 they pay me for the click.

    Thank you for visiting and please come back to read part 2 of this article.

    -Bill

  12. Gary said,

    June 1, 2007 @ 9:44 pm

    You’re going to get banned either way if you keep it up lol. Google doesn’t even want you to mention the ads, let alone tell people to click them even once.

    encourage any third party to: (i) directly or indirectly generate queries

    you just said directly, generate one click.

  13. Eric said,

    June 1, 2007 @ 9:55 pm

    Bill,

    Thank you for clearing that up :)

  14. Nic said,

    June 1, 2007 @ 10:33 pm

    Bill,

    While it is not a violation of the AdSense ToC, please take time to peruse this quotation from the Program Policies.

    “In order to ensure a good experience for users and advertisers, publishers may not request that users click the ads on their sites or rely on deceptive implementation methods to obtain clicks. Publishers participating in the AdSense program:

    * May not encourage users to click the Google ads by using phrases such as “click the ads,” “support us,” “visit these links,” or other similar language.”

    According to this, requesting users to make a click is against the policy.

    - Nic

  15. bill said,

    June 1, 2007 @ 10:35 pm

    Hey, if they ban me, they ban me. There’s always commission junction, ad brite, link exchange, etc. I’m lucky if I make enough off all of them to cover my hosting fees in any given month, so I’m not too concerned. I maintain that their T&Cs only prevent me from generating queries, clicks or referrals through “automated, deceptive, fraudulent or other invalid means…” Nothing automated, deceptive or fraudulent about asking people to support your sponsors. Not usings robots, automated query tools, or computer generated search requests. I’m just one poor schmuck pimping his wares.

    If you really want to support me, buy one of the books I recommend off my amazon link, or use my Amazon search box to find and buy something else. I make more money off Amazon than anything else…still not enough to fund my own book habit though. :P

    -Bill

  16. Sam said,

    June 1, 2007 @ 11:39 pm

    Am i missing something? You say theoretical speed through fiber optics is 200*10^6 m/s. That’s 200*10^3 kilometers/s, or 200,000 km/sec. Then you make the completely reasonable assumption that the USA is 5,000 km across. Then you somehow say that that means that the theoretical latency is therefore 50ms.

    To find the theoretical latency, you then just take the distance (5,000km) and divide it by your given theoretical speed (200,000km/sec). The distances cancel, leaving you with seconds as your unit. Switch that to ms, and you have what you want–the latency.

    5,000km / 200,000km/s = .025s = 25ms

    25 ms seems like the correct theoretical latency to me, which means saying “a 55ms RTT SLA is pretty good” would actually not be true at all, given that it’s over twice the theoretical speed.

    Unless I’m missing something, the math stated in this article is incorrect, which really serves to damage it’s credibility.

    - Sam

  17. bill said,

    June 2, 2007 @ 12:20 am

    Nic,

    I re-read the Google policies and changed my pages to read “please visit our sponsors.” It’s walking a fine line, but I don’t specify which ads to click. Google and its advertisers should want people to click the ads. Hopefully, they won’t ban me but if they do, c’est la vie.

  18. bill said,

    June 2, 2007 @ 1:02 am

    Sam,

    Latency is measured in round trip time. The packet has to go there, and the acknowledgement has to come back. Google Maps reports the distance from San Francisco to New York as 2,907 miles or 4,678 kilometers. Fiber doesn’t follow a straight path, so I rounded up to 5,000km. Round trip it is 10,000km. Hence 50ms as the theoretical speed. Sprint is advertising a North American SLA on MPLS of 55ms between any 2 POPs. This means they have 10% overhead on the theoretical limit. Thanks for reading and commenting.

    -Bill

  19. 8Man said,

    June 2, 2007 @ 2:29 am

    I’m not impressed with my DSL ISP – a province-wide telco. They do what I call “Fedex-style” routing. Instead of mesh-routing or shortest-path routing, they route the traffic from every smaller town back to their central NOC and back out again. The valley I live in has 7 towns about 20 miles apart with a fiber point-to-point backbone connecting them.

    The Telco uses the point-to-point fiber for voice traffic, but a tracert from town A to adjacent town B shows it gets routed to big city NOC 300 miles away and back, so instead of a 20 mi. direct route, it takes at least 600 mi. get 20 mi. up the valley.

    Why do they do this? Maybe they can’t afford BGP-capable routers? They can’t afford enough knowledgeable techs to configure a mesh network? They have to do it this way (bring it all back to one central monitoring point) to comply with CALEA-type requirements?

    I thought part of the rationale for the Internet was to minimize single-point-of-failure situations. This ‘Fedex-style’ routing can’t be good for latency either..

    Any logical reason why Telco ISP does their network this way?

  20. Mark said,

    June 2, 2007 @ 3:22 am

    Nice write up!

    Living in Hawaii and working for an ISP, the concept of the “big long pipe” is an everyday concern for us. Best case latency for us to the mainland is usually around 60ms first hop with an average of 150 to 200ms to actual servers.

    With latency in this range a windows user can’t even benefit from an internet connection over 3mbs (give or take) without doing some tweaking.

    Anyway good intro write up to the problem. I love the analogy, I think I’ll steal it.

  21. pat said,

    June 2, 2007 @ 5:24 am

    What about Latency of ethernet gigabit network cards. I worry much about latencies in my very fast NFS (Network File System). Gigabit is enough but most disk reads are small and occur many at once. I have very low access time on SCSI disks and huge (over 6GB) memory disk cache. Usually when you want to buy network interface card you don’t have latency parameter in product description. I’d like to know card are better for that job and how should I search for them.

  22. Ozh said,

    June 2, 2007 @ 10:11 am

    Off topic : the “If you found this article useful, please visit our sponsors.” right above your Adsense ads will some day get you in trouble :) (read: violate Adsense TOS)

  23. André said,

    June 2, 2007 @ 1:17 pm

    Thank you for a great article. I’t was very educational. You’re in my bookmarks!

  24. Mark said,

    June 2, 2007 @ 5:40 pm

    Hey Pat,

    Serialization delay or latency is the amount of time for a packet to be transmitted on the physical medium. This delay is determined by the size of a packet and the rate of your physical interface.

    Serialization delay is only a concern on links below a T1 (for the most part). Anything higher is generally fast enough to never cause any noticeable delay. On your gigabit ethernet link the serialization is fixed so for your example it only takes 0.12ms to transmit a 1500 byte packet. The reason why you never hear anyone advertising what thier NIC serial latency is because it is fixed.

    Also, the topic of this article is on TCP windowing, for all practical purposes TCP sessions are from NIC to NIC. Even though gigabit latency is better then most modern hard drives it’s irrelevant to this discussion as this topic is about the impact of what’s called propagation delay on a TCP session. Unless you are dealing with a SAN and have to worry about large data transfers from server to server over direct gigabit links then don’t worry about how your hard drives performs compared to your NIC.

  25. Roman Gaufman said,

    June 2, 2007 @ 6:12 pm

    I’m a fairly junior sysadmin/network admin but this doesn’t really cover much or explain it correctly — The article makes it look like the further away the destination, the bigger effect on bandwidth when in reality it is very rare if the TCP window size is large enough.

    I wrote a little guide on network performance tuning that covers I think most reasons for low bandwidth or high latency – the guide can be found here: http://hackeron.dyndns.org/hackeron/trac.cgi/wiki/Linux%20Network%20Performance%20Tuning

  26. Mark said,

    June 2, 2007 @ 10:59 pm

    Roman, your link and his are saying the same thing. I wouldn’t be surprised if his next article is about tweaking a windows PC the same way yours discusses tweaking a linux box.

  27. bill said,

    June 2, 2007 @ 11:07 pm

    Roman,

    Thanks for the great link. This first article was discussing the limitations of TCP on “big long pipes” as Mark put it. The next article will focus on what to do about it. Although much of what I will cover is network and/or infrastructure solutions, I also plan to cover host tweaks in most major operating systems. I will be sure to link back to your page. Thanks for reading.

    -Bill

  28. Anon said,

    June 4, 2007 @ 11:27 am

    We get away with a much smaller pipe than we would normally- we have an appliance that caches all data on a block level that gets sent over the WAN and only ever sends data once. Most of the time, it just sends references to the other side.

    It’s kinda cool, because if an employee accesses a file on a server over the WAN, changes a bunch of stuff, and then emails it to someone else back at the other side, the WAN sends a bunch of references to the blocks that made up the unchanged parts file.

  29. fragglet said,

    June 4, 2007 @ 2:19 pm

    Hi,

    I’ve noticed some fundamental flaws in your understanding of how TCP works. I’ve put a full response on my blog:

    http://fragglet.livejournal.com/11924.html

  30. bill said,

    June 4, 2007 @ 3:09 pm

    Fragglet,

    Thanks for linking to my article. You are correct that TCP will, to use your analogy, add more trucks, up to a point. How many trucks TCP will support depends on the host OSs on each end, many of which still default to a 64K Byte TCP window. You are incorrect in assuming this only applies to 10Gb/s links. On a long-distance WAN, latency can have an impact on T1s. You are also incorrect in assuming that the TCP window will not shrink due to congestion control algorithms. Most high-latent connections will also experience an increase in packet loss. When packets are lost, the congestion algorithm will decrease the congestion window.

    While my sandbag analogy is not perfect, it does describe a fairly complex concept in language that a non-technical person can understand. As distance increases, it takes longer for the packet to travel round trip (the wall) and in some cases the TCP window (the container) shrinks. I hope you will check back in for the 2nd part in the series. In that article I will discuss what to do about latency. This includes tweaking the host TCP stack to increase the “number of trucks” as well as using network accelerators.

    Thanks,

    -Bill

  31. fragglet said,

    June 4, 2007 @ 5:16 pm

    > Thanks for linking to my article. You are correct that TCP will, to use
    > your analogy, add more trucks, up to a point. How many trucks TCP will
    > support depends on the host OSs on each end

    This is incorrect. The congestion control algorithms run on the sending side, not the destination. It is the behaviour of the congestion control algorithms of the sending OS that determines the TCP window size.

    > Most high-latent connections will also experience an increase in packet
    > loss. When packets are lost, the congestion algorithm will decrease the
    > congestion window.

    This is the normal behaviour of the congestion control algorithms. Furthermore, you’re making the flawed assumption that latency causes packet loss, which is not true. Latency and packet loss are both symptoms of network congestion, caused by bandwidth being maxed out at a router. To understand why this is the case, you have to think about how routers work. Packets arrive at a router and are put into a queue. They get transferred over some form of link and retransmitted onto another network.

    In an ideal situation, the queue has at most one packet stored in it. If packets arrive faster than the bandwidth of the link between the networks (or the bandwidth of the networks themselves), the queue backs up, as packets are held, waiting for the next one to be retransmitted. It’s kind of like 30 people all trying to get onto a bus at once. You get latency because packets are being held in a queue.

    In the extreme situation, packets get lost because the queue is a limited size (you can’t keep queueing packets forever). So after a while, any more incoming packets just get dropped, resulting in dropped packets. There are other reasons for dropped packets, but they basically all involve your network hardware being broken. Network congestion due to lack of bandwidth is the main cause of packet loss. This is why it’s used by the congestion control algorithms as a signal to reduce the transmit rate (ie. reduce the transmit rate).

    I seriously suggest you go and read Jacobson’s original paper on congestion avoidance [http://ee.lbl.gov/papers/congavoid.pdf], as it explains the problems of congestion avoidance from first principles and how the TCP Reno algorithms help solve these.

  32. bill said,

    June 4, 2007 @ 5:49 pm

    Fragglet,

    I appreciate your comments and your interest. 3 things:

    1) Most applications these days, but certainly not all, are bi-directional. Sometimes I’m the sender and sometimes I’m the receiver. That’s why I said the host on each side. Since the premise of this article is a WAN design, where sometimes clients are sending data and sometimes they are receiving it, I need to be aware of the limitations at both ends.

    2) I did not say that latency causes packet loss; I said there was a correlation between the two. I will drop more packets on my trans-atlantic MPLS circuit than on my point-to-point link between two locations in California, all other things being equal.

    3) Even if the window stays the same size, it still takes longer for a complete round trip transaction to occur over a highly latent connection. This is the whole reason RFC 1323 exists!

    Obviously your contention is that high-latency networks should not have a problem because TCP will magically deal with the issue. The simple fact is that this is not the case. That is why companies design around this problem with CDNs and network accelerators. I hope you’ll check out and link to part 2 tomorrow.

    Thanks,

    -Bill

  33. fragglet said,

    June 4, 2007 @ 6:22 pm

    > 1) Most applications these days, but certainly not all, are
    > bi-directional. Sometimes I’m the sender and sometimes I’m the receiver.
    > That’s why I said the host on each side. Since the premise of this article
    > is a WAN design, where sometimes clients are sending data and sometimes
    > they are receiving it, I need to be aware of the limitations at both ends.

    Although you’re right in that most network protocols are bi-directional (eg. HTTP), congestion control only takes effect when the congestion window is reached. In a typical download over HTTP, the client making the request will not reach the congestion window size. The congestion control algorithms on the client are therefore irrelevant. It’s the server’s algorithms that matter, because it’s the one sending lots of data and hitting the congestion window ceiling.

    > 2) I did not say that latency causes packet loss; I said there was a
    > correlation between the two. I will drop more packets on my trans-atlantic
    > MPLS circuit than on my point-to-point link between two locations in
    > California, all other things being equal.

    Correlation does not equal causation! As I explained, high latencies and lost packets are both symptoms of network congestion. What is the solution to network congestion? …. add more bandwidth!

    > 3) Even if the window stays the same size, it still takes longer for a
    > complete round trip transaction to occur over a highly latent connection.
    > This is the whole reason RFC 1323 exists!

    Actually, no. Read the introduction to RFC1323 that explains the reasons for its existence:

    The introduction of fiber optics is resulting in ever-higher
    transmission speeds, and the fastest paths are moving out of the
    domain for which TCP was originally engineered.

    > Obviously your contention is that high-latency networks should not have a
    > problem because TCP will magically deal with the issue. The simple fact is
    > that this is not the case. That is why companies design around this
    > problem with CDNs and network accelerators.

    No, this is not what I am saying. You are saying that latency causes network problems and that by improving latency you can improve your network. I assert that this is false. If you have latency problems, they are a symptom of network congestion. If your network is suffering from serious congestion, it probably needs more bandwidth.

  34. bill said,

    June 4, 2007 @ 7:40 pm

    Fragglet,

    >”You are saying that latency causes network problems and that by improving latency you can improve your network. I assert that this is false. If you have latency problems, they are a symptom of network congestion. If your network is suffering from serious congestion, it probably needs more bandwidth.”

    Wow. It is impressive how someone can miss the point so completely so many times. While network congestion will add to latency, latency is in and of itself a problem. In a network with zero congestion, latency will still be a problem. The problem is distance. More bandwidth cannot improve upon the speed of light. Sorry. This is the whole point of my article. Latency does cause issues unrelated to bandwidth or congestion. Those issues can be reduced with planning.

    Thanks for commenting.

    -Bill

  35. rs said,

    June 5, 2007 @ 11:17 am

    8man,

    you didn’t provide any details of which ISP you’re using, but yes, a naive CALEA implementation could have everything routed the way you observe though not likely. most of that data is collected local to the node as gathering it on an aggregate interface downstream is harder to do due to the data rates involved.

    it’s hard to speculate without seeing traceroutes to understand some of the topology involved. given what you said, it sounds like the majority of devices in their network have little more than static routes to the next router and have no real IGP awareness, or worse, have only one path through the network to other nodes.

    it could be that their implementation has all remote nodes as circuits back to their ‘central’ office like a traditional backhauled dial network. in general seeing all traffic go through one node like that is indicative of a lean network with no other paths. this is fairly common in smaller isps as they can not afford the infrastructure as yet to allow for multiple exits from each pop to their core(s). in some cases this backhauling results in said traffic going via a ‘scenic’ route. these cases can be financially and politically driven at times as well.

    finding out why it is this way would require getting to know your ISP’s network engineers and noc. while they may not be able to share all the details, you could gain a better understanding of some of it. in some cases it may be simple oversight and misconfiguration, as they’re human and make mistakes too.

  36. rs said,

    June 5, 2007 @ 11:30 am

    > Obviously your contention is that high-latency networks should not have a problem because TCP will magically deal with the issue. The simple fact is that this is not the case. That is why companies design around this problem with CDNs and network accelerators. I hope you’ll check out and link to part 2 tomorrow.

    CDNs came about because content networks were interested in solving a problem on their own that ISPs and NSPs should be solving but are not for financial and political reasons. the manner in which they did this was to eliminate intermediate networks altogether and introduce faux localization. this is neither here nor there on topic.

    a high latency network should experience no more issues than a low latency network. however, as more outstanding data will exist on a high latency network, the risk is bigger when something does become a problem.

    simply saying “add more bandwidth’ is oversimplified as well. too many admins default to this when in reality you should first understand the cause of a problem instead of blaming the symptoms.

    latency isn’t a problem unto itself. neither is bandwidth. more often than not, your assumptions on what your network is actually doing is the problem.

  37. Mark said,

    June 6, 2007 @ 3:49 am

    You guys really aren’t getting it.

    Bill is talking about the impact caused to TCP by networks with high, fixed latency.

    Here is another example besides Bills. I work for an ISP in Hawaii. We use an OC48 to get to the mainland. It has a fixed propagation delay of around 60ms RTT because the distance. On the other hand, I can set up a T1 line in my lab and get 5ms of latency off it.

    It’s this fixed latency that can wreak havoc on a TCP session flow (like Bill explained) regardless of the size of the connection.

    For example, a Windows PC that uses the default receiver buffers will not be able to take advantage of high speed connections in the 7mbs and up range if there is a lot of latency. The default RWIN value on a windows based PCs peaks out around 2.6mb/s on a path with 200ms RTT. It’s only through tweaking the registry that will allow a windows PC to take advantage of RFC1323 mechanisms. By doing some tweaking you can alleviate some of the performance issues caused by high latency connections to TCP.

    On the other hand, if you have a MAC or Linux (supposedly Vista too) based PC you probably won’t have to do any tweaking as they generally have window scaling, SACK and a higher RWIN value enabled by default.

    This subject is very near and dear to me as I am currently evaluating, in the lab, our 7 and 11mb DSL offering. One of the things I am testing is the results of website based speed tests to a local server setup with a speed test application. To this directly connected server I see

  38. Mark said,

    June 6, 2007 @ 3:55 am

    |Continued from above|

    …..around 3.5 to 4mbs download speeds on a 11mb DSL circuit compared to seeing exactly 11mb (+/- 300kb) when the latency is

  39. Mark said,

    June 6, 2007 @ 3:56 am

    (stupid carrot sign)

    less then 1ms. I’m not a rocket scientist but it looks to me like latency does have an impact.

    If you guys still dont get it, google “long fat networks”

  40. Munich Unix » Blog Archive » It’s Still the Latency, Stupid… said,

    October 29, 2007 @ 7:21 am

    [...] read more | digg story [...]

  41. Josh Betz » It’s still Latency said,

    May 25, 2008 @ 10:19 am

    [...] edgeblog » It’s Still the Latency, Stupid…pt.1. Here’s a great article about how your problems with network speed may have more to do with latency than bandwith. Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages. [...]

  42. Dirk said,

    July 27, 2009 @ 2:13 am

    Thanks for a well written article, I am about to give a training course on application profiling (with a specific tool) and one of the topics is latency impact on application performance. I would like to refer students to this article,
    Bandwidth is becoming more available now in South Africa (still more expensive than US or Europe) but we too sit with the “distance problem”. (hey its a big country)They often ask you why applications in Cape Town are slower than Durban with the same link speed.
    And the deeper into Africa you get, the more prevalent the latency issue becomes, less POPs, poor connectivity so sometimes have to use satelite….

    Kind Regards

    Dirk

  43. Billy Guthrie said,

    March 8, 2010 @ 8:07 pm

    >”You are saying that latency causes network problems and that by improving latency >you can improve your network. I assert that this is false. If you have latency >problems, they are a symptom of network congestion. If your network is suffering >from serious congestion, it probably needs more bandwidth.”

    hahahaha! That is classic (needs more bandwidth.)

    Bill, great article!

  44. Sektormedia: Brain Ramblings » Blog Archive » Great Latency Discussion said,

    October 25, 2010 @ 7:41 pm

    [...] has a great article entitles It’s Still the Latency, Stupid…pt.1. A great thing for junior admins and other folks to read to get a good understanding of the Latency [...]

  45. Cloud Performance – Why Long Distance Relationships Don’t Work « Apparent Networks said,

    November 4, 2010 @ 6:48 am

    [...] has a dramatic impact on the maximum achievable performance you will see from the cloud. As this example details the latency on a network connection can be a bigger factor in the end to end performance, [...]

  46. KdV said,

    November 8, 2010 @ 8:17 am

    I like your articles about the latency problem Sir. They are clear and informative, thank you!

    Greetings from the Netherlands,
    Kees

  47. Brandon said,

    December 13, 2011 @ 2:20 pm

    I’m not sure that I agree with your conclusion in this article. As latency increases window size should increase, not decrease. Congestion windows are based off of packet loss, not latency. You have an advertised window which is equal to the buffer space on the receiving host and then a congestion window that is cut in half every type there is a lost segment. Latency doesn’t come in to play for either of the windows, except that if you have a high latency network, you should increase those windows.

    To use your sandbag analogy, if you are going over a wall you add another person (or hop) because the distance increased. That now means to go at the same speed, you need two containers (increased window size). The fact that latency increased won’t affect your actual speed as long as you appropriately manage your buffers and windows. That is however only the case when you are sending lots of data one way, if you are doing something real time (ssh, gaming) then latency is king.

  48. my article archive said,

    July 31, 2012 @ 3:08 am

    When I originally commented I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get
    several emails with the same comment. Is there any way you can remove people from that service?
    Bless you!

  49. my blog guest said,

    August 16, 2012 @ 4:05 am

    I could not resist commenting. Well written!

  50. Herbert Valle said,

    April 6, 2013 @ 1:17 am

    Good points on the availablity, bandwidth and latency.

RSS feed for comments on this post · TrackBack URI

Leave a Comment