Written by Jan Otte, Tuesday 11 December 2018
So, you heard that 6.2.0 has something called load balancing, right?
Let's have a look at it while we still keep baking the release.
As mentioned on our recent Webinar (where we made a mistake to be corrected by this article, see below), the Load Balancing is a generic term. It describes a state where there are multiple requests to do something and, at the same time, there may be multiple resources available to handle the requests.
The load balancing mechanism is actually used in two different situations:
The Server Load Balancing happens when you have a group of servers able to serve the same kind of requests (e.g. a web-server farm). Because there are many many clients consuming the HTTP service, you put a server load balancer in front of the servers. The requests are coming to the load balancer and the only job of the load balancer is to balance the requests among the available servers, based on their load. It does not serve the actual requests with content. While this is a straightforward task, there are still good and bad load balancers out there. Anyways, this is not todays topic as our router is NOT a server load balancer.
The Link Load Balancing happens when there are multiple outgoing links from a device providing Internet connection to a LAN. The device, being a link load balancer must split the LAN connections between the available outgoing links. That's the beast to tame here - or at least to talk about from a safe distance :-)
The link load balancing is a much tougher discipline than a server load balancing. Why? Because when doing the server balancing a load balancer knows perfectly the state of the endpoints (servers) of the number of open connections to every server (note each server is usually connected to precisely one line - nothing shared) and may even incorporade a semi-automatic feedback mechanism directly from servers.
When balancing the links, there are much more vague information available. The status of the link is known only from the local point of view - until you get a ping response or a ping timeout you cannot say. And even when the ping from the nearest neighbor on the outgoing link returns back, it does not say anything about the rest of the path...
Wait, what path? There are multiple outgoing connection with multiple targets usually so there can be many and many different paths opened on one interface at once!
But we need to start from the start. Consider a trivial example with only two connections active:
First, let's say there are only these two (TCP) connections. Each one using different outgoing link. So the first link is almost saturated (50% of bandwidth used up), queue is empty, RTT rocks. Second link is not that saturated (10%), queue is not handled and RTT is poor.
Note that the second connection being poor may impact your opinion on the link itself - while the link could be in the very good shape! The problem here can be with the far connection endpoint and this would show only after some time (some packets are sent out fast but then it starts go worse).
So, how does the router know which link should be used for a third connection request which just came in?
Let you think for a while...
What? So how the heck do you load balance the links?
A rule of a thumb tells that for deciding what interface to use to balance one needs to know the shape of the link beforehand. There is no other way of telling the shape of the link than actually using it.
If we combine the two sentences above we get to the most important point: load balancing is about balancing a load. When the device is not loaded, we don't talk about load balancing. While this may seems trivial, it is important think to remember.
Another rule of the thumb is that a basic unit for the link load balancer is a connection. I probably cannot stress it better way than to repeat it once more: a connection. Looking at the most used Internet traffic, a connection is a TCP connection. Looking at the second most used traffic, a UDP connecion is actually one UDP packet as UDP itself has no concept of a connection.
If we put the above together, we get:
Returning to the example above, we see it is not a good example for a link load balancer - there are simply too few connections to be balanced.
Let's look about some better-fitted example - something to actually load balance.
Let's have a router with three outgoing links and 30 connections. While the number may seem a bit high, it is actually not that much. Almost every web page shown in browser requires a number of connections to be open - some for the DNS, another for a web server, and additional ones for pictures sourced from another sites, for advertisments, for statistics and analytics and potentially other ones as well. Also, there is a number of requests that are initiated from a PC without user knowledge - directly by operating system or application being run in the background.
We can classify the TCP connections to be short-lived and long-lived but at the time being, let's just remember this difference.
In the reality, there may be many new connections being opened and many closed on a router per every second. Therefore, the connection element is a good unit to base the load balancing on - in usual case.
Now, remembering the not-so-good example above - the connections may go to a fast, responsive endpoint or a slow, not responsive endpoint. If there are many connections, it does not matter - statistically, the connections get split over the outgoing links equally. Note the word statistically.
As you see, the world of requests is dynamic. There are connection coming and going. On the other side, state of an interface is dynamic as well. The Internet connection speed and shape can change very dynamically. All in all, we need a dynamic algorithm to perform in the dynamical environment.
If you were listening to our webinar, you may remember that I have mentioned that interface weights do change over time based on the link status and congestion and the initial values are only for start. That was correct. What was not correct - the name of the algorithm that is used. We do not use the Weighted Round Robin (that one does not change weights dynamically) but Cubic. The algorithm correct name is Cubic-TCP, see links below the article.
But lets describe it from the beginning as there may be some readers that were not present on our webinar.
Looking at the router configuration, we have a few possibilities to choose the behavior of handling the WANs (outgoing paths). The typical scenario is to have one active WAN (and a number of backup ones), but that is not interesting to us at the moment. Another scenario is to have one primary WAN for outgoing traffic (a default WAN) and a secondary in listen-only mode. That is still not a case of load balancing. The third mode we support is a link load balance - the router will balance the connections between the available WANs.
While choosing the WANs you may provide a weigth for any WAN.
The weight for a WAN (interface) is an initial value for the algorithm. You typically assign a higher weigth for a faster interface so that it is preferred for new connections. Actually, it is not a preference setting, it is a weigth. Let's explain it a bit:
Given the latter example, we have 30 connections and three links. Let's say the initial weights were 10:5:1 and the interfaces were eth1:wlan0:usb0 (ethernet, wifi, cellular). All links are in good shape and the router is running for some time already. Currently, there are 16 connections over ethernet, 8 over wifi and 8 over cellular.
Why? Because the math done by the algorithm says so. The Cubic-TCP is not a simple algorithm and it evaluates several things.
Now a new connection comes in. The ethernet seems to be the best-fitted at the moment (evaluated by the algorithm) so the new connection is established via the ethernet.
Happy with the example? Yes? No? Actually it does not matter.
Because in the meantime 10 seconds went by and 40 new connections were initiated and 42 closed so it is no longer relevant.
So why are we talking about all of this?
I am trying to explain to you that in the usual scenario, the environment is so much dynamic it simply does not make sense to look at one connection. All what makes sense is to balance the load.
How does the router (any algorithm) tries to balance the load?
Simply put, keeping all of the links congested/loaded at the same level. It is really looking just at the status of the links and selecting the least loaded/congested as the one for the new connection coming in. Nothing else makes sense.
Now we are getting close to another rule of thumb - the link load balancing is for balancing load of the links. While it may seem that there is nothing new in the sentence at a first glance, you need to go deeper. I will help you: It may happen that you actually do not want to balance a link load.
Let's look at some specific scenario:
In this scenario, you do not want to balance the links. Why? It is about making priorities. What is your priority? Using most of your available bandwidth or keeping the SCADA connection at best possible rate? I bet it is the latter one so the solution is simple - configure your equipment to fit your needs.
Remember that link load balancer balances connections. It does not know which connection is more important and which less important. It actually cannot know it. If all of your connections are equal, link load balance is good for you. If they are not equal, it is not a good mode for your use case.
Remember the note about long-living and short-living TCP connection? This is another important point of view. At the time of connection being established there is no information regarding the longveity of a TCP connection. All connecions are counted equal. Now consider that in our cellular routers, the nature of interfaces may vary a lot. While the cellular may be fine for a long time, the situation can change in time - e.g. outage of the nearest BTS may force the cellular chip to migrate to a BTS with much worse signal and the interface parameters may change.
In general, the link load balance handles this very well - for new connections. In time, connections are getting closed and new ones will be opened on another interfaces. The statistical model works perfectly. Not for long-living TCP connections though. If the long-living TCP connection was established on cellular at the time the interface was in good shape and now the conditions are worse, the connection will starve. Note there is no way of migrating existing connection to another link.
Of course, there are some limits - once the connection dies, it gets re-established probably (by application layer) and the router chooses another interface for it. Wait... does it? Well, you know from the above already - new connection, new consideration. If the cellular died completely in the meantime, another interface is chosen. If it is still alive, it has some share of new connections assigned to it.
Why? Because the status of connections is the only way the router may actually enumerate the shape of the device. It must try all the available interfaces and periodically adapt the weights to the actual situation - the cellular (or any other interface) could recover, right?
While this article gets longer and longer, I still have at least my 2 cents to share with you before we get to conclusion:
As for the conclusion, please remember: