Getting to the point where big is never big enough, one may think “What’s cooking?” Well, BGP in the DC is a subject that’s been under my radar for some time, so the purpose of this article is to get things a bit more straight-forward regarding the WHYs and HOWs.
A look in the past
First of all, we should ask ourselves who was Clos. Charles Clos started his work at Bell Labs, mainly focusing on finding a way to switch telephone calls in a scalable and cost-effective way. In 1953, he published the paper “A Study of Non-Blocking Switching Networks”, where he described how to use equipment having multiple stages of interconnections to switch calls.
The crossbar switches (you may think of them as common use switches with a defined number of ports) connected in a three-stage network (ingress, middle, egress) form the so called Clos network.
This had a pretty big use back in the 1950’s but once the level of circuit integration got to the point where interconnections would no longer be a problem, it was no longer of interest, at least for some time. Until huge scale data centers came to be needed (and I mean warehouse huge).
Let’s see what are the abstracted components of a Clos network:
You may not figure too much out from the above drawing, which would be the standard one. But if we take a look at how this gets applied in the data center to obtain the large scale interconnects, we get to the following design:
This is called a “folded clos network” and works the same way as above, but it’s usually the representation of network equipment in a DC design. Also, in terms of notations, leaves may often be called Tier 2 devices, and the spine components Tier 1 devices. In practice, there is also another layer added to the design, which consists in the Tier 3 TOR (Top of Rack) Switches, that are used to aggregate the actual servers. As seen above, a set of directly connected Tier 2 and Tier 3 devices along with their attached servers is referred to as a “cluster”.
Depending on the values of M and N, the blocking characteristic of the network is defined. Not going too far into details, to obtain a non-blocking clos network in the sense that each input has full access to an output at a certain time, it’s been proven that m should be >= than n. Over-subscription is usually enforced at Tier 3, mainly to allow capacity planning for different services and clients.
The secret to scaling up the number of ports in a Clos network is adjusting two values—the width of the spine and the over-subscription ratio. The wider the spine, the more leaves the fabric can support. Also, the more over-subscription placed into the leaves, the larger the fabric as well.
Back to the future
Now, question comes: How do I make this work?
So we go for the L2 or L3 design?
There are some advantages in the L2-wide design, such as seamless server mobility, but that usually works with a limited number of devices and additional protocols such as TRILL (Transparent Interconnection of Lots of Links). Anyway, no classical spanning-tree would ever work here. Why? Because it is not suitable to scale when it comes to “many” links! (It would probably take a lifetime to reach convergence and broadcast storm would kill you and the sizes of ARP tables would be humongous).
If not L2, then L3? IS-IS or OSPF would be nice but there are some limitations they are facing in terms of scalability (database size may become a blocking point in large scale networks), traffic engineering (one example is the ability to shift traffic around a spine for maintenance, and this is pretty difficult to do with IS-IS/OSPF), hierarchical design necessity (continuous backbone area, L2 backbone) and last but not least, added complexity to the network (BGP is still needed to connect to the edge of the DC, so an IGP would add some t-shooting/design overhead).
We’re left with BGP then?
Yes, it seems so, but we still have to make a choice: iBGP or eBGP?
Let’s have a look at the following analysis for some reasoning on why we should choose eBGP and then we’ll discuss the possibility of implementing iBGP:
|– Based on TCP to maintain sessions||– Hello Protocol used to setup, maintain and bring down the adjacencies|
|– A network failure is solved as soon as a new best path is found, which is very fast is symmetric designs such as Clos networks||– Event propagation is area-wide|
|– Supports recursive next hops which means routing info may be injected easily into the system||– OSPF makes use of the Forwarding Address, but the implementation is often messy|
|– AS_PATH standard loop prevention system works best in multi-tier environments rendering long paths as unused||– Multi-topology support, but that’s something really complicated.|
Yes, indeed. Let’s see an application of what we’ve discussed (Note: This is a personal example, things could be done and are done in many different ways, depending on the design requirements):
The addressing scheme needs to be standardized, using /31 subnets for each point to point link. One issue may be the scalability as the number of leaves increases – this can be easily solved using an IP allocation script. For example, in our design there is a need for 8 (TORs) x 2 (links each) + 8 (T2s) x 4 (T1s) = 48 /31s
Private ASNs are limited in number
If you look at the ASN allocation scheme above, there are intrinsic limitations to it. This could be solved in one of two ways: either use 32-bit ASNs or reuse ASNs for each cluster. The second one would need the ALLOW AS IN option as well, on the Tier 3 devices.
If we’d start advertising all point-to-point prefixes into BGP, there would be no actual gain, because eBGP changes the next hop each time the prefix is announced. So there is little need to know p2p /31s. One thing that may need to be done is advertise the loopback prefixes for each device, mainly for management purposes. This can be done elegantly, by aggregating the loopback addressing space at each level. Server subnets need to be announced into BGP, and there is no summarization possible here, because it would lead to traffic black holing when a link fails (as a direct consequence of Clos networks principles).
To connect to the edge (WAN, other DCs) a dedicated cluster may be used. The cluster will need to hide DC specifics (private ASNs). Also, from the edge a default route may be injected into the DC for external reachability purposes.
How does ECMP work with Clos networks?
Effectively, every lower-tier device will use all of its directly attached upper-tier devices to load-share traffic destined to the same IP prefix. The number of ECMP paths between any two Tier-3 devices in Clos topology equals to the number of the devices in the Tier1.
Also, there may be the case where the same prefix is advertised from more TORs, to ensure the application load balancing. The prefix will appear to the other devices with the same AS PATH length but different originating ASNs. BGP can multi-path over these paths using the “multi-path relax” feature.
When a T2 device loses a path for a prefix, it usually has a default route that points to the T1 device. It’s possible that for a short period of time, the T1 device still points to the T2 device that lost the route, and such create a transient micro loop. In order to avoid this, discard routes may be configured on the T2 devices to ensure that traffic destined to the underlying T3 devices is not bounced to the T1 devices using the default route.
What about iBGP?
Designing IBGP is a bit different since IBGP requires that all switches peer with every other device within the fabric. To mitigate the burden of having to peer with every other device in the IP fabric, we can use BGP route reflectors in the spine of the network.
The problem with standard BGP route reflection is that it only reflects the best prefix so it doesn’t work well with ECMP. In order to enable full ECMP, we have to use the BGP AddPath feature, which provides additional ECMP paths into the BGP advertisements between the route reflector and clients.