Layer 2 Datacenter Interconnect options
- 04 September, 2010 04:55
For more than 20 years we have been using Layer 3 connectivity powered by dynamic routing protocols to route traffic between data centers, but adoption of virtualization and geo-clustering technologies is forcing us to re-examine our data center interconnect (DCI) models.
Harnessing the power of virtualization allows organizations to view and treat their compute resources as a global resource pool unconstrained by individual data center boundaries. Resources can span multiple buildings, metro areas or theoretically even countries. This basically means you can increase collective compute power in a data center that needs it by "borrowing" the resources from a data center that has spare capacity at the moment. This task is achieved by moving virtual machines between the data centers.
All major virtualization vendors, such as VMware, Xen and Microsoft support the concept of virtual machine live migration, where you can move live VMs from one physical host (server) to another without powering them down or suffering application connectivity break (there is a short pause, but not long enough for TCP sessions to be torn down).
Now the question is, what happens to the network settings -- specifically IP address/Subnet Mask/Default Gateway -- of the VM when it moves from Data Center A to Data Center B?
The answer is they remain the same. Well, to be precise, they remain the same when live migration is performed. If, however, the VM is powered down in Data Center A, copied in the down state to Data Center B and then powered up, the server administrator will have to change the IP address on the operating system running inside the VM to match the settings required at the destination Data Center B.
This, however, is not a very elegant solution, because it will require all connections to be re-established, let alone possibly create an application mess due to IP address change, since we all know how developers like to use static IP addresses rather than DNS names. So for the sake of our discussion, we are keeping the same IP address while live VMs move between data centers.
On the network level, now both source and destination data centers need to accommodate the same IP subnets where VMs are located. Traditionally having the same IP subnet appear in both data centers would generally be considered a misconfiguration or a really bad design. Consequently, it also means that Layer 2, aka VLANs, need to be extended between these data centers and this constitutes a major change in the way traditional data center interconnection had been done.
The other development that is forcing us to re-examine our data center interconnect models is geo-clustering, which involves use of existing application clustering technologies while positioning the servers, members of the cluster, in separate data centers. The biggest rationale behind doing this is to achieve very quick Disaster Recovery (DR). After all, it only takes cluster failover to resume the service out of the DR data center.
The failover usually means the standby cluster member, the one in the DR site, takes over the IP address of the previously active cluster member to allow TCP sessions survivability. At the network layer it again translates to the fact that both data centers hosting cluster members need to accommodate the same IP subnets and extend VLANs where those clustered servers are connected.
Some of the clustering solutions, for example Microsoft's Server 2008, actually allow cluster members to be optionally separated by Layer 3, meaning that they do not have to belong to the same IP subnet and reside in the same VLAN. These solutions, however, rely on Dynamic DNS updates and the DNS infrastructure to propagate new name-to-IP address mapping across the network to reflect the IP address change of the application once cluster failover occurs. This introduces another layer of complexity and punches a hole in the concept of using geo-clustering for quick DR.
What we can do
Now that we understand why Data Center Interconnect is morphing, lets look at the technologies in network designer's toolkit that can help solve the puzzle of IP subnets and VLAN extension across multiple data centers.
* Port Channel: Data centers can be interconnected using port channeling technology, which can run on top of dark fiber (due to distances, copper cabling is highly unlikely), xWDM wavelengths or Ethernet over MPLS (Layer 2) pseudo-wires. Port Channels can be either statically established or dynamically negotiated using LACP (IEEE 802.3ad) protocol.
* Multi-Chassis Port Channel: Multi-Chassis Port Channel is a special case where port channel member links are distributed across multiple switches. The added value is, of course, that the port channel can survive an entire switch failure. One popular implementation of this technology is from Cisco, using Virtual Switching System (VSS) on Catalyst 6500 series switches or a virtual Port Channel (vPC) on the Nexus 7000 and 5000 series switches. Nortel has a similar implementation called Split Multi-Link Trunking (SMLT). Multi-Chassis port channel can run over the same set of technologies as the traditional port channel and can also be either statically established or dynamically negotiated.
With both Port Channels and Multi-Chassis Port Channels, VLAN extension is achieved by forwarding (also referred as trunking) the VLANs between data centers across the port channel. Without special configuration, Spanning Tree Protocol is extended by default, merging STP domains of both data centers.
This is most often an unwanted outcome because issues, such as infamous Layer 2 loops, in one data center can propagate across and impact the other data center. Methods of filtering Bridged Protocol Data Units or BPDUs are usually employed to isolate STP and thus fault domains between the data centers. Media access control (MAC) reachability information is derived through flooding of unknown unicast traffic across the port-channeled links, which is common and not particularly efficient way of learning MAC addresses in a transparent bridging environment. The use of Port Channels as DCI is relatively simple and intuitive, however it suffers from poor scalability beyond two data centers and as such does not fit well in larger DCI deployments.
* Ethernet over MPLS: This is the oldest of the pseudo-wire technologies in which an MPLS backbone is used to establish a logical conduit, called a pseudo-wire, to tunnel Layer 2 -- in this case Ethernet -- frames across. EoMPLS is also sometimes referred to as a Layer 2 VPN.
Layer 2 Ethernet frames are picked up on one side, encapsulated, label switched across the MPLS backbone, decapsulated on the other side and then forwarded as native Layer 2 Ethernet frames to their destination. Frames can also include 802.1q trunking tag if you want to transport multiple VLANs across the same pseudo-wire.
EoMPLS pseudo-wires make both sides appear as if they were connected with a long-reach physical cable. If you're thinking "that sounds similar to Port Channels", you are right. Setting aside the MPLS backbone in between, EoMPLS pseudo-wires share similar characteristics with the Port Channels-based DCI solution we talked about earlier.
Just like with Port Channels, BPDUs are forwarded by default across the pseudo-wires merging STP domains, and BPDU filtering can be employed to prevent that. MAC learning is still done through flooding, so EoMPLS does not change that inefficient concept.
In fact those two technologies can even be layered on top of each other and, as we briefly mentioned before, at times Port Channels are built across the EoMPLS pseudo-wires to deliver DCI connectivity and VLAN extension.
If an MPLS backbone is too much for you to handle, EoMPLS can run on top of regular IP backbone using GRE tunneling. Keep in mind that MPLS label exchange still occurs across the GRE tunnels, so by using EoMPLSoGRE we now have one more protocol layer to troubleshoot and account for, but the up side is that there is no MPLS backbone to maintain.
The use of GRE tunneling also has implications on the Maximum Transmission Unit (MTU) size needed to be supported across the IP backbone, since the use of GRE protocol adds 24 bytes of overhead (20 bytes of outer IP header + 4 bytes of GRE header) per each packet in addition to the encapsulated MPLS label stack size.
* VPLS: Virtual Private LAN Services extends EoMPLS by allowing multipoint connectivity, which is achieved through a set of pseudo-wires running between VPLS Provider Edge (PE) routers. Pseudo-wires endpoints can either be statically defined or automatically discovered using Multi-Protocol BGP (MP-BGP).
VPLS provides STP domain isolation by default, which is an improvement over EoMPLS and Port Channels DCI, however, achieving edge redundancy with VPLS is no easy task and network designers need to be crafty to make sure that inter-data center loops are broken.
VPLS brings no good news about MAC address learning, which is still achieved by flooding unknown unicast traffic throughout the network across the pseudo-wires, however, once properly tuned, VPLS provides a quite effective data center interconnect.
Just as EoMPLS, VPLS has a VPLSoGRE variant for non-MPLS environments and just like EoMPLSoGRE, it adds 24 bytes of GRE overhead when compared with traditional VPLS, so interface MTU needs to be properly planned across the backbone.
One more interesting point: in case MP-BGP is used to automatically discover pseudo-wires endpoints, GRE tunnels are still needed to be manually created, which undermines the advantages of using MP-BGP in VPLSoGRE deployment.
Two proprietary approaches
All Layer 2 Data Center Interconnect technologies discussed so far are industry standards. Let us now look into two innovative proprietary technologies from Cisco.
* Advance-VPLS: A-VPLS is another variant of VPLS technology, but introduces several new properties, which make it stand out. First, it mitigates the difficulties of providing DCI Edge redundancy without utilizing any fancy scripting mechanisms. To achieve that, A-VPLS builds on top of Cisco's Virtual Switching System (VSS-1440) available in Cisco Catalyst 6500 switches.
Second, it utilizes port channel hashing techniques, which take into calculation Layer 2, Layer 3 and Layer 4 information to determine DCI Edge switch outbound interface. This allows excellent traffic load-sharing between data centers over multiple DCI links.
Third, as packets traverse DCI links, A-VPLS introduces optional MPLS flow labels to further improve traffic load-balancing through the label switched core.
Fourth, it significantly simplifies user configuration, which intuitively resembles configuring a trunk interface on Cisco switches with an addition of a very few MPLS related commands. There is no longer a need for per-VLAN VFI configuration, which is a huge time saver and has lower chances of human error.
As far as underlying network connectivity prerequisites, A-VPLS can run over diverse transports, with three clearly identifiable options: a) Layer 2 Core or no Core at all, b) Layer 3 Label Switched Core and c) Layer 3 non-Label Switched Core.
In case of Layer 2 Core, A-VPLS can run on top of technologies, such as EoMPLS and even VPLS (yes, A-VPLSoVPLS). If the network is simplified not to have a DCI Core at all, then Dark Fiber or xWDM wavelength can be used for back-to-back connectivity between A-VPLS PEs. The Layer 3 Label Switched Core option can make use of traditional MPLS VPN service interconnecting A-VPLS sites, in which case certain label exchange will happen between MPLS VPN PEs and A-VPLS PEs (for Loopback interfaces of A-VPLS PE routers).
Finally, a Layer 3 non-Label Switched Core makes use of GRE tunnels created between all participating A-VPLS PEs, while label switching occurs inside the GRE encapsulation. Mimicking VPLS behavior, Spanning Tree environments are automatically isolated between the data centers to limit Layer 2 fault domain and prevent one data center from being impacted by STP issues in another.
MAC address learning is again done through unknown unicast flooding, after all, A-VPLS is a VPLS variant and that behavior does not change. Flooded traffic does consume some DCI bandwidth, however MAC address tables will be populated very quickly before this traffic causes any concern, so this is normally a non-issue. Even with current specific requirements of Catalyst 6500 switches (Sup720) and SIP-400 line cards to make use of A-VPLS technology (this will change with time), it is an excellent choice for efficient Layer 2 DCI.
* Overlay Transport Virtualization: OTV is an emerging technology that is unlike any other data center interconnect solution we discussed so far. You can call it an evolutionary protocol, which utilizes lessons learned from past DCI solutions and integrates them into the inherent protocol operation. It is transport agnostic and can be equally supported over IP or MPLS backbone, which gives it a great versatility.
Cisco calls the underlying concept of OTV traffic forwarding "MAC routing", since it behaves as if you are routing Ethernet frames over the DCI transport. OTV uses a control plane protocol to proactively propagate MAC address reachability before traffic is allowed to pass, which eliminates dependency on flooding mechanism to either learn MAC addresses or forward unknown unicasts.
By keeping unnecessary traffic away from data center interconnect we achieve better scalability and prevent bandwidth "waste". Proactive MAC learning is one of the unique OTV differentiators.
DCI Edge redundancy is achieved by having OTV Edge devices automatically elect an authoritative edge device (AED) on a per-VLAN basis, which allows traffic load-sharing while simplifying the deployment model. Only AED is responsible for sending VLAN traffic in and out of the data center, which guarantees loop-free topology across the DCI. Spanning Tree isolation between data centers is inherent within the protocol, which is really becoming standard best practice.
One additional OTV feature worth mentioning is regarding multicast traffic replication, which is performed by backbone routers, rather than by OTV DCI edge devices in the data center where multicast source resides, also known as head-end replication Subsequently, the load on those edge devices is significantly reduced, however the trade-off is that backbone routers now need to be aware of certain client multicast routing information.
OTV currently requires Nexus 7000 switches at the edge and multicast enabled core (for control plane protocol). Multicast requirement will be lifted in the subsequent Nexus 7000 NX-OS software releases.
By now you must see that extending VLANs across data centers is easier said than done, however we are not done yet. Having Layer 2 connection is only half of the story, we now need to provide the way to connect in and out of those VLANs and for that we need to bring Layer 3 connectivity into the mix.
The easiest method is to keep all Layer 3 functionality in one data center and extend Layer 2 to the other. This setup simplifies handling routing in and out of the extended VLANs, however, when traffic enters an extended VLAN in the data center where Layer 3 functionality for that VLAN is implemented, it will be sent across the DCI link if the destination server is in the other data center.
This solution increases the load on the data center interconnect and the latency for traffic that needs to traverse it. If the server's default gateway is in the other data center (remember, in this scenario there is only one Layer 3 entity for that VLAN), traffic leaving the extended VLAN will also cross DCI link with the same potential latency and bandwidth concerns.
To provide Layer 3 redundancy for extended VLANs, each one of the data centers needs to have a Layer 3 component for those VLANs. As advantageous as it sounds, the biggest concern in this setup is traffic symmetry, which is required for stateful devices, such as firewalls and load-balancers, which are often part of the setup.
Without symmetry, traffic entering the extended VLAN in Data Center A and then trying to leave through Data Center B will be dropped unless session state information is synchronized between firewalls/load-balancers in those data centers.
If such synchronization does not exist, you will need to make sure that returning traffic is forwarded back to the original data center, which the request came through. This is most frequently achieved by using Source Network Address Translation (SNAT) techniques. As inefficient as it can be from bandwidth and latency perspective, at least it works, unless of course using NAT breaks the application.
Other methods include injecting /32 host routes or using Global Load-Balancing to direct the traffic to the actual data center where the server is located. There are also more daring initiatives from vendors, such as Cisco, to perform workflow integration between multiple components to collectively deliver the solution that ties together Layer 2 and Layer 3 Data Center Interconnect extensions. Also, stay tuned for new technologies, such as LISP (Locator/ID Separation Protocol), that tackle the issue.
Data center interconnect is definitely morphing, but before you tear up what you have, you should thoroughly design the solution to be scalable and resilient by selecting the technology that addresses your connectivity requirements from all angles.
Read more about lan and wan in Network World's LAN & WAN section.