In the world of datacentres and large-scale enterprise networks there has always been some form of perceived trade-off between performance and resilience. Building in resilience is absolutely essential of course, but it has historically affected service and application ability when brought to play. And - in spite of the best planned and designed networks, quality of components and management - problems do arise.
Add in the virtual world to the physical one we've come to know and trust and the stakes are raised again. The result is that vendors have been forced to redesign their systems to support the virtual environment, maintaining that level of resilience - or improving it - while also improving round-the-clock access to those services, applications and the data that lies beneath.
One such example is HP's Converged Infrastructure solution - incorporating servers, storage, networking and management - that is the focal point of this report. The datacentre is growing ever more critical to the enterprise, whether physical or virtual, in-house or outsourced. From a supplier's point of view, this means creating a complete system - a converged infrastructure - based on marrying truly compatible components with the best performance/feature set and with as little compromise as possible.
At the heart of HP's Converged Infrastructure (CI) system is the key to the resilience contained within - what HP calls the Intelligent Resilient Framework - that creates a resilient, fully-redundant virtual switching fabric.
Intelligent Resilient Framework (IRF) is designed to combine the benefits of box-type devices (simple, standalone switches for example) and chassis-based distributed devices, such as a blade switch. The argument is that box-type devices are cost-effective, but can be less reliable and less scalable, and are therefore unsuitable for critical business environments. In contrast, chassis-based devices tick all these boxes but are more expensive and considered to be more complex to deploy and manage. With IRF, then, HP is looking to merge the benefits of both approaches into one. IRF allows you to build an IRF domain, seen as one, big, logical device (see illustration).
By interconnecting multiple devices through ports (regular or via dedicated stacking) it is possible to manage all the devices in the IRF domain by managing one single IP address (attached to the logical device), which provides the lower cost of a box-type device and the scalability and reliability of a chassis-type distributed device.
In a converged infrastructure environment, an IRF-based network extends the control plane across multiple active switches, enabling interconnected switches to be managed as a single common fabric with one IP address. The claim is that it increases network resilience, performance and availability, while simultaneously reducing operational complexity.
Another key element of the solution being tested is HP's Virtual Connect Flex-10 which comprises two components: 10Gbps Flex-10 server NICs and the HP VC Flex-10 10Gbps Ethernet module. Each Flex-10 10Gb server NIC contains four individual FlexNICs, so a dual-channel module provides eight LAN connections with the bandwidth for each FlexNIC is user defined from 100Mbps to 10Gbps in 100Mbps increments. From a practicality perspective, VC Flex-10 reduces cables and simplifies NIC creation, allocation and management.
A series of tests based around the resilience of HP's IRF were created, inducing a series of different failures to see how the solution coped with the problems and what this meant in latency/lost packet issues. We also looked at the day-to-day management of the solution, including what happens when planned maintenance is required, in this case carrying out routine firmware upgrades involving switches reboots.
Our CI for the test was built around HPs A5820 Ethernet switches supporting IRF, then - at the back-end - a combination of the aforementioned Flex-10 technology and standard HP A6120 blade switches and C3000/C7000 server enclosures.
The first test involved seeing what happened when we simulated a failed link between the A5820 switch and a VC Flex-10. In this test both switches are simultaneously active, thanks to the LACP Bridge Aggregation mechanism - a key benefit of IRF being its ability to maintain an active-active state. So, in the event of a link failure, the second link of the LACP Bridge Aggregation and the second switch supports the traffic while the broken connection is repaired.
Looking at the illustration as a guide, note that we experienced 3ms failover time on this connection. However, between servers 9 and 10 we experienced no dropped packets whatsoever. Between server 11 and ESX4 we communicated with only a 1.3ms failover time. As we brought the link back up we experienced just a minor failover time, again across all server to server links, just 1ms in total.
Reverting back to the original situation and testing all connections while the second module was shut down and restarted, we recorded an aggregate failover time across all links of just 1.2ms.
For the second test we checked what happened when we simulated an additional bridge aggregation failure - potentially a traffic killer. Testing with a 64-byte ping while this was happening, we recorded just 4ms failover time and, while in recovery mode, a further 36ms between server 9 and server 10 and 23.6ms between server 11 and server 9 - easily our most significant latencies recorded yet, but still both well below our target level of 50ms.
For test three we are having a really bad day at the office, simulating an additional failed link between the second A5820 switch and the second VC Flex-10, meaning we now have to repair a situation with three concurrent broken links. Does anyone remember the song "three wheels on my wagon"? In addition to our failures induced in tests 1 and 2, adding this third failure saw us record an additional 4ms of failover time and minimal recovery time latencies.
Already down to the bare bones of communication, we then simulated a classic scenario where one of our redundant switch pair fails or has to be rebooted (maybe for unplanned maintenance, for example). We saw total failover time of sub 6ms, 4ms on shutdown and <2ms on the reboot. As if it wasn't bad enough already, in this scenario we additionally simulate the second A5820 switch losing all its' IRF links to the first A5820 and thereby losing all connectivity, to prove that there are multiple tiers of redundancy in the solution. For our test case we cut off the IRF, with a 64-byte ping running, with a default configuration of Unit 1 as Master and Unit 2 as Slave.
All three IRF links were cut, meaning that now both units were in master status. At this point in the test we measured a failover time of just 4ms. We now merged the two units, with unit 2 rebooting and the configuration being pushed to slave mode, so no conflicts and both units back up. During this phase we recorded just 1.4ms failover time.
We then tested how the IRF stack can accommodate other virtualisation technology such as VMotion - a VMware technology that enables virtual machines to be migrated live from server to server - through the IRF links directly, to gauge the effect on performance.
The aggregated bandwidth of the IRF links (here 30Gbps) provides best-in-class network performance and low latency for any VMotion events in the datacentre that we've seen. The traffic peaks we can see in the top right and left of the highest graphs above show the VMotion traffic. This test suggests the CI solution is optimised for virtual environments.
Overall we found the claimed resilience of HP's IRF technology to be justified. In every case we found the system recovered successfully from induced failures, most of which were very severe. We recorded latency/lost packets at every stage of the system recovery and found extremely low failover times - generally in the low milliseconds for complete recoveries, allowing for system elements to shutdown, reboot etc. To put this into perspective, it is not very long since failover times - bringing a redundant device up after a failure of the primary device - in this type of situation were measured in seconds and where sub 30-seconds recovery time was seen as class-leading.
Our firmware upgrade test also ran successfully and recorded very low overall switchover time, just 9ms for the complete upgrade of two switches (master/slave). This augurs well for day-to-day management of what is, in theory a complex CI solution, making it a relatively straightforward administrative task.
|Pros and cons of HP’s Intelligent Resilient Framework|
|Pros:||Proved extremely resilient during all tests.|
|Very low latency observed during all tests.|
|Allows for both high performance and redundancy in one solution.|
|Cons:||Technology is specific to HP products and solutions.|
This was first published in May 2011