Disclaimer: The method, workaround, idea and script in this article is NOT supported by Velocloud, use them on your own risk.
Backgroud
When deploy Velocloud SD-WAN Edge (VCE) hardware, or virtual edge in KVM/ESXi, High Availability (HA) is supported. However, in public cloud, including Alibaba Cloud, VCE HA is not possible. This is because:
- VCE in HA pair discover each other by Multicast. Multicast is not supported in public cloud.
- The HA interface is always automatically assigned with IP address 169.254.2.1 ad 169.254.2.2. This is not possible in public cloud because each VPC comes with it’s only address block and we cannot assign IP address outside of the VPC address block.
Alibaba Cloud High-Availability Virtual IP Address (HAVIP)
Alibaba Cloud has a feature called HaVip which the detail can be found here: https://www.alibabacloud.com/help/en/vpc/user-guide/highly-available-virtual-ip-address-havip
And there are some other vendors support HA with this HaVip feature, this is because those vendors can support configurable IP address of the HA interface and also VRRP communication by unicast (not multicast).
Since VCE cannot support HA communication by unicast, this article is about how we can use a script to workaround this situation (where the script is running at the VCE itself).
The Idea
The idea of the workaround is simple, there will be two VCE which working independently. However, logically one VCE is working as a primary and one is working as a secondary (let’s call them primary VCE and secondary VCE from now on). On both VCE LAN interfaces, a secondary IP is configured and that secondary IP is the HaVip. And this will result of IP address conflict. To make things work, the secondary VCE LAN interface is intentionally shut down. And there will be a python script running on the secondary VCE, the script will continuously ping the primary VCE WAN interface. If the ping success (which means primary VCE is up), nothing will be done. However, if the ping failed (which means primary VCE is down), the python script will bring up the secondary VCE LAN interface, so the secondary VCE can take over. Thus, the python is responsible to make sure traffic from VPC to HaVip will hit the primary VCE unless the primary VCE is down.
For the remote site, we also need to ensure the traffic will prefer hitting the primary VCE. Since the secondary VCE LAN interface is intentionally shut down, the secondary VCE will not advertise the VPC routes (reachable is false). But to make a precaution, the primary VCE will advertise the VPC routes with cost 0, while the secondary VCE will advertise the VPC routes with cost 10. Since lower cost is preferred, the traffic from remote site will always prefer the primary VCE as long as primary VCE is up and running.
Since the VCE is not really running VRRP and using the above idea to make the VCE able to work with Alibaba Cloud HaVip, so there will be caveats such as this is not officially supported, there will be some network interruption when the primary VCE come back, etc. However, the script in this workaround does not need to touch the VPC route table, so it should requires minimal maintenance.