High availability using VpnCloud

VpnCloud can be used to provide high availability (HA) on the IP layer. That means that by using VpnCloud you can configure several server nodes to listen on the same IP address. Clients will automatically switch to a different server when a server node becomes unavailable.

Note that high availability is not the main use case of VpnCloud. Therefore VpnCloud lacks some functionality that other HA solutions have:

There is no load balancing functionality in VpnCloud
VpnCloud server nodes are not aware whether they are active or in standby mode
There is no support for HA on layers other than IP (e.g. HTTP proxy)

How it works

VpnCloud does not check whether claimed IP addresses are unique. Therefore several nodes can claim the same addresses in the VPN. Normally, that makes no sense, since the nodes will not be able to communicate if their IPs are the same, however other nodes can then reach both nodes via this single IP address (at a given time only one node will be reached).

When a node crashes, its peers will detect this and remove the node from the peer list. Together with the node, they will also remove the addresses claimed by that node, which switches that address to a different node if it is claimed multiple times. To detect node crashes quickly, the node timeout and keepalive settings can be tweaked.

Note that there is no restriction on the number of nodes sharing the same IP address. So you are not limited to have 2 nodes in your HA setup, you could also have 3 or more nodes.

Detailed solution

To share the same IP address with several server nodes, the network should run in TUN mode and all server nodes should claim that single IP address (--ip IP_ADDRESS). For example:

$> vpncloud --password 'mysecret' --ip 10.0.0.1 -c other_node:3210

The client simply claims a unique IP address in the address range like normal. For example:

$> vpncloud --password 'mysecret' --ip 10.0.0.100 -c node:3210

Note that nodes should have the IPs of the other nodes configured as peers (using -c). This guarantees, that any node can crash and when the node is restarted, it will reconnect to all peers.

Keepalive and timeout

Node crashes are detected when peers do not send their regular peer list update. Nodes will detect this and remove the faulty nodes from the network.

The default peer timeout of 300 seconds is too long for a HA setup. The timeout and keepalive interval have to be reduced with the options --peer-timeout and --keepalive where the peer timeout has to be longer than the keepalive interval.

For example: --peer-timeout 5 --keepalive 2. This will detect node crashes within 5 seconds at most.

Active/passive server nodes

In general, clients will always forward packets to the node that claims the most specific matching subnet. Since /32 is the most specific subnet that is possible, all server nodes will have the same priority. In this case, clients will use the server node, that is online for the longest time.

That means that each client will use only one server node. When multiple server nodes start at the same time, the choice is kind of random. However, the server node used by a client will be stable unless that server node crashes.

If you want to define one server node as primary and the other as secondary, you can use the subnet prefix to do so. If one server node claims an IP with /32 and the other claims the IP with /31, the former node is primary and the later node is secondary. However, note that

if the primary node comes up after a crash, it will disrupt all current connections to the secondary node
using /31 actually claims a second IP address that should not be used for client nodes

Full example

In this example we will setup two server nodes in HA mode (server-a and server-b), as well as one client node (client). The commands assume, that the nodes can be reached using their name (either add the IPs to /etc/hosts or replace the names by the IPs).

On both servers run a socat command in the background:

$server-a> socat -U tcp-listen:8000,reuseaddr,fork exec:hostname &

$server-b> socat -U tcp-listen:8000,reuseaddr,fork exec:hostname &

This command runs a small server listening on port 8000 that returns the hostname of the node.

Then run the vpncloud command that claims the IP address 10.0.0.1 on both server nodes:

$server-a> sudo vpncloud --daemon --password 'mysecret' --ip 10.0.0.1 -c server-b:3210 --peer-timeout 5 --keepalive 2

$server-b> sudo vpncloud --daemon --password 'mysecret' --ip 10.0.0.1 -c server-a:3210 --peer-timeout 5 --keepalive 2

On the client, run vpncloud claiming a different IP (10.0.0.100):

$client> sudo vpncloud --daemon --password 'mysecret' --ip 10.0.0.100 -c server-a:3210 -c server-b:3210 --peer-timeout 5 --keepalive 2

Then you can ping the shared IP 10.0.0.1. Also you can check which server is currently active:

$client> ping -c 3 10.0.0.1
$client> socat tcp-connect:10.0.0.1:8000 -

Finally you can check what happens when you crash the active node (you need to kill -9 to prevent vpncloud from shutting down gracefully):

$server-?> sudo killall -9 vpncloud