The last couple of years, we've had two ISP's on premise. One (XS4ALL) for basic Internet Access via VDSL, and one our (VoIP) phone provided by Ziggo. The Ziggo phone services includes free (and ultra lite) Internet access through the use of their cable modem. It's ultra-lite, since it's only 256kbps. More than enough for VoIP, but not nearly enough for modern basic Internet access.
Having these two ISP's means that I should be able to provide some redundancy in case my primary DSL connection fails (for whatever reason). Preferably an automated fail-over of some kind. Since there are no dynamic protocols available from either ISP (the Internet service is consumer-grade), I have to find some work-around.
The following figure shows the basic layout of my network. I won't go into details regarding the security policies or NAT configuration. That's all pretty basic (and Junos 101).
The following solution is based on a Juniper Branch SRX with software version JUNOS 12.1X44-D20.3. And the functionality is delivered by using Junos Real-time Performance Monitoring (RPM) and ip-monitoring.
The idea is that we monitor (probe) a server in the XS4ALL network (194.109.6.66) over interface fe-0/0/7.20. When that server is no longer responding, we introduce a new default route to the Ziggo network.
The Code
First we create the probe that monitors the server:
set services rpm probe XS4ALL test testsvr target address 194.109.6.66
set services rpm probe XS4ALL test testsvr probe-count 10
set services rpm probe XS4ALL test testsvr probe-interval 5
set services rpm probe XS4ALL test testsvr test-interval 10
set services rpm probe XS4ALL test testsvr thresholds successive-loss 10
set services rpm probe XS4ALL test testsvr thresholds total-loss 5
set services rpm probe XS4ALL test testsvr destination-interface fe-0/0/7.20
set services rpm probe XS4ALL test testsvr next-hop 192.168.0.254
You can fiddle around with the parameters above to tune the probing.
After that we create the part that monitors the result of the probe and acts upon it. What this does is, it introduces a more preferred route into the routing table when the server is no longer reachable.
set services ip-monitoring policy Tracking-XS4ALL match rpm-probe XS4ALL
set services ip-monitoring policy Tracking-XS4ALL then preferred-route route 0.0.0.0/0 next-hop 84.105.14.1
Checking the results
The command 'show services ip-monitoring status ' displays the results of the probe (PASS or FAIL ) and if the new route is applied or not. In this case the backup route is not applied.
root@srx100> show services ip-monitoring status
Policy - Tracking-XS4ALL (Status: PASS)
RPM Probes:
Probe name Test Name Address Status
---------------------- --------------- ---------------- ---------
XS4ALL testsvr 194.109.6.66 PASS
Route-Action:
route-instance route next-hop state
----------------- ----------------- ---------------- -------------
inet.0 0.0.0.0/0 84.105.14.1 NOT-APPLIED
root@srx100> show route
inet.0: 22 destinations, 24 routes (22 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
0.0.0.0/0 *[Static/5] 02:14:26
> to 192.168.0.254 via fe-0/0/7.20
[Access-internal/12] 1d 19:03:07
> to 84.105.14.1 via fe-0/0/7.30
When I disable the DSL router, the RPM status becomes FAIL, and the preferred route is 'injected'. When the DSL is back up again, the preferred route is removed, and the original default route is re-instated.
The following output is from when the DSL router was down:
root@srx100> show services ip-monitoring status
Policy - Tracking-XS4ALL (Status: FAIL)
RPM Probes:
Probe name Test Name Address Status
---------------------- --------------- ---------------- ---------
XS4ALL testsvr 194.109.6.66 FAIL
Route-Action:
route-instance route next-hop state
----------------- ----------------- ---------------- -------------
inet.0 0.0.0.0/0 84.105.14.1 APPLIED
root@srx100> show route
inet.0: 22 destinations, 25 routes (22 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
0.0.0.0/0 *[Static/1] 00:00:15, metric2 0
> to 84.105.14.1 via fe-0/0/7.30
[Static/5] 02:12:15
> to 192.168.0.254 via fe-0/0/7.20
[Access-internal/12] 1d 19:00:56
> to 84.105.14.1 via fe-0/0/7.30
root@srx100> show services rpm probe-results
Owner: XS4ALL, Test: testsvr
Target address: 194.109.6.66, Probe type: icmp-ping
Destination interface name: fe-0/0/7.20
Test size: 10 probes
Probe results:
Response received, Fri Aug 16 17:28:44 2013, No hardware timestamps
Rtt: 19722 usec
Results over current test:
Probes sent: 4, Probes received: 4, Loss percentage: 0
Measurement: Round trip time
Samples: 4, Minimum: 19722 usec, Maximum: 21331 usec, Average: 20195 usec, Peak to peak: 1609 usec, Stddev: 659 usec, Sum: 80781 usec
Results over last test:
Probes sent: 10, Probes received: 10, Loss percentage: 0
Test completed on Fri Aug 16 17:28:19 2013
Measurement: Round trip time
Samples: 10, Minimum: 19590 usec, Maximum: 20712 usec, Average: 20214 usec, Peak to peak: 1122 usec, Stddev: 396 usec, Sum: 202139 usec
Results over all tests:
Probes sent: 1514, Probes received: 1459, Loss percentage: 3
Measurement: Round trip time
Samples: 1459, Minimum: 19203 usec, Maximum: 70250 usec, Average: 20276 usec, Peak to peak: 51047 usec, Stddev: 1681 usec, Sum: 29582747 usec
The fail-over takes around a minute (I missed ~60 ICMP replies while pinging 8.8.8.8) with the parameters provided in the example.
root$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=49 time=24.004 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=49 time=22.948 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=49 time=23.095 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=49 time=23.163 ms
Request timeout for icmp_seq 4
Request timeout for icmp_seq 5
Request timeout for icmp_seq 6
[...]
Request timeout for icmp_seq 61
Request timeout for icmp_seq 62
Request timeout for icmp_seq 63
64 bytes from 8.8.8.8: icmp_seq=64 ttl=50 time=17.203 ms
64 bytes from 8.8.8.8: icmp_seq=65 ttl=50 time=519.751 ms
64 bytes from 8.8.8.8: icmp_seq=66 ttl=50 time=97.280 ms
This could be improved by tweaking the values. Switching back goes a lot faster (if the failed link is restored).
When I checked Splunk for messages regarding the fail-over I could only find messages regarding the failed ping from the rpm. Note that the messages come from the control plane of the SRX, so make sure you configure your logging accordingly.
The Splunk graph gives a nice representation of the availablitiy of the primary ISP, or at least the targeted server (a load-balanced DNS cluster-thinghy IIRC). A green bar (in this case) means that the primary ISP is not available.
This fail-over mechanism is only for traffic initiated from within the internal network. Traffic originating from the Internet is not aware of the public IP change, and will therefor not be able to connect to the internal network (if that's configured). A Dynamic DNS service might provide a solution there.
Conclusion: fail-over is not very fast in this case (although improvements can be made with different probe parameters), but it sure beats manual labor.