Simplified Hosted PBX - One Way Audio - 24/07/2019 16:40
Posted by Adam Howard on 25/07/19 - 13:58
Good Afternoon, We would like to start by offering our apologies for the impact to service for resellers and customers late yesterday afternoon. After discussions with our platform partner we have a full report on the incident which is as follows.
Third-party DNS Server failure
We use a third-party DNS service, Quad9, with Google as a back-up. Quad9 had a failure and Google rate limited the number of accesses which could be made. This caused a number of knock-on effects for any services which use domain names rather than IP addresses (which is most of our services).
We were not expecting to be rate limited on our back-up server, however mitigation is fairly simple as described in detail below. We therefore do not expect a repeat of this issue.
We use two nameservers: Quad9 DNS (126.96.36.199) and Google Public DNS (188.8.131.52). Unfortunately, 184.108.40.206 suffered some issues, which caused a cascade affect which affected the following:
We failed over to the secondary nameserver, however, we did not realise that Google enforces a QPS (Queries Per Second) policy and so stopped responding to us.
To combat this, we will introduce Cloudflare (220.127.116.11) as our 2nd server and have Google as our 3rd. Cloudflare do not enforce a QPS limit
We will also add static host file entries for our REST requests, that way, in the event that the DNS fails (for whatever reason) REST requests would not suffer since our own HOSTs files would take over.