Who is behind the biggest crash of Facebook and WhatsApp in history?
Facebook mistakenly updates BGP and drags DNS with it
It seems that around 11:30 a.m. EDT Monday, the Facebook team made a very important change in the BGP protocol, taking the DNS records with them, resulting in Facebook, WhatsApp, and Instagram domains completely disappearing.
The Internet is made up of thousands of autonomous systems, also known as AS. These AS use the BGP protocol to communicate with other autonomous systems and exchange routes. When we connect to Facebook, the first thing we must do is consult the DNS servers to know where to go with the public IP address that it provides us, later the packets will be routed directly from the origin of the connection (us) to the destination, going through several intermediate routers, each and every one of these routers have the necessary routes to take us to the destination, which are the Facebook servers.
Although the servers are still up and running without problems, Facebook internally uses different DNS that they have in order to reach their own services, however, since the DNS does not work, logically no one can reach the destination. If we try to do a search from our Internet connection, the DNS server will automatically tell us that the Facebook domain or any other related domain cannot be found.
The failure that has caused the entire Facebook platform to crash is a bad update of the BGP protocol, making it impossible to immediately access these systems remotely again to fix the problem. When a change is made to BGP, these changes are quickly propagated to all other routers involved. There are Facebook people for hours in their data centers who are physically trying to solve the problem, however, the people who have the necessary knowledge to be able to authenticate the problem in the system and proceed with the changes are working remotely from their homes, and of course, they can't remotely access Facebook to fix it.
What happened is like when we try to configure a firewall of a remote server via SSH, and by mistake we block ourselves. In this case, when updating the BGP protocol and due to the rapid propagation of new routes with the changes incorporated, there is simply no longer a "path" to access these computers, they cannot go back with the changes because they have lost connectivity.
Facebook uses its own DNS for absolutely everything, for WhatsApp, for VoIP calls, Facebook's internal email, etc., therefore, if the DNS goes down, the way to fix it remotely also goes down. Because Facebook has very tight security to prevent attacks, and even to prevent employees themselves from making critical changes, only a few people have the knowledge and access credentials to access and fix it.
What if it really was an attack?
On the Internet, it is said that the Anonymous group has attacked Facebook. In the event that an attack has seriously compromised the company's infrastructure, the most logical thing is to cut all communications in the bud, which is precisely what Facebook has done by updating its protocol BGP to erase all routes from all routers in the world. For a company the size of Facebook, with people with years of experience behind them and who are among the best in the world in their field, it is quite strange that they have updated the BGP protocol badly to precisely lose all communication with the outside world. Unless it's for a good reason: a major hack.
Other services have problems too
Other services like Google and Telegram are also having some stability issues, they may be the next to crash. Right now the operation of these services is not entirely correct, for example, they do not allow you to download photos or upload them, in addition, when browsing the Internet with Google it also gives an error on some occasions. If you have a smartphone, it is very possible that it tells you that you are connected to a WiFi network without an Internet connection, this is because to verify the Internet connection they make communications with Google servers, and it seems that they are down or do not work at all.
The reason for these crashes is because people try to enter Facebook domains in an “avalanche” and continuously, the DNS servers cannot correctly resolve these domains, and they request overload, and it is for this reason, that it sometimes seems that the different services have also crashed when they haven't.