Reliability

Contents

Lower level mechanisms are generally more reliable
Simplicity
Essential and Accidental Complexity
The network is a weak link
Does using multiple services/computers increase reliability?
Examples
- Database connectivity issues
- Network issues in a cloud provider

Lower level mechanisms are generally more reliable

Lower level, widely used mechanisms are typically more reliable than higher level mechanism. For instance, a cloud object storage service will probably be more reliable than a hosted authorization service. Sending a TCP packet is typically more reliable than a specialized GRPC endpoint at a hosted service. Why? Becuse one is widely used and the other is not. The more we can standardize communication infrastructure, payloads, storage, etc, the easier it is to build reliable systems. For example, in the Simple IoT project, we standardize on points as the payload for communication and storage. If we can store, transfer, and syncronize points reliably, then the system generally works. We only have to make modifications at the edges. However, if we need to store, transfer, and synchronize dozens of different payloads and mechanisms, and modify every part of the system every time we add a feature, then it is more difficult to build a reliable system. This is why efforts like GraphQL and NATS are interesting -- they are standards for common problems and are widely used and therefore tend to be more reliable.

Simplicity

The only path to reliability is simplicity. We generally can't fix problems by adding more "stuff". As time moves on, technologies designed for one problem domain are no longer adequate. We may try to solve this problem by adding additional layers but this rarely works. Additional layers of abstraction are useful if they simplify our interactions with technology and the underlying technology is sound, however if their purpose is to gloss over problems where the underlying system is fragile and complicated, then the problem is usually made worse. Sometimes the best solution is to start over with something new which is better fitted to the problem domain.

Essential and Accidental Complexity

Modern systems are complex because of the type of problems we are trying to solve, which is called "essential" complexity -- we can't avoid it as it is part of the problem domain.

The type of complexity we must avoid is "essential" or "accidental" complexity.

The network is a weak link

A system is only as good as the weakest link and often the problems come from the areas we least expect. We can have replicated/redundant databases, load balancers with redundant web servers, etc. But one network problem can bring the whole thing down, which leads us to the fallacies of distributed computing.

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

The Internet was designed to be redundant and reliable, and in many ways it accomplishes these goals very well. The problems most often come in at the edges of the network where there is no redundancy, or when we depend on a single company's WAN for security, etc.

Networks can also be compromised by denial of services attacks where latency can be driven up significantly. A service may fail if latencies cause communication timeouts to fail, or we can't get the throughput we need.

Does using multiple services/computers increase reliability?

If we are building a system, it may be temping to outsource everything we can (auth, storage, ingress, notifications, etc.) to 3rd party services with the assumption they not everything will go down at once if something breaks. There are good reasons to use services like Twilio which provide a gateway to external systems like SMS and the telephone network that are impossible for small organizations to do on their own. The reality is that if something goes down, there is a good chance your system will not be useable anyway, so much of the time, this "diversity" does not really buy us much, but rather introduces more risk and cost as there are now many more network connections between services.

The same question applies in the decision to use microservices or a monolith architecture. To misquote a wise man:

Some people, when faced with a coupling problem, think 'I know, I’ll use microservices!'. They now have two problems.

Others have written extensively on this subject:

Some problems are inherently distributed:

IoT systems where devices are physically separated by some distance
Browsers and webservers
Applications that reach large scale

If you are not Google scale, then perhaps you should put everything you can on one server, run backups, and be done with it. Technologies like Litestream make this very practical.

Examples

Database connectivity issues

A system was built that using a database that was hosted by the company that produced the database. Multiple database nodes were used for redundancy to prevent data loss, down-time, etc. At one point, a service started loosing its connection to the database. The only way to recover was to restart the service. The db hosting company suspected network issues between the cloud providers that hosted the service and the database and could not provide any help beyond that. They suggested moving the service to the same cloud/region as the db. This was not practical to do quickly for multiple reasons so a watchdog was implemented that restarted the service when the db error count started to rise. After a week or so the situation resolved itself.

Network issues in a cloud provider

In another case, users could not log into a hosted IoT service. The root cause of the problem was a connectivity issue in the cloud hosting company their service used. They were able to resolve the issue by resetting the affected microservices. This same service was interrupted a few days later when the cloud company made changes to their WAN that impacted connectivity network connectivity between clients on the Internet and the cloud service.

In both of these cases, the network was the culprit and something beyond the control of any of the operations people involved. The only way to reduce these risks is to move services closer together -- same cloud region, same data-center, same subnet, same machine. While it may seem risky to put everything on one machine, at least you have control of that machine and can spin up another if needed.

The TMPDIR Handbook