What makes your system more reliable

3 min readMar 27, 2022

When we are thinking about reliability, we often go to the technical solutions, tools too directly. Then, you improve your application & system with great solutions with the cost. But there should be a balance between the cost (the cost is not only hardware & implementation cost but also performance, operating cost, etc. )and reliability. You need to always consider your plans on the site reliability need to cover all points logically perfectly. Imagine you spend x millions $$ for scaling out for the computing system like min (10) & max (100) servers. Then you may find out your application generates tons of garbage logs & data which are never going to be used. (It may be just needed to simply get rid of useless loads and only needs 2 servers as resilience))

The descriptions blow arn’t including the d2d operational reliability. (Deployment, monitoring, etc.)

Optimize Load

Why do we care for the load? (transactions) There are a lot of potential risks if we send (load) data unnecessarily.

Prioritization: When your system needs to handle tons of loads, the system needs to know specific priority among them, otherwise, critical traffic may be buried and behind the worthless load. (Your system may fail the critical transaction due to just meaningless status update)
Throttling: The heavy load will limit the bandwidth available to users and slow down the application use speed.
Cost: Whatever you produce from your applications to servers, and from servers to the data storage, you can treat all of them as cost. You need to really think about how much value it will bring from the data you produce as you will eventually be charged for all transactions & data through your application & system.
Security: Many times, the security breach is not from the data you focus on. The critical data the entire company looks at is not commonly shoved without the careful touch. Many breaches can be somewhere you don’t feel it’s important like a non-essential library, applications, and logs data. So providing only essential & bringing-value data & transaction is paramount to have your service secure.

Lower Latency

Do you think the latency is related to reliability? My view is ‘Yes’.

I like playing a game. like a boxing game. If when you click the button, the character throws a punch after 1 second, I will delete the game directly. The game may be perfectly reliable, with no downtime, no disconnection. But I never think it’s a reliable game.

The key is to capitalize the low latency as it will be directly connected with the users’ adoption and willingness to pay for the service. In most cases, it is relying on the network performance (e.g. internet, routing), BUT there are a lot of ways to improve the latency of your service.

Caching
Replications
CDN

Those are techniques that make your service low latency.

High Redundancy (“I will die alone” + “In fact, I have my twince”) — more techniques.

Circuit breakers: AWS Circuit Breaker pattern
Bulkheads: AWS Bulkhead pattern
Failure Handling: Retries, Timeouts, Jitters
Failovers: When your systems are not able to serve the services, you should think about how we can ‘fail’ → ‘over’. There are a lot of patterns on the failover, HA (Active-active, active-passive), regional DR.
Failbacks: The failover should be the temporary solution to continue your business as normal to avoid x days downtime before your system is recovered well. Therefore, when it’s failover, your system needs to prepare failback by reversing the direction of data replication from the target machine back to the source machine.