Managing the Complexity
As highlighted above, a number of different types of systems run into similar constraints, and therefore similar problems arise. The exact mechanisms that might be suggested for tackling these issues would be implementation dependent. However, it is worth discussing, in broad terms, the types of approaches that could be considered.
System uptime becomes more critical as the system grows, primarily because more customers depend on these systems. Additionally, the customers tend to start placing your systems on their critical paths, so they also depend on the systems more heavily.
In order to address this a we need to start by assuming that:
“failure” is not an “exception” but part of the “steady-state” of the system.
With that out of the way, we can start to focus on keeping the system up during various types of disruptions, such as:
- capacity constraints: every individual component has some limit to its capacity, and if that capacity is reached then the system’s functioning will be disrupted. Instead we need to consider any part of the system where individual component capacities may be reached as an element that may need to be partitioned into “shards.”
- node failure: irrespective of the number of layers of indirection, ultimately computing systems run on hosts that may fail and cause disruptions. They may fail at the bare metal level, the virtual machine level, the container level or simply at the process level. We need to be able to detect these failures and replace the lost capacity, while offloading work and ensuring that we do not lose any critical state information. Some parts of the system might be amendable to being designed to be stateless and therefore easier to manage in a failure state. Other parts may need to expressly manage their state in a distributed manner relying on consensus algorithms.
- code changes: again, ultimately compute systems are comprised of software applications that both need to run and need to be changeable in order to continue to meet the business needs. These changes need to be introduced without causing noticeable disruptions to the system. This can be achieved by leveraging the replicated nature of the system and doing controlled rollouts to replace components. But other techniques can be brought in to further manage these changes.
- communication bottlenecks: as the system grows, the amount of traffic moving through the system may start to challenge original implementations. Synchronous request handling can cause stalls. Fine-grained messages can consume capacity due to frame overheads. Connection setup times may start to dominate processing time. Techniques and mechanisms such as actor model processing, asynchronous message handling, message box-carring and channel multiplexing can be brought to bear on the problem and keep the system operating within tolerance as the load increases.
Beyond directly tackling and mitigating various types of system disruptions, be they planned or unplanned, there are many other aspects of large scale distributed systems that may need to be considered over the life-span of the system:
- data integrity and atomicity of operations
- process monitoring and tracing
- concurrent code paths and experimentation
- data migration between systems with overlapping concerns
- development life-cycles and approaches to testing
- correctness testing and system health validation
- complexity management and component tracking
- dependency management
- performance monitoring and tuning
- system security and isolation
- etc.
Ultimately business success drives the scale of the supporting software systems, but the ongoing viability and growth of those systems will reciprocally help to drive the success of your business.