Large scale distributed systems

Posted in systems-engineering with tags distributed-systems large-scale - Wednesday, August 1, 2018

It doesn’t matter how you get to the point of needing to address the problem of dealing with a large scale distributed system, because it means that your business is being successful in producing the levels of throughput that demand these types systems.

The issue is to make sure that these systems continue to deliver value to the business and operate smoothly on the critical path.

The key observation is that these types of systems tend to illicit the same range of problems that need to be solved. Without solving these underlying issues it becomes difficult to focus on the higher level processing requirements, be that: supporting large volumes of concurrent users, enabling “big data” subsystems to interact, off loading optimisation problems to “machine learning” subsystem etc.

While there is no single underlying issue, the types of problems that often arise are:

ensuring high availability during component failures, as the probability of a failure increases with the number of moving parts managed by the business.
ensuring high availability during component updates, as the rate of change increases with the number changeable parts managed by the business.
ensuring low latency in response times as the system scales, while simultaneously the user base matures and expects a tighter adherence to SLAs.
ensuring system security as the system becomes more spread out and centralised controls become a bottleneck.
ensuring ongoing correctness of system functioning as the unavoidable system complexity increases.

For all of these types of issues, we would be able to provide concrete input, guidance and expertise to help manage the complexity and meet the overarching system requirements in order to support your business.

more...

Managing the Complexity

As highlighted above, a number of different types of systems run into similar constraints, and therefore similar problems arise. The exact mechanisms that might be suggested for tackling these issues would be implementation dependent. However, it is worth discussing, in broad terms, the types of approaches that could be considered.

System uptime becomes more critical as the system grows, primarily because more customers depend on these systems. Additionally, the customers tend to start placing your systems on their critical paths, so they also depend on the systems more heavily.

In order to address this a we need to start by assuming that:

“failure” is not an “exception” but part of the “steady-state” of the system.

With that out of the way, we can start to focus on keeping the system up during various types of disruptions, such as:

capacity constraints: every individual component has some limit to its capacity, and if that capacity is reached then the system’s functioning will be disrupted. Instead we need to consider any part of the system where individual component capacities may be reached as an element that may need to be partitioned into “shards.”
node failure: irrespective of the number of layers of indirection, ultimately computing systems run on hosts that may fail and cause disruptions. They may fail at the bare metal level, the virtual machine level, the container level or simply at the process level. We need to be able to detect these failures and replace the lost capacity, while offloading work and ensuring that we do not lose any critical state information. Some parts of the system might be amendable to being designed to be stateless and therefore easier to manage in a failure state. Other parts may need to expressly manage their state in a distributed manner relying on consensus algorithms.
code changes: again, ultimately compute systems are comprised of software applications that both need to run and need to be changeable in order to continue to meet the business needs. These changes need to be introduced without causing noticeable disruptions to the system. This can be achieved by leveraging the replicated nature of the system and doing controlled rollouts to replace components. But other techniques can be brought in to further manage these changes.
communication bottlenecks: as the system grows, the amount of traffic moving through the system may start to challenge original implementations. Synchronous request handling can cause stalls. Fine-grained messages can consume capacity due to frame overheads. Connection setup times may start to dominate processing time. Techniques and mechanisms such as actor model processing, asynchronous message handling, message box-carring and channel multiplexing can be brought to bear on the problem and keep the system operating within tolerance as the load increases.

Beyond directly tackling and mitigating various types of system disruptions, be they planned or unplanned, there are many other aspects of large scale distributed systems that may need to be considered over the life-span of the system:

data integrity and atomicity of operations
process monitoring and tracing
concurrent code paths and experimentation
data migration between systems with overlapping concerns
development life-cycles and approaches to testing
correctness testing and system health validation
complexity management and component tracking
dependency management
performance monitoring and tuning
system security and isolation
etc.

Ultimately business success drives the scale of the supporting software systems, but the ongoing viability and growth of those systems will reciprocally help to drive the success of your business.