Production ready systems

Posted in systems-engineering with tags production operational-overhead -

There is obvious value in having code reach production rapidly. However, it is important to make sure that these systems do not continually come back to haunt the software engineering team. This implies that the business is being prevented from deriving the full intended value from the systems, and furthermore, it hampers the ability of the business to redirect its engineering resources to new avenues and opportunities.

In short, how do you keep the operational load of your software engineering team to a minimum, so as to increase business value?

I believe that this can be achieved to a large degree by ensuring that we design and build “production ready” systems. Such systems would, at a minimum, meet the basic feature requirements of the business case. Additionally, production ready systems meet a broader range of requirements which will enable them to operate continuously in production with minimal intervention from the software engineers and without other ongoing and heavy handed operational support.

I work with software teams to ensure that architectural designs and implementation aspects cover, and include, a range of mechanisms that lead the resulting system to be ready for running in production.

Some of the considerations are covered in more detail below, but the focus is always on building a more complete and self-sustaining system. This system therefore renders value to the business, while freeing up the limited software engineering resources to focus on the next business phase, or to help the business experiment with new opportunities.

more...

Guidelines for production

The guiding question for members of the development team is:

“How do I ensure that the automated software system being built does not require manual intervention once it is running?”

Clearly part of the answer is grounded in testing and verification, so that we can ensure that the software is performing as expected and intended. However, interestingly, another important part of the answer comes from understanding how to ensure that the system operation is bounded and closed. That is, will the system continue to run smoothly within the fixed resource allocation given to it, and do the mechanisms designed into the system provide sufficient closure to be able to support the ongoing operation of the system.

Rather than trying to answer these questions directly, I find that it is more effective to work a number of concrete mechanisms into the development process and system implementation, such as:

  • unit tests: while this is an industry standard by now, it is always important to be explicit about the need for automated testing of isolated code.

  • telemetry: explicitly introduce mechanisms for collecting and aggregating statistics pertaining to the performance characteristics of the system.

  • control surfaces: explicitly introduce mechanisms that make it possible to parameterise data flows within production code and manage the overlap of multiple code paths.

  • probers: explicitly perform repeated tests against the production deployment of the system and validate correctness and adherence to SLAs.

  • deployment descriptors: explicitly capture the structure and interdependencies of subsystems in a manner that encodes meaning at the level of the system architecture and design.

  • process tracing: explicitly capture abstractions of data flows across multiple subsystems so at to validate correctness and better understand the performance characteristics or failure states of the system.

  • operational closure: explicitly review system control mechanisms with the intent of ensuring that APIs will support the full life-cycle of the system as it runs in production, and not just the initial bootstrap.

  • system decomposition: explicitly review the concrete structuring of the system and create encapsulated subsystems.

  • capability models: explicitly decouple subsystem access from authentication mechanisms.

  • governors: explicitly capture the need for continuous automated monitoring and self-correction by having components that manage the system.

These mechanisms help to ensure that, while good quality software reaches production, that the software systems will continue to operate within the necessary capacity constraints, and that automation can be leveraged to mitigate problems before they become failures.

Furthermore, any problems that do arrived can be effectively and more dynamically manged in terms of deriving useful information from the system in order to resolve issues, and to be make amendments with minimal impact.

Thus, the value of these mechanisms is often not immediately apparent in terms of the business requirements, but they help to ensure that the systems are more resilient and reliable.

This is important when seen in the context of the business, where the value of the system to business will only be realised if the system can run smoothly, and with minimal overhead, in production.

The reduced overhead can then be translated into future opportunity as software engineering resources are directed towards extending systems and building new systems.

Written by Stewart Gebbie