Guidelines for production
The guiding question for members of the development team is:
“How do I ensure that the automated software system being built does not require manual intervention once it is running?”
Clearly part of the answer is grounded in testing and verification, so that we can ensure that the software is performing as expected and intended. However, interestingly, another important part of the answer comes from understanding how to ensure that the system operation is bounded and closed. That is, will the system continue to run smoothly within the fixed resource allocation given to it, and do the mechanisms designed into the system provide sufficient closure to be able to support the ongoing operation of the system.
Rather than trying to answer these questions directly, I find that it is more effective to work a number of concrete mechanisms into the development process and system implementation, such as:
-
unit tests: while this is an industry standard by now, it is always important to be explicit about the need for automated testing of isolated code.
-
telemetry: explicitly introduce mechanisms for collecting and aggregating statistics pertaining to the performance characteristics of the system.
-
control surfaces: explicitly introduce mechanisms that make it possible to parameterise data flows within production code and manage the overlap of multiple code paths.
-
probers: explicitly perform repeated tests against the production deployment of the system and validate correctness and adherence to SLAs.
-
deployment descriptors: explicitly capture the structure and interdependencies of subsystems in a manner that encodes meaning at the level of the system architecture and design.
-
process tracing: explicitly capture abstractions of data flows across multiple subsystems so at to validate correctness and better understand the performance characteristics or failure states of the system.
-
operational closure: explicitly review system control mechanisms with the intent of ensuring that APIs will support the full life-cycle of the system as it runs in production, and not just the initial bootstrap.
-
system decomposition: explicitly review the concrete structuring of the system and create encapsulated subsystems.
-
capability models: explicitly decouple subsystem access from authentication mechanisms.
-
governors: explicitly capture the need for continuous automated monitoring and self-correction by having components that manage the system.
These mechanisms help to ensure that, while good quality software reaches production, that the software systems will continue to operate within the necessary capacity constraints, and that automation can be leveraged to mitigate problems before they become failures.
Furthermore, any problems that do arrived can be effectively and more dynamically manged in terms of deriving useful information from the system in order to resolve issues, and to be make amendments with minimal impact.
Thus, the value of these mechanisms is often not immediately apparent in terms of the business requirements, but they help to ensure that the systems are more resilient and reliable.
This is important when seen in the context of the business, where the value of the system to business will only be realised if the system can run smoothly, and with minimal overhead, in production.
The reduced overhead can then be translated into future opportunity as software engineering resources are directed towards extending systems and building new systems.