John’s 9 Traits of Well-built Systems

Introduction

Everyone has their view of what a well-built system looks like. We have the Joel test, the 12-factor app and many more. This inspired me to reflect on what is important in a modern system, at least to me.

This list focuses on the traits a well-built system should have. The traits can be achieved using different technologies and techniques, which is intentional. I don’t want to dictate technologies, even if I suggest some.

If you just want the list without all the fuzz, here are the 9 traits of well-built systems::

  • A well-built system is internally consistent
  • A well-built system can be fully developed and verified locally
  • A well-built system has a well-defined domain
  • A well-built system should alert developers when they might be causing issues with dependants
  • A well-built system communicates through strict schemas
  • A well-built system has a thorough suite of automated tests, reducing or eliminating the need for manual testing
  • A well-built system is deployed completely automatically or at the push of a button
  • A well-built system keeps its feedback loops short
  • A well-built system reduces or eliminates the need for production access
NOTE
In this post, I will use the word “system”, which is only partially correct. A better phrasing would be “software module” or “codebase”. The term “system” flows better, so I decided to use it.

A well-built system is internally consistent

Internal consistency is important for readability and maintainability. It can be frustrating when different developers have applied their interpretation of “best practices” within a single codebase. The result won't read well even if they have applied great practices. A less-than-perfect approach might be more favourable than two competing best practices. We know what to expect, which makes it easier to read the code and make changes.

We achieve internal consistency by agreeing on how different patterns should work and what code style the codebase should use. We get internal consistency when we ensure that the same words mean the same thing across the codebase and that the overall functionalities behave similarly.

A well-built system does not separate new and old code. Old code must be continuously updated and maintained, or internal consistency goes out the window. Doing this takes effort and has to be prioritised by the team.

A well-built system can be fully developed and verified locally

A well-built system must be able to be built, run, and tested locally on developers' machines. Too often, systems rely on external services or test environments, which can cause problems down the line. This might seem convenient initially, but once infrastructure or environmental instabilities arise, troubleshooting and fixing issues can often take time.

For example, I once worked on a project with little automated testing, and all manual testing had to be done against the test environment. The project used Flyway for database versioning, which is a common and valuable tool. However, relying on the test environment resulted in issues when developers wanted to test new features requiring database changes. For example, the database in the test environment would get upgraded to the developer's branch build, which caused problems for other developers who couldn't test their own changes because Flyway detected incompatibilities with their local branches and the test environment.

In short, relying on external systems or environments can lead to conflicts and prevent developers from being able to test and verify their changes.

To avoid these issues, a well-built system should be designed to be built, tested, and executed locally on developers' machines. This includes replicating database schemas and spinning up virtualised or containerised environments as needed. Additionally, if the system relies on specific hardware, developers should have access to local versions of that hardware. This way, each developer's development branch won't impact other developers' ability to test their own changes. Therefore, building, running, and testing locally is crucial for maintainability and, therefore, an essential part of a well-built system.

A well-built system has a well-defined domain

A well-defined domain is crucial for the long-term success of a system. It helps to ensure that the system is focused and not trying to do too much. With clear boundaries, it can be easier to understand the system's capabilities and responsibilities, leading to clarity and complexity. This can make it difficult for developers to work on the system, as they may need clarification on what the system is supposed to do or how it fits into the bigger picture.

I’m sure we’ve all seen the “black-hole system”; I have multiple times. Someone decided to use microservices but forgot to scope one of their core services. As such, that one service grew and grew. Shortcuts were taken to compensate for inadequacies in the overall solution, and one service started swallowing other services. The result often being worse than what was before.

The domain should also be defined somewhere. This could be through system documentation which outlines the system’s responsibilities, but it could also be documented through schemas such as OpenAPI, gRPC and so forth. Of course, the nature of such documentation isn’t that important, but it should be written somewhere about why the system exists, what problems it sets out to solve, and the overall domain for that system.

Having a clear domain also makes planning for changes and updates to the system easier. When the system's responsibilities are well-defined, it's easier to determine where changes should be made and how they will impact the system as a whole. This can help to reduce the risk of unintended consequences and make it easier to maintain the system over time. Overall, having a clear and well-defined domain is an important aspect of well-engineered systems.

A well-built system should alert developers when they might be causing issues with dependants

Other systems often rely on our systems. For example, it could be a classic frontend-backend solution or service-to-service. The dependants might exist outside the organisation or use the code directly as a library. I’m sure many have experienced their system suddenly throwing exceptions while unsure why. After all, no changes have taken place recently, so what gives? The answer is that some system dependency has been changed in a way that causes issues for dependants. These kinds of issues can take a long time to debug, and they’re often no fault of the dependents themselves. As such, we should make sure to not cause issues for dependants.

If we take the frontend-backend example, a backend developer should be alerted when they’re about to make changes that may cause issues with the frontend system. The current standard for achieving this is to use contract testing or have wide-integration/E2E tests.

API versioning is another important tool to reduce the risk of impacting dependants.

No matter the implementation or approach we use, the developer should be warned about potentially breaking code that depends on their system - and such a warning should come as early in the development process as possible.

A well-built system communicates through strict schemas

Too many times, I've encountered a service or file output and had to ask myself questions like, "Can this ever be null?", "What is the valid range of this field?" and "What values are valid here?". Not having schemas in place can confuse and ultimately lead to bugs - and I've seen plenty come from this alone.

A well-built system uses schemas in one form or another. For example, it could be OpenAPI for web services, Proto for gRPC, JSON schema for JSON files - even database schemas are a part of this. The point is that we should use something that clearly defines what values are valid and what recipients of the data can expect. Using schemas helps to establish a shared understanding of the data being exchanged between systems, which is crucial for maintaining the integrity and reliability of the systems. With a clear definition of the data, it can be easier to understand the meaning and context of the data, leading to clarity and the potential for errors and bugs.

Not only should a well-built system use schemas, but they should be strict. It is better to allow for too little and loosen up the schema as it becomes necessary than to have a too loose schema which causes uncertainty in the data.

A well-built system has a thorough suite of automated tests, reducing or eliminating the need for manual testing

I’m the kind of developer that sees every bug as a test I forgot to write - and I believe a well-built system makes it easy to support that mindset by making it easy to write tests and verify functionality. However, it should be rare for a developer to test anything manually, as it is infinitely more valuable if they express it as an automated and repeatable test.

There are times when writing an automated test is inconvenient or near impossible. Couple difficulties to test with low severity and rarity, and then we may have a good contender for something we could test manually. What is important is that we’re not writing tests as an exception, not as a rule.

If the only way to verify a system’s functionality is to test it manually, then it is not a well-built system.

A well-built system is deployed completely automatically or at the push of a button

We’re well aware of CI/CD at this point; as such, I won’t go into too much detail. The point is that moving changes into production or as close to production as possible should be easy. The best approach would be that the system deploys itself when certain circumstances have been met, for example, when changes have been pushed to the main branch. In this scenario, developers do not issue a production deployment - it just happens.

A shorter path to production means more frequent releases, leading to smaller releases with fewer changes in them and a smaller chance of something serious going wrong.

There might be reasons why a rapid and hands-off approach to production deployment is not viable. These reasons might be due to laws in certain industries or physical limitations on which the hardware run. In those cases, we should automate things as far as we can take them

Many other practices need to be in place to achieve this trait responsibly:

  • Extensive suite of automated tests.
  • Infrastructure that protects developers from poor releases.
  • Code reviews.

A well-built system keeps its feedback loops short

Having a short feedback loop is important for the efficiency and enjoyment of the development process. This includes both the time it takes to compile the code and the time it takes to run tests, especially unit and system tests. If compilation and testing take too long, developers may be less likely to run them as frequently, leading to a longer feedback loop and potentially slower development.

To ensure a short feedback loop, it is important to strive for as short a duration as possible for compilation and testing. While the specific duration may vary depending on the system and its requirements, a general guideline is to aim for a duration of 5 minutes or less, with an acceptable maximum of around 10 minutes. Of course, certain tests, such as wide-integration, end-to-end, and performance tests, may not fit within this guideline.

A short feedback loop is key to an efficient and enjoyable development process, and a well-built system strives to keep the loops as fast as possible.

A well-built system reduces or eliminates the need for production access

Privacy and security is more important than ever. Unfortunately, there are still many systems out there that do not prioritise privacy. As a result, developers may find themselves needing to access production systems to troubleshoot issues.

One common cause of this need for access is a need for more control over the data stored in the system's database. When the database schema is too loose, it can allow for data that the system is not designed to handle, leading to issues that can be difficult to identify. This can result in developers having to log into the database to examine the actual values to track down the source of the error. However, in a well-built system, this should not be necessary. A good indication of this issue is when developers claim they need "realistic data" when testing.

To avoid these issues, a well-built system should have robust logging that makes it easy for developers to track down issues. Additionally, a well-built system should have full control over the data that is allowed into the system, to ensure that only valid and appropriate data is stored.

Access to production systems should be treated as an exception, not the rule. Developers should only be granted access on an as-needed basis and should never be given write access. Access to production systems should be supervised to ensure that it is used responsibly. Therefore, it's important to prioritise privacy and control to build reliable and secure systems.

Conclusion

As developers, our goal is always to make the best systems we can given our limitations. Perfection will always be out of reach, but that doesn’t mean we shouldn’t put in effort to inch closer. By focusing on the traits we've discussed - like consistency in style, patterns, and architecture; being able to develop and verify things locally; having a clear domain; alerting developers about dependencies; and having a suite of automated tests - we can create systems that encourage change and rapid feedback loops, and that protect us from mistakes and malpractice.

These traits are all about supporting maintainability, testability, and readability on a system level while also promoting flow and joy for developers. Sure, the code and architecture in these systems might not always be perfect, but at least we can make changes and verify that they're working safely and reliably. And if we ever need to do a more extensive rewrite, it's easier in a system with these traits.

So, whether you agree with these traits or not, it's always worth considering what you believe makes for a well-built system. And if you have any thoughts, opinions, or reflections on the topic, feel free to reach out!

Previous
Previous

Why can't we just turn off infrastructure?

Next
Next

Authorisation Patterns for Monoliths and Microservices