The larger the product or project I work on, the more of my time is better spent thinking about:
- Knowing if something can be done, over how it is done.
- How we can change our mind, over making the right choice
- How a system can answer questions, over operational efficiency.
To steal a phrase from the agile manifesto: while there is value in the items on the right, I value the items on the left more. What you're about to read does echo some of the agile principles but day-to-day I can't keep 12 principles in my working memory. What I'm trying to codify here is a set of instincts.
Thinking in terms of types of risk, they can be reduced to:
There are lots of possibile '-ilities' - If you've never tried it before, I highly recommend Risk Storming as a technique to uncover some that might be most appropriate in your team's situation.
In my world of public-facing websites and APIs, there's very little that can't be solved with an API over a SQL database - it's dealing with everything else that's the problem. However, feasibility is the first thing I have to tick off. If I have no idea whether it can be done, it needs to be investigated.
Even small changes can need a sketch of a plan and in proving it can be done at all, you know at least one way to get there.
One of my early mistakes as a tech lead was taking on a requirement to do some text-based matching. It was a search problem essentially, making sure the data was indexed in the right way and we could query it effectively. We spent weeks getting a nice UI together and making sure the API service we'd built was well monitored. We gave a nice demo and then we realised it wasn't going to get up to the standard of matching we actually needed. It was marginally better than what we had before, but not better enough to justify all the effort we spent getting to that point.
If we don't get it right first time — shock horror — that's OK - we can write new code again tomorrow.
That's the theory anyway. It's mostly true. It's easier for web developers than iOS app developers, and easier for app developers than people who write firmware for a specific device.
Changeability is where the craft of software engineering - the techniques and practices - can make a big difference. ! Test code is really a whole separate application that helps you to make the application you really want to make*.
*I'm likely paraphrasing GeePaw Hill here, but I can't find the reference!
That isn't to say your technical choices need to be dictated by changeability. Choosing "Database X" because it'll save you a ton of development time is still likely sensible choice - but minimise and contain the effect of changing X on the rest of your codebase. A little investment in this area can save you a world of trouble upgrading to "Database Y" in the future.
Observability has a formal definition but I don't find it particularly useful day-to-day. This isn't a new take, but I tend to think of it as being able to being able to learn new things with existing code.
There are hundreds of companies out there that will sell you a tool or a managed service to 'improve your observability' - they're usually tools to help you to aggregate your logs, add metrics or do distributed tracing. Making those things easier is to use is an obvious win for you as a developer, but a tool to make it easier doesn't solve it for you.
It's a wider concern than just production monitoring. Knowing the CPU and memory usage of a Docker container is great but it doesn't help you make better product decisions. If you're chasing some KPI, how will you be able to tell from your running software that you've made a difference? Capturing the right data about the business process or the user journey, in a way you can ask new questions is just as important, if not more so.
Practically speaking, what can this look like? It's capturing events as well as capturing state - when something happened for which users, not just that it did happen. Making sure you and analysts can query that data and that you can make dashboards - it's a deliberate blurring of analytics and monitoring.
If I'm using more CPU, memory or disk space to do that, then so be it. Performance is important, but without the right data, how will you know your users care about performance?
The balancing act
These things don't exist in isolation and interact with each other in weird ways.
If you observe a production issue:
- Is it actually feasible to fix in a holistic way?
- If you rush in and apply a "temporary" fix, what damage does that do to your changeability?
- Have you invested enough in changeability to make it easy to fix this type of issue?
If you're refactoring for changeability,
- Are you adding new abstractions that will make it more difficult to observe?
- In what ways might you be reducing feasibility or changeability for the future?
If you're deciding on the best way to deliver a user story:
- Does taking the easiest route now hinder how incremental you can be in the near future?
- Are you thinking too big? Are you compromising delivering something early because you're keeping too many options open?
Finding the balance
It's quite frankly overwhelming at times, but I have a great team around me and I'm learning more each day.
When all is unclear, the best way out I've found is to take a risk-based approach.
- What are the potential risks to feasibility, changeability and observability?
- How likely are they? What's the impact? Prioritise!
- What things can we do to control for those risks?
- Can we reduce those risks to an acceptable level? (Residual risk)
Measuring and prioritising risks
- Focus on the things that matter for a particular product, story or task - not your company as a whole
- Don't discount the incredibly unlikely if it has a huge impact - given enough time, it may become inevitable. Think about the timeframe you consider acceptable to deal with this risk.
- If I'm using some deprecated database features that might get removed, that risk might be acceptable for months, but not years.
- You haven't got to be exact, but relative ordering is a good starting point
The most scalable controls are often those that are people and process based. There's still a place for standardisation and guidelines to reduce friction and cognitive load but I think culture is most important. Encouraging test-driven approaches will serve you better than a mandatory code coverage rule. Encouraging people to think about what they need to monitor up-front will serve you better than saying "You must implement the four golden signals with this metrics library".
What's definitely not scalable is personally getting involved in every decision or code review!
This has turned into a monster post, but I found it a lot of fun to clarify my thoughts. I think it'll be a great snapshot of my engineering approach to reflect on for years to come.
If you've found the balance, let me know. I'm @LewisGJ on Twitter, or you can email me at <my twitter name>@hey.com.