Kelly discusses how software can be upgraded and changed with the least amount of disruption to the business and ultimately customer experience.
Kelly writes as follows…
Software is a living entity and it is inevitable that it needs to change as businesses expand and mature. A solution that worked well six months ago may no longer be the best solution today, especially in a startup environment where things change at a rapid rate.
Every so often we need to change part of our infrastructure to meet new requirements from both our internal and external users.
It’s hard to gain users, easy to lose them. All the marketing and sales effort that goes into attracting and retaining users can easily be for nought if we deliver a bad experience. When we are upgrading components and making changes our users don’t want to see a maintenance page, an error page or, even worse, lose data.
This means it’s important to plan changes out, trial run them and then script them to be easily reproduced. Sometimes unforeseen things can happen when updating software, but that doesn’t mean we should not attempt to control the risk as much as we can.
Here are some principles that we use at carwow to manage the disruption caused by changes to our tech stack:
Don’t do it, unless you have to
This might seem counter-intuitive but it is important to consider.
Any change to infrastructure is risky no matter how well you plan and prepare.
If it is something that can be deferred until a time where the risk is lower, great?—?you have solved the problem and it’s time to move onto the next one.
Limit the scope
Don’t change everything at once.
Modern tech stacks consist of many different components and the more components that change at once, the greater the risk of services going down, data being lost and users having a bad experience.
All of which can result in the loss of customers.
Script all the things
Every step / command / action must be written down. Don’t rely on your memory to help you out – steps that are written down are much harder to skip or be forgotten about.
Go one step better and create a script that runs the changes?—?copying and pasting commands or clicking a user interface is prone to human error, especially when there are more than a few steps involved.
Automating the process removes the possibility of human error as the steps are always run in the same way, in the same order.
Writing it all down has another advantage as the process is kept for future use – saving much needed time and hassle. After all, there is no need to reinvent what you have already done.
We use bash scripts and rake tasks for smaller changes, while Ansible and/or Terraform are used to manage larger changes.
Split the change into stages
If your changes are applied in multiple steps, break them up. Verify that each step has been successful before continuing on.
You don’t want to get to the end of the change process before realising that an earlier step hasn’t done what it was meant to do. This can take the form of performing counts, checking that a web request returns an expected response, or that a service has started (and stayed) running.
Log every step, even if it is something simple, including output from API calls and results from database queries. The more information you have, the better prepared you are to diagnose any problems that occur along the way.
Give yourself a way out
You need to think about what happens when you are half way through a change process and realise it cannot be completed. This could be due to a step not working properly, the process taking too long for a maintenance window or something totally unexpected happening.
Being able to reverse the changes means you can get your environment back to where you started (read: undo the damage). With some changes however, this is not always easy to do. For instance, database migrations where large datasets are being manipulated. So it is important to think about how something like this can be fixed before it’s too late.
Practice makes perfect
Rather than applying a change straight away, test it first.
Have a duplicate environment (cloud services make this easy) set up like your production environment so that if something goes wrong it’s not the end of the world, simply a case of try again.
Run in parallel / phased changeover
Flicking a switch from one service to another might work but it’s less risky to have a phased changeover (aka soft launch).
If it is possible, run both old and new services side-by-side so you can switch between them to test that the new service / component works.
After the change has happened, it’s important to reflect on the process and share your experience with team members. What worked? What didn’t work? How can the process be improved? After all, there’s no point in you or anyone else making the same mistakes again.