Improving predictability and delivery velocity in IAM
One of the most important ways that my team has improved their predictability and delivery velocity is the art of cutting big tasks into small pieces and having a limit on the amount of work in progress.
Limiting the amount of work in progress ensures work items actually get finished. This reduces the time it takes from starting work to finishing it because developers will have fewer context switches. In turn improving predictability.
What this looks like in our team is to focus on your development task. When you are done and it needs to be reviewed (a natural context switching point), don't immediately pick up another development task, but review and help another developer first.
Making those tasks small has the benefit of having to hold a smaller context in your head as well. It makes code reviews easier (who hasn't dreaded a code review with 100+ changed files). And when it is deployed and something does turn out to be broken, you are going to find the cause of it a lot faster.
So yes, we try to get those tasks small and to the point. And the task is done when the code is running in production. This does not necessarily mean that all customers are hitting that new code though. It might well be behind a feature toggle. But at least it's deployed.
There are way more in-depth teachings in books like "The DevOps Handbook" and"Accelerate". And some fictionalized examples that often seem all too familiar in "The Phoenix Project" and "The Unicorn Project".
So, the above sounds all well, and good… but it can't be that easy, now, can it? As it turns out there are a few things that might get in the way of making nice small tasks and deploying to production as soon as possible. For instance:
Pipelines are so slow (or contain manual steps) that making tasks smaller will create a backlog of implemented work that is waiting to be deployed.
Your deployment process causes downtime or other types of customer impact (maybe it is just a few seconds or a minute). Doing more deployments because of the more and smaller tasks will cause more downtime.
Your businesspeople and/or product owner don't like you just releasing stuff to customers every day or releasing only half of a feature for that matter.
Pipelines are so slow (or contain manual steps) that making tasks smaller will create a backlog of implemented work that is waiting to be deployed. Your deployment process causes downtime or other types of customer impact (maybe it is just a few seconds or a minute). Doing more deployments because of the more and smaller tasks will cause more downtime. Your businesspeople and/or product owner don't like you just releasing stuff to customers every day or releasing only half of a feature for that matter.
First, we made sure that all our manual steps were eliminated from the pipelines. We found that most manual steps were there to have a human do a quality check. And quality checks can generally be very well automated by all the different types of automated testing tools at our disposal.
Now that everything is automated, we can get to the speeding-up part. In our case, we wanted to make sure to do as little as needed during each run of the pipeline. So, make sure you only build, test, and deploy that one thing that you changed. Why compile and update your binaries if you only changed your infrastructure as code. Why run DB schema update tasks when there hasn't been a schema update?
This works on a higher level as well. Don't deploy your API if only your website has changed. Making tasks smaller means only changing one deployable component at a time. So, let us not deploy any of those other components in the process.
Make sure API updates are backward compatible, e.g., making a breaking change to an API's signature means creating a new version and having the old version still available. If you don't do this, your front end calling those APIs will fail when the API is updated, and the front end is not yet updated. Even if we hadn’t split up the deployments of our individual components as much as possible, we would still have needed to do this. For instance, because front-end code is running in a browser, it doesn't really know you updated the server until you do a page refresh.
Same thing goes for database schemas. Make those backward compatible as well. Don't remove fields from tables until none of your environments are using it any longer. In our projects updating the database is a separate work item, and we make sure to change only the database and nothing else in that work item. This way we can make sure that existing code runs against it without issues.
Last but not least, downtime is often caused by tearing down the old instance of your running code and starting a new instance. Things like loading assemblies and warming up caches takes time. There are plenty of ways to make sure this goes smoothly. In our set-ups, we use Azure App Service with staging and production slots. We deploy to the staging slot, warm up the instance by calling particular endpoints, then swap it out with the production slot.
Trunk-based development practices are an absolute must if you want to get your code in to production quickly. In our team we use one long living branch called `main`, of which we create small feature branches for each small piece of work we do. When the work is done, the code merges directly into main. Pipelines run on main and pull requests into the main branch (where pull request pipeline runs only deploy to a pre-production environment).
More in-depth information about trunk-based development can be found here: https://trunkbaseddevelopment.com/
(our team uses, what they call, scaled trunk-based development)
The business often has some need to control when a new piece of functionality is made available to customers. They might want to save up some juicy features to be released during a trade show for instance. It would be really inconvenient if you would have to bunch up all that changed code and release it for the first time to a production environment at that moment.
The key here is to disconnect the release of a feature from the deployment of the code that implements it. The way to do this is with feature toggles, or feature flags if you prefer that term. There are plenty of great feature toggle systems that can help you with this. We rolled our own with specific features that match with the way we have segmented the different types of customers and users in our system. It allows us to enable a feature for a single user or for all users of a single customer to name a few. This also means we can enable it in production for demo accounts without any of our customers hitting the new code yet.
There you have it. A small look into how my team can split their work up into small batches to improve quality and velocity. There are still plenty of things we can improve of course. Maybe we can get our build times even shorter. Maybe we can improve our automated quality checks so much that we can get the holy grail of directly pushing to the main branch someday