Agile Approaches to Handling Bugs in Production | David Tzemach
Updated: Mar 3, 2022
In theory, Agile teams are expected to deliver a working increment of the software ready to be deployed at the end of each sprint. To support this, the increment must meet very restrictive quality standards that ensure the removal of critical bugs that may lead to unexpected system behavior.
It takes time until the team establishes all the supporting frameworks and processes to meet these expectations. The truth is that most teams struggle to do it, and worse, their deliverables are released and cause more damage than actual value.
Besides the fact that we now have an angry customer who just realized that his production environment now has critical bugs that made him lose money or don’t allow him to keep providing his services, production bugs directly impact the team itself.
Think about a situation where a team has taken a commitment for a sprint. In the middle, a new bug was discovered in production, which now means that they need to shift their work and handle it, even at the cost of losing their ability to deliver a working increment.
There are different ways of dealing with bugs related to a production environment (internal or external) that come up once the sprint is already started. Below are a few common scenarios and how the team should handle them:
Scenario 1: Bugs with no significant impact on the system (internal production)
In this case, there is no real reason to interrupt the current sprint and affect the team’s velocity. Instead, the Product Owner can decide whether he wants to fix this bug as part of a future sprint, defer it to a future version or provide a quick fix without recording a bug in the reporting system.
Scenario 2: Bugs with a significant impact on the system (internal production)
Bugs with a significant impact on the system should be addressed even if that affects current commitments. The common approaches to handle this type of bug are:
Address it immediately – A bug is opened with high severity and is given a high priority by the Product Owner, so the team starts working on it as soon as possible.
Revert the system to a previous state – This is an excellent solution as it allows the team to keep working on their current commitments. A bug should be opened and addressed before the next deployment. However, to do so, the organization must have a virtualized production environment that supports rollbacks to a previous state with minimal downtime.
Provide a patch – Another common solution is for the team to implement a quick patch that provides a temporary solution, allowing the team to conduct a deeper investigation in the next sprint before the next deployment.
Scenario 3: Bugs with no significant impact on the system (customer)
Bugs with a low system impact should not cause any interruptions in the current development cycle. The simple solution is to record all these bugs and address them in a future sprint based on the importance determined by the customer. There should be good communication and full trust between the team and the customer for this to work. These reduce customer pressure for quick fixes that interrupt team commitments.
Scenario 4: Bugs with a significant negative impact on the system (customer)
This scenario is the most interesting and the most important. It mainly has the most impact on the business, the customer, and the development team. In this scenario, the bug's severity represents a scenario where the customer cannot use a critical function of the system.
From both the business and customer perspective, when a bug with this impact has been found in a customer environment, these bugs receive the highest priority compared to any other user story in the team backlog. In addition, these bugs are usually the ones that provide the most significant challenges to developers for several reasons:
The customer environment is not always available for the team to investigate.
The quality of bug descriptions arriving from the field usually fails to provide the real reason for the failure of valid reproduction steps.
Due to the urgency of the fix for the customer, there is more pressure on the team to provide a quick solution.
So now that we understand the challenges, let’s focus on the main point and explain how these bugs can impact the team by reviewing the three levels of impact:
L1: Low impact on the team
Although the bug has a severe impact on the customer, the team can resolve this issue using only one or two team members who take ownership and investigate the problem and its cause. In this scenario, the solution is relatively quick and minimally affects the team’s velocity for the current sprint.
L2: Major impact on the team
There are times when the customer will call the support team and say something like, “I cannot use a specific functionality of the product” or “The system is not working for me as you promised.” This may indicate the existence of a bug that has a significant impact on critical areas of the product. In that case, it is necessary to mitigate the problem in the current sprint even if it goes against current commitments.
Due to the urgency and pressure related to these bugs, the Product Owner usually adds a new story with a high priority to the current sprint based on the scope of impact on the customer.
When a new story is added in the middle of the sprint, both the SM and the PO must work together to figure out how to add it without severe impact on the current sprint work and minimum impact on existing commitments.
Moreover, to ensure that the team does not become frustrated, it is also essential that the PO and SM explain the reasons and urgency that led them to make this change.
L3: Critical impact on the team
The last thing that we want to see in the customer environment is the presence of a critical bug blocking the customer from using the system. When this happens, it requires the team's full attention and, in the worst cases, even the termination of the current sprint.