“Queues lie at the root of a large number of product development problems. They increase variability, risk, and cycle time. They decrease efficiency, quality, and motivation.”
“Large queues form when processes with variability are operated at high levels of capacity utilization. In reality, the misguided pursuit of efficiency creates enormous costs in the unmeasured invisible portion of the product development process, its queues.”
“Since high capacity utilization simultaneously raises efficiency and increases the cost of delay, we need to look at the combined impact of these two factors.”
“We have already pointed out that companies do not manage product development queues. What do they manage? Timelines...In contrast, when we emphasize flow, we focus on queues rather than timelines. Queues are a far better control variable than cycle time because, as you shall see, queues are leading indicators of future cycle-time problems. By controlling queue size, we automatically achieve control over timelines.”
– Don Reinertsen, 2009 – The Principle of Product Development Flow: Second Generation Lean Product Development
If you’re new to or rusty on CFDs, it would be helpful to review “Basics of Reading Cumulative Flow Diagrams” first. This earlier post covers basic definitions and the mechanics of reading WIP (work-in-progress), lead-time, and an average completion rate (throughput) from a CFD. It also includes links to external sources, one showing a basic example of how to create a CFD using MS-Excel, and another providing a quick overview of their use in the kanban method context. From this foundation I’d like to shift here toward a visual analysis perspective, exploring a few contexts where a bottleneck in our workflow is more or less present and see how it might appear on a CFD.
To be clear, I don’t think CFDs are the only way or “a best way” for spotting a bottleneck in your workflow. If you’re adequately visualizing your workflow on a board and diligently measuring and managing work items as they flow through your process, you’re well on the way to being able to identify where any bottleneck might be occurring. (Note: see my Kanban Board Design Primer under Resources/Nuggets here). That said, don’t under-estimate the value of using the CFD as well for learning about what happened, is happening, or will happen in your workflow. It can be a very effective tool for creating an easily accessible visual historical record and a trending perspective of a workflow in your context.
What is a Bottleneck?
Let us get a shared understanding of the term “bottleneck” by first looking at these selected Wikipedia definitions:
- Metaphorically a bottleneck is a section of a route with a carrying capacity substantially below that characterizing other sections of the same route. This is often a narrow part of a road, perhaps also with a smaller number of lanes, or a reduction of the number of tracks of a railway line.
- In engineering, a bottleneck is a phenomenon by which the performance or capacity of an entire system is severely limited by a single component.
- A bottleneck in project management is one process in a chain of processes, such that its limited capacity reduces the capacity of the whole chain.
As we continue our discussion, a key point from these definitions to keep in mind is the “singular” characteristic. Additionally, I’ll infer a “stationary” characteristic as well for the bottleneck throughout the time reflected by the example CFDs that follow.
A Bottleneck Without WIP Limits
In the first example, the CFD is derived from a workflow that didn’t utilize WIP limits (see Figure 1, click image to enlarge, and note for this post I’ve chosen to run CFDs with latest date intervals on the right and earlier date intervals scrolling off to the left, which is more common, but again it’s a preference, not required). The workflow states shown include Design, Code, Test, Deploy, and Done. In this first example, do you clearly see work items are arriving into Design at a rate significantly greater than they are exiting from Deploy and entering the Done state?
What else stands out here? The width of the Design and Deploy states of the CFD remain relatively consistent over the reporting intervals, and they’re also narrow relative to the other states. In contrast, the Code and Test states get wider over time. What does this tell us? Recall, no WIP limits are being used here, so in this context, upstream of a “bottleneck” work items will back up, over time taking longer and longer to get through this point, and through the entire workflow end to end.
But which process state is the bottleneck? Clearly the Test state shows the point in the workflow where the greatest backing up is occurring, or more precisely appears to be the “single” component that is severely limiting the entire system. But what else do you see in this CFD?
Note the “flat-shelves” that occur on the CFD chart line between the Test and Deploy states (ex. look vertically straight up from Reporting Interval 57). What causes these to appear in this CFD? The “shelf” at Reporting Interval 57 shows a “purple” gap exists for the Deploy state, but there is no such gap (no purple) for the Deploy state at Reporting Interval 81 and 83. Is there any WIP for the Deploy state near Reporting Interval 81 and 83? Do you see any similar “shelves” on the CFD chart line between the Code and Test states (ex. Reporting Intervals 20 and 26)? Do shelves occur less frequently between the Code and Test states and do any show no WIP (no “green” gap) existing in the Test state?
When a shelf occurs, no work items are moving from the upstream workflow process state to the adjacent downstream state. When WIP limits are in place, it’s possible that an upstream state could be “blocked” from moving a work item downstream even when all the work for that upstream state is completed for one or more work items. But for this example, since there are no WIP limits being used, nothing is blocking the Test state. When no work items are passed from Test to Deploy during a reporting interval, it is due to none having their testing “completed” during that reporting interval.
We also see the Deploy state is frequently “starved”, with no work items (no purple, no WIP) for several reporting intervals. When a process state frequently has no WIP, is this another characteristic to look for in a CFD when determining where a “significant” limiting factor might exist for the overall system?
Note: It would’ve been time consuming to collect real world project data for each CFD example, and no simple task to manually generate data for a useful number of work items in each example as well. Instead, I utilized an MS-Excel based kanban simulation tool (developed by Mark Robinson, available at Excelville.com). Using Mark’s tool I was able to easily and quickly model the processing of 100 work items through a fixed number of workflow process states (Design, Code, Test, Deploy, and Done) for a number of different runs.
The desired example models were created by adjusting the WIP limits and the level of effort limits in each process state of the workflow for the work items. These settings and summary statistics generated for individual runs were captured and placed above each CFD example. Since the output for each run required additional processing to create specific counts and reporting intervals needed to create a CFD, I used an MS-Excel companion workbook to code VBA routines to automate with a button click all the additional processing of the raw data, along with capturing the parameter settings, and generating the CFD.
By reviewing these settings for Team Size, WIP Limits, and Min, Max “level of efforts” for the various workflow states, and understanding the tool simulates rolling of dice to process work items, and being fairly confident it uses FIFO with no capacity allocated across functions or carried over reporting intervals, you should be able to get any necessary appreciation for how I used it to model the various examples.
Using a WIP Limit to Protect and Buffer Your Bottleneck
Picking up from our first CFD example above, I added a single WIP limit of 2 (hint: based on team size, min, max settings for test, and simulating dice rolls) to the Test workflow state (see Figure 2, click image to enlarge) . This was in essence to model the accomplishing of a couple of goals.
The first, was to protect the Test state from being overrun. Why? As per the quotes at the top of this post, in a real software development context, that ever growing Test backlog causes problems. Secondly, to provide a buffer for the Test state, against the inherent variability in work item size and arrivals, to ensure it always has work to pull from, since starving the “bottleneck” of a system can only lengthen the time it takes to get work items completed.
The resulting CFD shows a markedly reduced and uniform width for the Test state indicating the added WIP limit helped reduce the overall time a work item spends in this state. However, notice we still see shelves occurring between Test and Deploy, and where Deploy WIP goes to “zero” (starves). Why?
Applying a WIP limit and observing lead times decreasing for the Test state does not mean the number of work items per reporting interval (the rate) that “complete” (exit) the Test state increases (automatically). In a real world software development context, this single WIP limit would likely help in a number of ways to complete work items in less overall time for the Test state. One obvious way is the WIP limit “controls” when the clock starts on tracking time in the Test state, therefore, overall work items spend less time physically “waiting” in this state before being “actively” worked on. But another possible real world effect might be less overhead (lost capacity) managing a large and mostly idle backlog. Another, might be less context switching (lost capacity) between now fewer “partially active” work items, which could likely contribute to less in-process mistakes (improving quality) and less rework (lost capacity). While the rate of work items completing Test might increase due to these real world benefits, for other real world reasons it may simply remain the same (or could even possibly decline). However, a large Test queue here isn’t without costs either, and an appropriate WIP limit did help to create predictability in the Test state lead times, which in many real world contexts is a step toward balancing the overall system costs.
Note: The point above regarding the exit rate is a subject worth more discussion, but one I won’t dive into here. Still, understanding and being aware of this, is important in any real world context. It is also why we continue to see shelves between Test and Deploy in Figure 2. Once a work item reaches the Test state, the simulation tool simply processes the work items as it did before when there wasn’t any WIP limit controlling the entry. That is, its simple processing rules don’t model effects like increased capacity benefits from reduced context switching, or reduced overhead and potentially less rework from managing a much smaller backlog. Still, for the purposes of seeing examples of how a bottleneck might appear in a CFD, it is plenty useful.
The other “obvious” effect of this single added WIP limit is the Code state now looks like it could be a “bottleneck” in our workflow. But is it really a bottleneck? There are now quite a number of noticeable shelves between the Code and Test states in the CFD, much more than we saw in Figure 1. Why? By adding the WIP limit to the Test state, work items in the Code state now have to “wait” more often before proceeding to Test. Notice also there are no shelves where the Test WIP goes to “zero” waiting for work items from the Code state. Do both of these observations suggest the Code state is not the bottleneck of this workflow?
Applying WIP Limits Upstream of the Bottleneck
In this next CFD example, WIP limits are applied upstream of the Test state to the Code and Design states as well (see Figure 3, click image to enlarge). This models a “kanban pull system” from Design thru Test states in the workflow. What immediately stands out? First, just as we saw adding the single WIP limit earlier produced a predictable lead time for the Test state, we now see these additional WIP limits produce a very predictable lead time expectation for work items passing through the entire modeled workflow. Is this beneficial?
Recall, in the earlier examples, the overall WIP of the workflow (from entering Design to exiting Deploy) at any specific reporting interval increased in the subsequent reporting interval. We can infer the lead times for work items passing through the system was “increasing” too as we progressed through the reporting intervals. Adding the WIP limits upstream of the Test state, to the Design and Code states, produced a system where the overall WIP at any specific reporting interval, in particular after Reporting Interval 11, stabilized more or less, creating relative lead time predictability as we progressed through the reporting intervals. But where did the “visible backlog glut” go?
The work items in this simulated example simply enter the workflow at a slower rate, over a longer period of time taking more reporting intervals to complete all the work items. But, again, this is only a model derived from a simple simulation tool. In a real world software development context, there are likely benefits from reducing and managing overly large queues that can help to increase the rate of completing (quality) work items as I mentioned earlier. While the predictability created is possibly a significant business value on its own (ex. helping to define meaningful lead time SLAs or system throughput measures), it’s also creates a foundation for meaningful process improvement efforts, a baseline, from which proposed process changes can be objectively measured and evaluated.
The other observation that stands out here is the “shelves” now appear frequently and on all the lines of the CFD chart. Why? Again, adding WIP limits causes work items in upstream states to “wait” more often before proceeding. Notice too, upstream of the Test state, there are no shelves where there the WIP goes to “zero”, these (starving) shelves still only appear for the Deploy state. Look at Reporting Intervals 57, 59, and 61, where the CFD shows clearly the Design and Code states are in a “holding pattern” and the Deploy state “starving” for work (WIP of zero), all waiting for the Test state to finish up work items. Clearly, the Test state has the characteristics of a bottleneck, right?
How Might Addressing the Bottleneck Change the CFD?
In this next CFD example, I modeled three successive improvement efforts on the the Test state (see Figure 4A, 4B, and 4C, click each image to enlarge). Looking first at Figure 4A, what immediately stands out?
The first, is the shelves appear less frequent. Why? Obviously due to the ability to process work items more quickly through the Test state, the Code state is now blocked less often from passing work items downstream as soon as coding is completed. In turn, the Design state too is now blocked less often by the Code state being blocked.
Secondly, while there are still reporting intervals where the Deploy state is “starved”, these periods are now mostly shorter than in the earlier examples. Again, this is due to the improved performance of the Test state.
After two more runs that modeled further improvements to the Test state, the resulting CFDs respectively show even fewer shelves. Notice too the Test state band after the third improvement now becomes fairly narrow relative to the Code state. While there are still reporting intervals where the Deploy state is “starved”, they have become even less frequent now.
As a final observation, notice the “Average time to complete” metric goes from 12.4 down to 8.0 after the first modeled improvement effort to the Test state, then down to 7.1 after the second modeled improvement effort, then down to 6.3 after the third modeled improvement effort. Does this add even more support that the Test state was in fact the “bottleneck” for the modeled workflow?
Note: After reviewing these three example CFDs closely, I think Figure 4B (derived after the second modeled process improvement effort to the Test state) showed the “smoothest” overall workflow even though its “Average time to complete” metric is slightly higher than Figure 4C (derived after the third modeled process improvement to the Test state). Again, this is only a simulation, but this observation reminded me that focusing improvement on the bottleneck could “eventually shift” where it is in your workflow. After the third improvement, there are reporting intervals where the Test state WIP goes to “zero” and it appears to happen more frequently than for the Code state. Does this noticeable “turbulence” appearing in the CFD chart lines in Figure 4C relative to the “smoothness” in Figure 4B help indicate the third improvement effort on the Test state moved the system close to this shift? What might this “turbulence” suggest about the “bottleneck” in this workflow example?
How Might Addressing a Non-Bottleneck Change the CFD?
I couldn’t help but run one last CFD example to “validate” my expectation about what happens if I had modeled an improvement effort on the Code state instead of the Test state. That is, picking up from the example that created Figure 3 above, if I model an improvement to the Code state, and left the Test state untouched, what might be your expectation? The result of this last example modeled is the CFD below (see Figure 5, but DON’T click the image to enlarge just yet).
If the Test state is the bottleneck, would a modeled improvement to the Code state reduce the “Average time to complete” metric? Looking back at our earlier definitions for a bottleneck would suggest no, right? So, before you click to enlarge the Figure 5 image, one last time are you on-board with the Test state being the bottleneck?
Alright then, with your expectation now firmly in place, click on the image to enlarge and see for yourself what happened to the “Average time to complete” metric after I modeled an improvement to the Code state rather than the Test state.
Well, was your expectation met? Look at the chart lines now more closely. How different in terms of the chart line characteristics does the CFD in Figure 5 appear from the CFD in Figure 3? How different are the CFD chart line characteristics in Figure 5 from the CFD in Figure 4A or 4B?
Now, ask yourself, “where are you making improvements in your workflow today?” Where in your workflow are these improvements having a desired impact? Are you seeing the benefits from these efforts that you were expecting?
If the modeled examples helped you better understand how bottlenecks might appear in a CFD, it would be helpful (and appreciated) to hear from you and learn how they helped, or also hear what is still puzzling.
The last thought I’ll close with is that a root cause analysis of any bottleneck should still be done carefully. In these examples, the Test state may be where the “bottleneck” appears in the CFD but an effective root cause analysis could lead you to something occurring elsewhere in the workflow. Maybe the Code state is producing very poor code. Could some of that “excess capacity” in the Code state go toward adding unit testing that would improve the quality and possibly see improvement in the Test state and the overall workflow lead time? Maybe the Design state is producing very hard to test solutions. Could using some of that “excess capacity” in the Design state to add some acceptance test driven development (ATDD) to the mix help reduce the testing effort further downstream in the Test state and improve the overall workflow lead time? Anything that directly or indirectly helps to reduce the efforts needed in the Test state in these examples would likely lead to an overall improvement to the workflow lead time, right?
The CFD tool, like other tools, really only helps you see an issue exists and hints where to begin an effective root cause analysis. It can help you get to better questions faster, but as I like to say often regarding any tool, “Thinking is still required.”
Additional helpful references:
David J. Anderson, see these two posts titled “Two Types of Bottlenecks” and “Detecting Bottlenecks in a Kanban System” (they are both at the same location, blog post, April, 2008, scroll down to see the second; Note: as of July 2013, these two posts are no longer available at the original web site).
David J. Anderson, Kanban, (2010); foundational book on implementing concepts of limiting work-in-progress in software development context.
Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development, (2009); indispensable resource for learning about “the science” of managing and improving workflows.