1. 9

Even though we would all like to have bug free software, that is sadly hardly ever the case! I’m curious about how other people tackle the burden of a constant trickle of bug reports (and if you don’t have that, congratulations on your eminent software development practices).

How does your workplace deal with semi-urgent bugs and requests, that are important and small enough to be completed within a day or two?

In larger organizations it seems quite common to have each team spend one day of the week on these types of issues, or having a rotating schedule. What’s your experience with these or other methods, pros and cons? What do you believe works best?

  1. 4

    We do one day per week, affectionately called “bugday”. Bugs are tracked with GitHub issues, reported by staff, triaged by engineers and leads, then split up for bugday. When you are done with your split of bugs, you can return to project work, which we manage in a Notion board per Iteration (similar to Trello). If bugs are gnarly enough, they become iteration projects.

    1. 1

      That sounds like a healthy practice. Does it work well in reality?

      I’ve commonly seen the triage step skipped, which is unfortunate because it makes working with bug reports so much more fun. Don’t know how many hours (days) I’ve spent fixing some badly reported bug.

      1. 4

        We’ve been doing it for 8+ years. It started when the engineering team was 3-4 people, and that team is now 25-30 people. It’s one of our most beloved processes (even if we are occasionally grumpy about a gnarly bug), so I’d say it works very well. Sometimes it requires some upkeep, e.g. declaring bankruptcy on certain bugs or classes of tickets, or tweaking the triage process. But, in general, we’ve kept up with “one day per week for bugs” and “keep the bug list small” for years, and I can’t imagine doing it any other way.

        You’re right that it’s important to make time for the triage step. On our teams, what happened over time is that a team’s weekly checkpoint meetings would happen on Tuesdays and bugday would be Wednesday, so each team would spend 15 minutes in their weekly meeting triaging/splitting/assigning bugs so that could people could show up on Wednesday with clear assignments to whittle down the list.

        The other nice thing about bugday is that it creates a “pressure relief valve” with the support team, but with a reasonable-ish SLA beyond “hotfix”. That is, either something is a “hotfix”, or it’s a “bugday ticket”. If it’s a “hotfix”, there better be a damn good reason (like, a security problem or serious breakage). But, if it’s not in one of those clearly awful areas, it’s just a bugday ticket, which means it gets worked on after the ticket has had some time (a few work days, maybe even a week) to sit, be triaged and lightly investigated, and prioritized alongside other tickets. This avoids the must-fix-now, interrupt-laden, reactive culture you see on a lot of teams with widely-used products.

        Also, I’ll mention that as a product becomes very widely used, your ticket list starts to get dominated not by bugs, but more by “customer technical questions” (that can look like bugs on the surface). In this case, you really need to separate those two things, and have a technical support staff (perhaps even with light coding/debugging background) focus on the non-bugs, and also triage every ticket, and only escalate the “may-be-a-bug” or “definitely-a-bug” items for the core engineering team.

        1. 3

          Sorry if this sounds stupid, but what does your workload look like on the other four days?

          It seems to me that devoting one day per week for bugs is to either:

          1. “Protect” bug fixing time by setting a minimum (no less than one day per week); or
          2. “Limit” bug fixing time by setting a maximum (no more than one day per week.)

          I realize you’re saying this process works for you, but I don’t understand how. If a process artificially limits bug fixing to less than what’s necessary, the number of bugs will grow over time. If it artificially allocates more time than is necessary, the number of bugs (or bug severity) should fall until the day seems scarcely worthwhile. In either scenario, the amount of time to spend seems like it should be subject to a feedback loop, right?

          1. 2

            No, not at all – this isn’t a stupid question whatsoever!

            So, I think it’s both a “protect” and a “limit”.

            For “protect”, it’s basically saying, “We’re going to think about paying down the bug list every week for up to a whole workday per engineer, if necessary.” So it sets a weekly cadence, weekly reminder, weekly focus.

            For “limit”, it’s saying, “We’re not going to work on bugs in drips and drabs all week, because we want to devote 80%+ of our ‘good hours’ to development of long-term projects. We’re also not going to let low or mid-priority bugs derail the iterative development process and the projects we’ve already committed to – we’re not going to devolve into a reactive culture. Only ‘hotfix’ bugs can ‘skip the queue’. Otherwise, we calmly work on our committed projects in 4-week timeboxed iterations.”

            It’s also a timeboxing technique: is there some way to fix this bug within a single day, rather than making it a 3-5 day investigation and fix?

            The timeboxing part is perhaps the most useful tool, and also answers your other question. For bugs that can’t get fixed in a single day, we end bugday with a question: “Can this bug fix wait till next week’s bugday, where we will only get another full workday to take a crack at it?” Or, “Is this bug big, gnarly, or important enough that we should ‘promote it’ to a full-blown team/iteration project?” Sometimes, low-priority bugs reveal underlying issues that truly deserve to get fixed, but need to get scheduled into an iteration to get fixed properly. The bugday process prevents us from doing this “preemptively”, instead, we do it “just-in-time”.

    2. 2

      The best experience with this I’ve had was planning the sprint as if the one person on bug duty was on vacation. So one dedicated person per cycle who triages and fixes the bugs, of course calling others for help. But no interruptions to the team and still timely response. The person can take low-prio tickets from the backlog while not working on bugs, or assist someone else via pairing, or improve docs or tests, any “we should do that some time” task, really.

      Of course this depends on your situation: team size, how many bugs generally come in, do the bug reporters (often coworkers) disturb the whole team or would they play along the process of simply filing it with a correct prio, etc.pp. I see this more of a defense mechanism. If your circumstances are different, there’s no problem in doing it different.

      Also obviously this is mostly for medium to low, or scoped high bugs, not for things that warrant hotfixes. But even then it helps to have one person available immediately, and they can always get help from the team.

      1. 1

        We’ve got a “fast lane” user story, which houses urgent bugs. Our PM is great, defers most of the work since none of it is really urgent. Most bugs go through our refinement and planning sessions. Actual urgent issues are discussed in the team, planned/refined if needed and worked on, which results in other work being delayed. In our review that is made clear, and during the sprint stakeholders kept up to date (by our PM).

        1. 1

          I’ll report what was common to two rather different organisations. Very different size, very different in many respects, but both called themselves “agile”.

          When something really urgent came up and I noticed (because I kept an eye on the stream of reports from the SaaS provider that reported our crashes), I might just fix it without waiting. Or if I was in heads-down mode working on something, I wouldn’t notice because in that case I wouldn’t keep an eye on that. If I fixed it, I’d mention that in the daily standup and someone else would review my fix in a hurry and we’d get it merged. If something less urgent came up I’d add it to my (plain text) file of notes for next week, and mention it in the weekly what-to-work-on meeting, and we’d usually have a fix deployed in less than two weeks.

          And the key is: Agile doesn’t mean “you may only act according to preapproved plan”. Agile means that the team trusts its members and if someone’s deviations from what was agree during standup/status meetings aren’t a net benefit, then the team discusses that afterwards and that team member adjusts. An agile team accepts some mistakes and learns, instead of trying to prevent wasted time by planning.

          I did receive negative feedback for some of my deviations, but also positive from some. And some of the negative feedback was of the form “well that didn’t need fixing in such a hurry, but I can see why it looked awfully urgent before the root cause was known”.

          1. 1

            Put the bugs at the top of the backlog, they become the highest priority items next sprint (aka next week)

            1. 1

              For the team (consists of many crews), we have a SHIELD crew that protects the team. For the crew (usually less than 10 engineers), we may also have an OCE (on-call engineer) to deal with the bug to protect the crew.

              Otherwise, it’ll get put on the backlog and prioritized along with other stuff on the backlog for next sprint.