Analysts Have Technical Debt, And It Will Kill You

By Edgar Hassler
Category: Miscellaneous

The concept of technical debt is commonly understood by developers in organizations' tech departments, but technical debt also exists for analysts in analytics departments. As technical debt accumulates the rate of errors increases. Few analysts concern themselves with this, relying on hope that their results will be close to correct, oblivious to the dangers lurking just below their SQL queries and Excel spreadsheets.

Part of the reason developers take this more seriously than analysts is cultural. For analysts sometimes it is sufficient to get an answer that is nearly correct. But for developers there is often no such margin for error. Consider what happens if a system charges a credit card number near the correct one, or ships an item with an identifier near the target item. Because of this tech often is more rigorous about managing errors.

Even though an analyst may be comfortable with having a small amount of error, it is rare to have guarantees that the source of observed errors remains small. For this reason I propose analytics adopt some of the methods from tech for managing errors. It is important to implement such methodology before the tangled morass of assumptions destroys your ability to bring understanding to your organization.

Technical Debt

Software developers are very sensitive to technical debt. This is in part due to the fact that developers are the most likely to be pissed off at having to deal with technical debt. I'm going to paraphrase a quote from Ward Cunningham (who coined the term) and described technical debt in the following sense:

If we failed to make our program align with what we then understood to be the proper way to think about our [problem domain], then we were going to continue to stumble on that disagreement which is like paying interest on a loan.

— From agilealliance.org

Consider some code that controls how new users can register for a video streaming service. Now, imagine that management comes to us and says we need to have regular user accounts and business accounts. We could do one of the following:

  1. We could just copy and paste all of our user creation code, rename "user" to "business", and make small changes to represent how the logic for business users is different.
  2. We could take time to define the commonalities between the two cases and carefully refactor the code to understand there are users of type personal or business, and allow the specifically different behavior to be associated with this type.

If we do (1) we can be done quickly. If we do (2) it will take longer but we'll be in a better position to handle future requests.

Let's say one such future request is to add guest users to accounts. In (1) we now have to add that code twice, once to the "user" code path and one for the "business" code path. But in (2) we only have to add it to users and it is automatically picked up by personal and business users. The amount of extra time it takes to make the changes, to test those changes, and to just handle the increasing cognitive load of trying to keep everything straight in (1) is the measure of our technical debt. (Note that in the present example, this kind of behavior grows tech debt exponentially).

Developers who have to deal with technical debt suffer this. Their work becomes more frustrating and less rewarding, so there's real force driving them to pay off that debt or find a new job. But analysts are in an awkward situation where they may feel the pain of technical debt but not be in a position to pay it down. For analysts, technical debt can come from the databases, APIs, and myriad other back-end systems in addition to any direct tech debt from their queries/analysis pipeline. When a developer does not understand "the proper way to think about [a problem domain]" faced by analysts then the analyst suffers the technical debt while being unable to address it.

Analytics versus Development

If you're familiar with Conway's Law then a reasonable corollary is that, when development is harshly separated from analytics then there may exist no communication structure to efficiently relate the two. If developers and analysts don't communicate well then their systems will likely not do so either. Because development and analytics are both highly specialized it's likely that there is no serendipitous shared understanding between the two. This has a real cost to the business (in lost time for development, in increased error rates, in actual lost data that can never be recovered, et cetera). When organizations have more integrated tech and analytics then this can be avoided. The two groups, when in proximity, can learn from each other and talk through the challenges.

No matter how integrated tech and analytics are, there is a structural difference between the two that forces tech to better address errors. This causes them to lose more time in development to the handling of technical debt, whereas analytics likely has more errors simply slip past them.

  1. When things go wrong often a developer will get exceptions, stack traces, and/or other error reporting to indicate something went wrong and how that thing went wrong.
  2. A developer often has test cases that they can lean on to determine if things are working right, and can use these tests to better understand the nature of the error.
  3. Some teams have a QA person helping ensure requirements are met and functionality is working as intended. They may also work to check that previous work is unaffected by new changes.

Analysts have none of this. Sometimes when things go wrong an analyst will get an impossible number or a number that is very suspicious given some amount of domain knowledge. But other times they just have the wrong value and that's just accepted. There's no safety net other than this, and things can go very wrong very quickly.

Consider an example of clickstream analytics. Let's say we want to understand how our visitor behavior changed this past month from the same month last year.

  • First, what is a visitor? Is it any HTTP connection, including browsers that prefetch a page and bots? Is it just browsers that full render the page and execute javascript? If a user has javascript disabled do they not count as a visitor? Is there anything helping us to discern internal users doing software QA from external customers? Do we distinguish organic traffic versus paid ad visitors?
  • Let's say tech maintains a database of this traffic. Which choices did they make? Was it always that way or did it change? How would we know?
  • Let's say tech has a second database. Is the data there governed by the same decisions or were different decisions made? Is the recency of the data equivalent or is one database sufficiently behind the other enough to affect our analysis?
  • Are we able to look at individual records that occur in any mis-match or have privacy/availability concerns caused the data to be scrubbed of identifying characteristics? Is there any sensible way to determine which database is "correct" when results differ between the two?

Analytics Technical Debt is Pernicious

I chose the word pernicious because you can't directly ignore the kinds of problems above as such an action is clearly indefensible. When leadership asks why \(x \neq y\) you can't just stand there like an idiot and say you don't know. No, instead, that ignorance gets cloaked behind a long chain of historical analyses where the limitation of the analyses is lost over time and no one really understands fully what is going on.

Here's how I've seen it go down. If two data sources produce different statistics then someone is sent to investigate "why." Sometimes they determine that one source leaves out something that another source counts at least once, and then this is accepted as good enough and life moves on. It is rare that an analyst will be given enough time to check, and that the data will accommodate, an enumeration of the differences which ensures that some reason is the only reason for the difference.

If the reason for the difference is complicated enough then people will accept that "someone understands it so it's a solved problem." Later, the "knowledge" that it is solved is spread orally and people are increasingly less able to explain what the limitations of the analysis were or the specific details of the findings. Thus, we end up with an "explanation" but no way to know how well it explains anything. Fun times.

In this situation described above, the data is trusted, leadership congratulates themselves on being a data-oriented organization, and it's all fiction. This fiction is believed even in the face of evidence because the difference between what analytics predicts and what really happens can always be attributed to myriad background factors. Analytics can assign blame to random noise in much the same way medical doctors assigned blame to bad humors. Yet such a defense cannot be used against aggregate performance over time. Eventually this house of cards will fall.

Analytics can learn a lot of things from tech and the way they establish guarantees on quality. I think there needs to be greater emphasis put on enabling an ability to test analytical quality, and then on maintaining a testing regimen that guarantees analytics quality. Here are some suggested areas on which you can focus:

  • Data Quality: Be able to show that the fidelity of the data is maintained as it moves from production systems to analytics systems.

  • Leverage Automated Testing: Be able to automate sending data through your system at various points of injection with tags to be able to track its source, and ensure that this data presents itself in subsequent analyses as expected.

  • Demand Testability: If it is technically infeasible to test for analytics quality then rewrite your system. Being able to ensure quality and accuracy is of paramount importance.

  • Record The History of Your Data: Hire data QA people, give them the knowledge to be able to review things for correctness, and make sure they have recorded the history of the data.

  • Expose Ambiguity: Interview analysts and determine how many different ways there are to answer questions.

    1. Ask lots of questions: Are there multiple ways to answer questions? Why are there multiple ways? Must these ways agree? Do they really agree? Do they often agree? Do they ever agree?
    2. Have analysts from different teams try to answer questions for the other team, and if the two approaches and results differ start a discussion to figure out why that happened. You want to work towards a system where there's one obvious way to answer questions.
  • Refuse to Compromise on Understanding Your Data: People don't have to know everything about the data, but they must know how to find that information quickly if the need arises. If you have an analyst interviewing great swaths of people to find out how a number came to be and how long it's been that way then you have real problems.

  • Analysts and Developers Need To Understand Each Others Work: Cross train developers and analysts on the areas where their domains interface. Give your developers a statistics refresher course. Help your analysts understand how a web browser works if they're dealing with clickstream data. Smash the barriers between them.

Summing It Up

Analysis technical debt starts from the gap between developer understanding of analytics to the analyst's understanding of the data generation mechanism, and to varying degrees covers the analyst's body of work. This debt causes the rates of errors to be elevated. Since analysts often lack the robust testing frameworks we see in development this means more errors slip through the cracks. This means more incorrect conclusions are reached. Above I have suggested stealing borrowing some of the methods that tech has used to help mitigate errors in the hopes that they can reduce errors due to analytics technical debt.

</Rant>