Can You Use Data Science to Wrangle Bad Tags?

Q & A with Mark Edmondson

"Garbage In, Garbage Out" goes the common saying, often used in the context of how subpar tagging can jeopardize even the most expensive of click stream solutions. We recently spoke with Mark Edmondson, a data insight developer and co-creator of dartistics.com on that very topic and how even some of the most advanced data science techniques are helpless when it comes to dealing with low quality data.

You have collaborated on putting together some really interesting resources on dartistics.com. What prompted you to begin this project? How do you see the evolution of the clickstream analytics industry over the next few years?

ME: Dartistics.com was originally formed as the course material for an R workshop run in Copenhagen with Tim Wilson. We had in mind that we wanted to encourage more use of data programming within digital analytics as a whole, and were talking about the roadblocks that prevented people using R or Python say already. A lot of the material is what we wish was on the web when we ourselves were getting to grips with it. We believe that doing data programming is going to become more and more essential in the clickstream analytics industry as analysts cope with the powerful tools such as machine learning and statistical inference.

In your experience at what point in the maturity level of an organization do customers begin to invest in more advanced, statistics-oriented aspects of click stream analytics?

ME: Organisations are usually pretty advanced in their digital analytics maturity before applying such techniques to digital analytics. They may have existing skills in other departments already such as CRM or BI, but merging those datasets and applying those techniques is still not common yet in those organisations from my experience.

What are some of the questions customers try to answer by turning to products such as R? Is anomaly detection one of these questions?

ME: Initially forecasting is often the first request, although once found actually not that useful without other data sources attached, to get actual decisions from the forecast to play out. Anomaly detection is a much more useful question since it prompts responses. Other questions include clustering of customers into segments, prediction of a customer’s next move and content recommendation.

Can you recall any past experience where implementing statistical algorithms on top of clickstream data resulted in uncovering tagging issues? How did that impact your plan?

ME: Almost always we find data collection issues in the data used in models, which requires either laborious data cleaning or correction of tagging. A model is only as good as the data you are feeding in, and even with perfect data you have to spend time shaping it so that models can use it. We often find issues such as PII in the data, broken iframe tracking inflating certain URLs, etc.

How often do you see cases where answering data science question becomes harder because of tagging quality issues? Can you recall any recent examples of such cases?

ME: I would say around 75% of cases. The most recent example was a forecast that was depending on URLs from an analytics system. As the tracking was stripping out parameters from the URL, the predictions were incorrect as the parameters changed the content on the page. It meant we had to stop all work, await for the fix to go live, then wait for new data to be gathered, putting us back at least a month.

What pros and cons do you see in trying to uncover data quality issues in clickstream implementations at the tagging level vs. relying on algorithmic anomaly detection to discover tagging issues?

ME: Some data quality issues will only be found at the tagging level, some only via anomaly detection. For instance, the afore mentioned iframe tracking inflating certain URLs was only found once the model was run and a unreachable URL was flagged as the most popular - without deep domain knowledge of the tracking setup they looked like legitimate data. Likewise, anomaly detection can be used to flag up when say event tracking stops working on a certain section of your website, even if the tracking tag is sending data ok but a misconfiguration in the analytics property (say a filter) prevents the data returning correctly.

Does the accuracy of the statistical models suffer because of issues with data collection? What are some of the interesting techniques that can account for such deficiencies?

ME: Absolutely. If you are lucky the model returns a nonsense result and you can track it back to which data created it, but the unknown effects of more subtle errors are very difficult to track down. There is always a validation phase for a model, so you should be able to catch such inaccuracies by reserving a sample of your raw data for that (say 30%), and testing against real world results.

What are some of the steps involved in trying to enable anomaly detection across an enterprise digital analytics footprint with a large number of metrics and dimension?

ME: A lot of analytics data models are hierarchal, so we have success by performing anomaly detection on the higher aggregations first, then drilling down if an anomaly is found to find a more accurate cause. There are also always key metrics that the business is reliant on, which can be set up on demand.

Do you see anomaly detection and tagging audits as competing or complementing each other?

ME: Complementing. At IIH Nordic we have the term “Data Governance” which covers a range of techniques from anomaly detection to tagging audits, workplace data rules and privacy consultation.

Tags: Bad Data Data Quality Trends

Subscribe to our quarterly data quality newsletter: