Does Data Fusion work?

The power and limitations of data integration tools

Is it possible to create a wholly single source dataset from two separate disparate datasets? No! Can we, under favourable conditions, using similar datasets, get incredibly close. Yes! The reason Data Fusion has remained such a staple of the media research world has more to do with a growing appreciation of its challenges, than any great evolution in computing power or mathematical innovation.

Data Fusion algorithms, in themselves, are beautifully elegant. Their application however, the merging of datasets, remains a deeply fraught affair.

Data Fusion is not a perfect technique - no predictive technique is. First, consider the differences it needs to overcome.

In reality, no two datasets are the same and fusing them together will not make them the same. To avoid falling to unrealistic expectations, first appreciate the data you're working with. So before looking at why fusion works, it's easier to cover off the reasons why it doesn't, and this revolves around the challenges of working with disparate data.

Difference of definition

The concept underpinning Data Fusion is quite intuitive: Similar individuals will share similar data points. Of course, they will not share identical data. However, one room of loyal customers are likely to say, almost exactly the same thing as another room full of loyal customers. Now, scale this concept up into the hundreds and thousands. Data Fusion works at scale.

However, this relies on the datasets comprising the same population of individuals. So, matching 'your loyal subscribers' to 'all customers' or 'potential customers', will not predict future sales or churn. For the purposes of the analysis, these are different populations. They will have different answers.

Difference of source

Methodology determines the data: Where the data comes from matters. This is not merely an issue of the composition, representation, and quality of the data, but also the mode of data collection. Collecting the same data point through different modes begets different distributions/skews/biases.

Difference of time

Truth changes over time - and it is highly unlikely that the datasets you are working with will represent the same snapshot in time. In practical terms, fusing advertising data collected this year, to this year's brand tracker, is far superior than fusing this year's data to the a brand tracker covering the past 5 years. Time frames matter.

Difference in structure

Data Fusion tools rely entirely on availability of common variables that can explain the parts of the data which are not common. By the same token, no one set of variables will correlate well with all parts of a dataset. Your common variables may be good at explaining some parts of your data well, and other parts, less so.

The most optimal fusion design cannot overcome the many ways in which datasets differ. All the above differences combine to represent the ceiling which a perfect Data Fusion will invariably hit. Data Fusion is very much the uphill battle from the start. Failing to take differences into account means placing expectations on Data Fusion which not even longitudinal or single source data could meet. Yet for most of its history, Data Fusion companies were measured against this insurmountable standard.

#TLDR

Why Fusion?

How fusion approaches data integration problems differently and achieves perfect consistency across mapped variables.

Useful for researchers, marketers & analysts who are new to Data Fusion.

So, does it work?

Lets pivot now to the other side of the coin - why Data Fusion does work. In reality, finding good matches across datasets is not difficult. By way of example, one may only have 1 in 365 chances of meeting someone with the same birthday but two people have 1/183 chances. If there are 25 customers in a room, there is already a 50% chance of a shared birthday. A typical Data Fusion will typically have thousands of individuals to choose from. But the magic doesn't happen yet…

Indeed, a typical fusion is not attempting to match measures which have 365 possibilities. In almost all commercial contexts, there are only a finite number of ways an individual can differ on any one metric. The matching task is more like 1/10, like a geographical region for instance, or 1/5 for age bands, gender, survey answers, sometimes attitudinal. With each measure the probabilities of identifying two similar individuals becomes frighteningly probable, and so ever more likely that the rest of the data, not common to both datasets, will be very similar indeed.

That said, the objective of fusion is not to get each individual correct - it most certainly will not be able to match everybody up on everything, nor do so in a way which perfectly reflects the data elsewhere. The objective of fusion is to just be more similar than not, across as many measures as possible.

While each match might be off, as one adds more similarities and individuals, eventually the law of averages will kick in. It is here, in the overall shakeup, the aggregate view, where Data Fusion shines.

Commercial impact of Data Fusion

In the modern data economy, the goal however is not to be perfect - it's to leverage the data points you have as effectively as possible. This puts the accuracy the analyst can achieve through fusion into context. This is not old school data enrichment - the goal is not to know with certainty that Customer #123 will find a certain message/creative more appealing, or relevant enough to respond an e-mail campaign and purchase. The goal is to know what messages or creatives customers like Customer #123 are more likely to engage with, and to do so in a way which makes sense and, is neither creepy or invasive. The goal is merely to know what is likely to be more resonating. For many businesses, leveraging the added confidence which fusion brings means being able to shift a click-through-rate from 0.5% to 0.8% and make millions in the process.

In this light it's clear to see why Data Fusion, when applied well is an extremely valuable tool. This is because Data Fusion only needs to work a little bit to pay real dividends - and in practice, it does a lot better than that!

The history of Data Fusion has been slow and steady, rather than assured. It however has, unlike many techniques, stood the test of time. It has done so, despite unreasonably unfair expectations, because it works.

#TLDR

Building a Sandbox

Learn how to test fusion using your own data. This is really informative when you're just starting out, and you don't yet have access to datasets which share common variables.