Is it possible to create a wholly single source dataset from two separate disparate datasets? No! Can we, under favourable conditions, using similar datasets, get incredibly close. Yes! The reason Data Fusion has remained such a staple of the media research world has more to do with a growing appreciation of its challenges, than any great evolution in computing power or mathematical innovation.
Data Fusion algorithms, in themselves, are beautifully elegant. Their application however, the merging of datasets, remains a deeply fraught affair.
Data Fusion is not a perfect technique - no predictive technique is. First, consider the differences it needs to overcome.
In reality, no two datasets are the same and fusing them together will not make them the same. To avoid falling to unrealistic expectations, first appreciate the data you're working with. So before looking at why fusion works, it's easier to cover off the reasons why it doesn't, and this revolves around the challenges of working with disparate data.
Difference of definition
The concept underpinning Data Fusion is quite intuitive: Similar individuals will share similar data points. Of course, they will not share identical data. However, one room of loyal customers are likely to say, almost exactly the same thing as another room full of loyal customers. Now, scale this concept up into the hundreds and thousands. Data Fusion works at scale.
However, this relies on the datasets comprising the same population of individuals. So, matching 'your loyal subscribers' to 'all customers' or 'potential customers', will not predict future sales or churn. For the purposes of the analysis, these are different populations. They will have different answers.
Difference of source
Methodology determines the data: Where the data comes from matters. This is not merely an issue of the composition, representation, and quality of the data, but also the mode of data collection. Collecting the same data point through different modes begets different distributions/skews/biases.
Difference of time
Truth changes over time - and it is highly unlikely that the datasets you are working with will represent the same snapshot in time. In practical terms, fusing advertising data collected this year, to this year's brand tracker, is far superior than fusing this year's data to the a brand tracker covering the past 5 years. Time frames matter.
Difference in structure
Data Fusion tools rely entirely on availability of common variables that can explain the parts of the data which are not common. By the same token, no one set of variables will correlate well with all parts of a dataset. Your common variables may be good at explaining some parts of your data well, and other parts, less so.
The most optimal fusion design cannot overcome the many ways in which datasets differ. All the above differences combine to represent the ceiling which a perfect Data Fusion will invariably hit. Data Fusion is very much the uphill battle from the start. Failing to take differences into account means placing expectations on Data Fusion which not even longitudinal or single source data could meet. Yet for most of its history, Data Fusion companies were measured against this insurmountable standard.