How to do Data Fusion

The TLDR beginners step-by-step guide to data integration

The objective of Data Fusion is to merge two datasets by matching individual cases from one dataset to the other. These datasets are widely referred to as the donor and recipient datasets. Instead of trying to generate or estimate each column/field/variable in the recipient dataset invidually (which does not work!), real data from a donor individual is gifted to a recipient. If recipients receive data from donors who are more like them, at an aggregate level, the data ought to behave as if it belongs to the same individual. That is, while each merged case will be wrong, over hundreds and thousands of individuals, the data estimates begin to approximate single source data. Its easy to validate data fusion using your own data. It is also recommended that you start here, by building a data fusion sandbox for yourself to test & experiment.

Much attention is paid to the common measures between the two datasets. These become the zip which binds the two. In the literature, they are often referred to as being the X variables, where the first dataset comprises X and Y, and the second X and Z. If X explains Y in the first dataset, and X explains Z in the second dataset, then it stands to reason that if X were joined across datasets, one could approximate the relationships among Y and Z.

Step 1: Deciding which common variables to use

And so, the first step is to ensure that the abovementioned associations hold up to scrutiny. The analyst needs to be confident that the common variables (otherwise known as 'hooks') are meaningful to the objective of the analysis. It is not necessarily the case that including all the common variables in the design, will lead to the best set of matches or the most accurate fused estimates. Including irrelevant common variables can introduce unwanted noise and pull the algorithm off in an unhelpful direction. Setting clear objectives as to how the fused dataset will be used and, how it will be analysed is important. This a critical consideration when the datasets are varied and broad in scope. That is, if one or more of the datasets traverses many subject areas, it will not be possible for the fused dataset to accurately approximate all the topics of interest within the same fusion design. Often the fusion is asked to be better at some topics in dataset, at the expense of others. All this, very much, depends on your use case.

Often sociodemographic measures like gender, age, and geography are important common variables, but alone, they are seldomly enough to to explain Y and Z. Product usage, brand engagement and attitudes to the sector typically perform better, and work even better when combined with sociodemographic measures. While supervised fusions help greatly, beginning with a strong set of common candidate variables, and clear objectives is clearly preferable, even if in some contexts, a luxury.

Step 2: Deciding the role each dataset will play

It may occur to the analyst, at this stage, that as data is being gifted from one dataset to another, it makes no difference which dataset brings Y or Z. The result is a merged dataset containing both sources of information, right? In some more advanced fusion designs, this is absolutely the case. However, for most standard designs, this can matter a great deal, as the size of the dataset, and therefore the effective base size (the sample size which determines the confidence ranges around your estimates) is determined by the ratio of donors to recipients.

Data Fusion companies advise that one selects the smaller dataset as the recipient dataset. This allows each recipient to have a greater selection of donors to choose from. The higher the ratio, the more choice. This is an important dynamic to consider before running your first project. If the analyst were to select donors from a smaller pool, than recipients, each donor is more likely to be used multiple times, which greatly lowers the effective base size, and widens the confidence ranges - which removes confidence in the final fused estimates. Not good.

Also be cognisant that populations from which each dataset is drawn may differ slightly. In the ideal situation, both datasets will be identical in composition. This way they are likely to have similar data distributions. When recipients have many donors to choose from, donors who are less likely to fit the recipient distribution can be left behind (ignored by the fused estimates). This isn't neccessarily a good thing - but may prove beneficial in some 'operational style' research contexts.

Step 3: Harmonise

It is not necessarily the case that all the values of the common variables will be the same. In most examples, the analyst will need to merge values across datasets to establish semantic equivalences for each common variable. For example, if one dataset contains actual age, and the other age bands, the ages of individuals from the first dataset must be banded to match the second, so that the fusion can attempt to match individuals within age bands. The best data integration tools will have this capability built in.

#TLDR

Building a Sandbox

This is an excellent starting point - experiment with your own data, and you don't need to have two datasets which share common variables. Learn how to build a mock fusion using a single dataset.

Step 4: Decide on the Data Fusion Technique

There are a few, very different, Data Fusion algorithms to choose from. Generally, they sit within two camps: Unconstrained Data Fusion & Constrained Data Fusion. Whereas unconstrained approaches behave as above, capable of leaving ill-fitting donors behind, constrained approaches ensure all donors and recipients are represented in the final data. This can be critically important in many research contexts where representation is all important.

Unconstrained fusion designs will work to preserve the distributions of the recipient dataset only, ignoring the representation of the donor dataset when looking for the best possible matches. Constrained fusion designs however, will place more importance on representation than similarity - ensuring the representation of both donor and recipient datasets.

The analyst also needs to choose how to quantify the difference between each individual case, choices range from Mahalanobis, Euclidean, Manhattan (sometimes referred to as city block). The choice of distance measure is relevant when common variables are not measured on a standardised scale.

Step 5: Merge, Check & Supress

The result of a Data Fusion is a set of matched identifiers (unique IDs) indicating which donor is matched with each recipient. The analyst then matches the datasets according the matched identifiers, joining the two datasets into one. With all the variables now in one place, analysis can begin in earnest. If possible, the analyst should first seek to validate the results if any corroborating data points are available. At a very basic level, the fused dataset should make sense and not offend common sense. For example, it should not be the case that pensioners are reading magazines aimed at teenagers or commuters are unemployed. This would mean that there is an error in the design. While some Regression to the Mean is expected, it should not dominate the validation.

Data Fusion is not a templated or automatable analysis. There are no rules or strict guidelines for the analyst to follow. The best Data Fusion is the one which produces the most cogent results. Indeed, many approaches may be suitable to the same problem, and indeed very different techniques can return similar results and make the most commercial impact. The nuances of the technique itself are a minor concern. The real value lies with the analyst's domain knowledge and understanding of the risks and rewards associated with the analysis objectives. The planning of the fusion, management of stakeholder expectations, data curation and deployment all contribute to its success.