The objective of Data Fusion is to merge two datasets by matching individual cases from one dataset to the other. These datasets are widely referred to as the donor and recipient datasets. Instead of trying to generate or estimate each column/field/variable in the recipient dataset invidually (which does not work!), real data from a donor individual is gifted to a recipient. If recipients receive data from donors who are more like them, at an aggregate level, the data ought to behave as if it belongs to the same individual. That is, while each merged case will be wrong, over hundreds and thousands of individuals, the data estimates begin to approximate single source data. Its easy to validate data fusion using your own data. It is also recommended that you start here, by building a data fusion sandbox for yourself to test & experiment.
Much attention is paid to the common measures between the two datasets. These become the zip which binds the two. In the literature, they are often referred to as being the X variables, where the first dataset comprises X and Y, and the second X and Z. If X explains Y in the first dataset, and X explains Z in the second dataset, then it stands to reason that if X were joined across datasets, one could approximate the relationships among Y and Z.
Step 1: Deciding which common variables to use
And so, the first step is to ensure that the abovementioned associations hold up to scrutiny. The analyst needs to be confident that the common variables (otherwise known as 'hooks') are meaningful to the objective of the analysis. It is not necessarily the case that including all the common variables in the design, will lead to the best set of matches or the most accurate fused estimates. Including irrelevant common variables can introduce unwanted noise and pull the algorithm off in an unhelpful direction. Setting clear objectives as to how the fused dataset will be used and, how it will be analysed is important. This a critical consideration when the datasets are varied and broad in scope. That is, if one or more of the datasets traverses many subject areas, it will not be possible for the fused dataset to accurately approximate all the topics of interest within the same fusion design. Often the fusion is asked to be better at some topics in dataset, at the expense of others. All this, very much, depends on your use case.
Often sociodemographic measures like gender, age, and geography are important common variables, but alone, they are seldomly enough to to explain Y and Z. Product usage, brand engagement and attitudes to the sector typically perform better, and work even better when combined with sociodemographic measures. While supervised fusions help greatly, beginning with a strong set of common candidate variables, and clear objectives is clearly preferable, even if in some contexts, a luxury.
Step 2: Deciding the role each dataset will play
It may occur to the analyst, at this stage, that as data is being gifted from one dataset to another, it makes no difference which dataset brings Y or Z. The result is a merged dataset containing both sources of information, right? In some more advanced fusion designs, this is absolutely the case. However, for most standard designs, this can matter a great deal, as the size of the dataset, and therefore the effective base size (the sample size which determines the confidence ranges around your estimates) is determined by the ratio of donors to recipients.
Data Fusion companies advise that one selects the smaller dataset as the recipient dataset. This allows each recipient to have a greater selection of donors to choose from. The higher the ratio, the more choice. This is an important dynamic to consider before running your first project. If the analyst were to select donors from a smaller pool, than recipients, each donor is more likely to be used multiple times, which greatly lowers the effective base size, and widens the confidence ranges - which removes confidence in the final fused estimates. Not good.
Also be cognisant that populations from which each dataset is drawn may differ slightly. In the ideal situation, both datasets will be identical in composition. This way they are likely to have similar data distributions. When recipients have many donors to choose from, donors who are less likely to fit the recipient distribution can be left behind (ignored by the fused estimates). This isn't neccessarily a good thing - but may prove beneficial in some 'operational style' research contexts.
Step 3: Harmonise
It is not necessarily the case that all the values of the common variables will be the same. In most examples, the analyst will need to merge values across datasets to establish semantic equivalences for each common variable. For example, if one dataset contains actual age, and the other age bands, the ages of individuals from the first dataset must be banded to match the second, so that the fusion can attempt to match individuals within age bands. The best data integration tools will have this capability built in.