How to Set up a Constrained Data Fusion

With great power comes a great responsibility

Constrained Data Fusion sits behind many of the research industry's top measurement products. Its ability to preserve the distribution of two datasets, while also merging the datasets in a systematic way, is a compelling proposition, guaranteeing the face validity of fused dataset.

The aim of data integration tools like these is to ensure the best possible matches for all donors and recipients. Constrained techniques will typically include all the cases from both datasets regardless of how similar the datasets are. This objectives are however, not necessarily compatible...

Preserving unweighted or weighted distributions of both datasets is, however, not desirable when those distributions are very different from each other. To this end, Constrained Data Fusion will often fail if the datasets are too dissimilar. Such Constrained Data Fusion failures demonstrate a hard truth - one of the pre-requisites for Data Fusion is that the datasets must be like-for-like (or nearly) to begin with.

Ensuring that the datasets you wish to fuse are sampled the same, the data collected the same way, at the same time is and under the same conditions is clearly not realistic. While the datasets don't have to match identically, it is 'very strongly preferred' that they are comprised from the identical population of individuals, even if the data collection methodologies differ. If this is not the case, an unconstrained approaches will prove safer.

#TLDR

Fusion algorithms

Here we visualise the differences between the two Data Fusion algorithms.

Strategy #1

Use quota sampling to ensure one of the datasets matches the other on all the important measures. These often extend beyond basic demographics to include purchasing behaviour, product ownership and usage, and in some cases attitudinal tendencies.

Strategy #2

If this is not possible, researchers can manufacture a sample which matches the other dataset through stratified random sampling. This ensures the fusion receives perfectly matched datasets. This can sometimes be preferable to unconstrained fusion approaches.

Strategy #3

The two datasets are weighted to a common set of population estimates, or the one dataset is weighted to match the other. In all commercial research products, Data Fusion consultants would do this as a matter of course. Even though the unweighted distributions don't align perfectly, the weighted distributions do. This is however not a panacea as too much weighting will destroy data quality, and therefore the quality of the fused estimates.

Pro Tip!

Forcing individuals to match specific variables, say gender and region, is not something a constrained algorithm takes kindly to, if the underlying distributions differ. When forcing variables where the distributions are dissimilar, unconstrained fusion approaches will just reuse more donors as required, however a Constrained Data Fusion, is by definition constrained against doing so. While the constrained solution is still possible, it can only comply with one side of the equation - and therefore will continue to find optimal matches but will no longer respect representation of that particular variable. If it is important that the fused result reflects a specific distribution variables - ensure your weighting design includes all the variables on which you wish to force perfect matching. Each weighting cell should therefore have the same number of weighted individuals.

Preparing your fused dataset for analysis

Where unweighted and weighted distributions differ across and within datasets, some constrained algorithms will use individuals more than once, and split their weights. While this is a desired outcome, it is important for the analyst to always work with the weighted data only, as the unweighted donor base size, the unweighted recipient base size and unweighted fused base size cease to have meaning. It is only when applying a weight to the data, that the distributions will align with the original distributions. When running a constrained Data Fusion, use the new weighting variable provided rather than the original weights.