There is no one right data integration technique which works best in all situations. Provided that the goals of fusion are artfully met, there are no out-and-out winning algorithms. (That's okay. You knew you were going to read that the moment you clicked in here!) Choosing the right approach for your data integration problem is tricky. The good news is that provided you understand the associations in your data, bring a little critical thinking and experimentation, landing on the best possible solution can be a matter of minutes rather than days or weeks.

The choice depends on the nature of your datasets and what you intend to do with the results.

Data Fusion algorithms fall into two categories: unconstrained and constrained matching. Each method carries its own set of advantages and limitations. Let's review.

Glossary: When referring to the distribution of a dataset, we're making broad reference to the incidences and profiles in the data. This being the spread of values on each variable - this could be the gender split, variation in household income or purchase frequency etc. Think of each variable having its own histogram. This also theoretically refers to how these variables work in concert, across all variables - so try to imagine all these histograms being linked to each other (although how you quite imagine that…). The strongest fusions happen when the distributions align. Depending on your situation, changing the shape of these histograms through data fusion can be a good or bad thing (but most will say it's a bad thing.)

Unconstrained Data Fusion

Unconstrained techniques are flexible. Their only goal is to seek out the best matches for every recipient without imposing any direct controls (that is, without constraint). One key advantage of unconstrained matching is its adaptability to disparate datasets - and land on good solutions even though the distribution of the two datasets don't line up well. This does not mean that you can fuse very disparate datasets - the data still has to line up reasonably well.

This freedom comes at a cost, as unconstrained methods can introduce biases and inaccuracies if not carefully managed. Data fusion tools allow you to use introduce a penalty term into fusion equations which controls donor usage.

As an apt analogy, lets pretend our fusion is a dating app. We're matching individuals, right? If a dating app used unconstrained data fusion, its goal would be to ensure that when a match is made, that individual will be meeting someone very compatible. The most compatible individuals will get prioritized to meet each other. However, its still a business so that the app will try send the most desirable suitors on all the dates - to allow for slightly less popular suitors to go on some dates. Not everyone on this app would necessarily be given a match. Lets be a bit facetious: This app might be the kind of app who paid for the long-term relationships and marriage.

The key things to remember here are:

  • Unconstrained fusion is concerned with each recipient only getting their best possible match.
  • The fused dataset contains each recipient case matched with their best donor.
  • The fusion tool will introduce penalties to control donor usage.
  • Unweighted analysis or weighted analysis respects the distribution of the recipient's dataset only.
  • Unconstrained fusion ignores the profiles of the donor dataset.

Unconstrained data fusion might be considered a selfish, individualistic approach to matching.

dots

#TLDR

Visualisation

Here we visualise the difference between unconstrained and constrained data fusion.

Constrained Data Fusion

While a Constrained Data Fusion algorithm will still attempt to seek similarities between donors and recipients, it keeps one eye on all the similarities between all the other donors and other recipients in the two datasets. The other eye is fixed on the distribution of both datasets. Constrained Data Fusion orchestrates the matches so that matches occur which respects the distribution of both datasets. This is why it is 10000x more computationally intensive.

Returning to the dating app analogy, if the app a constrained Data Fusion, everyone on the app will be matched for a date - even undesirable individuals! While the app will still try to make each couple as compatible as possible, it believes that for everyone to have fun on their date, sacrifices must be made by everyone, so that everyone will be matched with someone 'they can get along with'. So to be doubly facetious: Let's say the objective of this app is to ensure all the users have fun on their dates - and don't really expect marriage proposals.

The key things to remember here are:

  • The fused dataset contains all donors and all recipients.
  • Weighted analysis respects the distribution of both the donor and recipient's dataset.

It might be regarded as an altruistic, collective approach to matching.

Maintaining the representation of both datasets renders a dataset more in line with stakeholder expectations, as the distributions from both datasets are maintained. Media researchers refer to this as preserving both currencies, which is why Constrained Data Fusion is used to maintain the face value of large JIC surveys. Here, being able to match the original datasets note for note is extremely important. That said, it works because a lot of blood, sweat & tears goes into ensuring the distributions align as closely as is feasibly possible, through expert planning and research design.

Further, when datasets are very similar, arising from the same target population, and sampled in similar ways, the 'sacrifice made' as each donor and recipient forgo their best possible match, is greatly reduced.

When not to use Constrained Data Fusion?

Reading the above, it appears easy to conclude that constrained Data Fusion is by far the superior technique. The ability to maintain representation of both datasets is extremely appealing. While that is true, it assumes that the analyst wants to represent the donor distribution accurately. If upon comparison, the donor dataset is skewed, bias / has different distributions, maintaining those disparities might not a good thing. This would have everyone sacrificing good matches, and no one having tolerable matches - everyone loses, individually and collectively.

When the primary importance is to maintain the distributions of the recipient dataset only, then unconstrained will always be the best choice. Some analysts will still prefer constrained approaches because they are more systematic, and there is less room for subjectivity and potential upset. Maintaining the profile of both sides of the fusion, maintains face validity - stakeholders unfamiliar with Data Fusion, still see the numbers they expect to see.

The dividing line between Constrained and Unconstrained fusions ends here. In truth, the above describes underlying philosophies, rather than two specific algorithms. There are variations of each.

Those new data fusion should start with Hot Decking, which is an unopinionated approach, relying more on the power of randomness and the law of averages than more formal systematic techniques. It's very hard to mess up a Hot Deck - though it is a safer option, it will not land on the most optimal estimates, because it does not seek to use the data to best effect. However, if one understands the data a little more, more opiniated Unconstrained Data Fusion will make significant gains. Under certain conditions, where distributions align, with a tad more planning and preparation, constrained approaches will win out.

dots

#TLDR

Building Data Products with Fusion

Blueprints insight agencies can follow to strengthen their credentials and earn repeat revenue.

This website uses a 3rd party cookies to improve your experience.
Explide
Drag