There is no one right data integration technique which works best in all situations. Provided that the goals of fusion are artfully met, there are no out-and-out winning algorithms. (That's okay. You knew you were going to read that the moment you clicked in here!) Choosing the right approach for your data integration problem is tricky. The good news is that provided you understand the associations in your data, bring a little critical thinking and experimentation, landing on the best possible solution can be a matter of minutes rather than days or weeks.
The choice depends on the nature of your datasets and what you intend to do with the results.
Data Fusion algorithms fall into two categories: unconstrained and constrained matching. Each method carries its own set of advantages and limitations. Let's review.
Glossary: When referring to the distribution of a dataset, we're making broad reference to the incidences and profiles in the data. This being the spread of values on each variable - this could be the gender split, variation in household income or purchase frequency etc. Think of each variable having its own histogram. This also theoretically refers to how these variables work in concert, across all variables - so try to imagine all these histograms being linked to each other (although how you quite imagine that…). The strongest fusions happen when the distributions align. Depending on your situation, changing the shape of these histograms through data fusion can be a good or bad thing (but most will say it's a bad thing.)
Unconstrained Data Fusion
Unconstrained techniques are flexible. Their only goal is to seek out the best matches for every recipient without imposing any direct controls (that is, without constraint). One key advantage of unconstrained matching is its adaptability to disparate datasets - and land on good solutions even though the distribution of the two datasets don't line up well. This does not mean that you can fuse very disparate datasets - the data still has to line up reasonably well.
This freedom comes at a cost, as unconstrained methods can introduce biases and inaccuracies if not carefully managed. Data fusion tools allow you to use introduce a penalty term into fusion equations which controls donor usage.
As an apt analogy, lets pretend our fusion is a dating app. We're matching individuals, right? If a dating app used unconstrained data fusion, its goal would be to ensure that when a match is made, that individual will be meeting someone very compatible. The most compatible individuals will get prioritized to meet each other. However, its still a business so that the app will try send the most desirable suitors on all the dates - to allow for slightly less popular suitors to go on some dates. Not everyone on this app would necessarily be given a match. Lets be a bit facetious: This app might be the kind of app who paid for the long-term relationships and marriage.
The key things to remember here are:
- Unconstrained fusion is concerned with each recipient only getting their best possible match.
- The fused dataset contains each recipient case matched with their best donor.
- The fusion tool will introduce penalties to control donor usage.
- Unweighted analysis or weighted analysis respects the distribution of the recipient's dataset only.
- Unconstrained fusion ignores the profiles of the donor dataset.
Unconstrained data fusion might be considered a selfish, individualistic approach to matching.