The 'Art of Data Fusion' is about balancing two objectives. The first is to achieve close matches between donors and recipients, and the second is not to destroy the data in the process!

Finding good matches is the easy bit, should the datasets be similar, that is, they comprise similar folk - a.k.a. drawn from the same population. Some data integration approaches prioritise creating the highest number of very close matches and pay little heed to the remaining individuals whom do not have good matches available to them. Some approaches will try to spread their effort across the whole dataset, so while few have great matches, most have 'a close enough' match. Deciding how to prioritise these distances, and how to frame the distances between potential matches is very much about linking analytical choices to commercial objectives.

Most analysts who new to fusion, intuitively grasp how this choice between selfish, individualistic prioritization and altruistic, collective prioritization might influence the underlying structure of the data. One could argue that those who can achieve very close matches across datasets should be allowed to make those matches. Equally, one might argue that they accept poorer matches so that others don't get extremely poor matches.

The problem Data Fusion faces is that, more often than not, a handful of donors serve as great matches for almost all the recipient sample. These "average Joes" benefit from the middle ground, so appear like good candidate for matching from the perspective of more recipients. The purpose of the Data Fusion is to control donor usage, to preserve the underlying data structure.

It is this leap away from merely finding good matches, to controlling the rates at which donors match, which turns a fuzzy matching algorithms into Data Fusion.

This is counterintuitive! These average Joe's ARE better matches! And, if it were the case that only Averages Joe's are used in a fusion, surely their data will also be average, and isn't that what we're normally trying to estimate? Doesn't this bake the right answer into our data - this self-selecting of Average Joe's? Aren't we just selecting those donors most like the average individual anyway? Isn't that who we normally report on? This is not the paradox it seems.

Back to basics

By merely settling for the best matches for everyone, regardless of donor usage, we're violating the two tenants of statistical estimation. We're never dealing with a census of information when we're fusing data - we're dealing with samples. When we're dealing with samples, we're asking ourselves how likely is it, that what we see in our sample, is indicative of the population of interest? We're not interested in what our sample says - we're interested in the population. We need statistical inference to get us there - and so must respect the rules it plays by.

Through sampling, we can become more confidence is our sample. Calculations, based on the assumptions of the law of averages, allow us to calculate our confidence intervals. When our sample is representative of our population of interest, we can be confident that the true answer lies between an upper and lower bound, say 9 out of 10 times.

The above paragraph is usually everyone's second lesson in basic statistics. The underlying calculation, (loosely: variance divided by effective base size) also lights the way out of our conundrum. For what is the matching process, but a sampling process? If we were to reuse donors, we're merely sampling them again, and not actually increasing our sample, which means our effective base does not grow when we reuse the same donors - and when we reuse the same donors, we're harming representation of our sample. The variance which we achieve may be much lower than the real variance in the market (and this violates an important underlying statistical assumption: we must assume the variance in our sample is the same as the population.) In the end, we still need to preserve the variability in the data at all costs.

And that's it in a nutshell (sorry, you might have to read that through again, if you, like me, fell asleep, dangerously hungover, in your first statistics class).

dots

#TLDR

Fusion algorithms

Here we look at the difference described above, between the two differing approaches to Data Fusion, the selfish versus the collective - and when to use which.

How does Constrained Data Fusion help?

This is why many analysts prefer constrained approaches. They feel better about the solution, as it's more systematic and less prone to subjectivity. Constrained Data Fusion maintains heterogeneity perfectly (by including everyone in the fused dataset in their correct representative proportions).

This, however, does not come for free.

Constrained Data Fusion achieves this through its rigidity and a strict rules-based approach - this inflexibility is both its strength and weakness. It must assume that the datasets are very similar, that every donor belongs in the fused dataset, regardless of similarity to any of the recipients. In other words, Constrained Data Fusion doesn't exactly resolve the problem, it only balancesthe objectives and it does to by relying on a flakey assumption. One that, if not met, will directly skew the fused estimates.

And so, Data Fusion remains more artful than automatic, a tradeoff between competing objectives, in which what you lose on the swings, you gain on the roundabouts. So, how confident do we need to be? How much would flatter data or skews influence decision making? Should closer matches be prioritised in favour of the collective? It is all up to the analyst to decide how to strike this unwinnable balance - to share in the victory and torment, always informed by how the fused data will be used.

Ready to impress?

How does it feel to be minutes away from seeing the results of your first data fusion project.

No fee. No card.

This website uses a 3rd party cookies to improve your experience.
Explide
Drag