The 'Art of Data Fusion' is about balancing two objectives. The first is to achieve close matches between donors and recipients, and the second is not to destroy the data in the process!

Finding good matches is the easy bit, should the datasets be similar, that is, they comprise similar folk - a.k.a. drawn from the same population. Some data integration approaches prioritise creating the highest number of very close matches and pay little heed to the remaining individuals whom do not have good matches available to them. Some approaches will try to spread their effort across the whole dataset, so while few have great matches, most have 'a close enough' match. Deciding how to prioritise these distances, and how to frame the distances between potential matches is very much about linking analytical choices to commercial objectives.

Most analysts who new to fusion, intuitively grasp how this choice between **selfish, individualistic** prioritization and **altruistic, collective** prioritization might influence the underlying structure of the data. One could argue that those who can achieve very close matches across datasets should be allowed to make those matches. Equally, one might argue that they accept poorer matches so that others don't get extremely poor matches.

The problem Data Fusion faces is that, more often than not, a handful of donors serve as great matches for almost all the recipient sample. These "average Joes" benefit from the middle ground, so appear like good candidate for matching from the perspective of more recipients. The purpose of the Data Fusion is to control donor usage, to preserve the underlying data structure.

#### It is this leap away from merely finding good matches, to controlling the rates at which donors match, which turns a fuzzy matching algorithms into Data Fusion.

This is counterintuitive! These average Joe's ARE better matches! And, if it were the case that only Averages Joe's are used in a fusion, surely their data will also be average, and isn't that what we're normally trying to estimate? Doesn't this bake the right answer into our data - this self-selecting of Average Joe's? Aren't we just selecting those donors most like the average individual anyway? Isn't that who we normally report on? This is *not* the paradox it seems.

### Back to basics

By merely settling for the best matches for everyone, regardless of donor usage, we're violating the two tenants of statistical estimation. We're never dealing with a census of information when we're fusing data - we're dealing with samples. When we're dealing with samples, we're asking ourselves *how likely is it, that what we see in our sample, is indicative of the population of interest?* We're not interested in what our sample says - we're interested in the population. We need statistical inference to get us there - and so must respect the rules it plays by.

Through sampling, we can become more confidence is our sample. Calculations, based on the assumptions of the law of averages, allow us to calculate our confidence intervals. When our sample is representative of our population of interest, we can be confident that the true answer lies between an upper and lower bound, say 9 out of 10 times.

The above paragraph is usually everyone's second lesson in basic statistics. The underlying calculation, (loosely: variance divided by effective base size) also lights the way out of our conundrum. For what is the matching process, but a sampling process? If we were to reuse donors, we're merely sampling them again, and not actually increasing our sample, which means our effective base does not grow when we reuse the same donors - and when we reuse the same donors, we're harming representation of our sample. The variance which we achieve may be much lower than the real variance in the market (and this violates an important underlying statistical assumption: we must assume the variance in our sample is the same as the population.) In the end, we still need to preserve the variability in the data at all costs.

And that's it in a nutshell (sorry, you might have to read that through again, if you, like me, fell asleep, dangerously hungover, in your first statistics class).