Why not just impute?

The real case for Data Fusion

Broadly, the objective of Data Fusion Tools is to fill in "missing data". It's really that simple. This can could either be:

genuinely missing data in the traditional sense, where each some data for an individual is missing and fusion is used to replace missing values. Here most of the data exists for all the individuals in the dataset - most have values for most of the variables. For one reason or another, some individuals are missing parts of the dataset, either systematically (by design) or unsystematically (through non-response, opting out or inconsistent data collection, etc.)
data where the analyst is missing entire variables. Here no values are present for any of the individuals in the dataset. We might also think of this as missing data.

In both cases there are established methods for remedying this. Lets put Data Fusion aside for a second. Missing data can be handled in one of two ways:

Prediction algorithms

These algorithms make use of individuals where the all the data is present, building a scoring system or set of rules by which a missing variable can be explained, then predicted with a known confidence. In cases where the relevant proxy variables are present, they can be used to predict what the missing value ought to be. When filling in a continuous or ordinal variable, a regression-based algorithm is used. When filling in a categorical variable, typically a nominal measurement, a classification algorithm is used. Data Fusion companies, in this way, will spend a good deal of time understanding their client's data - to get to grips with predictive common variables. (This process is semi-automated within this platform.)

These prediction algorithms are typically supplied with training data, and then used to predict what they have learned onto a holdout sample. In this way, the analyst can establish how well the algorithm performs on data that it hasn't seen before, and guard itself against overfitting. (Overfitting is equally bad: The algorithm fails to generalize beyond the training data, despite appearing very proficient at predicting that specific training data.)

The performance of the algorithm is often compared against what could be achieved at random. For example, if I were to sell you an algorithm which is correct 60% of the time, you should first ask what you could achieve if you didn't buy the alogrithm. That is, if 50% of the cases turn out to be loyal customers, and 50% of the cases turn out quickly defect to a competitor, the algorithm already has a 50% chance of a correct guess, even if it were just flipping the coin - in this case - you would not want to buy my algorithm! Similarly, if attempting to guess one of 6 customer segments, the algorithm has a 17% chance of getting the segment correct, even if it were rolling dice. The trouble I'm alluding to here, is that it is very easy to fool stakeholders into thinking you are adding value through data science. Either algorithms just add noise and complexity, or fooling both the marketers (and analyst) through nothing more than randomness. See Nassim Taleb's first book. Oh my hat - was that over 20 years ago?!

When generating any form of synthetic data, it is simply too easy to be fooled by randomness.

Within a commercial context therefore, the goal of the analyst is to leverage existing data, to do better than this, to secure better commercial outcomes than what the business could achieve at random, and to do this in a consistent way which is also 'face' valid.

While these algorithms can also be extremely predictive, the reason they can be, is because they are trying to imput a single variable at a time. A scoring system or set of rules is developed and then optimised for the purpose of getting a single variable correct. (Remember that for later or just watch this video).

#TLDR

Why Fusion?

How fusion approaches data integration problems differently to achieves perfect consistency across imputed variables.

Learn how fusion differs from traditional imputation.

Multiple Imputation

Now the plot thickens... When working with more than one variable, many analysts may switch to multiple imputation techniques. This technique is the go-to option to address the scenario (a) above, where some values are missing for some individuals.

These algorithms can work in several ways, ranging from the simple to the sophisticated, and everything in between, by:

filling in all the missing values with a measure of central tendency such as the mean, mode or median. All the missing values receive the same value on each variable. Efficient, not particularly clever, or precise.
replacing the missing value with the value of an individual who is similar to- or 'near' the individual in multidimensional space. For each imputation the missing value may be taken from a randomly selected near individual and different individual every time a new value is needed. This process is usually undertaken many times, and the average or most common value used as the replacement. (A notorious rule of statistics applies here: "Though shalt never only impute once!"). Clever & efficient.
iterating through each missing variable and repeating a classification or regression approach. Here that that statistics rule is broken in favour of assuring the best possible accuracy. Clever, inefficient, & precise.

These algorithms largely require the data to be missing unsystematically, as exampled above, or in more technical parlance: Missing at Random (MAR).

So, what's the problem?

The above techniques are extremely performant when a small minority of individuals are missing values in a minority of variables. In these approaches, new data is said to be estimated or generated on a variable by variable, individual by individual basis - the more data is required, the more noise is added to the data.

More importantly, these approaches are optimal when working with a single variable but introduce too much error variance when working with more than two or more variables. Each new estimation introduces error. Each new imperfect guess is added to the end analysis. A paradox seemingly emerges. While each prediction is as accurate as possible, each new variable is a separate estimation which introduces more error. The errors stack shockingly quickly.

Two variables might be okay, by the time you have three; nonsense. Experiment on your own data and see for yourself.

As the scoring system (or set of rules) required to achieve an accurate estimate as possible on the first missing variable, is different to that which is required to achieve an accurate estimate as possible on the second variable, and so on.

As such the relationships which exist between the first variable and the second variable in the first dataset are broken once they are separately estimated onto the second dataset (equally within the first sparse dataset). Neither algorithm need consider the surrounding data. Herein lies the rub. This means that the relationships between the first estimated variable and the original dataset are different to the relationship between the second variable and the original dataset. The problem isn't so much that each estimation is imperfect, it's that they are being estimated separately and the estimations are imperfect. In almost every data science project - variables must be used together to find a narrative or build models which rely data which is consistent/internally coherent.

The Solution is Data Fusion

The solution is to make one set of rules with which to encompass all variables, using the same common variables which are used for the individual classification. This is a single mechanism (aka "one foul swoop!"") to complete all missing data at once! This is not done on a variable-by-variable basis. When Data Fusion matches individuals, it does so in a single analysis, and when individuals are matched, the entire data from one dataset is appended to the other dataset. By matching rows and not columns, there is one "weak point", rather than a weak point for every row or every variable being integrated. That is, error is only introduced once - according to the same set of rules. One error term.

The Result

All the relationships from both datasets are preserved. Any variable from one dataset can be tabulated by any variable in the other, based on the same conditions. While each variable will have a lower accuracy, the analysis benefits from consistency, and the reliance on real data, rather than generated data. Moreover, while each individual match will be wrong and in a minority of cases, completely wrong - it can be shown that estimates approach accuracy at an overall, aggregate level. In more complex Data Fusion tools, the representation of both datasets is perfectly maintained. The dataset behaves, in most (but not all) respects, as though it is a single source dataset where the same individuals contribute all the data.

Although superior accuracy on each variable is lost, the move to representation and consistency pays off commercially. Narrative produced with the data, estimates and ultimately the decisions made, come with substantially lower risk as they can now safely draw on all datapoints working together.

To summarise, when a single variable is moved from one dataset to another, a single algorithm will outperform a fusion. When a small minority of cases, over a small minority of variables are missing at random, multiple imputation approach do not have the opportunity to harm the dataset. However, when a lot of data is missing and the analyst needs to preserve the data's representation "currency" then the consistency offered by constrained Data Fusion, and its use of real, rather than fabricated data, makes it the appropriate approach.