Broadly, the objective of Data Fusion Tools is to just fill in "missing data". This can could either be:
- genuinely missing data in the traditional sense, where each some data for an individual is missing and fusion is used to replace missing values. Here most of the data exists for all the individuals in the dataset - most have values for most of the variables. Some one reason or another, some individuals are missing parts of the dataset, either systematically (by design) or unsystematically (through non-response, opting out or inconsistent data collection, etc.)
- data where the analyst is missing entire variables. Here no values are present for any of the individuals in the dataset. We could also think of this as missing data.
In both cases there are established methods for remedying this. Data Fusion aside, missing data can be handled in one of two other ways:
Prediction algorithms
These algorithms make use of individuals where the all the data is present, building a scoring system or set of rules by which a missing variable can be explained, then predicted with a known confidence. When filling in a continuous or ordinal variable, a regression-based algorithm is used, when filling in a categorical variable, typically a nominal measurement, a classification algorithm is used. In cases where the relevant proxy variables are present, they can then be used to predict what the missing value ought to be. Data Fusion companies will spend a good deal of time understanding their client's data to get to grips with predictive common variables.
These prediction algorithms are typically supplied with training data, and then asked to predict what they have learned onto a holdout sample. In this way, the analyst can establish how well the algorithm performs on data that it hasn't seen before, and guard itself against overfitting. (Overfitting is equally bad: The algorithm fails to generalize beyond the training data, despite appearing very proficient at predicting the training data.)
The performance of the algorithm is often compared against what could be achieved at random. If I were to sell you an algorithm which is correct 60% of the time, you probably would want to know what you could achieve without it. For example, if 50% of the cases turn out to be loyal customers, and 50% of the cases turn out quickly defect to a competitor, the algorithm already has a 50% chance of a correct guess, even if it were just flipping the coin - don't buy that algorithm obviously! Similarly, if attempting to guess one of 6 customer segments, the algorithm has a 17% chance of getting the segment correct, even if it were rolling dice.
Within a commercial context therefore, the goal of the analyst is to leverage existing data, to do better than this, to secure better commercial outcomes than what the business could achieve at random.
While these algorithms can also be extremely predictive, the reason they can be, is because they are working with a single variable at a time. A scoring system or set of rules is developed and then optimised for the purpose of getting a single variable correct. (Remember that for later or just watch the video).