Broadly, the objective of Data Fusion Tools is to fill in "missing data". It's really that simple. This can could either be:
- genuinely missing data in the traditional sense, where each some data for an individual is missing and fusion is used to replace missing values. Here most of the data exists for all the individuals in the dataset - most have values for most of the variables. For one reason or another, some individuals are missing parts of the dataset, either systematically (by design) or unsystematically (through non-response, opting out or inconsistent data collection, etc.)
- data where the analyst is missing entire variables. Here no values are present for any of the individuals in the dataset. We might also think of this as missing data.
In both cases there are established methods for remedying this. Lets put Data Fusion aside for a second. Missing data can be handled in one of two ways:
Prediction algorithms
These algorithms make use of individuals where the all the data is present, building a scoring system or set of rules by which a missing variable can be explained, then predicted with a known confidence. In cases where the relevant proxy variables are present, they can be used to predict what the missing value ought to be. When filling in a continuous or ordinal variable, a regression-based algorithm is used. When filling in a categorical variable, typically a nominal measurement, a classification algorithm is used. Data Fusion companies, in this way, will spend a good deal of time understanding their client's data - to get to grips with predictive common variables. (This process is semi-automated within this platform.)
These prediction algorithms are typically supplied with training data, and then used to predict what they have learned onto a holdout sample. In this way, the analyst can establish how well the algorithm performs on data that it hasn't seen before, and guard itself against overfitting. (Overfitting is equally bad: The algorithm fails to generalize beyond the training data, despite appearing very proficient at predicting that specific training data.)
The performance of the algorithm is often compared against what could be achieved at random. For example, if I were to sell you an algorithm which is correct 60% of the time, you should first ask what you could achieve if you didn't buy the alogrithm. That is, if 50% of the cases turn out to be loyal customers, and 50% of the cases turn out quickly defect to a competitor, the algorithm already has a 50% chance of a correct guess, even if it were just flipping the coin - in this case - you would not want to buy my algorithm! Similarly, if attempting to guess one of 6 customer segments, the algorithm has a 17% chance of getting the segment correct, even if it were rolling dice. The trouble I'm alluding to here, is that it is very easy to fool stakeholders into thinking you are adding value through data science. Either algorithms just add noise and complexity, or fooling both the marketers (and analyst) through nothing more than randomness. See Nassim Taleb's first book. Oh my hat - was that over 20 years ago?!
When generating any form of synthetic data, it is simply too easy to be fooled by randomness.
Within a commercial context therefore, the goal of the analyst is to leverage existing data, to do better than this, to secure better commercial outcomes than what the business could achieve at random, and to do this in a consistent way which is also 'face' valid.
While these algorithms can also be extremely predictive, the reason they can be, is because they are trying to imput a single variable at a time. A scoring system or set of rules is developed and then optimised for the purpose of getting a single variable correct. (Remember that for later or just watch this video).