Handling missing values is commonplace when working with customer data and market research surveys. When there is more than a negligible level of missing data, the way the analyst handles missing values can determine the outcome of the analysis - nerve-racking stuff for even the most experienced analysts. Without Data Fusion, options are, however, limited and disadvantageous.

The analyst may choose one or more of the following:

  1. Remove individuals who have any missing information - if the missing information is even slightly systematic (not random, but determined by something relevant and meaningful), the analysis will be based on unrepresentative results, and therefore likely to miss its mark, or bias a certain subgroup within the data. This can have dire and untold commercial ramifications.
  2. Remove individuals who have a lot of missing data, keep those who have relatively fewer missing data and impute their missing values only. Again, not good for representation if systematically missing, but not as bad as the first option.
  3. Replace all missing values, either by filling with a measure of central tendency, like the average, median or mode.
  4. Use multiple imputation, predictive or based on k-nearest neighbors (replacing the value by borrowing it from a randomly selected individual in the dataset, deemed to be like the individual missing the value).
  5. If possible, not replacing the value - however this does then preclude most types of advanced analyses - and if not Missing at Random (MAR) renders results highly impeachable.

All the above are possible options may be fair, if only a minority of values are missing over a small number of variables. Replacing the values does lead to some noise, but with some care on the part of the analyst, will have a negligible effect. Again, it is often an important statistical requirement of imputation that the missing data be not systematic within the data, that is, missing values must be Missing at Random. This is a very serious assumption, not to be ignored. Beyond this, there is a problem with imputation.

A point, however, soon arrives when either (a) there is not enough data to remedy the problem or (b) as the multiple imputation is conducted separately for each variable, the data become ever more inconsistent. For individuals missing a tranche of data, each imputation will have an ever-weaker relationship with other imputed values. Without a common thread to bind the missing values, relationships in large swathes of data break down, creating havoc for the analyst who no longer can tell the difference between truth and the necessary fictions introduced during data preparation.

dots

#TLDR

Why Fusion?

How Data Fusin approaches the problem differently to achieve perfect consistency across inputed variables.

How to use Data Fusion to solve the problem

A Data Fusion avoids the chaois this by ensuring that when missing values are replaced:

  • they originate from the same source, and therefore perfectly consistent.
  • they are added to the data in a way which preserves the dataset's representation.

Typically, a Data Fusion requires donor data, comprising individuals who will donate their data to individuals in a recipient dataset. At the end of a typical fusion analysis, all the recipients' data will be gifted to the donors. When remedying missing data, the analyst does not need a donor dataset to provide additional data. The donor and recipient dataset can be one in the same.

When setting up the analysis, upload the same dataset to both the donor and recipient slots. The only difference is that the donors will comprise the dataset which has all their data in place (as would be the dataset created in (i) above. There will not need much aligning, and of course, there are many common variables to choose from. This design is known as a row wise Data Fusion - and an unconstrained Data Fusion with large penalty weights is the preferred approach. If missing values are deemed too systematic, applying forced variables may also assist to correct for this.

Once complete, the matching IDs provide the analyst with a donor case from which to borrow missing data for the recipient. This can either be done programmatically or with a combination of an (IF) and (VLOOKUP or XLOOKUP) functions within your spreadsheet. That is, if a value is missing, take the value from what the donor has, otherwise if it's not missing, keep the recipient's original value.

The analyst will quickly appreciate the very tangible impact of perfectly consistent missing values. Rather than adding noise to each variable, and noise to the associations across variables, the data is cleaner, and insights easier to come by.

Ready to impress?

How does it feel to be minutes away from seeing the results of your first data fusion project.

No fee. No card.

This website uses a 3rd party cookies to improve your experience.
Explide
Drag