Data Fusion (otherwise known as statistical matching) is the merging of datasets using information common to both datasets.

There are two important pre-requisites:

  • Shared Variables: Data Fusion requires that this common information held in each dataset explain the data which is not common.
  • Shared Populations: Data Fusion requires that the populations being fused are the same

Let's consider each requirement in detail, and how to best prepare so that these things are in place from the get-go.

Shared Variables

When planning a fusion, its important to know if the common variables you have are any good, and which ones to use, to ensure the best possible outcome. Running explanatory models (something you can do with Freed on the free plan), will give you a sense of how well the common variables you do have predict the uncommon ones. Some may lead to high classification rates; some may be irrelevant. Similarly, you can use this information to know which parts of the uncommon data are most sensibly consumed by the business, and which ones should not be (and you should effectively delete out).

Shared Populations

Notice this requirement is not shared individuals - but shared populations. That is, the datasets are drawn from, or comprise of the same definition of individuals. This is an underlying assumption of Data Fusion though, this requirement is not as hard and fast as the previous requirement, depending on your use case. Here, you get to make a judgement as to how different the populations can be and interpret your results accordingly.

Say for example, you have two datasets, one containing very progressive voters and the second containing very conservative voters. You may have a lot of information common to both, which is very relevant to the data which is not common. However, you would not expect that once fused, the progressive opinions are indicative of conservative ones. The results won't make any sense. However, if the purpose of the fusion is to understand their opinions of- and likelihood to buy different brands of toothpaste, one would not expect political leaning to have any bearing on oral hygiene. That is, both datasets might be drawn from the same population, "people who brush their teeth." Taking this example one step further, if the first dataset are exclusive users of Brand A, and the second, exclusive users of Brand B, then once again, the assumption of shared populations cannot be made - and the fusion, for the purposes of understanding toothbrush purchasing, would fail.

In many commercial situations, organisations are often fusing information about their customers onto those who are not their customers - to work out what product non-customer groups might prefer, to hone their development and messaging. This is a tricky one, as it relies on the assumption that customers and non-customers are drawn from the same population. In some examples, this might be the case. Though, there is something fundamentally different about customers, as opposed to non-customers one fears, in which case the fusion will fail. However, if the difference between customers and non-customers is not related to need, but rather to brand awareness or knowledge, then fusing these datasets might make sense.

This requirement speaks to shared distributions and, is linked with the first. Sticking with the toothpaste example, let's assume that 80% of customers in dataset pair their daily brushing with mouth wash, whereas only 40% of those in the second dataset do. This is probably the case because they reflect different market segments or markets... It means that if fusing the first to the second, we would, at maximum, see all mouth wash users matching with a mouthwash user, (which would be good). But it would also see most of non-mouthwash be matched with a mouthwash user (which would be bad). This issue would not necessarily be solved by fusing the second onto the first. More technically said, the differences in distributions of individuals determine the highest possible match rates. In summary, drawing from the same population for the purposes of the analysis, is the best way to ensure differences in distributions across all your measures is minimised, to ensure the strongest possible matching, and better fused estimates.

dots

#TLDR

Building Data Products with Fusion

Blueprints insight agencies can follow to strengthen their credentials and earn repeat revenue.

The strategic approach

The age of AI has unreasonably, and in many cases, erroneously raised expectations as to what is possible with data. The rules of data fusion are the same for AI, it still boils down to meaningful relationships in the data, and using the information appropriately, taking great care not to overreach or read into data where the assumptions of fusion do not allow for it.

Any expertise in data strategy comes to the fore here. While Data Fusion is a powerful marketing tool, just like a market segmentation, successful examples require the careful planning and management of stakeholder expectations as to what is possible.

While much is said about data strategy, planning for a fusion comes down to some very basic ingredients. Firstly, knowing what information is lacking within an organisation, and secondly, collecting new data with an eye on integrating it. That is, collecting data so that the above assumptions are met as closely as possible.

It asks marketing analyst to weigh heavily into what data should be collected and how it is best integrated using tools such as data fusion. Knowing how far to push such tools is infinitely more value than knowing how to execute the technique. The latter is something that is, as Freed demonstrates, increasingly democratised, however the former, due to the ever-growing complexities and nuance, becoming somewhat rare and elite.

Ready to impress?

How does it feel to be minutes away from seeing the results of your first data fusion project.

No fee. No card.

This website uses a 3rd party cookies to improve your experience.
Explide
Drag