How to test data integration tools

Experimenting with Data Fusion

For those analysts new to data integration, the promise of Data Fusion may sound, on the face of it, too good to be true. Fancy techniques usually dissapoint. The difference here, is that this one has been used commercially for 30+ years, so you need not brace yourself for a fall. Nevertheless, before launching headfirst into a real project, you'll need to build confidence in using the technique so you know how to plan your project, and for the shrewdest, evaluate how well it works using their own data.

A new data fusion project can sometimes mean making changes to what data is collected, or investing in collecting new data, either via a new survey or buying in fresh data from a vendor. It's hella expensive. So many are caught wanting to trial data fusion on their own data, but not yet having a second dataset with which to fuse. In many situations, it's too risky a proposition because of the investment outlay. Also, it's professional to be skeptical, and intelligent to the fear messing up complicated things.

So, if you would like to evaluate data integration tools without the investment of collecting new data, or you would like to set yourself up to collect new data as efficiently as possible, then working through the data you already have is the smartest option.

Here are the steps you would take to test data fusion using only a single dataset.

#TLDR

Building a Sandbox

More advice about experimenting with your own data. Useful when just starting out, and you don't yet have access to datasets which share common variables.

Step 1 - Find a dataset you are already very familiar with

You're going to need quite a sizable dataset, something like your company's brand tracker or a very large ad hoc survey, a segmentation project would probably be best. Dataset size is important here, as you'll need to create two files from it. If you don't have access to such a survey or your organisation has a rich CRM database, exporting data out of this, and then deleting out all personally identifiable information will also suffice.

Step 2 - Split the dataset into two

For ease of use, remove variables you won't be looking at. You'd like to have at least 40 columns/variables of data. Create two datasets from the single dataset. Anything above 1000 cases per file will work, however anything above 2000 cases will be ideal. Note: You can't just take the top half of the file and the bottom half, as this will introduce unnecessary bias which will ruin your experiment.

So, if using a data science software randomly sample from your file (without replacement). If you don't have, either:

Sort your data by key variables. Create a column, repeating number from 1 to k, where k = the percentage of the file you want to sample. e.g., If you have 10 000 cases and want to take out 2000, (10000/2000) = 5. Add in numbers repeating from 1 to 5. Filter out all numbers that are not 2 for your first file and take all numbers that are not 4 to make your second file. For ease, set up a data filter drop down, apply your filter, and copy and paste your sample out.

OR

Create randomly generated number between 0 and 1 using a function in the spreadsheet package. In Excel this is RANDBETWEEN. Copy and paste it as a value so that it doesn't keep regenerating. Set up a data filter drop down. If you have 10 000 cases and want to take out two samples of 2000, select if lower than 0.2 (2000/10000) and higher than 0.8 (1-(2000/10000).

If you don't have the luxury of a large and complex dataset, you can also not split your file into two, but rather fuse the same file back on itself, preserving the sizes of the files you're working with (however, this is a very easy task for a data fusion - you'll still get the feel for the technique though).

Step 3 - Decide which will be common variables

These variables will be variables you intend collecting in the future from your new dataset, which duplicates the information you already have. In other words, these will be the common variables upon which you will match the new dataset with the data you already have.

For your Data Fusion experiment to work - these common variables must bear some relationship to variables in both datasets (or the measures you will look to collect in the future.) For example, if planning to integrate your CRM with a survey, these must be fields a survey respondent can provide a meaningful answer to. Try not to rely on just demographics, as demographics alone may not be enough to explain customer behaviour in your market. Look for summary measures. At this stage you can include all the potentials - the aim of this experiment is also to learn which common variables will be most useful to you.

Step 4 - Create two fake datasets

Divide all the remaining information into two meaningful sets of measures, (again, both of which must bear some relationship to your common variables!) Delete out the first set from the first data file and the second set from the second data file. Another reminder, if potentially linking with a CRM, now is the opportunity to replace your customer identifier with a sequential number - account numbers are personally identifiable. Keep a record of customer identifier and sequential numbering if you'd like to swap them back over at a later stage. You should now have two datasets, each comprising a unique identifier (which is not an account number), the same set of common variables, and each a set of uncommon variables. There is no one from your one file who is also in your other file. For ease, you really want to land up with two files with no more than 30-40 columns in each.

Step 5 - Fuse the two datasets

When doing so, you may wish to experiment with different combinations of common variables. Make note of which ones you use, as again, these will be important to include in future data collection.

Step 6 - Evaluating the matching

The platform will produce a quality grade for each run you make, you can expect high grades here if you have chosen good common variables. If you've sampled correctly, you can also expect strong matches. If you're on a paid plan, you can also take the opportunity to export your fused dataset to cross-tabulate uncommon variables. Barring this, the fused counts on each common variable, as they would appear in your fused dataset, will also be displayed.

Step 7 - Contextualizing

NB: The results you see represent the very best outcome you can expect using the common variables available to you. While having good common variables across the two datasets is the most important contributor to the success of your future fusion project, sampling your two datasets in this experiment from a single source, ticks the second box a fusion need, a shared population. If you've sampled well, with enough samples, it is highly likely the distribution of your data is very similar. This is unlikely to be as perfect as this going forward, even if you can keep the definition of your two datasets identical, as data collection methods, sampling frames and perhaps even the mode of collection may differ. There is also the possibility that data distributions will shift over the course of time. All these differences define the ceiling which data fusion cannot overcome, and how well your data fusion will work.

Now that you're familiar with the matching process, repeat Step 2 by disporportionately sampling. Try to skew the sample, by a realistic amount (don't be hacky about, you really want to control your experiment). This will begin to simulate a real world fusion project where the datasets don't line up perfectly.

By experimenting with your single source data in this way, you can study the effect of introducing disparate populations, as well as the effectiveness of your common variables to match them.