How to design a split survey data fusion project

A little pain for a lot of gain

Data Fusion tools merge datasets together using common variables. It is widely used to build more rounded customer views, usually done by joining together survey data with other surveys, CRMs & digital data sources. With a little extra work, it can also be used within the same survey, to integrate parts of the survey which not all respondents completed. This might be to reduce questionnaire length and/or enhance representation - lowering fieldwork costs while improving estimates. In its most simplest guise, once the core of your survey is complete, respondents are routed randomly into one of two questionnaire parts - and it is strongly suggested you experiment with this on a real project before attempting larger multi-category survey like the one described below (because data processing will get very complex, very quickly). This should not be your first attempt at a live fusion project but if you don't have a real project with only two categories, it is advised you practice multi-category data using a similar approach to what is described here.

Data Fusion is a better alternative to the least fill quota approach, as it is either impossible or too costly to control quota representation once in field, particularly across multi-category surveys where least fill quotas destroys representation and therefore the accuracy of the results.

While this approach takes more effort, the combined net cost win and increased accuracy substantially trumps any additional effort performing the added data processing.

The great news is that working within the same survey, rather than across surveys, makes relatively light work for Data Fusion, which relies on a set of relevant common variables, and similar data distributions (which here, arises from the single sample).

#TLDR

Visualising it

Why least fill quotas might not work as well as you think. Use fusion instead to achieve representation at a lower cost per interview.

Let's take the example of a typical multi-category survey. A travel booking website would like to capture diagnostic information across a range of holiday destinations (or types) to optimise their messaging. They may wish to conduct a deep diagnostic survey within each holiday category. These categories might be local destinations, package beach holidays, adventure holidays, cruises, and luxury holidays. The questionnaire would be too long to collect data about all these categories from a single respondent, and so a limit must be placed so that each respondent answers only two types of holidays. This means that not only is that respondent not represented in the other two categories (if indeed they are applicable), but the data for them is missing, making comparison across categories quite challenging.

Instead of a least fill quota across all categories, we instead set a minimum base size on cruises and luxury holidays to ensure we achieve a reasonably fair base size. We do not limit this be a least fill. Once these respondents have completed this holiday type, they can be randomly assigned to another qualifying category. In this way we ensure a good spread of higher spending holidaymakers across the other categories - rather than assign them based on least fill. Under a least fill approach, higher spending customers would not be asked about a package holiday, even if they qualify - and this skews the remaining profiles of package holidays.

Under a least fill apporach, estimations of the value of each type of holiday would be very skewed as a result (as these customers who can afford luxury holidays and cruises disproportionately account for spending on cheaper holiday types). It is this effect which skews the data, and disproportionally so, as the top 20% of spenders are likely to be responsible for 80% of revenue! This would not be good for you, if your client/stakeholders already knows what the real spend levels are! So, if we are not including a boost of low incidence categories, and later weighting them back to their real market incidence, we do really want to keep disproportional answering to a minimum.

Step 1 - Setting up your screener.

Keeping with the holiday example: In your screener section establish which categories each respondent qualifies for. Punch this, as normal, as binary multi-punch question. e.g., S1_1, S1_2, S1_3. Follow up with variables which determine category engagement. This is where you can ask about those variables which broadly define the experience, including recency, frequency, spend, destination and overall satisfaction. Alongside the demographics section, these will later serve as important matching variables. Once deep dive category sections are randomly assigned, track this with another multi-punch dummy question D1_1, D1_2, D1_3, D1_4. Note that there is now a disparity between what deep dive categories a respondent could answer, and what they do answer.

Those respondents who spent the most on holidays, go on more types of holidays, are now represented randomly. We want to avoid any effect which would push respondents into one category over the other, and prefer a quota design that will spread different types of holiday makers across the categories. We will then rely on Data Fusion to bring in answers that are "missing".

Step 2 - Analyzing the common variables.

Once out of field, investigate the relationships between the core screener variables and the answers in the deep dive category sections. Appreciate that different factors may be more important to understanding differences in some categories than in others. In our example, when looking at package and adventure holidays we might discover that recency and short haul destinations aren't really related to the category diagnostics. Likewise, the level of spending might explain much of the variation in luxury holidays, while actual destination does not matter. We put irrelevant factors aside, so that our common variables are meaningful and won't introduce unnecessary noise to each fusion.

Step 3 - Non-response weighting

Weight the data according to the screener section or to known industry or business data, to ensure customers are represented in their correct proportions. One could also weight each category individually (interlocking the weights with the overall weights). This ensures that regardless of category they answered, overall, the sample represents the market. In our example, we might need to correct for age, level of spending, frequency, while controlling for the categories of holiday each respondent qualified for. At this point, it is clear why this approach is far healthier to your overall estimates and provides a lot more scope for your to expand your insights presentation to include overlapping cohorts, and allows you to explore the relationship of one category with others.

At this stage, we would also be hopeful that restricting each respondent to two holiday types, rather than allowing all to answer about all four (if applicable) didn't skew the number of respondents who answered each question too much (although some skew is unavoidable). The most important thing is that we achieve fair representation across all categories. The only issue we really should be left with is that we're missing their answers in the categories for which they would have answered, if they were given the opportunity to do so.

For this we will use Data Fusion to borrow the missing answers from like-minded travelers, those with similar enjoyment, recency, freequecy and similar spend levels.

Step 4 - Tag fusion cohorts

Export your data to a spreadsheet package, like Microsoft Excel or Google Sheets. Subtract the categories answered from categories qualified for. (In this example, D1_1 minus Q1_1). When this new variable equals 1, it indicates that for this category, the respondent is a recipient in a fusion. Now add categories answered and categories qualified for. (In this example D1_1 plus Q1_1). When this variable equal 2, it indicates that for this category the respondent is a donor in that fusion. Repeat for all codes.

Step 5 - Export and Clean

Filter and save, creating a file of donors and recipients for each category. In this example, we'll have 8 files, 4 donors files and 4 recipient files. At this stage, you may also take the opportunity to remove any odd respondents from the donor file. That is, you may keep them as legitimate recipients within their own category, without cross polluting other categories. (Normal rules and ethics for data cleaning market research data still apply.)

Step 6 - Design and execute the fusion.

Select key diagnostic measures as meaningful target variables, so that each fusion matches donors and recipients on those common variables which matter most to that category. You need to set S1 as a forced variable to control your most commercially important overlapping categories.

Step 7 - Assemble your final dataset.

When using an unconstrained data fusion, the recipient weight in each category is used for analysis (if you did not opt for interlocking weights earlier).

The other reason multi-category surveys grow in complexity is that instead of a single fusion, the analyst must now carefully and painstakingly assemble the final data set from a series of fusions. In some fusions, respondent 123 may be matched with 456 and in other categories, may take their information from 789. In our example, we need to merge the file 4 times, and ensure all the columns and rows line up.

While the application will give you back 4 files for merging, the best approach is to create a file containing all the matching IDs. So that each respondent in the survey has a row containing their own ID, and if they received data from another respondent for a category, the matching ID for that category.

Assemble your data, starting with the common data and weight variable(s). If you are working in data analysis software, use join, merge, and stack functionality as appropriate to your software. If working in a spreadsheet package like Excel, combining an IF function (take either from their own answers, or if not present, fetch the answers from the fused donor) with a VLOOKUP or XLOOKUP will also do the trick. Whatever you do, work slowly and systematically (and as you do so, think warm thoughts about how your job is safe from Generative AI).

Step 8 - Checking

Run frequencies and crosstabulations to ensure your assembled data matches the incidence you expect.

In these steps you have cut your fieldwork cost while retaining representation so that your core questionnaire section now matches the size and representation within category. You've also done it at a fraction of the cost. Your multi-category survey now also lines up with reality.