Regression to the mean is a phenomenon where extreme values in a dataset tend to move towards the mean upon subsequent measurements. In other words, if an initial measurement is very high or low, the subsequent measurement is more likely to be closer to the average. That is, one is always more likely to be presented with a measure closer to the average, because almost every naturally occurring phenomenon on planet earth ascribes to the bell curve, with more data points closer to the average. Notably, High or low measures are more likely to be unusual. For example, when you walk through your office building, you are far less likely to find very tall or very short people - most people are average height, average intelligence, and while everyone is unique, on average, we're less remarkable than we outwardly appear. We all try not to let on, but ultimately we know this is true and absolutely guaranteed by mother nature.
Within the context of data fusion and data integration, regression to the mean takes on a slightly different twist. When we're fusing data, we're artificially supercharging Regression to the Mean, as we repeatedly sample imprecisely from donors. It is the summed effect of the compromise we make when trying to match individuals as similarly as possible across many measures at the same time.
In an ideal world, we would expect to see perfect matching on all the common variables we have, and for all those common variables to fully explain all non-common variables. In practice, this cannot be the case.
Take me to where the candy is.
To illustrate Regression to the Mean, let's pretend we are fusing a dataset we collected from one suburb with another. Let us imagine a fictitious variable in the second dataset - the donor dataset. It's a measure of the "Amount of Halloween candy collected." We would expect the amount of candy individuals might collect to be dictated by age, with the youngest in our sample collecting lots of candy, teenagers getting less and adults getting none. If matches were forced on three age bands, we would expect zero regression to the mean. That is, children from one sample could be expected to collect the same as children in the other. Adults from one sample will get the same as adults in the other. None. Forcing matches mitigates Regression to the Mean.
However, when this forced variable is removed, we would expect some matches to occur across age bands. This less than perfect matching on age happens so that we might more closely on other variables of interest, say household size and gender. We'd certainly see some younger children matched with teenagers, and some adults being matched with teenagers. Adults would be likely to pick up some candy in the fused dataset, teenagers more, and younger children who collected 90% of the candy initially, perhaps only now pick up 70% of the candy. As some younger children were selected to donate their information to some teenagers, because teenagers didn't receive the same amount of candy, we literally see some of the children's candy going to teenagers. Overall, our dataset regressed to the average amount of candy collected (across children, teenagers and adults).
Said in another way: through imperfect sampling see our measure gradually approaching the mean.
The price we pay
In this light, Regression to the Mean is very much the price we pay for integrated data - valuable data comes at a cost not even the best integration tools can overcome. The good news is that we can easily study how our analysis performs through observing "folding". This means looking at the index of a measure across groups. When regression to the mean occurs, this index moves closer to 100 (the overall average). That is, the data becomes flatter, and less distinctive/marked across groups. This is unavoidable: Data Fusion gives us flatter data. To be more transparent and signal caution where it's due, data fusion companies will report on Regression to the Mean in detail. We don't report Regression to the Mean on our DIY application, as it is a bit of 'red herring' for those starting out, particularly if matching rates and donor usage fail to make the grade. Within most use cases, the analyst should focus on the latter.
Controlling Regression to the Mean through forced matching might prompt us to force matches at every opportunity. In practice, asking our data fusion tool to force matches on many common variables is not desirable, as forcing a match on one variable reduces the number of donors available for matching each time a new variable is forced. There may not even be a suitable recipient or donor which matches all contingencies! This point is reached very quickly. As we tighten up the design, we sample from an increasingly smaller pool. More importantly, it precludes close matching on other common variables. In short, for the purposes of maintaining the structure, effective base size, and heterogeneity of the data, it is much better to accept some Regression to the Mean - matching similarly on many common variables. We also do not want to be matching perfectly on some common variables, because it is likely to result in very poor matching on the rest. This would ruin the fusion.
So much so, that experienced analysts will avoid forcing variables like the plague, if at all possible. As a rule, particularly if new to data fusion, do avoid forcing matches when you don't have to.
Data Fusion companies will only force variables where Regression to the Mean becomes intolerable. These are always the most important common variables. However, promoting a variable from one with a high importance (where the fusion will attempt to match on this as a priority), to forcing the match is a grave decision which comes with widespread consequence.
Traditional Data Fusion designs
For almost all of Data Fusion's history, the safe/traditional bet was to limit forcing to gender and region. This combination is still used in many media fusions, because matching one gender to the other, or matching one region to the other, would create strange datasets, where men are more likely to read womens' fashion magazines, and folk in Scotland watch ITV London. It's not credible if these controls are not in place, even if the overall fusion is weakened. Similarly, when integrating customer data, other variables are likely to be far more important than basic demographics and might require forcing.
As hinted above, it need not be an all or nothing approach. Leverage supervised approaches, so that more important common variables which might exacerbate Regression to Mean are prioritised, rather than forced. The key is to experiment with different designs, and combinations - each offering an opportunity to study the indices. Studying Regression to the Mean literally tells us alot about how well our fusion is working. Plainly said, it is a byproduct - and the very concept behind how data fusion works, so its also a thing we stomach so that we can offer additional insight.