In fact, we have already seen the back-door adjustment when we discuss Simpson’s Paradox. We have calculated the weighted average of the rate of heart attacks with the drug in men and that in women to get the overall effect of the drug in the general population adjusted for the confounder, age. We can write this adjustment in the formula:

Here, Z is the confounder (age). X is the cause. Y is the effect. The notion do(x) simply means to remove all arrows pointing into X. In other words, deconfound the confounders.

However, what if we cannot observe/ measure the confounders? Then we cannot use the back-door adjustment formula since we do not know Z. In this case, we may want to look for a mediator between the cause and effect which are not related to the confounders. If we can find such a mediator, we can use the front-door adjustment formula. Let me give you an example:

There is one backdoor path, smoking <– smoking gene –> cancer. But, we cannot observe/ measure smoking genes since we do not know whether such genes may exist. We cannot use the back-door adjustment formula. Can we calculate the effect of smoking on cancer by looking at the path smoking –> tar –> cancer? Maybe we can calculate the effect of smoking on tar and the effect of tar on cancer.

To calculate the effect of smoking on tar, there is one back-door path, which is smoking <– smoking gene –> cancer <– tar. However, it is already blocked by collider cancer. Then, estimating the effect of smoking on tar is straightforward.

To calculate the effect of tar on cancer, there is also one back-door path, which is tar <– smoking <– smoking gene –> cancer. To block this back-door path, we need to adjust for smoking since we cannot observe smoking genes. Using our back-door adjustment formula, we get:

To get the overall effect of smoking on cancer, we can sum the probability of doing X resulting in M and doing M resulting in Y, and the probability of doing X resulting in not M and not M resulting in Y.

Now, we have our front-door adjustment formula.

The beauty of these formulae is that we expressed the probability of intervention, I,e. do(X), with the probability of observation P(Y|M), P(Y|X), etc. This allows us to estimate the cause and effect from data.

Of course, we started with our assumption on the causal diagram and from that, we can say whether it is possible to adjust for the confounders to get the cause and effect. You may challenge the assumption and say that there may be a link from smoking genes to tar. Then the front-door adjustment will not work since there is another back-door path from smoking to tar and there is no way to adjust for it. Like any scientific discovery, we started with some assumptions and try to falsify them with observations. The causal diagram actually makes the assumptions explicit.