Uncovering hidden variables

\(\Pi M E\) seminar

John Gardner

Dept. of Economics

jrgcmu.github.io

Motivation

Does attending college increase wages?

Most data show that

\[E(\text{Wage} | \text{College}) - E(\text{Wage} | \text{No college})>0\]

“Correlation is not causation”

Perhaps people who go to college are just more motivated:

\[\begin{split} E(\text{Wage} | \text{College}, \text{Motivated}) \\ - E(\text{Wage} | \text{No college}, \text{Motivated}) \le 0 \end{split} \]

Wages and education are observable

Motivation is not

How can we control for unobservable motivation?

We need to know the joint distribution of wages, education, and motivation

Latent variable models

Suppose we have a random variable \(X\)

With data, we can estimate \(P(X=x)\)

Now suppose that \(X\) is related to a binary latent variable \(Z\)

The observed probability distribution depends on some unobserved components:

\[ \begin{split} P(X=x) = P(Z=0) P(X=x|Z=0) \\ + P(Z=1) P(X=x|Z=1) \end{split} \]

Can we use data to estimate the latent components of this?

No.

All we see is \(P(X)\)

We have one equation:

\[ \begin{split} P(X=x) = P(Z=0) P(X=x|Z=0) \\ + P(Z=1) P(X=x|Z=1) \end{split} \]

and three unknowns: \(P(Z=0)\), \(P(X=x|Z=0)\), and \(P(X=x|Z=1)\)

The following would all be observationally equivalent:

\(Z\) is never zero: \(P(Z=0)=0\) and \[P(X=x)=P(X=x|Z=1)\]
\(Z\) is always zero: \(P(Z=0)=1\) and \[P(X=x)=P(X=x|Z=0)\]
\(X\) and \(Z\) are unrelated: \(P(Z=0) \in (0,1)\) and \[P(X=x|Z=1) = P(X=x|Z=0)\]

Now suppose we observe \(K\) rvs: \(X_1, X_2, \dots, X_K\)

And that our \(K\) rvs are independent conditional on \(Z\):

\[ \begin{split} P(X_1=x_1, \dots, X_k=x_k | Z=z) \\ = P(X_1=x_1 | Z=z) \cdots P(X_k=x_k | Z=z) \\ = \prod_{k=1}^K P(X_k=x_k | Z=z) \end{split} \]

Can we use data on \(K\) observed variables to estimate the latent components \[P(X_k=x_k | Z=z)\] for every value of \(k\) and \(z\)?

This question puzzled researchers for a long time

These are known as latent-class models

Researchers have been estimating them for decades (we’ll see how later)

E.g., a sociologist might assume that answers to a set of survey questions about opinions on gay marriage, abortion, gun control, etc. are all explained by an unobserved attitude or belief

Researchers knew that their models gave them seemingly sensible estimates

In simulations, estimates were usually close to the true latent distributions (which were determined by the researchers)

But nobody knew when, or how, the latent probability distributions could be recovered from the observed ones

Identifiability of latent-variable models

Recent research has shown when, and how, the latent components of these models can be recovered from the distributions of the observed variables

In a seminal paper, Hall and Zhou (2003, Annals of Statistics) prove the following:

Theorem. If \(K \ge 3\) and \(P(X_j=x_j, X_k = x_k) \ne P(X_j = x_j) P(X_k=x_k)\) for all \(j\) and \(k\), then \(P(X_j=x_j|Z=z)\) and \(P(Z=z)\) are uniquely determined for all \(k \in \{1, \dots, K\}\) and \(z \in \{0, 1\}\) (up to permutations of the labels).

To see the idea behind their argument, recall that

\[ \begin{split}P(X_1=x_1,\dots,X_K=x_K) \\ = \sum_{z} P(Z=z) \prod_k P(X_k=x_k | Z=z) \end{split} \]

For fixed \(x_1, \dots, x_k\), there are \(2K+1\) unknowns (the distribution of each of the \(X_k\) for each value of \(Z\) and the distribution of \(Z\))

If we sum this over \(X_1\), say, we get

\[ \begin{split} \sum_{x_1} P(X_1, \dots, X_k) = P(X_2, \dots, X_K) \\ = \sum_z P(Z=z) \prod_{k \ge 2} P(X_k=x_k | Z=z) \end{split} \]

By summing over different variables, we can obtain \(2^K-1\) different combinations of distributions for the observed variables

When \(K \ge 3\), the number equations is greater than the \(2K+1\) unknowns

If \(K=3\), e.g.,

We can obtain 7 distributions: \(P(X_1, X_2, X_3)\), \(P(X_1, X_2)\), \(P(X_1, X_3)\), \(P(X_2, X_3)\), \(P(X_1)\), \(P(X_2)\), \(P(X_3)\)
But we have 7 unknowns: \(P(Z=0)\) and 2 latent distributions \(P(X_k|Z=z)\) for each of 3 variables

Allman, Matias, and Rhodes (2009, Annals of Statistics) use a powerful theorem by Kruskal (1977, Linear Algebra and its Applications) to extend this to latent variables that take more than two values:

\[ \begin{split} P(X_1=x_1,\dots,X_K=x_K) \\ = \sum_{i=1}^r P(Z=r) \prod_{k=1}^K P(X_k=x_k|Z=r) \end{split} \]

Work to establish identifiability in more general cases is ongoing

E.g., the requirement that the observed variables are independent conditional on \(Z\) might be too strong

Kasahara and Shimotsu (2009, Econometrica) extend these results to allow the observed variables to be related through a Markov structure, conditional on the latent variable

\[ \begin{split} P(X_1=x_1,\dots,X_K=x_K | Z=z) \\ = \prod_{k=2}^K P(X_k=x_k | Z=z, X_{k-1}=x_{k-1}) P(X_1=x_1|Z=z) \end{split} \]

Estimation: The EM algorithm

So far, we have focused on identifiability from knowledge of the population distributions of the observed variables

How can latent-variable models be estimated from sample data?

The most common method is via the Expectation-Maximization algorithm, due to Dempster, Laird, and Rubin (1977, Journal of the Royal Statistical Society)

Suppose that we have observations on \(X_1, \dots, X_K\) for \(N\) individuals

Let \(q_i\) be the probability that the unobserved variable \(Z_i=0\) for observation \(i\), conditional on that observation’s realizations of \(X_1,\dots,X_K\)

Using Bayes’ rule, we can show that \(q_i\) is a function of the latent distributions:

\[ \begin{aligned} q_i &= \frac{P(X_{1i}=x_{1i},\dots,X_{Ki}=x_{Ki},Z_i=z)} {P(X_{1i}=x_{1i},\dots,X_{Ki}=x_{Ki})} \\ &= \frac{P(Z_i=0) \prod_k P(X_{ki}=x_{ki} | Z_i=0)} {\sum_z P(Z_i=z) \prod_k P(X_{ki}=x_{ki} | Z_i=z)} \end{aligned} \]

The EM algorithm iterates between two steps

In the E(xpectation) step, we use a guess of \(q_i\) to find the expected (log) likelihood of the observed data: \[ \begin{split} L = \sum_{i=1}^N \bigg[ q_i \sum_k \log P(X_{ki}=x_{ki} | Z_i=0) \\ + (1-q_i) \sum_k \log P(X_{ki}=x_{ki} | Z_i=1) \bigg] \end{split} \]

In the M(aximization) step, we choose the values of \[P(X_{k}=x_k | Z_i=z)\] that maximize \(L\)

We use these values to form a new guess of \(q_i\)

We iterate between these steps until our estimates converge

Simulation

\(X_1,\dots,X_4\), each taking 4 possible values
\(Z \in \{1, 2\}\)
\(N=500\)
R code available online

True distribution of \(Z\):

> pz
[1] 0.7 0.3

Estimates:

> pi
[1] 0.3159422 0.6840578

True distribution of \(P(X_k | Z=1)\):

> px1
     px11 px21 px31 px41
[1,]  0.1 0.50  0.3 0.10
[2,]  0.2 0.25  0.2 0.80
[3,]  0.3 0.20  0.1 0.05
[4,]  0.4 0.05  0.4 0.05

Estimates:

> prob2
          [,1]       [,2]      [,3]       [,4]
[1,] 0.0724729 0.51293214 0.3236297 0.08210373
[2,] 0.1801815 0.23980845 0.1835646 0.81299439
[3,] 0.3366125 0.20966098 0.1037317 0.05529881
[4,] 0.4107331 0.03759843 0.3890739 0.04960307

True distribution of \(P(X_k | Z=2)\):

> px2
     px12 px22 px32 px42
[1,]  0.2  0.3  0.5  0.4
[2,]  0.2  0.4  0.2  0.2
[3,]  0.2  0.2  0.2  0.2
[4,]  0.4  0.1  0.1  0.2

Estimates:

> prob1
          [,1]      [,2]       [,3]      [,4]
[1,] 0.2981149 0.2843557 0.51889884 0.4533979
[2,] 0.1905647 0.3836419 0.18404021 0.1604538
[3,] 0.1505027 0.1975006 0.22444383 0.1519697
[4,] 0.3608177 0.1345019 0.07261711 0.2341785