Does attending college increase wages?
Most data show that
\[E(\text{Wage} | \text{College}) - E(\text{Wage} | \text{No college})>0\]
“Correlation is not causation”
Perhaps people who go to college are just more motivated:
\[\begin{split} E(\text{Wage} | \text{College}, \text{Motivated}) \\ - E(\text{Wage} | \text{No college}, \text{Motivated}) \le 0 \end{split} \]
Wages and education are observable
Motivation is not
How can we control for unobservable motivation?
We need to know the joint distribution of wages, education, and motivation
Suppose we have a random variable \(X\)
With data, we can estimate \(P(X=x)\)
Now suppose that \(X\) is related to a binary latent variable \(Z\)
The observed probability distribution depends on some unobserved components:
\[ \begin{split} P(X=x) = P(Z=0) P(X=x|Z=0) \\ + P(Z=1) P(X=x|Z=1) \end{split} \]
Can we use data to estimate the latent components of this?
No.
All we see is \(P(X)\)
We have one equation:
\[ \begin{split} P(X=x) = P(Z=0) P(X=x|Z=0) \\ + P(Z=1) P(X=x|Z=1) \end{split} \]
and three unknowns: \(P(Z=0)\), \(P(X=x|Z=0)\), and \(P(X=x|Z=1)\)
The following would all be observationally equivalent:
\(Z\) is never zero: \(P(Z=0)=0\) and \[P(X=x)=P(X=x|Z=1)\]
\(Z\) is always zero: \(P(Z=0)=1\) and \[P(X=x)=P(X=x|Z=0)\]
\(X\) and \(Z\) are unrelated: \(P(Z=0) \in (0,1)\) and \[P(X=x|Z=1) = P(X=x|Z=0)\]
Now suppose we observe \(K\) rvs: \(X_1, X_2, \dots, X_K\)
And that our \(K\) rvs are independent conditional on \(Z\):
\[ \begin{split} P(X_1=x_1, \dots, X_k=x_k | Z=z) \\ = P(X_1=x_1 | Z=z) \cdots P(X_k=x_k | Z=z) \\ = \prod_{k=1}^K P(X_k=x_k | Z=z) \end{split} \]
Can we use data on \(K\) observed variables to estimate the latent components \[P(X_k=x_k | Z=z)\] for every value of \(k\) and \(z\)?
This question puzzled researchers for a long time
These are known as latent-class models
Researchers have been estimating them for decades (we’ll see how later)
E.g., a sociologist might assume that answers to a set of survey questions about opinions on gay marriage, abortion, gun control, etc. are all explained by an unobserved attitude or belief
Researchers knew that their models gave them seemingly sensible estimates
In simulations, estimates were usually close to the true latent distributions (which were determined by the researchers)
But nobody knew when, or how, the latent probability distributions could be recovered from the observed ones
Recent research has shown when, and how, the latent components of these models can be recovered from the distributions of the observed variables
In a seminal paper, Hall and Zhou (2003, Annals of Statistics) prove the following:
Theorem. If \(K \ge 3\) and \(P(X_j=x_j, X_k = x_k) \ne P(X_j = x_j) P(X_k=x_k)\) for all \(j\) and \(k\), then \(P(X_j=x_j|Z=z)\) and \(P(Z=z)\) are uniquely determined for all \(k \in \{1, \dots, K\}\) and \(z \in \{0, 1\}\) (up to permutations of the labels).
To see the idea behind their argument, recall that
\[ \begin{split}P(X_1=x_1,\dots,X_K=x_K) \\ = \sum_{z} P(Z=z) \prod_k P(X_k=x_k | Z=z) \end{split} \]
For fixed \(x_1, \dots, x_k\), there are \(2K+1\) unknowns (the distribution of each of the \(X_k\) for each value of \(Z\) and the distribution of \(Z\))
If we sum this over \(X_1\), say, we get
\[ \begin{split} \sum_{x_1} P(X_1, \dots, X_k) = P(X_2, \dots, X_K) \\ = \sum_z P(Z=z) \prod_{k \ge 2} P(X_k=x_k | Z=z) \end{split} \]
By summing over different variables, we can obtain \(2^K-1\) different combinations of distributions for the observed variables
When \(K \ge 3\), the number equations is greater than the \(2K+1\) unknowns
If \(K=3\), e.g.,
We can obtain 7 distributions: \(P(X_1, X_2, X_3)\), \(P(X_1, X_2)\), \(P(X_1, X_3)\), \(P(X_2, X_3)\), \(P(X_1)\), \(P(X_2)\), \(P(X_3)\)
But we have 7 unknowns: \(P(Z=0)\) and 2 latent distributions \(P(X_k|Z=z)\) for each of 3 variables
Allman, Matias, and Rhodes (2009, Annals of Statistics) use a powerful theorem by Kruskal (1977, Linear Algebra and its Applications) to extend this to latent variables that take more than two values:
\[ \begin{split} P(X_1=x_1,\dots,X_K=x_K) \\ = \sum_{i=1}^r P(Z=r) \prod_{k=1}^K P(X_k=x_k|Z=r) \end{split} \]
Work to establish identifiability in more general cases is ongoing
E.g., the requirement that the observed variables are independent conditional on \(Z\) might be too strong
Kasahara and Shimotsu (2009, Econometrica) extend these results to allow the observed variables to be related through a Markov structure, conditional on the latent variable
\[ \begin{split} P(X_1=x_1,\dots,X_K=x_K | Z=z) \\ = \prod_{k=2}^K P(X_k=x_k | Z=z, X_{k-1}=x_{k-1}) P(X_1=x_1|Z=z) \end{split} \]
So far, we have focused on identifiability from knowledge of the population distributions of the observed variables
How can latent-variable models be estimated from sample data?
The most common method is via the Expectation-Maximization algorithm, due to Dempster, Laird, and Rubin (1977, Journal of the Royal Statistical Society)
Suppose that we have observations on \(X_1, \dots, X_K\) for \(N\) individuals
Let \(q_i\) be the probability that the unobserved variable \(Z_i=0\) for observation \(i\), conditional on that observation’s realizations of \(X_1,\dots,X_K\)
Using Bayes’ rule, we can show that \(q_i\) is a function of the latent distributions:
\[ \begin{aligned} q_i &= \frac{P(X_{1i}=x_{1i},\dots,X_{Ki}=x_{Ki},Z_i=z)} {P(X_{1i}=x_{1i},\dots,X_{Ki}=x_{Ki})} \\ &= \frac{P(Z_i=0) \prod_k P(X_{ki}=x_{ki} | Z_i=0)} {\sum_z P(Z_i=z) \prod_k P(X_{ki}=x_{ki} | Z_i=z)} \end{aligned} \]
The EM algorithm iterates between two steps
In the E(xpectation) step, we use a guess of \(q_i\) to find the expected (log) likelihood of the observed data: \[ \begin{split} L = \sum_{i=1}^N \bigg[ q_i \sum_k \log P(X_{ki}=x_{ki} | Z_i=0) \\ + (1-q_i) \sum_k \log P(X_{ki}=x_{ki} | Z_i=1) \bigg] \end{split} \]
In the M(aximization) step, we choose the values of \[P(X_{k}=x_k | Z_i=z)\] that maximize \(L\)
We use these values to form a new guess of \(q_i\)
We iterate between these steps until our estimates converge
True distribution of \(Z\):
> pz
[1] 0.7 0.3
Estimates:
> pi
[1] 0.3159422 0.6840578
True distribution of \(P(X_k | Z=1)\):
> px1
px11 px21 px31 px41
[1,] 0.1 0.50 0.3 0.10
[2,] 0.2 0.25 0.2 0.80
[3,] 0.3 0.20 0.1 0.05
[4,] 0.4 0.05 0.4 0.05
Estimates:
> prob2
[,1] [,2] [,3] [,4]
[1,] 0.0724729 0.51293214 0.3236297 0.08210373
[2,] 0.1801815 0.23980845 0.1835646 0.81299439
[3,] 0.3366125 0.20966098 0.1037317 0.05529881
[4,] 0.4107331 0.03759843 0.3890739 0.04960307
True distribution of \(P(X_k | Z=2)\):
> px2
px12 px22 px32 px42
[1,] 0.2 0.3 0.5 0.4
[2,] 0.2 0.4 0.2 0.2
[3,] 0.2 0.2 0.2 0.2
[4,] 0.4 0.1 0.1 0.2
Estimates:
> prob1
[,1] [,2] [,3] [,4]
[1,] 0.2981149 0.2843557 0.51889884 0.4533979
[2,] 0.1905647 0.3836419 0.18404021 0.1604538
[3,] 0.1505027 0.1975006 0.22444383 0.1519697
[4,] 0.3608177 0.1345019 0.07261711 0.2341785