Discrete Markov Chains

Lecture 1

4.1 Introduction

state: finite/countable values, {0,1,2, …}
time: discrete, {0,1,2, …}
transition prob (time homogeneous), \(P=\{P_{ij}\}, P_{ij} = \Pr(X_n=j|X_{n-1}=i)\).
Random walk, {0,+/-1,+/-2,…}, (1-p,0,p) (Ex 4.5)
- starting at \(X_0=0, X_n=\sum_{i=1}^n Z_i, Z_i \sim Bernoulli(p)*2-1 \to X_n ~ binomial\)
- \(P_{ii}^{2n+1}=0, P_{00}^{2n}=(2n, n)p^n(1-p)^n=(2n)!/(n!n!)(p(1-p))^n\)
- If \(P_{00}=P_{NN}=1\): random walk with absorbing state 0/N.
Given a MC \(X_n\) with transition prob \(P_{ij}\), let \(A\) be a set of states.
- define \(N=\min\{n: X_n\in A\}\)
- define \(W_n=X_n\) if n<N and \(W_n=A\) if \(n\geq N\)
- \(\{W_n\}\) is also a MC with state \(\{i\notin A, A\}\) with \(A\) being an absorbing state.
- transition prob \(Q_{ij}\) of \(W_n\)
  - \(Q_{ij}=P_{ij}\), if \(i,j\notin A\)
  - \(Q_{iA}=\sum_{j\notin A} P_{ij}\), if \(i\notin A\)
  - \(Q_{AA}=1\)

4.2 Chapman-Kolmogorov Equations

n-step transition prob, \(P_{ij}^n = \Pr(X_{n+k}=j|X_{k}=i)\)
\(P_{ij}^{n+m} = \sum_{k=0}^{\infty} P_{ik}^nP_{kj}^m, \forall n,m>=0\).
- \(P_{ij}^{n+m}=\sum_{k=0}^{\infty}\Pr(X_{n+m}=j,X_n=k|X_0=i)=\sum_k\Pr(X_{n+m}=j|X_n=k)\Pr(X_n=k|X_0=i)=\sum_kP_{kj}^mP_{ik}^n\)
hence \(\textbf{P}^{(n+m)}=\textbf{P}^{(n)}.\textbf{P}^{(m)}\) (matrix multiplication)
- here \(\textbf{P}^{(n)}\) is the matrix of \(n\)-step transition probabilities, \(P_{ij}^n\).
Marginal prob: initial prob x conditional transition prob P.
- \(\alpha_i=\Pr(X_0=i), i\geq 0, \sum_i \alpha_i = 1\).
- \(\Pr(X_n=j) = \sum_i\alpha_iP_{ij}^n\).
Given a MC with transition prob \(P_{ij}\). Let \(A\) be a set of states.
1. Study \(\Pr(X_k\in \textrm{A for some } k=1,\ldots,m|X_0=i)\), where \(i\notin A\)
  - probability that the MC ever enters any of the states in A by time m
  - equal to \(\Pr(W_m=A|W_0=i)=Q_{i,A}^m\)
2. \(\Pr(X_m=j,X_k\notin A, k=1,\ldots,m-1|X_0=i)\), where \(i,j\notin A\)
  - probability that the MC, starting in state i, enters state j at time m without ever entering any of the states in A
  - equal to \(\Pr(W_m=j|W_0=i)=Q_{ij}^m\)
3. \(\Pr(X_m=j,X_k\notin A, k=1,\ldots,m-1|X_0=i)\), where \(i\notin A, j\in A\)
  - condition on time \(m-1\), equal to
    - \(\sum_{r\notin A}\Pr(X_m=j,X_{m-1}=r,X_k\notin A, k=1,\ldots,m-22|X_0=i)\)
    - i.e., \(\sum_{r\notin A}P_{rj}Q_{ir}^{m-1}\)
4. \(\Pr(X_m=j,X_k\notin A, k=1,\ldots,m-1|X_0=i)\), where \(i\in A\)
  - condition on time \(1\), equal to
    - \(\sum_{r\notin A}\Pr(X_m=j,X_{1}=r,X_k\notin A, k=1,\ldots,m-1|X_0=i)\)
    - i.e., \(\sum_{r\notin A}P_{ir}\Pr(X_m=j,X_k\notin A, k=2,\ldots,m-1|X_1=r)\)
  - if \(j\notin A\), equal to \(\sum_{r\notin A}P_{ir}Q_{rj}^{m-1}\)
  - if \(j\in A\), equal to \(\sum_{r\notin A}P_{ir}(\sum_{s\notin A}P_{sj}Q_{rs}^{m-2})\)
5. \(\Pr(X_m=j|X_0=i,X_k\notin A,k=1,\ldots,m)\), where \(i,j\notin A\)
  - conditional probability of \(X_m\) given that the chain starts in state \(i\) and has not entered any state in A by time m
  - equal to \(Q_{ij}^m/(\sum_{r\notin A}Q_{ir}^m)\)
Ex: random walks with turning right prob \(\Pr(X_n=1)=p\) and turning left prob \(\Pr(X_n=-1)=1-p\).
- Let \(N\) denote the number of walks until there is a run of three consecutive walks of turning right (i.e., \(X_n=1\))
- Study dist of \(N\): \(\Pr(N\leq 8)\) and \(\Pr(N=8)\)
- Define a new MC \(Y_n\) (current size of consecutive turning right) with states 0,1,2,3 where
  - \(i<3\) state means that we currently are on a walk of i consecutive turning right,
  - state 3 means that a walk of three consecutive turning right has already occurred.
  - its transition prob matrix \(P\)
    - \(\Pr(Y_1|Y_0=0)\): \(1-p, p, 0, 0\)
    - \(\Pr(Y_1|Y_0=1)\): \(1-p, 0, p, 0\)
      - we currently are on a run of size 1, then the next state will be 0 if turning left next, or 2 if turning right
    - \(\Pr(Y_1|Y_0=2)\): \(1-p, 0, 0, p\)
    - \(\Pr(Y_1|Y_0=3)\): \(0, 0, 0, 1\)
- \(\Pr(N\leq 8)=\Pr(Y_8=3)=P_{03}^8\)
- \(\Pr(N=8)=\Pr(N\leq 8)-\Pr(N\leq 7)=P_{03}^8-P_{03}^7\)

Lecture 2

4.3 Classification of States

j accessible from i, \(P_{ij}^n>0\) for some n.
i communicates with j, i<->j : i(j) accessible from j(i).
- State i communicates with state i, all i>=0.
- If i communicates with j, then j communicates with i.
- If i communicates with j, and j communicates with k, then i communicates with k (Chapman–Kolmogorov equations). \(P_{ik}^{(n+m)} = \sum_{r=0}^{\infty} P_{ir}^nP_{rk}^m \geq P_{ij}^nP_{jk}^m > 0\).
i and j are in the same class if i<->j.
- any two classes of states are either identical or disjoint
- concept of communication divides the state space up into a number of separate classes
Markov chain is irreducible if there is only one class, i.e., all states communicate with each other.

For any state i, define \(f_i\) as the probability that, starting in state i, the process will ever reenter state i.
- i is recurrent if \(f_i = 1\).
  - with probability 1, the process will eventually reenter state i. However, by the definition of a Markov chain, it follows that the process will be starting over again when it reenters state i and, therefore, state i will eventually be visited again.
  - i.e., starting in state i, the process will reenter state i again and again and again, in fact, infinitely often.
- i is transient if \(f_i < 1\).
  - each time the process enters state i there will be a positive probability, namely, \(1-f_i\), that it will never again enter that state
  - starting in state i, the probability that the process will be in state i for exactly n time periods equals \(f_i^{n-1}(1-f_i)\), n>=1.
  - if state i is transient then, starting in state i, the number of time periods that the process will be in state i has a geometric distribution with finite mean 1/(1-fi).
- i is recurrent iff, starting in state i, the expected number of time periods that the process is in state i is infinite.
- Number of periods that the process is in state i
  - <->\(\sum_{n=0}^{\infty} I(X_n=i)\)
  - \(E\sum_{n=0}^{\infty} I(X_n=i|X_0=i) = \sum_{n=0}^{\infty}\Pr(X_n=i|X_0=i) = \sum_{n=0}^{\infty}P_{ii}^n\).
- Proposition 4.1 State i is recurrent iff \(\sum_{n=0}^{\infty}P_{ii}^n=\infty\), and transient iff \(\sum_{n=0}^{\infty}P_{ii}^n<\infty\)
  - a transient state will only be visited a finite number of times
  - in a finite-state Markov chain, not all states can be transient
    - suppose the states are 0, 1, … , M and suppose that they are all transient. Then after a finite amount of time (say, after time \(T_0\)) state 0 will never be visited, and after a time (say, \(T_1\)) state 1 will never be visited, and after a time (say, \(T_2\)) state 2 will never be visited, and so on. Thus, after a finite time \(T = \max(T_0,T_1,...,T_M)\) no states will be visited.
    - But as the process must be in some state after time T we arrive at a contradiction, which shows that at least one of the states must be recurrent.
- Corollary 4.2 If state i is recurrent, and i <-> j, then state j is recurrent.
  - Note that, since i<->j, there exist integers k,m st \(P_{ij}^k>0, P_{ji}^m>0\). Now, for any integer n, \(P_{jj}^{m+n+k}\geq P_{ji}^mP_{ii}^nP_{ij}^k\). Hence by summing over n, \(\sum_{n=1}^{\infty} P_{jj}^{m+n+k}\geq P_{ji}^mP_{ij}^k\sum_{n=1}^{\infty}P_{ii}^n = \infty\). So j is recurrent by Proposition 4.1.
  - transience is a class property, if i is transient and i<->j, j must be transient.
  - all states of a finite irreducible Markov chain are recurrent
Ex 4.18 random walk
- on each transition the process either moves one step to the right (with probability p) or one step to the left (with probability 1-p)
- Since all states clearly communicate, it follows from Corollary 4.2 that they are either all transient or all recurrent.
- Check \(\sum_{n=1}^{\infty} P_{00}^n\) is finite or infinite.
  - Impossible to be even after an odd number of steps \(\to P_{00}^{2n-1}=0\), n=1,2,…
  - We would go back to 0 after 2n steps iff we went left n of these and right n of these. Because each step goes to left with prob 1-p and right with prob p, the desired probability is thus the binomial probability,
    - \(P_{00}^{2n}=(2n, n)p^n(1-p)^n=(2n)!/(n!n!)(p(1-p))^n\), n=1,2,3,…
    - Stirling approx: \(n!\sim n^{n+1/2}e^{-n}\sqrt{2\pi}\), hence \(P_{00}^{2n}\sim (4p(1-p))^n/\sqrt{n\pi}\).
    - CLT normal approx: \(Bin(2n,0.5) \sim N(n,n/2)\), hence \((2n, n) = (2n)!/n!n! \sim 4^n/\sqrt{n\pi}\).
    - positive an ~ bn when \(lim_{n\to\infty}an/bn = 1\). And \(\sum_n an<\infty\) <==> \(\sum_n bn<\infty\)
    - \(4p(1-p)\leq 1\) and equality holds iff p=0.5. Hence \(\sum_nP_{00}^n=\infty\) iff p=0.5.
- the chain is recurrent when p=0.5 and transient if p≠ 0.5.
Ex 4.18-1 Two-dim symmetric random work is also recurrent
- at each transition, either take one step to the left, right, up, or down, each having probability 1/4.
  - That is, the state is the pair of integers (i,j) and the transition prob are given by P(i,j),(i+1,j) = P(i,j),(i-1,j) = P(i,j),(i,j+1) = P(i,j),(i,j-1) = 1/4.
  - The chain obviously is irreducible, and we just need to check state z=(0,0).
    - Consider \(P_{zz}^{2n}\): after 2n steps, the chain will be back in its original location if for some i (0,1,…,n), the 2n steps consist of i steps to the left, i to the right, n-i up, and n-i down. Since each step will be either of these four types with probability 1/4, –> the desired probability is a multinomial probability,
      - \(P_{zz}^{2n} = \sum_{i=0}^n (2n, i,i,n-i,n-i) (1/4)^{2n} = (1/4)^{2n} (2n, n)^2\)
      - note \((2n, n)\sim 4^n/\sqrt{n\pi}\)
      - \(P_{zz}^{2n}\sim 1/(n\pi)\) –> \(\sum_n P_{zz}^{2n} = \infty\) –> z recurrent.
  - all states are recurrent.
The symmetric random walks in one and two dimens are both recurrent, all higher-dimensional symmetric random walks turn out to be transient. (For instance, the three-dimensional symmetric random walk is at each transition equally likely to move in any of six ways—either to the left, right, up, down, in, or out.)

Finite mixture model and HMM

Consider two state \(X_n=\{0,1\}\), given \(X_n=0, Y_n\sim f_0(y)\); and \(X_n=1, Y_n\sim f_1(y)\).
Assume a transition matrix for \(X_n\): \(\Pr(X_n=1|X_{n-1}=0)=p_1, \Pr(X_n=1|X_{n-1}=1)=p_2\).
\(Y_n|X_n\) are independent of each other
When \(p_1=p_2\), i.e., \(X_n/X_{n-1}\) are independent, we obtain the classific two-class finite mixture model for \(Y_n\) (iid) \(\sim p_1f_1(y) + (1-p_1)f_0(y)\)
when \(p_1\neq p_2\), \(\{X_n\}\) are a dependent MC, we obtain a HMM for \(Y_n\) (dependent). Its likelihood could be computed efficiently using backward/forward progs.

Lecture 3

4.4 Limiting Probabilities

Positive recurrent
- Prob of MC ever entering state \(j\) starting in state \(i\): \(f_{ij}=\Pr(X_n=j, \exists n>0|X_0=i)\)
  - if \(i\) is recurrent and i<->j, then \(f_{ij}=1\)
  - probability of never entering state \(j\) is \([1-P_{ij}^n]^{\infty}\) (geometric dist prob) where \(P_{ij}^n>0\) (i<->j).
- For recurrent \(j\), define \(N_j=\min\{n>0; X_n=j\}\): number of transitions until the MC enters \(j\)
  - \(m_j=E[N_j|X_0=j]\)
  - \(j\) is positive recurrent if \(m_j<\infty\) and null recurrent if \(m_j=\infty\)
- Define \(\pi_j\): the long-run proportion of time that the MC is in state \(j\).
  - \(\pi_j=1/m_j\) (irreducible and recurrent MC)
    - so positive recurrence <==> \(\pi_j>0\)
  - Let \(T_1\) be the number of transitions until MC enters \(j\) starting from \(i\)
  - Let \(T_n\) (n>1) be the additional number of transitions until re-entering \(j\) since previous entering \(j\)
  - \(\pi_j = \lim_n \frac{n}{\sum_{i=1}^nT_i}\)
  - Note \(T_n\) (n>1) is just \(N_j\) and they are independent, hence \(E[T_n]=m_j\) and \(\pi_j=1/m_j\)
- positive recurrence is a class property
  - if \(i\) is positive recurrent, and i<->j, then \(j\) is also positive recurrent
  - \(\pi_j\): long-run proportion of MC in state \(j\geq \pi_iP_{ij}^n>0\)
    - RHS: long run proportion of MC in \(i\) and will be in \(j\) after n transitions
    - it is also long run proportion of MC in \(j\) and was in \(j\) n transitions ago
- \(\pi_iP_{ij}\): long run proportion of transitions that MC go from i to j.
  - \(\pi_i\): long run proportion of MC in \(i\)
  - hence \(\sum_i\pi_iP_{ij}=\pi_j\)
  - in general for an irreducible and positive recurrent MC, the long run proportions \(\pi_i\) are the unique solutions of \(\pi_j=\sum_i\pi_iP_{ij}\) with \(\sum_i\pi_i=1\)
    - when there are no solutions, the MC is either transient or null recurrent with all \(\pi_i=0\).
- in a finite-state Markov chain all recurrent states are positive recurrent
Ex 4.23 (Hardy-Weinberg law, page 208)
- start with initial population of three gene pairs (genotypes), (AA,Aa,aa), with proportions of \(p_0,q_0,r_0\), where \(p_0+q_0+r_0=1\)
- random mating and random selection of gene variant (allele, a or A)
  - equivalent to randomly selecting an allele from the general pop
- For next generation, the proportions of three gene pairs are \((p_0+r_0/2)^2, 2(p_0+r_0/2)(q_0+r_0/2), (q_0+r_0/2)^2\)
  - note that proportion of allele A is \((p_0+r_0/2)^2+(p_0+r_0/2)(q_0+r_0/2)=p_0+r_0/2\)
  - so the proportion of alleles are never changed from generation to generation: \(\Pr(A)=p_0+r_0/2, \Pr(a)=q_0+r_0/2\)
- therefore the pop reaches the stable genotype dist after just one generation.
- After stable dist, define \(X_n\) as the genotype of a descendant in the \(n\)-th generation
  - transition Matrix: \(\Pr(X_{n+1}=AA|X_n=AA)=\Pr(A)\) just the proportion of allele A in the general pop
    - randomly selecting a parent and then randomly selecting one of his allele is equivalent to random selection of allele
  - limiting dist: just the stable dist
Period
- State i is said to have period d if \(P_{ii}^n = 0\) whenever n is not divisible by d, and d is the largest integer with this property.
- A state with period d=1 is said to be aperiodic
- periodicity is a class property: state i has period d, and i<->j, then state j also has period d.
  - random walk in (0,1,…,M) with \(P_{01}=1, P_{i,i+1}=P_{i,i-1}=0.5, P_{M,M-1}=1\).
Ergodic: positive recurrent, aperiodic states
Theorem 4.1
- For an irreducible ergodic Markov chain \(\lim_{n->\infty} P_{ij}^n\) exists and is independent of i. Furthermore, letting \(\pi_j=\lim_{n->\infty} P_{ij}^n\), j>=0, then \(\pi_j\) is the unique nonnegative solution of \(\pi_j=\sum_{i=0}^{\infty} \pi_iP_{ij}\), j>=0, \(\sum_{j=0}^{\infty} \pi_j = 1\).
- \(\pi_j = \lim_n \Pr(X_n=j)\).
- \(\pi_j\) is also the long-run proportion of time that the process will be in state j.
- The long run proportions \(\pi_j\) are often called stationary probabilities.
- If the Markov chain is irreducible, then there will be a solution to \(\pi_j=\sum_{i=0}^{\infty} \pi_iP_{ij}\), j>=0, \(\sum_{j=0}^{\infty} \pi\_j = 1\), iff the Markov chain is positive recurrent. If a solution exists then it will be unique, and \(\pi_j\) will equal the long-run proportion of time that the Markov chain is in state j. If the chain is aperiodic, then \(\pi_j\) is also the limiting probability that the chain is in state j.

Ex 4.25 (page 222-225)
Ex 4.26 Mean Pattern Times in Markov Chain Generated Data (Page 225-228)
Proposition 4.3
- Let \(\{X_n, n\geq 1\}\) be an irreducible Markov chain with stationary probabilities \(\pi_j\), j>=0, and let r be a bounded function on the state space. Then, with probability 1, \(\lim_{N->\infty} \sum_{n=1}^N r(X_n)/N = \sum_{j=0}^{\infty} r(j)\pi_j\).
- Proof: let \(a_j(N)\) be the amount of time the Markov chain spends in state j during time periods 1, …, N, then, \(\sum_{n=1}^N r(X_n) = \sum_{j=0}^{\infty} a_j(N)r(j)\). Note \(a_j(N)-> \pi_j\) as \(N->\infty\).

Lecture 4

4.8 Time Reversible Markov Chains

Consider a stationary ergodic Markov chain (i.e., safe to assume any \(X_n\) is in stationary dist) with transition probabilities \(P_{ij}\) and stationary probabilities \(\pi_i\).
- sequence of states \(\{X_n,X_{n-1},X_{n-2},...\}\) is a Markov Chain
  - need to verify \(\Pr(X_m=j|X_{m+1}=i,X_{m+2},\cdots)=\Pr(X_m=j|X_{m+1}=i)\)
    - Due to the symmetric property of independence: Pr(A|B,C)=Pr(A|C) <==> Pr(B|A,C)=Pr(B|C).
    - And the Markov property of forward chain.
  - Transition prob \(Q_{ij} = \Pr(X_m=j|X_{m+1}=i) = \pi_jP_{ji}/\pi_i\).
- time reversible if \(Q_{ij}=P_{ij}\)
  - <==> \(\pi_iP_{ij}=\pi_jP_{ji}, \forall i,j\).
  - Equal prob of going from i to j and going from j to i
For any positive \(x_i\) satisfying \(x_iP_{ij}=x_jP_{ji}, \sum_i x_i=1\)
- \(\sum_i x_iP_{ij}=\sum_i x_jP_{ji} = x_j, \forall j\)
- \(x_i\) is the stationary dist (unique solution) and the chain is time reversible.
Ex 4.35 random walk in a finite number of states {0,1,…,M} (page 250-252)
- only make transitions from a state to one of its two nearest neighbors
  - \(P_{01}=\alpha_0=1-P_{00}, P_{MM}=\alpha_M=1-P_{M,M-1}\)
  - \(P_{i,i+1}=\alpha_i=1-P_{i,i-1}\), i=1,…,M-1
- this MC is time reversible
  - between any two transitions from i to i + 1 there must be one from i + 1 to i (and conversely) since the only way to reenter i from a higher state is via state i + 1
  - it follows that the rate of transitions from i to i + 1 equals the rate from i + 1 to i, and so the process is time reversible
- Using reversible property to derive the stationary dist
  - \(\pi_i\alpha_i=\pi_{i+1}(1-\alpha_{i+1})\).
An ergodic Markov chain for which \(P_{ij}=0\) whenever \(P_{ji}=0\) is time reversible iff starting in state i, any path back to i has the same probability as the reversed path.
Consider an irreducible Markov chain with transition probabilities \(P_{ij}\). If we can find positive numbers \(\pi_i,i\geq 0\), summing to one, and a transition probability matrix \(Q = [Q_{ij}]\) such that \(\pi_iPij = \pi_jQji\), then the \(Q_{ij}\) are the transition probabilities of the reversed chain and the \(\pi_i\) are the stationary probabilities both for the original and reversed chain.

4.9 Markov Chain Monte Carlo Methods

Metropolis-Hastings, simulate from \(b_j/B\) without knowing the normalizing constant B.
- target density: \(p(j)=b_j/B\), j=1,2,…
- generating a sequence (not of independent random vectors) of the successive states of a vector-valued Markov chain \(X_1,X_2\),… whose stationary probabilities are \(b_j/B\).
- Use time reversible Markov Chains
  - \(Q=(q_{ij})\) be any specified irreducible Markov transition probability matrix on the integers
  - Construct Markov Chain \(\{X_n\}\) as follows: for \(X_n=i\), (1) generate a random variable Y accodring to transition prob \(\Pr(Y=j)=q_{ij}\). (2) If Y=j, set \(X_{n+1}=j\) with probability \(\alpha_{ij}\), and \(X_{n+1}=i\) with probability \(1-\alpha_{ij}\).
  - \(\alpha_{ij}\) is the acceptance prob.
  - the newly constructed \(\{X_n\}\) has transition prob \(P_{ij}\) given by
    - \(P_{ij}=q_{ij}\alpha_{ij}\), for \(j\neq i\)
    - \(P_{ii}=q_{ii}+\sum_{k\neq i} q_{ik}(1-\alpha_{ik})\)
  - \(\{X_n\}\) will be time reversible and have stationary probabilities \(\pi_j\) if
    - \(\pi_iP_{ij}=\pi_jP_{ji}\), for \(j\neq i\)
    - <==> \(\pi_iq_{ij}\alpha_{ij}=\pi_jq_{ji}\alpha_{ji}\)
    - for \(\pi_j=b_j/B\), <==> \(\alpha_{ij}=\min(1, \pi_jq_{ji}/\pi_i/q_{ij})\)
  - value of B is not needed to define the Markov chain, just the values \(b_j\) suffice.
  - It is almost always the case that \(\pi_j\) will not only be stationary probabilities but will also be limiting probabilities.
    - Indeed, a sufficient condition is that \(P_{ii}>0\) for some i.
Gibbs sampling, useful for high-dim random variable simulation
- Let \(X=(X_1,\ldots,X_n)\) be a discrete random vector with prob mass function \(p(x) = Cg(x)\), where g(x) is known, but C is not, only specified up to a multiplicative constant. We want to simulate from p(x) without knowing C.
- For any i and values \(x_j (j\neq i)\), generate a random variable X having the prob mass function,\(\Pr(X=x)=\Pr(X_i=x|X_j=x_j,j\neq i)\). i.e., each time only update one of the coordinates.
- It is a Metropolis-Hastings algorithm with chain transition prob constructed as
  - Given the present state is x, a coordinate that is equally likely to be any of 1,…,n is chosen
  - If coordinate i is chosen, then a random variable X with prob mass function \(\Pr(X=x)=\Pr(X_i=x|X_j=x_j,j\neq i)\) is generated. If X=x, then the state \(y=(x_1,\ldots,x_{i-1},x,x_{i+1},\ldots,x_n)\) is considered as the candidate next state.
  - With x and y given, the transition can be written as \(q(x,y)=1/n\Pr(X_i=x|X_j=x_j,j\neq i)=p(y)/n/\Pr(X_j=x_j,j\neq i)\)
  - With limiting mass prob to be p(x), the acceptance prob can be calculated as, \(\alpha(x,y)=\min(1,p(y)q(y,x)/p(x)/q(x,y)) = 1\).
  - when utilizing the Gibbs sampler, the candidate state is always accepted as the next state of the chain.

Application to GLMM

GLMM model setup
- conditional dist: \(\Pr(Y|\theta,u)\) typically GLM
- random effects dist: \(\Pr(u)\) typically normal dist with variance parameters
MLE: \(\max_{\theta}\Pr(Y|\theta), \Pr(Y|\theta)=\int_u\Pr(Y|\theta,u)\Pr(u)du\)
- marginal prob is typically very hard to compute.
EM maximization
- \(\max_{\theta} E[\log[\Pr(Y|\theta,u)\Pr(u)]|u]\)
- theoretical expectation hard to compute since it involves the unknown \(\Pr(Y|\theta)\)
- approximate MCMC simulation: \(u_b\sim \Pr(u|Y,\theta)\propto \Pr(u)\Pr(Y|\theta,u)\)
  - Metropolis-Hastings or Gibbs
- \(E[\log[\Pr(Y|\theta,u)\Pr(u)]|u]\approx \sum_b \log\Pr(u_b) +\log\Pr(Y|\theta,u_b)\)
  - can now be easily maximized using standard GLM

Lecture 5

4.11 Hidden Markov Chains

observed sequence of signals \(S_1\), \(S_2\), …
unobserved sequence of underlying Markov chain states \(X_1\), \(X_2\), . . .
signal emission prob: \(\Pr(S_n=s|X_n=j)=P_e(s|j)\)
MC transition prob \(\Pr(X_{k+1}=j|X_k=i)=P_t(j|i)\), and initial state prob \(\Pr(X_1=i)=P_0(i)\) (typical inference of interest)

Inference of MC state

Define \(S^k=(S_1,...,S_k)\), \(X^k=(X_1,...,X_k)\)
- observed data likelihood: \(\Pr(S^k)= \sum_{X^k} \Pr(S^k,X^k)\).
Forward search, define \(F_k(j) = \Pr(S^k,X_k=j)\)
- \(F_n(j) = P_e(S_n|j)\sum_i F_{n-1}(i)P_t(j|i)\).
- \(F_1(i) = P_0(i)P_e(S_1|i)\)
- \(\Pr(S^n) = \sum_i F_n(i)\) (likelihood); \(\Pr(X_n=j|S^n) = F_n(j)/\Pr(S^n)\)
Backward search, define \(B_k(i) = \Pr(S_{k+1},\ldots,S_n|X_k=i)\)
- \(B_k(i) = \sum_j P_e(S_{k+1}|j)B_{k+1}(j)P_t(j|i)\)
- \(B_{n-1}(i) = \sum_j P_t(j|i)P_e(S_n|j)\).
- \(\Pr(S^n) = \sum_i P_e(S_1|i)B_1(i)P_0(i)\)
Inference of state, \(\Pr(X_k|S^n), \Pr(X_k,X_{k+1}|S^n)\) (evaluation)
- \(\Pr(S^n,X_k=j)= F_k(j)B_k(j)\); \(\Pr(S^n) = \sum_j F_k(j)B_k(j)\)
- \(\Pr(X_k=j|S^n) = F_k(j)B_k(j)/\sum_i F_k(i)B_k(i)\)
- \(\Pr(S^n,X_k=i,X_{k+1}=j) = F_k(i)P_t(j|i)P_e(S_{k+1}|j)B_{k+1}(j)\)

Baum-Welch (EM) algorithm for MLE of HMM (learning)

complete data: \(S^n\) (observed), \(X^n=(X_1,\ldots,X_n)\) (missing)
observed data likelihood
- \(\Pr(S^n)= \sum_{X^n} \Pr(S^n,X^n)\).
complete data likelihood
- \(\Pr(S^n,X^n)=P_0(X_1)P_e(S_n|X_n)\prod_{k=1}^{n-1}P_e(S_k|X_k)P_t(X_{k+1}|X_k)\)
- If Xⁿ known, MLE is the sample proportions
  - \(P_t(j|i) = \sum_{k=1}^{n-1}I(X_{k+1}=j,X_k=i)/\sum_{k=1}^{n-1}I(X_k=i)\)
  - \(P_e(s|i) = \sum_{k=1}^n I(S_k=s,X_k=i)/\sum_{k=1}^nI(X_k=i)\)
Expectation of the complete data loglikelihood
- complete data log likelihood can be written as
  - \(\sum_i\{I(X_1=i)\log P_0(i) + I(X_n=i)P_e(S_n|i)+\sum_{k=1}^{n-1}(\log P_e(S_k|i) + \sum_j I(X_k=i,X_{k+1}=j)\log P_t(j|i)) \}\)
- Its conditional expectation (on \(S^n\)) is
  - \(Q=\sum_i\{\Pr(X_1=i|S^n)\log P_0(i)+\Pr(X_n=i|S^n)P_e(S_n|i)+\sum_{k=1}^{n-1}(\log P_e(S_k|i)+\sum_j\Pr(X_k=i,X_{k+1}=j|S^n)\log P_t(j|i))\}\)
\argmax Q -> MLE update
- \(P_t(j|i)=\sum_{k=1}^{n-1}\Pr(X_k=i,X_{k+1}=j|S^n)/\sum_{k=1}^{n-1}\Pr(X_k=i|S^n)\)
- \(P_e(s|i)=\sum_{k=1}^n I(S_k=s)\Pr(X_k=i|S^n)/\sum_{k=1}^n\Pr(X_k=i|S^n)\)
Previous EM algorithm can be readily extended to HMM with more complicated \(P_e/P_t\)
- the Maximization part will be replaced by corresponding MLE.
- see the following Binomial example

Viterbi algorithm: most likely sequence of states given a prescribed sequence of signals (decoding)

Find \(X^n\) maximizing \(\Pr(X^n,S^n)\)
\(V_k(j)= \max_{i_1,...,i_{k-1}} \Pr(X^{k-1} = (i_1,...,i_{k-1}),X_k=j,S^k)\)
target: \(\max_j V_n(j)\)
\(V_k(j)=P_e(S_k|j)\max_i P_t(j|i)V_{k-1}(i)\)
- \(V_k(j)=\max_{i,i_1,\ldots,i_{k-2}}\Pr(X^{k-2}=(i_1,\ldots,i_{k-2},X_{k-1}=i,S^{k-1},X_k=j,S_k)\)
- \(=\max_{i,i_1,\ldots,i_{k-2}}\Pr(X^{k-2}=(i_1,\ldots,i_{k-2},X_{k-1}=i,S^{k-1})\Pr(X_k=j,S_k|X_{k-1}=i)\)
- \(=\max_{i}V_{k-1}(i)P_t(j|i)P_e(S_k|j)\)
Note \(V_1(j)=\Pr(X_1=j,S_1) = P_0(j)P_e(S_1|j)\)
Let \(j_n=\max_j V_n(j)\): the final state of a maximizing state sequence
- \(V_n(j_n)=\max_{i_1,\ldots,i_{n-1}}\Pr(X^n=(i_1,\ldots,i_{n-1},j_n),S^n)\)
- \(=P_e(S_n|j_n)\max_i P_t(j_n|i)V_{n-1}(i)\)
- we can compute \(j_{n-1}\) which maximizes the previous prob.
- start from \(V_{n-1}(j_{n-1})\) …

A numerical exmaple of HMM computation

## simulation set up
## MC
     1     2
1  0.8   0.2 
2  0.7   0.3
## Signal
1  : (0,1,2) p=0.3
2  : (0,1,2) p=0.6 

###
set.seed(17)
pes = c(0.3, 0.6)
pts = c(0.2, 0.3)
p0 = 0.25

M = 200; N = 10
x = s = matrix(0, N,M)
for(i in 1:N){
  x[i,1] = rbinom(1,1,p0)+1
  s[i,1] = rbinom(1, 2, pes[x[i,1]])
  for(k in 2:M){
    x[i,k] = rbinom(1, 1,pts[x[i,k-1]])+1
    s[i,k] = rbinom(1, 2,pes[x[i,k]])
  }
}

### forward prob
fwd = function(s,p0,pes,pts){
   M = length(s)
   fpr = matrix(0, 2,M)
   fpr[1,1] = (1-p0)*dbinom(s[1], 2,pes[1])
   fpr[2,1] = p0*dbinom(s[1], 2,pes[2])
   for(k in 2:M){
     fpr[1,k] = dbinom(s[k],2,pes[1])*(fpr[1,k-1]*(1-pts[1])+ fpr[2,k-1]*(1-pts[2]))
     fpr[2,k] = dbinom(s[k],2,pes[2])*(fpr[1,k-1]*pts[1]+ fpr[2,k-1]*pts[2])
   }
   return(list(fpr=fpr))
}
### backward prob
bwd = function(s,p0,pes,pts){
   M = length(s)
   bpr = matrix(1, 2,M)
   for(k in (M-1):1){
     bpr[1,k] = dbinom(s[k+1],2,pes[1])*bpr[1,k+1]*(1-pts[1]) + dbinom(s[k+1],2,pes[2])*bpr[2,k+1]*pts[1]
     bpr[2,k] = dbinom(s[k+1],2,pes[1])*bpr[1,k+1]*(1-pts[2]) + dbinom(s[k+1],2,pes[2])*bpr[2,k+1]*pts[2]
   }
   return(list(bpr=bpr))
}
## cond prob
cprs = function(s, p0,pes,pts){
  M = length(s)
  bpr = bwd(s, p0,pes,pts)$bpr
  fpr = fwd(s, p0,pes,pts)$fpr
  prsn = sum(bpr[,1]*fpr[,1])
  cpr1 = bpr*fpr/prsn
  cpr2 = array(0, dim=c(2,2,M-1))
  for(k in 1:(M-1)){
    cpr2[1,1, k] = fpr[1,k]*(1-pts[1])*dbinom(s[k+1],2,pes[1])*bpr[1,k+1]
    cpr2[1,2, k] = fpr[1,k]*pts[1]*dbinom(s[k+1],2,pes[2])*bpr[2,k+1]
    cpr2[2,1, k] = fpr[2,k]*(1-pts[2])*dbinom(s[k+1],2,pes[1])*bpr[1,k+1]
    cpr2[2,2, k] = fpr[2,k]*pts[2]*dbinom(s[k+1],2,pes[2])*bpr[2,k+1]
  }
  return(list(cpr1=cpr1,cpr2=cpr2/prsn))
}
## EM MLE
bwem = function(s, p0,pes,pts, itMax=1e2,eps=1e-3){
  it =1; pdiff=1
  while((it<itMax)&(pdiff>eps)){
    ### E-step
    bpr = bwd(s, p0,pes,pts)$bpr
    fpr = fwd(s, p0,pes,pts)$fpr
    prsn = sum(bpr[,1]*fpr[,1])
    pr1 = cprs(s,p0,pes,pts)
    cpr1 = pr1$cpr1; cpr2 = pr1$cpr2 
    ### M-step: binomial MLE
    pts1 = pts
    pts1[1] = sum(cpr2[1,2,])/sum(cpr1[1,-M])
    pts1[2] = sum(cpr2[2,2,])/sum(cpr1[2,-M])
    pes1 = pes
    pes1[1] = sum(cpr1[1,]*s)/sum(cpr1[1,])/2
    pes1[2] = sum(cpr1[2,]*s)/sum(cpr1[2,])/2
    p01 = p0
    p01 = cpr1[2,1]
    ### update
    pdiff = mean(abs(pts-pts1))/3+mean(abs(pes-pes1))/3+abs(p0-p01)/3
    pts=pts1; pes=pes1; p0=p01
  }
  return(list(pts=pts,pes=pes,p0=p0)) 
}

## viterbi
vitb = function(s, p0, pes,pts){
  M = length(s)
  vj = matrix(0, 2,M)
  vj[1,1] = (1-p0)*dbinom(s[1],2,pes[1])
  vj[2,1] = p0*dbinom(s[1],2,pes[2])
  for(k in 2:M){
    a1 = dbinom(s[k],2,pes[1])
    a2 = dbinom(s[k],2,pes[2])
    vj[1,k] = a1*max((1-pts[1])*vj[1,k-1], (1-pts[2])*vj[2,k-1])
    vj[2,k] = a2*max(pts[1]*vj[1,k-1], pts[2]*vj[2,k-1])
  }
  return(list(vj=vj))
}

##
ks = 1
table(s[ks,])
bwem(s[ks,], p0,pes,pts)
cpr1 = cprs(s[ks,], p0,pes,pts)
plot(1:dim(s)[2], x[ks,], type='b')
points(1:dim(s)[2], apply(cpr1$cpr1, 2, which.max), pch=2, col=2)

## viterbi decoding
vj = vitb(s[ks,], p0,pes,pts)$vj
M = length(s[ks,])
vsq = rep(0, M)
vsq[M] = which.max(vj[,M])

for(k in M:2){
   jk = vsq[k]
   if(jk==1){
     tmp = c((1-pts[1])*vj[1,k-1], (1-pts[2])*vj[2,k-1])
     vsq[k-1] = which.max(tmp)
    } else{
     tmp = c(pts[1]*vj[1,k-1], pts[2]*vj[2,k-1])
     vsq[k-1] = which.max(tmp)
    }
}
points(1:dim(s)[2], vsq, pch=3, col=3)

table(x[ks,], apply(cpr1$cpr1, 2, which.max))
table(x[ks,], vsq)

4.5 Some Applications

Ex 4.5.1 The Gambler’s Ruin Problem (page 230)
Ex 4.5.3 random walk (page 237-241)

Lecture 6

4.6 Mean Time Spent in Transient States

Consider now a finite state Markov chain and T = {1, 2, …, t} denotes the set of transient states.
Let \(P_T=(P_{ij}, i\leq t,j\leq t)\), some of its row sums are less than 1.
For transient state i and j,
- \(s_{ij}\) denote the expected number of time periods that the Markov chain is in state j, given that it starts in state i.
- for recurrent state k, \(s_{kj} = 0\)
  - impossible go from recurrent to transient (otherwise k<->j and j must be recurrent)
  - \(f_k=1\): starting from \(k\), we will return to \(k\) with prob 1.
- \(s_{ij}=\delta_{i,j} + \sum_k P_{ik}s_{kj} = \delta_{ij} + \sum_{k=1}^t P_{ik}s_{kj}\)
- Let \(S=(s_{ij}, i<=t,j<=t)\). => \(S=I+P_TS\) => \(S=(1-P_T)^{-1}\).
For \(i\in T, j\in T\), \(f_{ij}=\) Pr(MC ever enters j given that it starts at i).
- \(s_{ij}\) =E[time in j|start at i], condition on ever entering j
- \(s_{ij}\) =E[time in j|start at i, ever entering j]\(f_{ij}\) +E[time in j|start at i, never entering j]\((1-f_{ij})\)
  - = \((\delta_{ij}+s_{jj})f_{ij}+\delta_{ij}(1-f_{ij})\)
  - \(s_{ij}\) is the expected number of additional time in j given that it is eventually entered from j
- so \(f_{ij}=(s_{ij}-\delta_{ij})/s_{ij}\)

4.7 Branching Processes

each individual produces j new offspring with probability \(P_j\) (<1), j>=0, independently of the numbers produced by other individuals.
number of individuals initially present, denoted by \(X_0\), is called the size of the zeroth generation
\(X_n\) denotes the size of the nth generation. \(\{X_n, n=0,1,\ldots\}\) is a Markov chain having as its state space the set of nonnegative integers
- 0 is a recurrent state, since \(P_{00}=1\).
- if \(P_0> 0\)
  - all other states are transient
    - \(P_{i0} = P_0^i\): starting with i individuals, there is a positive probability (\(\geq P_{0i}\)) that no later generation will ever consist of i individuals.
    - so the prob of reentering i is smaller than 1, i.e., i is transient.
  - population will either die out or its size will converge to infinity
    - any finite set of transient states {1,2,…,n} will be visited only finitely often.
For number of offspring of a single individual, denote mean/variance: \(\mu=\sum_{j=0}^{\infty}jP_j\), \(\sigma^2 = \sum_{j=0}^{\infty}(j-\mu)^2P_j\). Given \(X_0=1\):
- \(E(X_n)=\mu^n\)
  - Note \(X_n=\sum_{i=1}^{X_{n-1}} Z_i\), where \(Z_i\) represents the number of offspring of the ith individual of the (n-1)st generation
  - \(E(X_n)=E(E(X_n|X_{n-1}))=\mu E(X_{n-1})\), and \(E(X_1)=\mu\)
- \(Var(X_n)=\sigma^2(\mu^{n-1}+\mu^n+...+\mu^{2n-2})\)
  - \(Var(X_n)= E[Var(X_n|X_{n-1})] + Var(E[X_n|X_{n-1}])\)
  - \(E[X_n|X_{n-1}]= X_{n-1}\mu, Var(X_n|X_{n-1}) = X_{n-1}\sigma^2\)
  - \(Var(X_n)=E(X_{n-1}\sigma^2)+ Var(X_{n-1}\mu) = \mu^{n-1}\sigma^2+\mu^2Var(X_{n-1}) = \sigma^2\mu^{n-1}+\mu^2(\sigma^2\mu^{n-2}+\mu^2Var(X_{n-2}))\)
  - Note \(Var(X_0)=0\).
  - \(\mu=1\) => \(Var(X_n)=n\sigma^2\); \(\mu\neq 1\) => \(Var(X_n) = \sigma^2\mu^{n-1}(1-\mu^n)/(1-\mu)\)
Considering \(\pi_0 = \lim_n \Pr(X_n=0|X_0=1)\)
- \(\mu\leq 1\) –> \(\pi_0=1\); \(\mu>1\) –> \(\pi_0<1\) is the smallest positive root.
  - \(\pi_0 = \sum_{j=0}^{\infty}\pi_0^jP_j\)
    - \(\pi_0 = \Pr(\mbox{population dies out})=\sum_j \Pr(\mbox{population dies out}|X_1 = j)P_j\)
    - \(\Pr(\mbox{population dies out}|X_1=j)= \pi_0^j\) (independence among individuals)
  - \(\mu=E[X_n]=\sum_j j\Pr(X_n=j)\geq \sum_{j>=1} 1.\Pr(X_n=j) = \Pr(X_n\geq 1)\)
    - \(\mu<1\) –> \(\mu^n\to 0\) –> \(\Pr(X_n\geq 1)\to 0\) –> \(\Pr(X_n=0)\to 1\).