Discrete Markov Chain

4.1 Introduction

state: finite/countable values, {0,1,2, ...}
time: discrete, {0,1,2, ...}
transition prob (time homogeneous), P={P_{ij}}, P_{ij} = Pr(X_n=j|X_{n-1}=i).

Ex 4.5 random walk, {0,+/-1,+/-2,...}, (1-p,0,p)
- starting at X0=0, Xn=\sum_{i=1}^n Z_i, Z_i ~ Bernoulli(p)*2-1 —> Xn ~ binomial
- P_{ii}^{2n+1}=0, P_{00}^{2n}=(2n n)p^n(1-p)^n=(2n)!/(n!n!)(p(1-p))^n
Ex 4.6 gambling model, random walk with absorbing state 0/N, P_{00}=P_{NN}=1.

4.2 Chapman-Kolmogorov Equations

n-step transition prob, P_{ij}^n = Pr(X_{n+k}=j|X_{k}=i)
P_{ij}^{n+m} = \sum_{k=0}^{\infty} P_{ik}^n{_{kj}^m, \forall n,m>=0. (P196 formal proof)
P^(n+m) = P^(n).P^(m) (matrix multiplication)
Ex 4.10 urn with 2 balls (red/blue)
Marginal prob: initial prob x conditional transition prob P.
- \alpha_i=Pr{X0=i}, i>=0, \sum_i \alpha_i = 1.
- Pr(Xn=j) = \sum_i\alpha_iP_{ij}^n.

4.3 Classification of States

j accessible from i, P_{ij}^n>0 for some n.
i communicates with j, i<->j : i(j) accessible from j(i).
- State i communicates with state i, all i>=0.
- If i communicates with j, then j communicates with i.
- If i communicates with j, and j communicates with k, then i communicates with k (Chapman–Kolmogorov equations).
  
  P_{ik}^(n+m) = \sum_{r=0}^{\infty} P_{ir}^nP_{rk}^m >= P_{ij}^nP_{jk}^m > 0.

i and j are in the same class if i<->j.
- any two classes of states are either identical or disjoint
- concept of communication divides the state space up into a number of separate classes
Markov chain is irreducible if there is only one class, i.e., all states communicate with each other.

For any state i, define fi as the probability that, starting in state i, the process will ever reenter state i.
- i is recurrent if fi = 1.
  - with probability 1, the process will eventually reenter state i. However, by the definition of a Markov chain, it follows that the process will be starting over again when it reenters state i and, therefore, state i will eventually be visited again.
  - i.e., starting in state i, the process will reenter state i again and again and again, in fact, infinitely often.
- i is transient if fi < 1.
  - each time the process enters state i there will be a positive probability, namely, 1-fi, that it will never again enter that state
  - starting in state i, the probability that the process will be in state i for exactly n time periods equals fi^{n-1}(1-fi), n>=1.
  - if state i is transient then, starting in state i, the number of time periods that the process will be in state i has a geometric distribution with finite mean 1/(1-fi).
- i is recurrent iff, starting in state i, the expected number of time periods that the process is in state i is infinite.
- Number of periods that the process is in state i
  - <==> \sum_{n=0}^{\infty} I(Xn=i)
  - E(\sum_{n=0}^{\infty} I(Xn=i)|X0=i) = \sum_{n=0}^{\infty} P{Xn =i|X0 =i} = \sum_{n=0}^{\infty}P_{ii}^n.
- Proposition 4.1 State i is recurrent iff \sum_{n=0}^{\infty}P_{ii}^n=\infty, and transient iff \sum_{n=0}^{\infty}P_{ii}^n<\infty
  - a transient state will only be visited a finite number of times
  - in a finite-state Markov chain, not all states can be transient
    - suppose the states are 0, 1, . . . , M and suppose that they are all transient. Then after a finite amount of time (say, after time T0) state 0 will never be visited, and after a time (say, T1) state 1 will never be visited, and after a time (say, T2) state 2 will never be visited, and so on. Thus, after a finite time T = max{T0,T1,...,TM} no states will be visited.
    - But as the process must be in some state after time T we arrive at a contradiction, which shows that at least one of the states must be recurrent.
- Corollary 4.2 If state i is recurrent, and i <-> j, then state j is recurrent.
  - Note that, since i<->j, there exist integers k,m st P_{ij}^k>0, P_{ji}^m>0. Now, for any integer n, P_{jj}^{m+n+k}>= P_{ji}^mP_{ii}^nP_{ij}^k. Hence by summing over n, \sum_{n=1}^{\infty} P_{jj}^{m+n+k}\>= P_{ji}^mP_{ij}^k\sum_{n=1}^{\infty}P_{ii}n = \infty. So j is recurrent by Proposition 4.1.
  - transience is a class property, if i is transient and i<->j, j must be transient.
  - all states of a finite irreducible Markov chain are recurrent
Ex 4.18 random walk (page 208)
- on each transition the process either moves one step to the right (with probability p) or one step to the left (with probability 1-p)
- Since all states clearly communicate, it follows from Corollary 4.2 that they are either all transient or all recurrent.
- Check \sum_{n=1}^{\infty} P_{00}^n is finite or infinite.
  - Impossible to be even after an odd number of steps —> P_{00}^{2n-1}=0, n=1,2,..
  - We would go back to 0 after 2n steps iff we went left n of these and right n of these. Because each step goes to left with prob 1-p and right with prob p, the desired probability is thus the binomial probability,
    - P_{00}^{2n} = (2n n)p^n(1-p)^n = (2n)!/(n!n!)(p(1-p))^n, n=1,2,3,...
    - Stirling approx: n! ~ n^{n+1/2}e^{-n}\sqrt{2\pi}, hence P_{00}^{2n} ~ (4p(1-p))^n/sqrt{n\pi}.
    - CLT normal approx: Bin(2n,0.5) ~ N(n,n/2), hence (2n n) = (2n)!/n!n! ~ 4^n/sqrt{n\pi}.
    - positive an ~ bn when lim_{n->\infty}an/bn = 1. And \sum_n an<\infty <==> \sum_n bn<\infty
    - 4p(1-p)<=1 and equality holds iff p=0.5. Hence \sum_nP_{00}^n=\infty iff p=0.5.
- the chain is recurrent when p=0.5 and transient if p\neq 0.5.
Ex 4.18-1 (page 209-210) Two-dim symmetric random work is also recurrent
- at each transition, either take one step to the left, right, up, or down, each having probability 1/4.
- That is, the state is the pair of integers (i,j) and the transition prob are given by
  
  P(i,j),(i+1,j) = P(i,j),(i-1,j) = P(i,j),(i,j+1) = P(i,j),(i,j-1) = 1/4.
- The chain obviously is irreducible, and we just need to check state z=(0,0).
  - Consider P_{zz}^{2n}: after 2n steps, the chain will be back in its original location if for some i (0,1,...,n), the 2n steps consist of i steps to the left, i to the right, n-i up, and n-i down. Since each step will be either of these four types with probability 1/4, —> the desired probability is a multinomial probability,
    - P_{zz}^{2n} = \sum_{i=0}^n (2n, i,i,n-i,n-i) (1/4)^{2n} = (1/4)^{2n} (2n n)^2
    - note (2n n) ~ 4^n/sqrt{n\pi}
    - P_{zz}^{2n} ~ 1/(n\pi) —> \sum_n P_{zz}^{2n} = \infty —> z recurrent.
- all states are recurrent.
The symmetric random walks in one and two dimens are both recurrent, all higher-dimensional symmetric random walks turn out to be transient. (For instance, the three-dimensional symmetric random walk is at each transition equally likely to move in any of six ways—either to the left, right, up, down, in, or out.)

Finite mixture model and HMM

Consider two state Xn={0,1}, given Xn=0, Yn~f0(y); and Xn=1, Yn~f1(y).
Assume a transition matrix for Xn: Pr(Xn=1|Xn-1=0)=p1, Pr(Xn=1|Xn-1=1)=p2.
Yn|Xn are independent of each other
When p1=p2, i.e., Xn/Xn-1 are independent, we obtain the classific two-class finite mixture model for Yn (iid) ~ p1*f1(y) + (1-p1)*f0(y)
when p1\neq p2, {Xn} are a dependent MC, we obtain a HMM for Yn (dependent). Its likelihood could be computed efficiently using backward/forward progs.

4.4 Limiting Probabilities

Period
- State i is said to have period d if P_{ii}^n = 0 whenever n is not divisible by d, and d is the largest integer with this property.
- A state with period d=1 is said to be aperiodic
- periodicity is a class property: state i has period d, and i<->j, then state j also has period d.
  - random walk in (0,1,...,M) with P_{01}=1, P_{i,i+1}=P_{i,i-1}=0.5, P_{M,M-1}=1.
Positive recurrent
- recurrent i is positive recurrent if, starting in i, the expected time until the process returns to state i is finite
- positive recurrence is a class property
- in a finite-state Markov chain all recurrent states are positive recurrent
Ergodic : positive recurrent, aperiodic states
Theorem 4.1
- For an irreducible ergodic Markov chain lim_{n->\infty} P_{ij}^n exists and is independent of i. Furthermore, letting \pi_j=lim_{n->\infty} P_{ij}^n, j>=0, then \pi_j is the unique nonnegative solution of \pi_j=\sum_{i=0}^{\infty} \pi_iP_{ij}, j>=0, \sum_{j=0}^{\infty} \pi_j = 1.
- \pi_j = \lim_n Pr(Xn=j).
- \pi_j is also the long-run proportion of time that the process will be in state j.
- The long run proportions \pi_j are often called stationary probabilities.
- If the Markov chain is irreducible, then there will be a solution to \pi_j=\sum_{i=0}^{\infty} \pi_iP_{ij}, j>=0, \sum_{j=0}^{\infty} \pi\_j = 1, iff the Markov chain is positive recurrent. If a solution exists then it will be unique, and \pi_j will equal the long-run proportion of time that the Markov chain is in state j. If the chain is aperiodic, then \pi_j is also the limiting probability that the chain is in state j.
Ex 4.23 (Hardy-Weinberg law, page 217-219)
Ex 4.25 (page 222-225)
Ex 4.26 Mean Pattern Times in Markov Chain Generated Data (Page 225-228)
Proposition 4.3
- Let {Xn, n>=1} be an irreducible Markov chain with stationary probabilities \pi_j , j>=0, and let r be a bounded function on the state space. Then, with probability 1, lim_{N->\infty} \sum_{n=1}^N r(Xn)/N = \sum_{j=0}^{\infty} r(j)\pi_j.
- Proof: let aj(N) be the amount of time the Markov chain spends in state j during time periods 1, ..., N, then, \sum_{n=1}^N r(Xn) = \sum_{j=0}^{\infty} a_j(N)r(j). Note a_j(N)-> \pi_j as N->\infty.

4.5 Some Applications

Ex 4.5.1 The Gambler’s Ruin Problem (page 230)
Ex 4.5.3 random walk (page 237-241)

4.6 Mean Time Spent in Transient States

Consider now a finite state Markov chain and T = {1, 2, ..., t} denotes the set of transient states.
Let P_T=(P_{ij}, i<=t,j<=t), some of its row sums are less than 1.
For transient state i and j,
- s_{ij} denote the expected number of time periods that the Markov chain is in state j, given that it starts in state i.
- for recurrent state k, P_{ik}s_{kj} = 0 (impossible recurrent <-> transient)
- s_{ij}=\delta_{i,j} + \sum_k P_{ik}s_{kj} = \delta_{ij} + \sum_{k=1}^t P_{ik}s_{kj}
- Let S=(s_{ij}, i<=t,j<=t). —> S=I+P_TS —> S=(1-P_T)^{-1}.

4.7 Branching Processes

each individual produces j new offspring with probability P_j(<1), j>=0, independently of the numbers produced by other individuals.
number of individuals initially present, denoted by X_0, is called the size of the zeroth generation
Xn denotes the size of the nth generation. {Xn, n=0,1,...} is a Markov chain having as its state space the set of nonnegative integers
- 0 is a recurrent state, since P_{00}=1.
- P_{i0} = P_0^i : starting with i individuals, there is a positive probability (>=P_{0i}) that no later generation will ever consist of i individuals. —> if P_0 > 0, all other states are transient
- any finite set of transient states {1,2,...,n} will be visited only finitely often. —> if P_0>0, population will either die out or its size will converge to infinity.
For number of offspring of a single individual, denote mean/variance: \mu=\sum_{j=0}^{\infty}jP_j, \sigma^2 = \sum_{j=0}^{\infty}(j-\mu)^2P_j. Given X0=1 :
- E(Xn)=\mu^n
  - Note Xn=\sum_{i=1}^{X_{n-1}} Z_i, where Z_i represents the number of offspring of the ith individual of the (n-1)st generation
  - E(Xn)=E(E(Xn|X_{n-1})) = \muE(X_{n-1}, and E(X1)=\mu
- Var(Xn) = \sigma^2(\mu^{n-1}+\mu^n +···+\mu^{2n-2})
  - Var(Xn) = E[Var(Xn|X_{n-1})] + Var(E[Xn|X_{n-1}])
  - E[Xn|X_{n-1}] = X_{n-1}\mu, Var(Xn|X_{n-1}) = X_{n-1}\sigma^2
  - Var(Xn) = E(X_{n-1}\sigma^2) + Var(X_{n-1}\mu) = \mu^{n-1}\sigma^2+\mu^2Var(X_{n-1}) = \sigma^2\mu^{n-1} +\mu^2(\sigma^2\mu^{n-2}+ \mu^2Var(X_{n-2})
  - Note Var(X0)=0.
  - \mu=1 —> Var(Xn)=n\sigma^2; \mu\neq 1 —> Var(Xn) = \sigma^2\mu^{n-1}(1-\mu^n)/(1-\mu)
Considering \pi_0 = lim_n P{Xn =0|X0 =1}
- \mu<=1 —> \pi_0=1; \mu>1 —> \pi_0<1 is the smallest positive root.
  - \pi_0 = \sum_{j=0}^{\infty}\pi_0^jP_j
    - \pi_0 = Pr{population dies out}=\sum_j Pr{population dies out|X1 = j}P_j
    - P{population dies out|X1 = j} = \pi_0^j (independence among individuals)
  - \mu=E[Xn]=\sum_j jPr{Xn=j}>= \sum_{j>=1} 1.Pr(Xn=j) = Pr(Xn>=1)
    - \mu<1 —> \mu^n -> 0 —> Pr(Xn>=1) -> 0 —> Pr(Xn=0) -> 1.

4.8 Time Reversible Markov Chains

Consider a stationary ergodic Markov chain (i.e., safe to assume any Xn is in stationary dist) with transition probabilities Pij and stationary probabilities \pi_i.
- sequence of states {Xn,X_{n-1},X_{n-2},...} is a Markov Chain
  - need to verify Pr{Xm=j|X_{m+1}=i,X_{m+2},...}=Pr{Xm=j|X_{m+1}=i}
    - Due to the symmetric property of independence: Pr(A|B,C)=Pr(A|C) <==> Pr(B|A,C)=Pr(B|C).
    - And the Markov property of forward chain.
  - Transition prob Q_{ij} = Pr(Xm=j|X_{m+1}=i) = \pi_jP_{ji}/\pi_i.
- time reversible if Q_{ij}=P_{ij}
  - <==> \pi_iP_{ij}=\pi_jP_{ji}, \forall i,j.
  - Equal prob of going from i to j and going from j to i
- For any positive xi satisfying xiP_{ij}=xjP_{ji}, \sum_i xi=1
  - \sum_i xiP_{ij}=\sum_i xjP_{ji} = xj, \forall j
  - stationary dist \pi_i is the unique solution —> xi=\pi_i —> the chain is time reversible.
Ex 4.35 random walk in a finite number of states {0,1,...,M} (page 250-252)
- only make transitions from a state to one of its two nearest neighbors
  - P_{01}=\alpha_0=1-P_{00}, P_{MM}=\alpha_M=1-P_{M,M-1}
  - P_{i,i+1}=\alpha_i=1-P_{i,i-1}, i=1,...,M-1
- between any two transitions from i to i + 1 there must be one from i + 1 to i (and conversely) since the only way to reenter i from a higher state is via state i + 1
- it follows that the rate of transitions from i to i + 1 equals the rate from i + 1 to i, and so the process is time reversible
- Using reversible property to derive the stationary dist
  - \pi_i\alpha_i=\pi_{i+1}(1-\alpha_{i+1}).

- Theorem 4.2: An ergodic Markov chain for which Pij=0 whenever Pji=0 is time reversible iff starting in state i, any path back to i has the same probability as the reversed path.
- Proposition 4.6: Consider an irreducible Markov chain with transition probabilities Pij. If we can find positive numbers \pi_i,i 0, summing to one, and a transition probability matrix Q = [Qij] such that \pi_iPij = \pi_jQji, then the Qij are the transition probabilities of the reversed chain and the \pi_i are the stationary probabilities both for the original and reversed chain.

4.9 Markov Chain Monte Carlo Methods

Metropolis-Hastings, simulate from b_j/B without knowing the normalizing constant B.
- target density: pi(j) = bj/B, j=1,2,...
- generating a sequence (not of independent random vectors) of the successive states of a vector-valued Markov chain X1,X2,... whose stationary probabilities are bj/B.
- Use time reversible Markov Chains
  - Q=(q_{ij}) be any specified irreducible Markov transition probability matrix on the integers
  - Construct Markov Chain {Xn} as follows: for Xn=i, (1) generate a random variable Y accodring to transition prob P{Y=j}=q_{ij}. (2) If Y=j, set X_{n+1}=j with probability \alpha_{ij}, and X_{n+1}=i with probability 1-\alpha_{ij}.
  - \alpha_{ij} is the acceptance prob.
  - the newly constructed {Xn} has transition prob P_{ij} given by
    - P_{ij}=q_{ij}\alpha_{ij}, for j\neq i
    - P_{ii} = q_{ii} + \sum_{k\neq i} q_{ik}(1-\alpha_{ik})
  - {Xn} will be time reversible and have stationary probabilities \pi_j if
    - \pi_iP_{ij}=\pi_jP_{ji}, for j\neq i
    - <==> \pi_iq_{ij}\alpha_{ij}=\pi_jq_{ji}\alpha_{ji}
    - for \pi_j=bj/B, <==> \alpha_{ij} = min(1, \pi_jq_{ji}/\pi_i/q_{ij})
  - value of B is not needed to define the Markov chain, just the values bj suffice.
  - It is almost always the case that \pi_j will not only be stationary probabilities but will also be limiting probabilities.
    - Indeed, a sufficient condition is that P_{ii}>0 for some i.
Gibbs sampling, useful for high-dim random variable simulation
- Let X=(X1,...,Xn) be a discrete random vector with prob mass function p(x) = Cg(x), where g(x) is know, but C is not, only specified up to a multiplicative constant. We want to simulate from p(x) without knowing C.
- For any i and values xj (j\neq i), generate a random variable X having the prob mass function, Pr{X=x}=Pr{Xi=x|Xj=xj,j\neq i}. i.e., each time only update one of the coordinates.
- It is a Metropolis-Hastings algorithm with chain transition prob constructed as
  - Given the present state is x, a coordinate that is equally likely to be any of 1,...,n is chosen
  - If coordinate i is chosen, then a random variable X with prob mass function Pr{X=x}=Pr{Xi=x|Xj=xj,j\neq i} is generated. If X=x, then the state y=(x1,...x_{i-1},x,x_{i+1},...,xn) is considered as the candidate next state.
  - With x and y given, the transition can be written as q(x,y) = 1/n Pr{Xi=x|Xj=xj,j\neq i} = p(y)/n/Pr(Xj=xj,j\neq i)
  - With limiting mass prob to be p(x), the acceptance prob can be calculated as, \alpha(x,y) = min(1,p(y)q(y,x)/p(x)/q(x,y)) = 1.
  - when utilizing the Gibbs sampler, the candidate state is always accepted as the next state of the chain.

4.10 Markov Decision Processes

4.11 Hidden Markov Chains

observed sequence of signals S1, S2, ...
unobserved sequence of underlying Markov chain states X1, X2, . . .
signal emission prob: Pr(Sn=s|Xn=j)=P_e(s|j)
MC transition prob Pr(Xk+1=j|Xk=i)=P_t(j|i), and initial state prob Pr(X1=i)=P_0(i) (typical inference of interest)

Inference of MC state and MLE

Define S^k=(S_1,...,S_k), X^k=(X_1,...,X_k)
Forward search, define Fk(j) = Pr(S^k,Xk=j)
- Fn(j) = P_e(Sn|j)\sum_i F_{n-1}(i)P_t(j|i).
- F1(i) = P_0(i)P_e(S1|i)
- Pr(S^n) = \sum_i Fn(i) (likelihood); Pr(Xn=j|S^n) = Fn(j)/Pr(S^n)
Backward search, define Bk(i) = Pr(S_{k+1},...,Sn|Xk=i)
- Bk(i) = \sum_j P_e(S_k+1|j)B_{k+1}(j)P_t(j|i)
- B_n-1(i) = \sum_j P_t(j|i)P_e(Sn|j).
- Pr(S^n) = \sum_i P_e(S1|i)B1(i)P_0(i)
Inference of state, Pr(Xk|S^n), Pr(Xk,X_{k+1}|S^n) (evaluation)
- Pr(S^n,Xk=j) = Fk(j)Bk(j); Pr(S^n) = \sum_j Fk(j)Bk(j)
- Pr(Xk=j|S^n) = Fk(j)Bk(j)/\sum_i Fk(i)Bk(i)
- Pr(S^n,Xk=i,X_{k+1}=j) = Fk(i)P_t(j|i)P_e(S_{k+1}|j)B_{k+1}(j)
Baum-Welch (EM) algorithm for MLE of HMM (learning)
- complete data: S^n (observed), X^n=(X1,...,Xn) (missing)
- observed data likelihood
  - Pr(S^n) = \sum_{X^n) Pr(S^n,X^n).
- complete data likelihood
  - Pr(S^n,X^n)=P_0(X1)P_e(Sn|Xn)\prod_{k=1}^{n-1}P_e(Sk|Xk)P_t(X_{k+1}|Xk)
  - If X^n known, MLE is the sample proportions
    - P_t(j|i) = \sum_{k=1}^{n-1}I(X_{k+1}=j,X_k=i)/\sum_{k=1}^{n-1}I(Xk=i)
    - P_e(s|i) = \sum_{k=1}^n I(Sk=s,Xk=i)/\sum_{k=1}^nI(Xk=i)
- Expectation of the complete data loglikelihood
  - complete data log likelihood can be written as
    - \sum_i { I(X1=i)\logP_0(i) + I(Xn=i)P_e(Sn|i) + \sum_{k=1}^{n-1} (\logP_e(Sk|i) + \sum_j I(Xk=i,X_{k+1}=j)\logP_t(j|i)) }
  - Its conditional expectation (on S^n) is
    - Q = \sum_i {Pr(X1=i|S^n)\logP_0(i)+Pr(Xn=i|S^n)P_e(Sn|i) + \sum_{k=1}^{n-1} (\logP_e(Sk|i)+\sum_jPr(Xk=i,X_{k+1}=j|S^n)\logP_t(j|i)) }
- \argmax Q -> MLE update
  - P_t(j|i) = \sum_{k=1}^{n-1} Pr(X_k=i,X_{k+1}=j|S^n)/\sum_{k=1}^{n-1} Pr(Xk=i|S^n)
  - P_e(s|i) = \sum_{k=1}^n I(Sk=s)\Pr(Xk=i|S^n)/\sum_{k=1}^n Pr(Xk=i|S^n)
Previous EM algorithm can be readily extended to HMM with more complicated P_e/P_t
- the Maximization part will be replaced by corresponding MLE.
- see the following Binomial example

Viterbi algorithm: most likely sequence of states given a prescribed sequence of signals (decoding)

Find X^n maximizing Pr(X^n,S^n)
Vk(j) = max_{i_1,...,i_(k-1)} Pr(X^{k-1} = (i_1,...,i_(k-1)),Xk=j,S^k)
target : \max_j Vn(j)
Vk(j) = P_e(Sk|j)\max_i P_t(j|i)V(k-1)(i)
V1(j) = Pr(X1=j,S1) = P_0(j)P_e(S1|j)

A numerical exmaple of HMM computation

## simulation set up
## MC
     1     2
1  0.8   0.2
2  0.7   0.3
## Signal
1  : (0,1,2) p=0.3
2  : (0,1,2) p=0.6

###
set.seed(17)
pes = c(0.3, 0.6)
pts = c(0.2, 0.3)
p0 = 0.25

M = 200; N = 10
x = s = matrix(0, N,M)
for(i in 1:N){
  x[i,1] = rbinom(1,1,p0)+1
  s[i,1] = rbinom(1, 2, pes[x[i,1]])
  for(k in 2:M){
    x[i,k] = rbinom(1, 1,pts[x[i,k-1]])+1
    s[i,k] = rbinom(1, 2,pes[x[i,k]])
  }
}

### forward prob
fwd = function(s,p0,pes,pts){
   M = length(s)
   fpr = matrix(0, 2,M)
   fpr[1,1] = (1-p0)*dbinom(s[1], 2,pes[1])
   fpr[2,1] = p0*dbinom(s[1], 2,pes[2])
   for(k in 2:M){
     fpr[1,k] = dbinom(s[k],2,pes[1])*(fpr[1,k-1]*(1-pts[1])+ fpr[2,k-1]*(1-pts[2]))
     fpr[2,k] = dbinom(s[k],2,pes[2])*(fpr[1,k-1]*pts[1]+ fpr[2,k-1]*pts[2])
   }
   return(list(fpr=fpr))
}
### backward prob
bwd = function(s,p0,pes,pts){
   M = length(s)
   bpr = matrix(1, 2,M)
   for(k in (M-1):1){
     bpr[1,k] = dbinom(s[k+1],2,pes[1])*bpr[1,k+1]*(1-pts[1]) + dbinom(s[k+1],2,pes[2])*bpr[2,k+1]*pts[1]
     bpr[2,k] = dbinom(s[k+1],2,pes[1])*bpr[1,k+1]*(1-pts[2]) + dbinom(s[k+1],2,pes[2])*bpr[2,k+1]*pts[2]
   }
   return(list(bpr=bpr))
}
## cond prob
cprs = function(s, p0,pes,pts){
  M = length(s)
  bpr = bwd(s, p0,pes,pts)$bpr
  fpr = fwd(s, p0,pes,pts)$fpr
  prsn = sum(bpr[,1]*fpr[,1])
  cpr1 = bpr*fpr/prsn
  cpr2 = array(0, dim=c(2,2,M-1))
  for(k in 1:(M-1)){
    cpr2[1,1, k] = fpr[1,k]*(1-pts[1])*dbinom(s[k+1],2,pes[1])*bpr[1,k+1]
    cpr2[1,2, k] = fpr[1,k]*pts[1]*dbinom(s[k+1],2,pes[2])*bpr[2,k+1]
    cpr2[2,1, k] = fpr[2,k]*(1-pts[2])*dbinom(s[k+1],2,pes[1])*bpr[1,k+1]
    cpr2[2,2, k] = fpr[2,k]*pts[2]*dbinom(s[k+1],2,pes[2])*bpr[2,k+1]
  }
  return(list(cpr1=cpr1,cpr2=cpr2/prsn))
}
## EM MLE
bwem = function(s, p0,pes,pts, itMax=1e2,eps=1e-3){
  it =1; pdiff=1
  while((it<itMax)&(pdiff>eps)){
    ### E-step
    bpr = bwd(s, p0,pes,pts)$bpr
    fpr = fwd(s, p0,pes,pts)$fpr
    prsn = sum(bpr[,1]*fpr[,1])
    pr1 = cprs(s,p0,pes,pts)
    cpr1 = pr1$cpr1; cpr2 = pr1$cpr2
    ### M-step: binomial MLE
    pts1 = pts
    pts1[1] = sum(cpr2[1,2,])/sum(cpr1[1,-M])
    pts1[2] = sum(cpr2[2,2,])/sum(cpr1[2,-M])
    pes1 = pes
    pes1[1] = sum(cpr1[1,]*s)/sum(cpr1[1,])/2
    pes1[2] = sum(cpr1[2,]*s)/sum(cpr1[2,])/2
    p01 = p0
    p01 = cpr1[2,1]
    ### update
    pdiff = mean(abs(pts-pts1))/3+mean(abs(pes-pes1))/3+abs(p0-p01)/3
    pts=pts1; pes=pes1; p0=p01
  }
  return(list(pts=pts,pes=pes,p0=p0))
}

## viterbi
vitb = function(s, p0, pes,pts){
  M = length(s)
  vj = matrix(0, 2,M)
  vj[1,1] = (1-p0)*dbinom(s[1],2,pes[1])
  vj[2,1] = p0*dbinom(s[1],2,pes[2])
  for(k in 2:M){
    a1 = dbinom(s[k],2,pes[1])
    a2 = dbinom(s[k],2,pes[2])
    vj[1,k] = a1*max((1-pts[1])*vj[1,k-1], (1-pts[2])*vj[2,k-1])
    vj[2,k] = a2*max(pts[1]*vj[1,k-1], pts[2]*vj[2,k-1])
  }
  return(list(vj=vj))
}

##
ks = 1
table(s[ks,])
bwem(s[ks,], p0,pes,pts)
cpr1 = cprs(s[ks,], p0,pes,pts)
plot(1:dim(s)[2], x[ks,], type='b')
points(1:dim(s)[2], apply(cpr1$cpr1, 2, which.max), pch=2, col=2)

## viterbi decoding
vj = vitb(s[ks,], p0,pes,pts)$vj
M = length(s[ks,])
vsq = rep(0, M)
vsq[M] = which.max(vj[,M])

for(k in M:2){
   jk = vsq[k]
   if(jk==1){
     tmp = c((1-pts[1])*vj[1,k-1], (1-pts[2])*vj[2,k-1])
     vsq[k-1] = which.max(tmp)
    } else{
     tmp = c(pts[1]*vj[1,k-1], pts[2]*vj[2,k-1])
     vsq[k-1] = which.max(tmp)
    }
}
points(1:dim(s)[2], vsq, pch=3, col=3)

table(x[ks,], apply(cpr1$cpr1, 2, which.max))
table(x[ks,], vsq)