Exercise 4: Hidden Markov models

Introduction

This exercise uses three Python libraries:

  • pandas — for loading and exploring tabular data.
  • PyTorch — for tensor operations (matrix multiplication, element-wise products). A short tutorial can be found in the Appendix.
  • matplotlib — for plotting results.

If any of these are not installed, run pip install pandas torch matplotlib in your terminal.

In Exercises 1 and 2, we approximated posteriors by sampling. Here, we compute them exactly. In particular, we will derive the classical forward algorithm by parsing the pre-conditioned HMM string diagram from left to right. The algorithm emerges directly from the kernel composition rules, demonstrating how the diagrammatic language can guide us to the right algorithm.

The Hidden Markov Model

Recall the structure of a hidden Markov model (HMM) from the course notes:

A hidden Markov model with initial distribution \(i : I \kernel X\), transition kernel \(k : X \kernel X\), and observation kernel \(o : X \kernel Y\).

The model is built from three kernels: an initial distribution \(i : I \kernel X\) specifying the probability of each starting state, a transition kernel \(k : X \kernel X\) describing how the hidden state evolves, and an observation kernel \(o : X \kernel Y\) producing a noisy observation from the current state. The sum-product form of the joint distribution represented by the diagram is: \[ p(x_t, y_0, \ldots, y_t) = \sum_{x_0, \ldots, x_{t-1}} i(x_0) \left( \prod_{s=0}^{t} o(y_s \mid x_s) \right) \left( \prod_{s=0}^{t-1} k(x_{s+1} \mid x_s) \right) . \]

NoteMatrix layout conventions

We will store the transition kernel as a matrix \(K\) of shape (n_states, n_states) with entry \(K[x, x'] = k(x' \mid x)\), and the observation kernel as a matrix \(O\) of shape (n_states, n_obs) with entry \(O[x, y] = o(y \mid x)\). Each row is a probability distribution and sums to 1. This convention is chosen so that sequential composition corresponds to matrix multiplication: if \(f : A \kernel B\) and \(g : B \kernel C\) are two kernels with matrices \(F\) and \(G\), then the matrix of \(f \then g\) is the ordinary matrix product \(FG\). In Python, this means we can write f @ g using the @ operator.

Choose your application

The tasks below are stated in general terms. Choose one of the following two applications to work with throughout the exercise. Both involve a 2-state hidden Markov model — the only difference is what the states and observations represent. In a real application, the model parameters would typically be estimated from labeled data, as we practiced in the previous exercise sheet. Here, the parameters are given directly.

Application A: CpG island detection in DNA

In the human genome, the dinucleotide CG (called “CpG”) is relatively rare — most cytosines in a CG context are chemically modified (methylated) and tend to mutate over evolutionary time. However, certain regions called CpG islands resist this modification and retain a high frequency of C and G nucleotides. CpG islands are biologically important: they are found near roughly 70% of human gene promoters and play a key role in gene regulation.

Detecting CpG islands from raw DNA sequence is a classic application of HMMs in bioinformatics (see e.g. Durbin et al. (1998)). The hidden state indicates whether we are currently inside or outside a CpG island, and the observations are the nucleotides (A, C, G, T) at each position.

  • Hidden state \(X = \{0, 1\}\): \(\;0\) = non-island, \(\;1\) = island.
  • Observation \(Y = \{0, 1, 2, 3\}\): \(\;0\) = A, \(\;1\) = C, \(\;2\) = G, \(\;3\) = T.

Model parameters. The transition matrix is “sticky” — the chain tends to stay in each state for long runs, consistent with the fact that CpG islands and non-island regions span hundreds of nucleotides. The observation distributions differ in C/G content: CpG islands are enriched in C and G, while non-island regions favor A and T.

#                   non-island  island
init = tensor([       0.67,     0.33  ])

#             non-island  island
K = tensor([[ 0.99,       0.01  ],    # from non-island
            [ 0.02,       0.98  ]])    # from island

#             A      C      G      T
O = tensor([[ 0.30,  0.20,  0.20,  0.30 ],    # non-island
            [ 0.15,  0.35,  0.35,  0.15 ]])    # island

Test data (generated from the above HMM so we can verify our inference):

  • cpg_test.csv — 500 nucleotide positions without labels (for inference)
  • cpg_test_labels.csv — true hidden states for the test sequence (for validation)

Application B: Volatility regime detection in financial markets

Stock market returns exhibit volatility clustering: calm periods with small daily moves alternate with turbulent periods of large swings. This phenomenon is well-documented and has been modeled using regime-switching models since Hamilton (1989). The hidden state represents the current volatility regime, and the observations are discretized daily returns.

  • Hidden state \(X = \{0, 1\}\): \(\;0\) = low volatility, \(\;1\) = high volatility.
  • Observation \(Y = \{0, 1, 2, 3, 4\}\) represents daily returns, where \(\;0\) = large drop, \(\;1\) = small drop, \(\;2\) = flat, \(\;3\) = small gain, \(\;4\) = large gain. The bin boundaries are \((-\infty, -1.5\%,\, -0.5\%,\, 0.5\%,\, 1.5\%,\, +\infty)\).

Model parameters. The low-volatility regime concentrates probability on “flat” and “small gain,” while the high-volatility regime spreads mass more evenly across all bins, including the tails. The transition matrix is sticky: calm periods last ~50 days on average, turbulent periods ~20 days.

#                   low_vol  high_vol
init = tensor([      0.71,    0.29   ])

#             low_vol  high_vol
K = tensor([[ 0.98,    0.02   ],    # from low_vol
            [ 0.05,    0.95   ]])    # from high_vol

#             large_drop  small_drop  flat   small_gain  large_gain
O = tensor([[ 0.02,       0.13,       0.55,  0.23,       0.07 ],  # low_vol
            [ 0.15,       0.20,       0.20,  0.20,       0.25 ]]) # high_vol

Test data (generated from the above HMM so we can verify our inference):

Tasks

Set up the model

Task 1. Enter the model parameters for your chosen application as PyTorch tensors and load the test observation sequence. You will need:

  • The initial distribution init as a 1-D tensor of shape (n_states,).
  • The transition matrix K as a 2-D tensor of shape (n_states, n_states).
  • The observation matrix O as a 2-D tensor of shape (n_states, n_obs).
  • The test observations as a 1-D tensor of integer codes.
  • The true hidden states (for later validation).
import torch
import pandas as pd

# State and observation names
cpg_state_names = ["non-island", "island"]
cpg_obs_names = ["A", "C", "G", "T"]
cpg_state_to_idx = {s: i for i, s in enumerate(cpg_state_names)}
cpg_obs_to_idx = {o: i for i, o in enumerate(cpg_obs_names)}
cpg_n_states = len(cpg_state_names)
cpg_n_obs = len(cpg_obs_names)

# Model parameters (given)
cpg_init = torch.tensor([0.67, 0.33])

cpg_K = torch.tensor([
    [0.99, 0.01],
    [0.02, 0.98]
])

cpg_O = torch.tensor([
    [0.30, 0.20, 0.20, 0.30],
    [0.15, 0.35, 0.35, 0.15]
])

print("Initial distribution:", cpg_init)
print("Transition matrix K:")
print(cpg_K)
print("Observation matrix O:")
print(cpg_O)
Initial distribution: tensor([0.6700, 0.3300])
Transition matrix K:
tensor([[0.9900, 0.0100],
        [0.0200, 0.9800]])
Observation matrix O:
tensor([[0.3000, 0.2000, 0.2000, 0.3000],
        [0.1500, 0.3500, 0.3500, 0.1500]])

Load the test data:

cpg_test_df = pd.read_csv("../data/exercise-4/cpg_test.csv")
cpg_test_obs = torch.tensor(cpg_test_df["nucleotide"].map(cpg_obs_to_idx).tolist())

cpg_labels_df = pd.read_csv("../data/exercise-4/cpg_test_labels.csv")
cpg_true_states = cpg_labels_df["state"].map(cpg_state_to_idx).tolist()

print(f"Test sequence length: {len(cpg_test_obs)}")
print(f"First 10 observations: {cpg_test_obs[:10].tolist()}")
Test sequence length: 500
First 10 observations: [0, 3, 0, 3, 1, 3, 1, 1, 1, 3]
import torch
import pandas as pd

# State and observation names
vol_state_names = ["low_vol", "high_vol"]
vol_obs_names = ["large_drop", "small_drop", "flat", "small_gain", "large_gain"]
vol_state_to_idx = {s: i for i, s in enumerate(vol_state_names)}
vol_obs_to_idx = {o: i for i, o in enumerate(vol_obs_names)}
vol_n_states = len(vol_state_names)
vol_n_obs = len(vol_obs_names)

# Model parameters (given)
vol_init = torch.tensor([0.71, 0.29])

vol_K = torch.tensor([
    [0.98, 0.02],
    [0.05, 0.95]
])

vol_O = torch.tensor([
    [0.02, 0.13, 0.55, 0.23, 0.07],
    [0.15, 0.20, 0.20, 0.20, 0.25]
])

print("Initial distribution:", vol_init)
print("Transition matrix K:")
print(vol_K)
print("Observation matrix O:")
print(vol_O)
Initial distribution: tensor([0.7100, 0.2900])
Transition matrix K:
tensor([[0.9800, 0.0200],
        [0.0500, 0.9500]])
Observation matrix O:
tensor([[0.0200, 0.1300, 0.5500, 0.2300, 0.0700],
        [0.1500, 0.2000, 0.2000, 0.2000, 0.2500]])

Load the test data:

vol_test_df = pd.read_csv("../data/exercise-4/vol_test.csv")
vol_test_obs = torch.tensor(vol_test_df["return_bin"].map(vol_obs_to_idx).tolist())

vol_labels_df = pd.read_csv("../data/exercise-4/vol_test_labels.csv")
vol_true_states = vol_labels_df["regime"].map(vol_state_to_idx).tolist()

print(f"Test sequence length: {len(vol_test_obs)}")
print(f"First 10 observations: {vol_test_obs[:10].tolist()}")
Test sequence length: 500
First 10 observations: [1, 4, 2, 2, 2, 2, 2, 2, 3, 1]

The filtering problem

Now that we have a fully specified HMM, we can use it to infer hidden states from new observations. Given observations \(\bar{y}_0, \ldots, \bar{y}_{T-1}\), the filtering distribution is the posterior over the current hidden state given everything observed so far: \[ p(x_t \mid \bar{y}_0, \ldots, \bar{y}_t) \quad \text{for } t = 0, \ldots, T-1. \]

Our goal is to compute this exactly. Following the procedure from the chapter on Conditioning, we remove observed wires, pre-condition affected boxes on the observations, and normalize the resulting diagram. Pre-conditioning \(o\) on the output \(Y = \bar{y}_s\) gives the weight function \(w_s : X \kernel I\) defined by \[ w_s(x) := o_{\mid Y=\bar{y}_s}(x) = o(\bar{y}_s \mid x). \]

The pre-conditioned diagram represents the unnormalized joint \(p(x_t, \bar{y}_0, \ldots, \bar{y}_t)\):

The pre-conditioned HMM. Each observation kernel \(o\) is pre-conditioned on the observed value \(\bar y_s\).

Therefore, to compute the filtering distribution, it is sufficient to evaluate the pre-conditioned diagram using the kernel composition rules and normalize the result.

Task 2. Convince yourself that evaluating this diagram yields exactly the unnormalized filtering distribution \(p(x_t, \bar{y}_0, \ldots, \bar{y}_t)\).

We must simply plug in \(\bar{y}_0, \ldots, \bar{y}_t\) into the sum-product form of the joint distribution:

\[p(x_t, \bar y_0, \ldots, \bar y_t) = \sum_{x_0, \ldots, x_{t-1}} i(x_0) \prod_{s=0}^{t} o(\bar y_s \mid x_s) \prod_{s=0}^{t-1} k(x_{s+1} \mid x_s).\]

Applying the definition of the observation weight \(w_s(x) := o(\bar{y}_s \mid x)\), we can rewrite this as: \[ p(x_t, \bar y_0, \ldots, \bar y_t) = \sum_{x_0, \ldots, x_{t-1}} i(x_0) \prod_{s=0}^{t} w_s(x_s) \prod_{s=0}^{t-1} k(x_{s+1} \mid x_s). \]

This is exactly the sum-product form of the proposed diagram.

Deriving the forward algorithm from the string diagram

We now derive the key algorithm for filtering HMMs. The idea is to evaluate the pre-conditioned diagram from left to right using the kernel composition rules.

At each step, we maintain a belief vector \(\alpha_t\) with entries: \[ \alpha_t(x) := p(x_t = x,\, \bar{y}_0, \ldots, \bar{y}_t). \]

The filtering distribution is obtained by normalizing: \[ p(x_t \mid \bar{y}_0, \ldots, \bar{y}_t) = \frac{\alpha_t(x)}{\sum_{x'} \alpha_t(x')}. \]

Task 3a. Derive the forward recursion — express \(\alpha_{t+1}\) in terms of \(\alpha_{t}\) — by interpreting it as a left-to-right composition in the HMM string diagram. The starting point is

Identify the two composition operations that connect \(\alpha_t\) to the next time step in the pre-conditioned diagram. The first involves the transition kernel \(k\), and the second involves the observation weight \(w_{t+1}\). What type of composition (sequential or copy) does each one correspond to?

We read the recursion directly from the pre-conditioned diagram by identifying the two operations that advance from time \(t\) to time \(t+1\).

Step 1: Sequential composition with \(k\) (predict). The belief vector \(\alpha_t\) is connected to the transition kernel \(k\) by an internal wire over which we sum. This is sequential composition \((\alpha_t \then k)\), which propagates the belief forward through the transition:

In components: \((\alpha_t \then k)(x') = \sum_{x} k(x' \mid x)\, \alpha_t(x)\).

Step 2: Copy composition with \(w_{t+1}\) (update). The state wire at time \(t{+}1\) is copied: one copy feeds into the observation weight \(w_{t+1}\), and the other continues as the output. This is the copy composition of \((\alpha_t \then k)\) with \(w_{t+1}\):

Copy composition (denoted \(\odot\)) multiplies element-wise, giving: \[ ((\alpha_t \then k) \odot w_{t+1})(x') = (\alpha_t \then k)(x') \cdot w_{t+1}(x'). \]

We are now back in the situation we started with, but one time step to the right. The resulting vector is exactly \(\alpha_{t+1}\). The forward recursion is therefore: \[ \boxed{\alpha_{t+1} = (\alpha_t \then k) \odot w_{t+1}.} \]

For the base case, the same logic applies at \(t = 0\): the initial distribution \(i\) is copy-composed with \(w_0\), giving \(\alpha_0 = i \odot w_0\).

Task 3b. Express the forward recursion for \(\alpha_{t+1}\) in terms of matrix and element-wise multiplication. Use the convention that \(K\) is the transition matrix with entries \(K[x, x'] = k(x' \mid x)\) and \(O\) is the observation matrix with entries \(O[x, y] = o(y \mid x)\).

We translate each composition operation into a tensor operation, treating \(\alpha_t\) as a row vector.

  1. Sequential composition \(\alpha_t \then k\) is a vector–matrix product. Expanding the definition: \[ (\alpha_t \then k)(x') = \sum_{x} k(x' \mid x)\, \alpha_t(x) = \sum_{x} \alpha_t(x)\, K[x, x']. \] This is exactly the row-vector–matrix product \(\alpha_t K\).

  2. Copy composition with \(w_{t+1}\) is element-wise multiplication. The observation weight vector \(w_{t+1}\) has entries \(w_{t+1}(x') = o(\bar{y}_{t+1} \mid x') = O[x', \bar{y}_{t+1}]\). Copy composition multiplies entrywise: \[ \alpha_{t+1}(x') = (\alpha_t \then k)(x') \cdot w_{t+1}(x') \] This is the Hadamard product (element-wise multiplication) of the vectors \(\alpha_t K\) and \(w_{t+1}\). We write \(\odot\) for the Hadamard product: \((u \odot v)(x) := u(x) \cdot v(x)\).

Putting these together, the forward recursion in matrix form is: \[ \boxed{\alpha_{t+1} = (\alpha_t K) \odot w_{t+1},} \] where \(w_{t+1} = O[\,\cdot\,, \bar{y}_{t+1}]\) is column \(\bar{y}_{t+1}\) of \(O\), read as a row vector. The base case is \(\alpha_0 = i \odot w_0\).

Implement the forward algorithm

Task 4. Implement the forward algorithm in PyTorch. Your function should take the initial distribution, transition matrix, observation matrix, and a sequence of observations, and return the filtering distribution \(p(x_t \mid y_0, \ldots, y_t)\) at each time step.

NoteNote: Numerical stability

The unnormalized \(\alpha_t\) values shrink exponentially with \(t\) (each step multiplies by probabilities \(< 1\)). For long sequences, they will underflow to zero. However, since the filtering distribution is obtained by normalizing \(\alpha_t\), we are free to rescale \(\alpha_t\) by any positive constant without affecting the result — only the direction of the vector matters, not its magnitude. In particular, we can normalize \(\alpha_t\) at every step and use the normalized version as input to the next recursion. The normalized vector is exactly the filtering distribution \(p(x_t \mid y_0, \ldots, y_t)\), which is what we want to compute anyway.

Write a function forward(init, K, O, observations) that returns a tensor of shape (T, n_states) containing the filtering distribution at each time step.

The key PyTorch operations are:

  • * — element-wise multiplication of tensors of the same shape.
  • @ — matrix multiplication (or vector–matrix product when the left operand is 1-D).
  • tensor.sum() — sum of all elements.
  1. Initialization. Create a tensor filtering of shape (T, n_states) to store the results. Set T = len(observations).

  2. Base case. Compute the observation weight vector w_0 = O[:, observations[0]] (column observations[0] of the observation matrix) and form alpha = init * w_0 (element-wise multiplication). Normalize: alpha = alpha / alpha.sum(). Store filtering[0] = alpha.

  3. Recursion. For each t in 1, ..., T-1:

    1. Predict: compute alpha = alpha @ K (row vector times transition matrix). Here @ is Python’s matrix multiplication operator.
    2. Update: compute alpha = alpha * O[:, observations[t]] (element-wise multiplication with the observation weight vector).
    3. Normalize: compute alpha = alpha / alpha.sum().
    4. Store filtering[t] = alpha.
  4. Return filtering.

def forward(init, K, O, observations):
    """Run the forward algorithm.

    Args:
        init: initial distribution, shape (n_states,)
        K:    transition matrix, shape (n_states, n_states), K[x, x'] = k(x'|x)
        O:    observation matrix, shape (n_states, n_obs), O[x, y] = o(y|x)
        observations: list/tensor of observation indices, length T

    Returns:
        filtering: tensor of shape (T, n_states),
                   filtering[t] = p(x_t | y_0, ..., y_t)
    """
    T = len(observations)
    n_states = init.shape[0]
    filtering = torch.zeros(T, n_states)

    # Base case: alpha_0 = i * w_0
    alpha = init * O[:, observations[0]]
    alpha = alpha / alpha.sum()  # normalize
    filtering[0] = alpha

    for t in range(1, T):
        # Predict: propagate through transition (row vector times matrix)
        alpha = alpha @ K

        # Update: weight by observation likelihood
        alpha = alpha * O[:, observations[t]]

        # Normalize
        alpha = alpha / alpha.sum()
        filtering[t] = alpha

    return filtering

Run inference on the test sequence

Task 5. Run the forward algorithm on the test observation sequence using the model parameters from Task 1. For each time step \(t\), extract the posterior probability of being in state 1.

Plot this posterior probability over time. On the same plot, overlay the true hidden states from the labels file (as a step function or shaded regions). Discuss:

  • Does the model track the hidden state accurately?
  • Where does it do well? Where does it struggle?
  • What happens at the boundaries when the hidden state switches?
  1. Run the forward algorithm. Call forward(init, K, O, test_obs) with the model parameters and test observations from Task 1. This returns a tensor of shape (T, n_states).

  2. Extract the posterior for state 1. Index the filtering result as filtering[:, 1] to get a 1-D tensor of length \(T\).

  3. Plot. Use matplotlib to create a figure. Two useful functions:

Run the forward algorithm (using the model parameters and test data from Task 1):

cpg_filtering = forward(cpg_init, cpg_K, cpg_O, cpg_test_obs)
cpg_p_state1 = cpg_filtering[:, 1]

Plot the filtering distribution against the true hidden states:

Code
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 3.5))
T = len(cpg_p_state1)
t_axis = range(T)

ax.fill_between(
    t_axis,
    0,
    1,
    where=[s == 1 for s in cpg_true_states],
    alpha=0.25,
    color="orange",
    label="True island",
)
ax.plot(
    t_axis,
    cpg_p_state1,
    linewidth=0.8,
    color="steelblue",
    label="$p(\\mathrm{island} \\mid y_{0:t})$",
)

ax.set_xlabel("Time step $t$")
ax.set_ylabel("Posterior probability")
ax.set_ylim(-0.05, 1.05)
ax.legend(loc="upper left")
ax.set_title("Forward algorithm: filtering distribution over time")
plt.tight_layout()
plt.show()
Figure 1: CpG application — posterior probability of being in a CpG island (blue) overlaid with the true state (orange shading).

The model typically tracks the hidden state well once it has accumulated enough evidence — the posterior probability rises (or falls) decisively within the interior of each segment. At boundaries, there is a brief transition period where the model is uncertain.

Run the forward algorithm (using the model parameters and test data from Task 1):

vol_filtering = forward(vol_init, vol_K, vol_O, vol_test_obs)
vol_p_state1 = vol_filtering[:, 1]

Plot the filtering distribution against the true hidden states:

Code
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 3.5))
T = len(vol_p_state1)
t_axis = range(T)

ax.fill_between(
    t_axis,
    0,
    1,
    where=[s == 1 for s in vol_true_states],
    alpha=0.25,
    color="orange",
    label="True high_vol",
)
ax.plot(
    t_axis,
    vol_p_state1,
    linewidth=0.8,
    color="steelblue",
    label="$p(\\mathrm{high\\_vol} \\mid y_{0:t})$",
)

ax.set_xlabel("Time step $t$")
ax.set_ylabel("Posterior probability")
ax.set_ylim(-0.05, 1.05)
ax.legend(loc="upper left")
ax.set_title("Forward algorithm: filtering distribution over time")
plt.tight_layout()
plt.show()
Figure 2: Volatility application — posterior probability of the high-volatility regime (blue) overlaid with the true regime (orange shading).

The model typically tracks the hidden state well once it has accumulated enough evidence. The length of the transition at boundaries depends on how different the observation distributions are in the two regimes.

Evaluate inference quality

The forward algorithm returns a full probability distribution over states at each time step, not just a single guess. We consider two ways to evaluate its quality — one that reduces the posterior to a point estimate, and one that evaluates the posterior directly.

Task 6. Compute the following two metrics and interpret the results.

  1. MAP accuracy. The simplest summary of the posterior is the maximum a posteriori (MAP) estimate: at each time step, predict the state \(\hat x_t\) with the highest posterior probability. The accuracy is the fraction of time steps where this prediction matches the true state \(x_t^*\): \[ \text{Accuracy} = \frac{\#\{t : \hat{x}_t = x_t^*\}}{T}, \qquad \hat{x}_t := \operatorname*{argmax}_{x}\, p(x_t = x \mid \bar{y}_{0:t}). \] This is easy to interpret but discards the uncertainty information in the posterior. A prediction with 51% confidence counts the same as one with 99%.

  2. Log-loss. To evaluate the full posterior, we use the log-loss (also called cross-entropy loss): \[ \mathcal{L} = -\frac{1}{T} \sum_{t=0}^{T-1} \log p(x_t^* \mid \bar{y}_{0:t}), \] where \(x_t^*\) is the true state at time \(t\). Its main value is in comparing models: a lower log-loss means one model’s posteriors are closer to the truth than another’s.

  1. MAP predictions. Use tensor.argmax(dim=1) on the filtering tensor to get the most likely state at each time step. This returns a 1-D tensor of predicted state indices.

  2. Accuracy. Compare the predictions to the true states element-wise: (predicted == true_states) gives a boolean tensor. Convert to float with .float(), take the .mean(), and extract the Python number with .item().

  3. Log-loss. You need the predicted probability assigned to the true state at each time step. The expression filtering[range(T), true_states] picks out filtering[t, true_states[t]] for every t in one vectorized operation. Once you have this 1-D tensor of probabilities, apply torch.log(), negate, and take the .mean().

cpg_true_tensor = torch.tensor(cpg_true_states)

# (a) MAP accuracy
cpg_predicted = cpg_filtering.argmax(dim=1)
cpg_accuracy = (cpg_predicted == cpg_true_tensor).float().mean().item()
print(f"MAP accuracy: {100*cpg_accuracy:.1f}%")

# (b) Log-loss
cpg_true_probs = cpg_filtering[range(len(cpg_true_states)), cpg_true_states]
cpg_log_loss = -torch.log(cpg_true_probs).mean().item()
print(f"Log-loss:     {cpg_log_loss:.4f} nats")
MAP accuracy: 78.0%
Log-loss:     0.5061 nats
vol_true_tensor = torch.tensor(vol_true_states)

# (a) MAP accuracy
vol_predicted = vol_filtering.argmax(dim=1)
vol_accuracy = (vol_predicted == vol_true_tensor).float().mean().item()
print(f"MAP accuracy: {100*vol_accuracy:.1f}%")

# (b) Log-loss
vol_true_probs = vol_filtering[range(len(vol_true_states)), vol_true_states]
vol_log_loss = -torch.log(vol_true_probs).mean().item()
print(f"Log-loss:     {vol_log_loss:.4f} nats")
MAP accuracy: 82.4%
Log-loss:     0.4018 nats