The Viterbi algorithm is an interesting example of a dynamic programming algorithm that lies between memoized Fibonacci and Levenshtein distance in complexity. Like any other dynamic programming algorithm it is characterized by having an asymptotic complexity way lower than the naïve approach thanks to being able to divide the task at hand into subproblems that tend to be repeated and are cached.
In this case, the context of the Viterbi algorithm is a Markov chain that models some system as a directed graph of states with some transition probabilities. Reimplementing the example at the relevant wikipedia page in Scala, let’s say that a patient can be healthy or afflicted by a fever:
However, we are not in a hospital and we cannot know for sure if the person is really healthy or not, but we can observe the symptoms the person experiments day by day:
The Viterbi algorithm can be used only when we made some simplifying assumptions. First, the link between the health status and the symptoms can be modelled as a simple discrete probability distribution. Second, the health status only depends on the status of the previous day (no more memory) by another simple discrete probability. This is known as a hidden Markov chain and it is amenable to beautiful^{1} charts:
Note how states are hidden because of the indirection layer that the symptoms represent.
We can represent the same setup not so beautifully by means of some DSLish modelling in Scala:
Now we have all the pieces in place to really explain what the Viterbi algorithm does: given a sequence of symptoms, it searches for the most likely sequence of hidden states. For instance:
We might solve this problem by brute force by just running all the possible sequences of states (2^{3}, s^{n} in general) to know their probabilities, then finding the probability of each sequence to produce the actual observations. Too bad we are talking about exponential complexity making it impractical except for the smallest cases.
The Viterbi search runs in just O(ns^{2}) where n is the length of the sequence and s is the number of possible hidden states, a polynomial bargain!
The trick, as in other dynamic programming algorithms is avoiding rework while doing the search. The central fact making it possible is the Markov memoryless property: state transitions depend just on the previous state. This means that we can decompose the problem into:
 Computing the most probable paths of length n1 ending in each one of the possible states.
 Finding out what is the most likely extension of a path getting to each one of the possible states.
If we memoize this recursive function we get that attractive polynomial execution time.
An interesting point about the memoryless property of the Markov chain is that new evidence can change our interpretation of the previous states even if each state depends just on the previous state. Let’s illustrate it with a chain with different, more extreme probabilities:
In this scenario, having low fever for a day is not evidence strong enough to think that the patient is ill:
However, if there is high fever the third day our interpretation of what is most likely changes:
The implementation of the algorithm is surprisingly brief if you have a monadic discrete probability class to support it. Check the code, or better, implement it yourself to solidify your understanding of it.

At least it is easier to grasp than loads of tables with numbers. ↩