3 Huffman Tree and Huffman Coding

Understanding Huffman Tree and Huffman Coding intuitively, proving that Huffman Coding is the optimal variable-length coding.

Course Content#

Prerequisite Knowledge#

What is coding
- Where did you first hear about coding? ASCII code
  - 'a' = (97)10 = (0110 0001)2
  - '0' = (48)10 = (0011 0000)2
  - 8-bit binary code, one byte
- Note: In computers, all information is stored in binary
- The purpose of coding: mapping from human characters to computer characters
Example: Impact of coding length
- There is a piece of information in the computer: "aa00", its ASCII code is: 01100001 01100001 00110000 00110000
- At this time, the information is transmitted from one computer to another
  - In the network, the smallest unit of transmission is a bit [bit], so "aa00" needs to be transmitted in 32 bits
- Assuming: the network speed of a computer is 32 bits/s, then it takes: 1s
- 👇
- Convert to a special scenario: only four characters need to be transmitted: a, b, 0, 1
- A new coding method can be designed [Pirate Class Coding - Fixed-length coding]
  - Distinguish them with 2-bit binary code - a: 00, b: 01, 0: 10, 1: 11
- At this time, the representation of "aa00" only requires 8 bits
- Therefore, in the case of unchanged network speed, the time required for the current transmission is only: 0.25s
- [PS] Compared with Tencent Meeting, Zoom has more refined coding and decoding, so the experience of Zoom is better under the same network speed.

Fixed-length and Variable-length Coding#

Fixed-length coding - the coding length is the same for each character
- ASCII code and Pirate Class coding in specific scenarios are both fixed-length coding
- [PS] UTF-8 - variable-length coding; UTF-16 - fixed-length coding [self-study]
Variable-length coding - the coding length is different for each character
Variable-length coding is definitely not worse than fixed-length coding
- Fixed-length coding is a special case of variable-length coding
【Extended Understanding】
- Variable-length coding can be fixed-length coding
- Variable-length coding needs to be optimized for specific scenarios [the occurrence probability of transmitted characters needs to be pre-counted]
- Coding can be seen as a protocol that is predetermined and cannot be dynamically changed
- Coding and decoding need to be consistent

Application Scenarios of Variable-length Coding#

Specific scenario
- Only four characters: ab01
- The probability of each character appearing is different - a: 0.8, b: 0.05; 0: 0.1; 1: 0.05
Average coding length
- avg(l) = ∑ li × pi
- li: the coding length of the i-th character
- pi: the occurrence probability of the i-th character
- Actually, it is the expected value of the coding length
- Example
  - Assuming the average coding length is 1.16, then to transmit 100 characters, 116 bits need to be transmitted [estimated]
  - The average coding length of Pirate Class coding [see above]: avg(l) = 2 × ∑ pi = 2
    - The average coding length of fixed-length coding is fixed
New · Pirate Class Coding
- a: 1
- b: 01 [Note: Cannot start with 1, because it will conflict with a during decoding, ❌prefix overlap]
- 0: 000
- 1: 001
- At this time, the average coding length is: 1 * 0.8 + 2 * 0.05 + 3 * 0.1 + 3 * 0.05 = 1.35
- So, to transmit 100 characters, only 135 bits need to be transmitted [estimated]
[PS] The character with a higher probability corresponds to a shorter coding length

Huffman Coding#

First, calculate the probability of each character
Build a Huffman tree with n characters
Each character falls on a leaf node [no prefix overlap]
Read the coding according to the form of left 0 and right 1

Example of building a tree
- Mnemonic for building the tree: Take out the two characters with the smallest probabilities as nodes and combine them into a subtree [generate a new node]
- ⭐ Because all characters fall on leaf nodes, no code is a prefix of another code
  - No conflicts
- At this time, the average coding length is 1 * 0.8 + 3 * 0.05 + 2 * 0.1 + 3 * 0.05 = 1.3
Conclusion: Huffman coding is the optimal variable-length coding [proof is given below]
[PS]
- Pay more attention to the length, not the left and right of 01
  - If you want to pay attention, consider which one has a higher transmission cost between 0 and 1
- The Huffman tree is not unique, and the order can be arbitrary when the probabilities are equal
- Huffman coding can also degenerate into fixed-length coding [similarly, for example, coding four characters with equal probability (0.25)]

Proof of Huffman Optimality#

[In variable-length coding, optimality means the minimum average coding length]

Steps: ① Convert the representation of the average coding length; ② Solve the optimal solution of the formula

Thinking Transformation#

For the Huffman coding process, if the red node corresponds to a character, then the four leaf nodes below it must be pruned
Otherwise, conflicts will occur
[Intuitively, it is also a fact] Huffman coding covers all leaf nodes
Assume that there are four characters, and their coding results are as follows [red nodes]
- It can be known that the number of leaf nodes covered by the nodes on the l-th level of a tree with a height of H [starting from 0] is 2 ^ (H - l)
  - The l in the l-th level actually represents the coding length
  - The higher the node, the more leaf nodes it covers, and the lower the node, the opposite
👇
There is a mapping between the coding length li [left] and the number of leaf nodes covered by the corresponding node [right], as follows:
- li → 2 ^ (H - li)

Implicit Constraint#

Divide both sides of the second equation by 2 ^ H
[PS] The second equation of the Huffman tree must be an equation, but let's not rush to it

Transformation of Proof Target#

Make the average coding length Σ pi * li minimum, li = -li'
The constraint comes from the above tree structure [i.e., implicit constraint]
Let Ⅰi = 2 ^ li'
That is, let the target and constraint be transformed again
【Transformation Result】
- Target: Minimize -(p1logⅠ1 + p2logⅠ2 + p3logⅠ3 + ... + pnlogⅠn)
- Constraint: Ⅰ1 + Ⅰ2 + Ⅰ3 + ... + Ⅰn <= 1
[PS]
- It's a bit like entropy? In fact, it is proving from the perspective of entropy, and it can also introduce the concept of cross-entropy
- What is the purpose of the transformation?
  - Derive the relationship between the equality of the constraint condition and the minimization of the target
  - There is also a limiting condition: the sum of probabilities is 1

Further Narrowing of the Constraint Condition#

Proof: When the target function reaches the minimum, the constraint condition must be = 1
Proof by contradiction: that is, the constraint condition < 1
- Let 1 - ∑Ⅰi = Ⅰx', which is not negative [see the constraint condition]
- Ⅰx' can be added to any term
- As long as Ⅰx' is not 0, then the target function must not be the minimum, and it can be smaller
- So the constraint condition must be = 1

Find the Extreme Value by Setting the Partial Derivative to 0#

Reduce one unknown: Ⅰn = 1 - Ⅱ
When does the target reach the minimum: take the partial derivative
- Taking the partial derivative gives the left side of the equation
- That is, the upper right corner equation, which holds when the target is at its minimum
- Let p1/Ⅰ1 = ... = k, and knowing the constraint condition and the condition that the sum of probabilities is 1, k = 1 can be obtained
- ⭐ So pi = Ⅰi

❗ Thinking#

The relationship between average coding length, entropy, and cross-entropy

pi = Ⅰi
- Ⅰi is related to the coding length li, that is, the probability and length are closely related
  - The larger the probability, the shorter the coding length
- By substituting the equation, the formula for entropy can be obtained
  - In fact, entropy represents the minimum average representation unit required to represent some information in a system
  - Similarly, it can be understood as the average coding length [⭐ The ideas are connected]
  - The longer the average coding length → the more characters in the system → the more states of the system → the more chaotic the system
  - A larger entropy indicates a more chaotic coding system
- 【Cross-Entropy】
  - Treat pi and Ⅰi as a set of probabilities
  - It describes the similarity between two probability vectors
The proof is not very rigorous
- Because the Huffman process is actually discrete, but this proof process is continuous
- For example, pi = Ⅰi, the coding length li calculated by Ⅰi may be a decimal, but the coding length li is actually an integer

Code Demonstration#

Huffman Tree#

The structure is based on a tree
【Key Operations】
- Merge nodes to establish relationships between nodes [conversion between arrays and nodes]
- Find the two smallest probabilities [can be optimized using a priority queue]
- Recursively extract the coding [only leaf nodes correspond to character codes]
- When reading data, it is recommended to use a character array to store characters to reduce the error rate

Additional Knowledge#

In computers, the smallest unit [minimum] of storage is a byte, and the smallest unit of data representation is a bit

Points for Consideration#

Other ways to prove the optimality of Huffman coding?
- 058 Proof of the Correctness of Huffman Algorithm - Coursera
- The Simplest Proof Method for Huffman Tree as Optimal Binary Tree - Zhihu

Tips#

Mathematical knowledge determines the upper limit of computer direction