Understanding Huffman Tree and Huffman Coding intuitively, proving that Huffman Coding is the optimal variable-length coding.
Course Content#
Prerequisite Knowledge#
- What is coding
- Where did you first hear about coding? ASCII code
- 'a' = (97)10 = (0110 0001)2
- '0' = (48)10 = (0011 0000)2
- 8-bit binary code, one byte
- Note: In computers, all information is stored in binary
- The purpose of coding: mapping from human characters to computer characters
- Where did you first hear about coding? ASCII code
- Example: Impact of coding length
- There is a piece of information in the computer: "aa00", its ASCII code is: 01100001 01100001 00110000 00110000
- At this time, the information is transmitted from one computer to another
- In the network, the smallest unit of transmission is a bit [bit], so "aa00" needs to be transmitted in 32 bits
- Assuming: the network speed of a computer is 32 bits/s, then it takes: 1s
- 👇
- Convert to a special scenario: only four characters need to be transmitted: a, b, 0, 1
- A new coding method can be designed [Pirate Class Coding - Fixed-length coding]
- Distinguish them with 2-bit binary code - a: 00, b: 01, 0: 10, 1: 11
- At this time, the representation of "aa00" only requires 8 bits
- Therefore, in the case of unchanged network speed, the time required for the current transmission is only: 0.25s
- [PS] Compared with Tencent Meeting, Zoom has more refined coding and decoding, so the experience of Zoom is better under the same network speed.
Fixed-length and Variable-length Coding#
- Fixed-length coding - the coding length is the same for each character
- ASCII code and Pirate Class coding in specific scenarios are both fixed-length coding
- [PS] UTF-8 - variable-length coding; UTF-16 - fixed-length coding [self-study]
- Variable-length coding - the coding length is different for each character
- Variable-length coding is definitely not worse than fixed-length coding
- Fixed-length coding is a special case of variable-length coding
- 【Extended Understanding】
- Variable-length coding can be fixed-length coding
- Variable-length coding needs to be optimized for specific scenarios [the occurrence probability of transmitted characters needs to be pre-counted]
- Coding can be seen as a protocol that is predetermined and cannot be dynamically changed
- Coding and decoding need to be consistent
Application Scenarios of Variable-length Coding#
- Specific scenario
- Only four characters: ab01
- The probability of each character appearing is different - a: 0.8, b: 0.05; 0: 0.1; 1: 0.05
- Average coding length
- avg(l) = ∑ li × pi
- li: the coding length of the i-th character
- pi: the occurrence probability of the i-th character
- Actually, it is the expected value of the coding length
- Example
- Assuming the average coding length is 1.16, then to transmit 100 characters, 116 bits need to be transmitted [estimated]
- The average coding length of Pirate Class coding [see above]: avg(l) = 2 × ∑ pi = 2
- The average coding length of fixed-length coding is fixed
- New · Pirate Class Coding
- a: 1
- b: 01 [Note: Cannot start with 1, because it will conflict with a during decoding, ❌prefix overlap]
- 0: 000
- 1: 001
- At this time, the average coding length is: 1 * 0.8 + 2 * 0.05 + 3 * 0.1 + 3 * 0.05 = 1.35
- So, to transmit 100 characters, only 135 bits need to be transmitted [estimated]
- [PS] The character with a higher probability corresponds to a shorter coding length
Huffman Coding#
- First, calculate the probability of each character
- Build a Huffman tree with n characters
- Each character falls on a leaf node [no prefix overlap]
- Read the coding according to the form of left 0 and right 1
- Example of building a tree
- Mnemonic for building the tree: Take out the two characters with the smallest probabilities as nodes and combine them into a subtree [generate a new node]
- ⭐ Because all characters fall on leaf nodes, no code is a prefix of another code
- No conflicts
- At this time, the average coding length is 1 * 0.8 + 3 * 0.05 + 2 * 0.1 + 3 * 0.05 = 1.3
- Conclusion: Huffman coding is the optimal variable-length coding [proof is given below]
- [PS]
- Pay more attention to the length, not the left and right of 01
- If you want to pay attention, consider which one has a higher transmission cost between 0 and 1
- The Huffman tree is not unique, and the order can be arbitrary when the probabilities are equal
- Huffman coding can also degenerate into fixed-length coding [similarly, for example, coding four characters with equal probability (0.25)]
- Pay more attention to the length, not the left and right of 01
Proof of Huffman Optimality#
[In variable-length coding, optimality means the minimum average coding length]
Steps: ① Convert the representation of the average coding length; ② Solve the optimal solution of the formula
Thinking Transformation#
- For the Huffman coding process, if the red node corresponds to a character, then the four leaf nodes below it must be pruned
- Otherwise, conflicts will occur
- [Intuitively, it is also a fact] Huffman coding covers all leaf nodes
- Assume that there are four characters, and their coding results are as follows [red nodes]
- It can be known that the number of leaf nodes covered by the nodes on the l-th level of a tree with a height of H [starting from 0] is 2 ^ (H - l)
- The l in the l-th level actually represents the coding length
- The higher the node, the more leaf nodes it covers, and the lower the node, the opposite
- 👇
- There is a mapping between the coding length li [left] and the number of leaf nodes covered by the corresponding node [right], as follows:
- li → 2 ^ (H - li)
Implicit Constraint#
- Divide both sides of the second equation by 2 ^ H
- [PS] The second equation of the Huffman tree must be an equation, but let's not rush to it
Transformation of Proof Target#
- Make the average coding length Σ pi * li minimum, li = -li'
- The constraint comes from the above tree structure [i.e., implicit constraint]
- Let Ⅰi = 2 ^ li'
- That is, let the target and constraint be transformed again
- 【Transformation Result】
- Target: Minimize -(p1logⅠ1 + p2logⅠ2 + p3logⅠ3 + ... + pnlogⅠn)
- Constraint: Ⅰ1 + Ⅰ2 + Ⅰ3 + ... + Ⅰn <= 1
- [PS]
- It's a bit like entropy? In fact, it is proving from the perspective of entropy, and it can also introduce the concept of cross-entropy
- What is the purpose of the transformation?
- Derive the relationship between the equality of the constraint condition and the minimization of the target
- There is also a limiting condition: the sum of probabilities is 1
Further Narrowing of the Constraint Condition#
- Proof: When the target function reaches the minimum, the constraint condition must be = 1
- Proof by contradiction: that is, the constraint condition < 1
- Let 1 - ∑Ⅰi = Ⅰx', which is not negative [see the constraint condition]
- Ⅰx' can be added to any term
- As long as Ⅰx' is not 0, then the target function must not be the minimum, and it can be smaller
- So the constraint condition must be = 1
Find the Extreme Value by Setting the Partial Derivative to 0#
- Reduce one unknown: Ⅰn = 1 - Ⅱ
- When does the target reach the minimum: take the partial derivative
- Taking the partial derivative gives the left side of the equation
- That is, the upper right corner equation, which holds when the target is at its minimum
- Let p1/Ⅰ1 = ... = k, and knowing the constraint condition and the condition that the sum of probabilities is 1, k = 1 can be obtained
- ⭐ So pi = Ⅰi
❗ Thinking#
The relationship between average coding length, entropy, and cross-entropy
- pi = Ⅰi
- Ⅰi is related to the coding length li, that is, the probability and length are closely related
- The larger the probability, the shorter the coding length
- By substituting the equation, the formula for entropy can be obtained
- In fact, entropy represents the minimum average representation unit required to represent some information in a system
- Similarly, it can be understood as the average coding length [⭐ The ideas are connected]
- The longer the average coding length → the more characters in the system → the more states of the system → the more chaotic the system
- A larger entropy indicates a more chaotic coding system
- 【Cross-Entropy】
- Treat pi and Ⅰi as a set of probabilities
- It describes the similarity between two probability vectors
- Ⅰi is related to the coding length li, that is, the probability and length are closely related
- The proof is not very rigorous
- Because the Huffman process is actually discrete, but this proof process is continuous
- For example, pi = Ⅰi, the coding length li calculated by Ⅰi may be a decimal, but the coding length li is actually an integer
Code Demonstration#
Huffman Tree#
- The structure is based on a tree
- 【Key Operations】
- Merge nodes to establish relationships between nodes [conversion between arrays and nodes]
- Find the two smallest probabilities [can be optimized using a priority queue]
- Recursively extract the coding [only leaf nodes correspond to character codes]
- When reading data, it is recommended to use a character array to store characters to reduce the error rate
Additional Knowledge#
- In computers, the smallest unit [minimum] of storage is a byte, and the smallest unit of data representation is a bit
Points for Consideration#
- Other ways to prove the optimality of Huffman coding?
Tips#
- Mathematical knowledge determines the upper limit of computer direction