Bo2SS

Bo2SS

3 Huffman Tree and Huffman Coding

Understanding Huffman Tree and Huffman Coding intuitively, proving that Huffman Coding is the optimal variable-length coding.

Course Content#

Prerequisite Knowledge#

  • What is coding
    • Where did you first hear about coding? ASCII code
      • 'a' = (97)10 = (0110 0001)2
      • '0' = (48)10 = (0011 0000)2
      • 8-bit binary code, one byte
    • Note: In computers, all information is stored in binary
    • The purpose of coding: mapping from human characters to computer characters
  • Example: Impact of coding length
    • There is a piece of information in the computer: "aa00", its ASCII code is: 01100001 01100001 00110000 00110000
    • At this time, the information is transmitted from one computer to another
      • In the network, the smallest unit of transmission is a bit [bit], so "aa00" needs to be transmitted in 32 bits
    • Assuming: the network speed of a computer is 32 bits/s, then it takes: 1s
    • 👇
    • Convert to a special scenario: only four characters need to be transmitted: a, b, 0, 1
    • A new coding method can be designed [Pirate Class Coding - Fixed-length coding]
      • Distinguish them with 2-bit binary code - a: 00, b: 01, 0: 10, 1: 11
    • At this time, the representation of "aa00" only requires 8 bits
    • Therefore, in the case of unchanged network speed, the time required for the current transmission is only: 0.25s
    • [PS] Compared with Tencent Meeting, Zoom has more refined coding and decoding, so the experience of Zoom is better under the same network speed.

Fixed-length and Variable-length Coding#

  • Fixed-length coding - the coding length is the same for each character
    • ASCII code and Pirate Class coding in specific scenarios are both fixed-length coding
    • [PS] UTF-8 - variable-length coding; UTF-16 - fixed-length coding [self-study]
  • Variable-length coding - the coding length is different for each character
  • Variable-length coding is definitely not worse than fixed-length coding
    • Fixed-length coding is a special case of variable-length coding
  • 【Extended Understanding】
    • Variable-length coding can be fixed-length coding
    • Variable-length coding needs to be optimized for specific scenarios [the occurrence probability of transmitted characters needs to be pre-counted]
    • Coding can be seen as a protocol that is predetermined and cannot be dynamically changed
    • Coding and decoding need to be consistent

Application Scenarios of Variable-length Coding#

  • Specific scenario
    • Only four characters: ab01
    • The probability of each character appearing is different - a: 0.8, b: 0.05; 0: 0.1; 1: 0.05
  • Average coding length
    • avg(l) = ∑ li × pi
    • li: the coding length of the i-th character
    • pi: the occurrence probability of the i-th character
    • Actually, it is the expected value of the coding length
    • Example
      • Assuming the average coding length is 1.16, then to transmit 100 characters, 116 bits need to be transmitted [estimated]
      • The average coding length of Pirate Class coding [see above]: avg(l) = 2 × ∑ pi = 2
        • The average coding length of fixed-length coding is fixed
  • New · Pirate Class Coding
    • a: 1
    • b: 01 [Note: Cannot start with 1, because it will conflict with a during decoding, ❌prefix overlap]
    • 0: 000
    • 1: 001
    • At this time, the average coding length is: 1 * 0.8 + 2 * 0.05 + 3 * 0.1 + 3 * 0.05 = 1.35
    • So, to transmit 100 characters, only 135 bits need to be transmitted [estimated]
  • [PS] The character with a higher probability corresponds to a shorter coding length

Huffman Coding#

  1. First, calculate the probability of each character
  2. Build a Huffman tree with n characters
  3. Each character falls on a leaf node [no prefix overlap]
  4. Read the coding according to the form of left 0 and right 1
  • Example of building a tree
    • Image
    • Mnemonic for building the tree: Take out the two characters with the smallest probabilities as nodes and combine them into a subtree [generate a new node]
    • ⭐ Because all characters fall on leaf nodes, no code is a prefix of another code
      • No conflicts
    • At this time, the average coding length is 1 * 0.8 + 3 * 0.05 + 2 * 0.1 + 3 * 0.05 = 1.3
  • Conclusion: Huffman coding is the optimal variable-length coding [proof is given below]
  • [PS]
    • Pay more attention to the length, not the left and right of 01
      • If you want to pay attention, consider which one has a higher transmission cost between 0 and 1
    • The Huffman tree is not unique, and the order can be arbitrary when the probabilities are equal
    • Huffman coding can also degenerate into fixed-length coding [similarly, for example, coding four characters with equal probability (0.25)]

Proof of Huffman Optimality#

[In variable-length coding, optimality means the minimum average coding length]

Steps: ① Convert the representation of the average coding length; ② Solve the optimal solution of the formula

Thinking Transformation#

  • Image
  • For the Huffman coding process, if the red node corresponds to a character, then the four leaf nodes below it must be pruned
  • Otherwise, conflicts will occur
  • [Intuitively, it is also a fact] Huffman coding covers all leaf nodes
  • Assume that there are four characters, and their coding results are as follows [red nodes]
    • Image
    • It can be known that the number of leaf nodes covered by the nodes on the l-th level of a tree with a height of H [starting from 0] is 2 ^ (H - l)
      • The l in the l-th level actually represents the coding length
      • The higher the node, the more leaf nodes it covers, and the lower the node, the opposite
  • 👇
  • There is a mapping between the coding length li [left] and the number of leaf nodes covered by the corresponding node [right], as follows:
    • li → 2 ^ (H - li)

Implicit Constraint#

  • Image
  • Divide both sides of the second equation by 2 ^ H
  • [PS] The second equation of the Huffman tree must be an equation, but let's not rush to it

Transformation of Proof Target#

  • Make the average coding length Σ pi * li minimum, li = -li'
  • The constraint comes from the above tree structure [i.e., implicit constraint]
  • Image
  • Let Ⅰi = 2 ^ li'
  • That is, let the target and constraint be transformed again
  • 【Transformation Result】
    • Target: Minimize -(p1logⅠ1 + p2logⅠ2 + p3logⅠ3 + ... + pnlogⅠn)
    • Constraint: Ⅰ1 + Ⅰ2 + Ⅰ3 + ... + Ⅰn <= 1
  • [PS]
    • It's a bit like entropy? In fact, it is proving from the perspective of entropy, and it can also introduce the concept of cross-entropy
    • What is the purpose of the transformation?
      • Derive the relationship between the equality of the constraint condition and the minimization of the target
      • There is also a limiting condition: the sum of probabilities is 1

Further Narrowing of the Constraint Condition#

  • Proof: When the target function reaches the minimum, the constraint condition must be = 1
  • Proof by contradiction: that is, the constraint condition < 1
    • Let 1 - ∑Ⅰi = Ⅰx', which is not negative [see the constraint condition]
    • Image
    • Ⅰx' can be added to any term
    • As long as Ⅰx' is not 0, then the target function must not be the minimum, and it can be smaller
    • So the constraint condition must be = 1

Find the Extreme Value by Setting the Partial Derivative to 0#

  • Image
  • Reduce one unknown: Ⅰn = 1 - Ⅱ
  • When does the target reach the minimum: take the partial derivative
    • Taking the partial derivative gives the left side of the equation
    • Image
    • That is, the upper right corner equation, which holds when the target is at its minimum
    • Let p1/Ⅰ1 = ... = k, and knowing the constraint condition and the condition that the sum of probabilities is 1, k = 1 can be obtained
    • ⭐ So pi = Ⅰi

❗ Thinking#

The relationship between average coding length, entropy, and cross-entropy

  • pi = Ⅰi
    • Ⅰi is related to the coding length li, that is, the probability and length are closely related
      • The larger the probability, the shorter the coding length
    • By substituting the equation, the formula for entropy can be obtained
      • In fact, entropy represents the minimum average representation unit required to represent some information in a system
      • Similarly, it can be understood as the average coding length [⭐ The ideas are connected]
      • The longer the average coding length → the more characters in the system → the more states of the system → the more chaotic the system
      • A larger entropy indicates a more chaotic coding system
    • 【Cross-Entropy】
      • Treat pi and Ⅰi as a set of probabilities
      • It describes the similarity between two probability vectors
  • The proof is not very rigorous
    • Because the Huffman process is actually discrete, but this proof process is continuous
    • For example, pi = Ⅰi, the coding length li calculated by Ⅰi may be a decimal, but the coding length li is actually an integer

Code Demonstration#

Huffman Tree#

  • Image
  • Image
  • Image
  • The structure is based on a tree
  • 【Key Operations】
    • Merge nodes to establish relationships between nodes [conversion between arrays and nodes]
    • Find the two smallest probabilities [can be optimized using a priority queue]
    • Recursively extract the coding [only leaf nodes correspond to character codes]
    • When reading data, it is recommended to use a character array to store characters to reduce the error rate

Additional Knowledge#

  • In computers, the smallest unit [minimum] of storage is a byte, and the smallest unit of data representation is a bit

Points for Consideration#

Tips#

  • Mathematical knowledge determines the upper limit of computer direction

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.