Huffman Coding 102: Code It Up!

·

11 min read

Implementation in Go

Anything we learn about in Computer Science, is incomplete without an implementation in whatever language is trending at that moment (seriously, don't worry about the language, just focus on the concepts). That being said, let's build a compression/de-compression tool using everything we learnt about Huffman Coding in Part 1 of this article!

I'd strongly suggest you to first try to implement this yourself. If you get stuck somewhere you can follow along with this article, or take a look at the source code on my GitHub.

Step 0

Our goal in this step is to build a simple application that:

  1. Takes in the path of a .txt file as input

  2. Read the text and determine the frequency of each character occurring within the text

Here's the code that achieves these two steps


func main(){
    filePath := flag.String("path", "", "path to file to be compressed")
    flag.Parse()

    if *filePath == "" {
        panic("File path is required")
    }

    file, err := os.Open(*filePath)
    if err != nil {
        panic(err)
    }
    defer CloseFile(file)

    // declare a map to store the frequency of each character in the file
    freqMap := buildFreqMap(file)
}

func buildFreqMap(file *os.File) map[rune]int {
    // this will store a character->frequency mapping
    freqMap := make(map[rune]int)
    reader := bufio.NewReader(file)
    for {
        char, _, err := reader.ReadRune()
        if err != nil && err != io.EOF {
            panic(err)
        }
        if err == io.EOF {
            break
        }
        freqMap[char]++
    }
    return freqMap
}
// utility function to close a file in Go
func CloseFile(file *os.File) {
    err := file.Close()
    if err != nil {
        panic(err)
    }
}

Step 1

In this step, our goal is to build the Huffman tree that we read so much about in the first part of this article. We will be using the frequency map table that we built in the previous step.

I would strongly suggest you to try building the Huffman tree yourself before looking at the solution below.

Let's take a look at the huff package, which defines our Huffman tree structure:

package huff

type BaseNode interface {
    IsLeaf() bool
    Weight() int
}
type LeafNode struct {
    weight  int
    element rune
}
// implementing the BaseNode interface for the LeafNode struct
func (node LeafNode) IsLeaf() bool {
    return true
}
func (node LeafNode) Weight() int {
    return node.weight
}
func (node LeafNode) Value() rune {
    return node.element
}
// constructor for the LeafNode struct
func NewHuffLeafNode(el rune, w int) *LeafNode {
    return &LeafNode{element: el, weight: w}
}

type InternalNode struct {
    weight      int
    left, right BaseNode
    LeftEdge    int
    RightEdge   int
}
// constructor for the LeafNode struct
func NewHuffInternalNode(l, r BaseNode, w int) *InternalNode {
    return &InternalNode{left: l, right: r, weight: w, LeftEdge: 0, RightEdge: 1}
}
// implementing the BaseNode interface for the InternalNode struct
func (node InternalNode) IsLeaf() bool {
    return false
}
func (node InternalNode) Weight() int {
    return node.weight
}
func (node InternalNode) Left() BaseNode {
    return node.left
}
func (node InternalNode) Right() BaseNode {
    return node.right
}

Now that we have the various Nodes defined, we can start defining the Huffman Tree.

type Tree struct {
    root BaseNode
}

func NewHuffTreeFromLeaf(r BaseNode) *Tree {
    return &Tree{root: r}
}
func NewHuffTreeFromNodes(l, r BaseNode, wt int) *Tree {
    return &Tree{root: NewHuffInternalNode(l, r, wt)}
}
func (tree *Tree) Root() BaseNode {
    return tree.root
}
func (tree *Tree) Weight() int {
    return tree.root.Weight()
}

This structure allows us to create a tree with leaf nodes representing characters and internal nodes for branching.

Next, we need to define the methods that will help build our Huffman Tree from individual nodes.

// HuffmanHeap is a min-heap of HuffTree pointers
type HuffmanHeap []*Tree

// By implementing these methods, the HuffmanHeap type satisfies 
// the heap.Interface, allowing it to be used with Go's 
// heap package. This enables efficient selection of the two 
// lowest-weight trees at each step of the Huffman tree 
// construction process.
func (h *HuffmanHeap) Len() int           { return len(*h) }
func (h *HuffmanHeap) Less(i, j int) bool { return (*h)[i].Weight() < (*h)[j].Weight() }
func (h *HuffmanHeap) Swap(i, j int)      { (*h)[i], (*h)[j] = (*h)[j], (*h)[i] }

// implement the push and pop functions for the min-heap
func (h *HuffmanHeap) Push(x interface{}) {
    *h = append(*h, x.(*Tree))
}

func (h *HuffmanHeap) Pop() interface{} {
    old := *h
    n := len(old)
    x := old[n-1]
    *h = old[0 : n-1]
    return x
}

// BuildHuffmanTree constructs a Huffman tree from a map of 
// character frequencies
func BuildHuffmanTree(freqsMap map[rune]int) *Tree {
    // Create a min-heap
    h := &HuffmanHeap{}
    heap.Init(h)

    // Create a leaf node for each character and add it to the heap
    for ch, freq := range freqsMap {
        huffNode := NewHuffLeafNode(ch, freq)
        tree := NewHuffTreeFromLeaf(huffNode)
        heap.Push(h, tree)
    }

    // While there is more than one tree in the heap
    for h.Len() > 1 {
        // Remove the two trees with the lowest weight
        tree1 := heap.Pop(h).(*Tree)
        tree2 := heap.Pop(h).(*Tree)

        // Create a new internal node with these two nodes as children
        combinedWeight := tree1.Weight() + tree2.Weight()
        newTree := NewHuffTreeFromNodes(tree1.Root(), tree2.Root(), combinedWeight)

        // Add the new tree back to the heap
        heap.Push(h, newTree)
    }

    // The last remaining tree is the Huffman tree
    return heap.Pop(h).(*Tree)
}

Now that we have the Huffman Tree, we can proceed to building the prefix table.

Step 2

The prefix table will map characters to their Huffman codes and will be used during the encoding process.

func BuildPrefixTable(root huff.BaseNode) map[rune]string {
    prefixTable := make(map[rune]string)
    buildPrefixTableHelper(root, "", prefixTable)
    return prefixTable
}

func buildPrefixTableHelper(node huff.BaseNode, currentPrefix string, prefixTable map[rune]string) {
    switch n := node.(type) {
    case *huff.LeafNode:
        prefixTable[n.Value()] = currentPrefix
    case *huff.InternalNode:
        buildPrefixTableHelper(n.Left(), currentPrefix+strconv.Itoa(n.LeftEdge), prefixTable)
        buildPrefixTableHelper(n.Right(), currentPrefix+strconv.Itoa(n.RightEdge), prefixTable)
    }
}

This recursive function traverses the Huffman tree, building the code for each character.

Step 3

In this step, we construct a header section for our output file. This header section will be used when we want to obtain the original file from our compressed one, that is, during the decoding process. So, writing the prefix table we generated in the previous step, to the header section should be enough. It has all the information we need to obtain our original data.

Let's modify our main function to carry out this step.

func main() {
    // declare a flag variable to accept file name as input
    filePath := flag.String("path", "", "path to file to be compressed")
    outputPath := flag.String("output", "", "output file path")
    flag.Parse()

    if *filePath == "" || *outputPath == "" {
        panic("File path is required")
    }

    file, err := os.Open(*filePath)
    if err != nil {
        panic(err)
    }
    defer CloseFile(file)

    // declare a map to store the frequency of each character in the file
    freqMap := buildFreqMap(file)

    // build a huffman tree with the freqMap
    huffTree := huff.BuildHuffmanTree(freqMap)
    // build the prefix table from the huffman tree
    prefixTable := BuildPrefixTable(huffTree.Root())
    // create a new file at the output path provided
    outputFile, err := os.Create(*outputPath)
    if err != nil {
        panic(err)
    }
    CloseFile(outputFile)

    // Open the file in append mode
    outputFile, err = os.OpenFile(*outputPath, os.O_APPEND|os.O_WRONLY, 0644)
    if err != nil {
        panic(err)
    }
    defer CloseFile(outputFile)

    writePrefixTableToOutputFile(outputFile, prefixTable)
}

Let's define our writePrefixTableToOutputFile function.

func writePrefixTableToOutputFile(outputFile *os.File, prefixTable map[rune]string) {
    // Create a writer
    writer := bufio.NewWriter(outputFile)

    for char, prefix := range prefixTable {
        // Convert the character to its Unicode code point
        codePoint := int(char)

        // Write the code point, prefix, and a delimiter
        _, err := fmt.Fprintf(writer, "%d\t%s\n", codePoint, prefix)
        if err != nil {
            panic(err)
        }
    }
    _, err := writer.WriteString("***HEADER*END***\n")
    if err != nil {
        panic(err)
    }

    // Flush the writer to ensure all buffered operations have been applied to the underlying writer
    err = writer.Flush()
    if err != nil {
        panic(err)
    }
}

We first convert each character to its Unicode integer representation. This prevents the headache of finding a delimiting character that does not occur in the original text file. Each row of the prefix table is written as unicode_int<space>huffman_code.

We mark the end of the table as ***HEADER*END***, so that we can know when the header ends and the compressed data begins.

Step 4

We finally arrive where we wanted to: the encoding process. In this step, we use the prefix table to encode each character in the original text file to its huffman code, and write the compressed data to the output file. We will need to translate the prefixes into bit strings, and then pack them into bytes to achieve the compression. This will get a bit clearer when we look at the code.

func main(){
     //... earlier code
    reader := bufio.NewReader(file)
    writer := bufio.NewWriter(outputFile)
    // this butBuffer will store the prefix string
    var bitBuffer uint8
    // this will store the count of bits in the bitBuffer
    var bitCount uint8

    for {
        // read each character from the input file
        char, _, err := reader.ReadRune()
        if err != nil {
            if err == io.EOF {
                break
            }
            panic(err)
        }
        // get the huffman code(prefix string) for the read character
        // using the prefix table
        bitString := prefixTable[char]
        for _, bit := range bitString {
            // append each bit in the bitString to the bitBuffer
            bitBuffer = (bitBuffer << 1) | uint8(bit-'0')
            // increment the bitCount for each insertion
            bitCount++
            // once the bitCount equals 8, we have a complete byte
            // that we can write to our output file
            if bitCount == 8 {
                err := writer.WriteByte(bitBuffer)
                if err != nil {
                    panic(err)
                }
                // reset the bitBuffer and bitCount variables
                bitBuffer = 0
                bitCount = 0
            }
        }
    }
}

This code reads each character, looks up its Huffman code, and writes the code bits to the output file, packing them into bytes. Once you run this code, you will find an output file at the path you provided. If you check its size, it should be much less than the original input size!

Step 5

In this step, we will begin our decoding process. In order to decode the compressed file, we need the prefix table that we stored in its header. So, let's read the header and construct the prefix table from it.

func DecodeFile(filePath string) {
    // open the output file and initiate a reader for it
    file, err := os.Open(filePath)
    if err != nil {
        panic(err)
    }
    defer CloseFile(file)
    reader := bufio.NewReader(file)

    // initialize an empty map as the prefix table
    prefixTable := make(map[string]rune)
    for {
        // read each line from the file, using the \n character as the
        // delimiter
        line, err := reader.ReadString('\n')
        if err != nil {
            panic(err)
        }
        line = strings.TrimSpace(line)
        // check if the line equals ***HEADER*END***
        // if it does, break out of the loop, our prefix table is ready
        if line == "***HEADER*END***" {
            break
        }
        // split the read line using the \t delimiter
        // this is what separates the unicode int and their huffman
        // codes
        parts := strings.Split(line, "\t")
        if len(parts) != 2 {
            continue
        }
        // get the unicode and its prefix from the line
        codePoint, err := strconv.Atoi(parts[0])
        if err != nil {
            panic(err)
        }
        char := rune(codePoint)
        prefix := parts[1]
        // make an entry in the prefix table
        // note that this is the reverse mapping of how we originally
        // constructed our prefix table
        prefixTable[prefix] = char
    }
    // at the end of this loop, our prefix table will be ready
}

Step 6

Now that we have our prefix table ready, we can read each byte of the compressed file and decode it using the prefix table. Let's take a look at the modified DecodeFile function.

func DecodeFile(filePath string) {
    // ...
    // ... -> code to construct prefix table

    var decodedData strings.Builder
    var currentPrefix strings.Builder

    // Read the compressed data
    for {
        b, err := reader.ReadByte()
        if err != nil {
            if err == io.EOF {
                break
            }
            panic(err)
        }

        // Process each bit in the byte
        for i := 7; i >= 0; i-- {
            bit := (b >> uint(i)) & 1
            currentPrefix.WriteByte('0' + bit)
            // check if currentPrefix exists in the prefixTable
            if char, ok := prefixTable[currentPrefix.String()]; ok {
                // if it does, write the corresponding character to
                // the decodedData
                decodedData.WriteRune(char)
                currentPrefix.Reset()
            }
        }
    }

    // The last byte contains the number of valid bits in the previous byte
    validBits, err := reader.ReadByte()
    if err != nil && err != io.EOF {
        panic(err)
    }

    // Create a new file for writing the decoded text
    decodedFilePath := filepath.Join(filepath.Dir(filePath), "decoded_"+filepath.Base(filePath))
    decodedFile, err := os.Create(decodedFilePath)
    if err != nil {
        panic(err)
    }
    defer CloseFile(decodedFile)

    // Write the decoded text to the file
    _, err = decodedFile.WriteString(decodedData.String()[:len(decodedData.String())-int(8-validBits)])
    if err != nil {
        panic(err)
    }

    fmt.Printf("Decoded text has been written to: %s\n", decodedFilePath)

}

That's it! Our DecodeFile function is ready. Now when you call it at the end of your main function, you will find a decoded_out.txt file being created at the path you provide. Check its contents and you'll find that its identical to the original input file you used.

You have successfully achieved compression and decompression using Huffman Coding!

Efficiency and Complexity

Huffman coding provides optimal prefix-free coding, meaning no codeword is a prefix of another. The time complexity for building the Huffman tree is O(n log n) where n is the number of unique characters. Encoding and decoding both have a time complexity of O(N) where N is the total number of characters in the input.

The space complexity is O(n) for the tree and prefix table, and the compressed file size depends on the data but is guaranteed to be no larger than the original file (in the worst case of uniformly distributed characters).

Further Reading and References

This project was inspired by this Coding Challenge. For a better understanding of Huffman Trees, you can go through this article. To refer to the complete solution of this project, you can refer to my GitHub repository.

Thank you for reading so far, and make sure to subscribe to my blog newsletter to never miss out on fun projects such as these!