Implementation in Go
Anything we learn about in Computer Science, is incomplete without an implementation in whatever language is trending at that moment (seriously, don't worry about the language, just focus on the concepts). That being said, let's build a compression/de-compression tool using everything we learnt about Huffman Coding in Part 1 of this article!
I'd strongly suggest you to first try to implement this yourself. If you get stuck somewhere you can follow along with this article, or take a look at the source code on my GitHub.
Step 0
Our goal in this step is to build a simple application that:
Takes in the path of a
.txt
file as inputRead the text and determine the frequency of each character occurring within the text
Here's the code that achieves these two steps
func main(){
filePath := flag.String("path", "", "path to file to be compressed")
flag.Parse()
if *filePath == "" {
panic("File path is required")
}
file, err := os.Open(*filePath)
if err != nil {
panic(err)
}
defer CloseFile(file)
// declare a map to store the frequency of each character in the file
freqMap := buildFreqMap(file)
}
func buildFreqMap(file *os.File) map[rune]int {
// this will store a character->frequency mapping
freqMap := make(map[rune]int)
reader := bufio.NewReader(file)
for {
char, _, err := reader.ReadRune()
if err != nil && err != io.EOF {
panic(err)
}
if err == io.EOF {
break
}
freqMap[char]++
}
return freqMap
}
// utility function to close a file in Go
func CloseFile(file *os.File) {
err := file.Close()
if err != nil {
panic(err)
}
}
Step 1
In this step, our goal is to build the Huffman tree that we read so much about in the first part of this article. We will be using the frequency map table that we built in the previous step.
I would strongly suggest you to try building the Huffman tree yourself before looking at the solution below.
Let's take a look at the huff
package, which defines our Huffman tree structure:
package huff
type BaseNode interface {
IsLeaf() bool
Weight() int
}
type LeafNode struct {
weight int
element rune
}
// implementing the BaseNode interface for the LeafNode struct
func (node LeafNode) IsLeaf() bool {
return true
}
func (node LeafNode) Weight() int {
return node.weight
}
func (node LeafNode) Value() rune {
return node.element
}
// constructor for the LeafNode struct
func NewHuffLeafNode(el rune, w int) *LeafNode {
return &LeafNode{element: el, weight: w}
}
type InternalNode struct {
weight int
left, right BaseNode
LeftEdge int
RightEdge int
}
// constructor for the LeafNode struct
func NewHuffInternalNode(l, r BaseNode, w int) *InternalNode {
return &InternalNode{left: l, right: r, weight: w, LeftEdge: 0, RightEdge: 1}
}
// implementing the BaseNode interface for the InternalNode struct
func (node InternalNode) IsLeaf() bool {
return false
}
func (node InternalNode) Weight() int {
return node.weight
}
func (node InternalNode) Left() BaseNode {
return node.left
}
func (node InternalNode) Right() BaseNode {
return node.right
}
Now that we have the various Nodes defined, we can start defining the Huffman Tree.
type Tree struct {
root BaseNode
}
func NewHuffTreeFromLeaf(r BaseNode) *Tree {
return &Tree{root: r}
}
func NewHuffTreeFromNodes(l, r BaseNode, wt int) *Tree {
return &Tree{root: NewHuffInternalNode(l, r, wt)}
}
func (tree *Tree) Root() BaseNode {
return tree.root
}
func (tree *Tree) Weight() int {
return tree.root.Weight()
}
This structure allows us to create a tree with leaf nodes representing characters and internal nodes for branching.
Next, we need to define the methods that will help build our Huffman Tree from individual nodes.
// HuffmanHeap is a min-heap of HuffTree pointers
type HuffmanHeap []*Tree
// By implementing these methods, the HuffmanHeap type satisfies
// the heap.Interface, allowing it to be used with Go's
// heap package. This enables efficient selection of the two
// lowest-weight trees at each step of the Huffman tree
// construction process.
func (h *HuffmanHeap) Len() int { return len(*h) }
func (h *HuffmanHeap) Less(i, j int) bool { return (*h)[i].Weight() < (*h)[j].Weight() }
func (h *HuffmanHeap) Swap(i, j int) { (*h)[i], (*h)[j] = (*h)[j], (*h)[i] }
// implement the push and pop functions for the min-heap
func (h *HuffmanHeap) Push(x interface{}) {
*h = append(*h, x.(*Tree))
}
func (h *HuffmanHeap) Pop() interface{} {
old := *h
n := len(old)
x := old[n-1]
*h = old[0 : n-1]
return x
}
// BuildHuffmanTree constructs a Huffman tree from a map of
// character frequencies
func BuildHuffmanTree(freqsMap map[rune]int) *Tree {
// Create a min-heap
h := &HuffmanHeap{}
heap.Init(h)
// Create a leaf node for each character and add it to the heap
for ch, freq := range freqsMap {
huffNode := NewHuffLeafNode(ch, freq)
tree := NewHuffTreeFromLeaf(huffNode)
heap.Push(h, tree)
}
// While there is more than one tree in the heap
for h.Len() > 1 {
// Remove the two trees with the lowest weight
tree1 := heap.Pop(h).(*Tree)
tree2 := heap.Pop(h).(*Tree)
// Create a new internal node with these two nodes as children
combinedWeight := tree1.Weight() + tree2.Weight()
newTree := NewHuffTreeFromNodes(tree1.Root(), tree2.Root(), combinedWeight)
// Add the new tree back to the heap
heap.Push(h, newTree)
}
// The last remaining tree is the Huffman tree
return heap.Pop(h).(*Tree)
}
Now that we have the Huffman Tree, we can proceed to building the prefix table.
Step 2
The prefix table will map characters to their Huffman codes and will be used during the encoding process.
func BuildPrefixTable(root huff.BaseNode) map[rune]string {
prefixTable := make(map[rune]string)
buildPrefixTableHelper(root, "", prefixTable)
return prefixTable
}
func buildPrefixTableHelper(node huff.BaseNode, currentPrefix string, prefixTable map[rune]string) {
switch n := node.(type) {
case *huff.LeafNode:
prefixTable[n.Value()] = currentPrefix
case *huff.InternalNode:
buildPrefixTableHelper(n.Left(), currentPrefix+strconv.Itoa(n.LeftEdge), prefixTable)
buildPrefixTableHelper(n.Right(), currentPrefix+strconv.Itoa(n.RightEdge), prefixTable)
}
}
This recursive function traverses the Huffman tree, building the code for each character.
Step 3
In this step, we construct a header section for our output file. This header section will be used when we want to obtain the original file from our compressed one, that is, during the decoding process. So, writing the prefix table we generated in the previous step, to the header section should be enough. It has all the information we need to obtain our original data.
Let's modify our main function to carry out this step.
func main() {
// declare a flag variable to accept file name as input
filePath := flag.String("path", "", "path to file to be compressed")
outputPath := flag.String("output", "", "output file path")
flag.Parse()
if *filePath == "" || *outputPath == "" {
panic("File path is required")
}
file, err := os.Open(*filePath)
if err != nil {
panic(err)
}
defer CloseFile(file)
// declare a map to store the frequency of each character in the file
freqMap := buildFreqMap(file)
// build a huffman tree with the freqMap
huffTree := huff.BuildHuffmanTree(freqMap)
// build the prefix table from the huffman tree
prefixTable := BuildPrefixTable(huffTree.Root())
// create a new file at the output path provided
outputFile, err := os.Create(*outputPath)
if err != nil {
panic(err)
}
CloseFile(outputFile)
// Open the file in append mode
outputFile, err = os.OpenFile(*outputPath, os.O_APPEND|os.O_WRONLY, 0644)
if err != nil {
panic(err)
}
defer CloseFile(outputFile)
writePrefixTableToOutputFile(outputFile, prefixTable)
}
Let's define our writePrefixTableToOutputFile
function.
func writePrefixTableToOutputFile(outputFile *os.File, prefixTable map[rune]string) {
// Create a writer
writer := bufio.NewWriter(outputFile)
for char, prefix := range prefixTable {
// Convert the character to its Unicode code point
codePoint := int(char)
// Write the code point, prefix, and a delimiter
_, err := fmt.Fprintf(writer, "%d\t%s\n", codePoint, prefix)
if err != nil {
panic(err)
}
}
_, err := writer.WriteString("***HEADER*END***\n")
if err != nil {
panic(err)
}
// Flush the writer to ensure all buffered operations have been applied to the underlying writer
err = writer.Flush()
if err != nil {
panic(err)
}
}
We first convert each character to its Unicode integer representation. This prevents the headache of finding a delimiting character that does not occur in the original text file. Each row of the prefix table is written as unicode_int<space>huffman_code
.
We mark the end of the table as ***HEADER*END***
, so that we can know when the header ends and the compressed data begins.
Step 4
We finally arrive where we wanted to: the encoding process. In this step, we use the prefix table to encode each character in the original text file to its huffman code, and write the compressed data to the output file. We will need to translate the prefixes into bit strings, and then pack them into bytes to achieve the compression. This will get a bit clearer when we look at the code.
func main(){
//... earlier code
reader := bufio.NewReader(file)
writer := bufio.NewWriter(outputFile)
// this butBuffer will store the prefix string
var bitBuffer uint8
// this will store the count of bits in the bitBuffer
var bitCount uint8
for {
// read each character from the input file
char, _, err := reader.ReadRune()
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
// get the huffman code(prefix string) for the read character
// using the prefix table
bitString := prefixTable[char]
for _, bit := range bitString {
// append each bit in the bitString to the bitBuffer
bitBuffer = (bitBuffer << 1) | uint8(bit-'0')
// increment the bitCount for each insertion
bitCount++
// once the bitCount equals 8, we have a complete byte
// that we can write to our output file
if bitCount == 8 {
err := writer.WriteByte(bitBuffer)
if err != nil {
panic(err)
}
// reset the bitBuffer and bitCount variables
bitBuffer = 0
bitCount = 0
}
}
}
}
This code reads each character, looks up its Huffman code, and writes the code bits to the output file, packing them into bytes. Once you run this code, you will find an output file at the path you provided. If you check its size, it should be much less than the original input size!
Step 5
In this step, we will begin our decoding process. In order to decode the compressed file, we need the prefix table that we stored in its header. So, let's read the header and construct the prefix table from it.
func DecodeFile(filePath string) {
// open the output file and initiate a reader for it
file, err := os.Open(filePath)
if err != nil {
panic(err)
}
defer CloseFile(file)
reader := bufio.NewReader(file)
// initialize an empty map as the prefix table
prefixTable := make(map[string]rune)
for {
// read each line from the file, using the \n character as the
// delimiter
line, err := reader.ReadString('\n')
if err != nil {
panic(err)
}
line = strings.TrimSpace(line)
// check if the line equals ***HEADER*END***
// if it does, break out of the loop, our prefix table is ready
if line == "***HEADER*END***" {
break
}
// split the read line using the \t delimiter
// this is what separates the unicode int and their huffman
// codes
parts := strings.Split(line, "\t")
if len(parts) != 2 {
continue
}
// get the unicode and its prefix from the line
codePoint, err := strconv.Atoi(parts[0])
if err != nil {
panic(err)
}
char := rune(codePoint)
prefix := parts[1]
// make an entry in the prefix table
// note that this is the reverse mapping of how we originally
// constructed our prefix table
prefixTable[prefix] = char
}
// at the end of this loop, our prefix table will be ready
}
Step 6
Now that we have our prefix table ready, we can read each byte of the compressed file and decode it using the prefix table. Let's take a look at the modified DecodeFile
function.
func DecodeFile(filePath string) {
// ...
// ... -> code to construct prefix table
var decodedData strings.Builder
var currentPrefix strings.Builder
// Read the compressed data
for {
b, err := reader.ReadByte()
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
// Process each bit in the byte
for i := 7; i >= 0; i-- {
bit := (b >> uint(i)) & 1
currentPrefix.WriteByte('0' + bit)
// check if currentPrefix exists in the prefixTable
if char, ok := prefixTable[currentPrefix.String()]; ok {
// if it does, write the corresponding character to
// the decodedData
decodedData.WriteRune(char)
currentPrefix.Reset()
}
}
}
// The last byte contains the number of valid bits in the previous byte
validBits, err := reader.ReadByte()
if err != nil && err != io.EOF {
panic(err)
}
// Create a new file for writing the decoded text
decodedFilePath := filepath.Join(filepath.Dir(filePath), "decoded_"+filepath.Base(filePath))
decodedFile, err := os.Create(decodedFilePath)
if err != nil {
panic(err)
}
defer CloseFile(decodedFile)
// Write the decoded text to the file
_, err = decodedFile.WriteString(decodedData.String()[:len(decodedData.String())-int(8-validBits)])
if err != nil {
panic(err)
}
fmt.Printf("Decoded text has been written to: %s\n", decodedFilePath)
}
That's it! Our DecodeFile
function is ready. Now when you call it at the end of your main
function, you will find a decoded_out.txt
file being created at the path you provide. Check its contents and you'll find that its identical to the original input file you used.
You have successfully achieved compression and decompression using Huffman Coding!
Efficiency and Complexity
Huffman coding provides optimal prefix-free coding, meaning no codeword is a prefix of another. The time complexity for building the Huffman tree is O(n log n) where n is the number of unique characters. Encoding and decoding both have a time complexity of O(N) where N is the total number of characters in the input.
The space complexity is O(n) for the tree and prefix table, and the compressed file size depends on the data but is guaranteed to be no larger than the original file (in the worst case of uniformly distributed characters).
Further Reading and References
This project was inspired by this Coding Challenge. For a better understanding of Huffman Trees, you can go through this article. To refer to the complete solution of this project, you can refer to my GitHub repository.
Thank you for reading so far, and make sure to subscribe to my blog newsletter to never miss out on fun projects such as these!