UTF-8 | TechShoulders

UTF-8 is a variable-width character encoding that can represent every character in the Unicode standard. Created by Ken Thompson and Rob Pike, it became the dominant character encoding on the web, enabling true internationalization.

Origins

In 1992, Thompson and Pike designed UTF-8 during dinner at a New Jersey diner. They sought an encoding that could represent all Unicode characters while remaining backward-compatible with ASCII and avoiding embedded null bytes.

Design Elegance

UTF-8’s brilliance lies in its properties:

ASCII compatibility: ASCII text is valid UTF-8
Self-synchronizing: Can find character boundaries from any position
No null bytes: Safe in C strings and file paths
Prefix codes: No character code is a prefix of another
Compact for ASCII: English text uses only one byte per character

How It Works

UTF-8 uses variable-length encoding:

ASCII characters (0-127): 1 byte
Latin, Greek, Cyrillic: 2 bytes
Most other scripts: 3 bytes
Rare characters: 4 bytes

Impact

UTF-8 transformed computing:

Dominant encoding on the web (over 98% of websites)
Default encoding in modern operating systems
Enabled true global text processing
Made internationalization practical for software
Solved decades of character encoding chaos

Origins

Design Elegance

How It Works

Impact

Key Contributors

Rob Pike

Links