Character Encoding Comparison
Understanding the differences between ASCII, Extended ASCII, and Unicode
Overview
ASCII
- 7-bit encoding (0-127)
- 128 unique characters
- English-centric character set
- Developed in 1963
- Standard across all computing systems
Extended ASCII
- 8-bit encoding (0-255)
- 256 unique characters
- Multiple standards (CP437, ISO-8859)
- Added symbols and non-English characters
- Not standardized across systems
Unicode
- Variable-width encoding
- Over 143,000 characters
- Supports all writing systems
- First published in 1991
- UTF-8 is dominant implementation
Encoding Evolution
1963
ASCII
The original 7-bit encoding standard that defined 128 characters for electronic communication.
1981
Extended ASCII
8-bit extensions that added 128 more characters to support additional languages and symbols.
1991
Unicode
A comprehensive encoding standard that aims to include every character from all human languages.
Key Milestones
ASCII Development
- 1963: First version of ASCII published
- 1967: ASCII revised to include lowercase letters
- 1968: US adopts ASCII as a standard (ANSI X3.4)
- 1972: ASCII becomes international standard (ISO 646)
Extended ASCII Variants
- 1981: IBM PC introduces Code Page 437
- 1987: ISO 8859 series introduced for European languages
- 1992: Windows introduces CP1252 (Windows-1252)
- 1990s: Multiple region-specific code pages developed
Unicode Evolution
- 1991: Unicode 1.0 released with 7,161 characters
- 1996: Unicode and ISO 10646 synchronized
- 2003: UTF-8 becomes dominant encoding on the web
- 2010: Emoji characters added to Unicode 6.0
- 2023: Unicode 15.1 with over 149,000 characters
UTF-8 Adoption
- 1993: UTF-8 invented by Ken Thompson and Rob Pike
- 2007: UTF-8 surpasses legacy encodings on the web
- 2008: UTF-8 becomes an official internet standard (RFC 3629)
- 2022: Over 97% of websites use UTF-8 encoding
Detailed Comparison
Feature | ASCII | Extended ASCII | Unicode |
---|---|---|---|
Bit Depth | 7-bit | 8-bit | Variable (8 to 32 bits) |
Character Range | 0-127 | 0-255 | 0-1,114,111 |
Language Support | English only | Limited Western European | All world languages |
Control Characters | 0-31, 127 | 0-31, 127-159 | Various ranges |
Common Implementations | Universally standardized | CP437, ISO-8859, Windows-1252 | UTF-8, UTF-16, UTF-32 |
UTF Encoding Comparisons
Encoding | Bytes Per Character | Advantages | Disadvantages |
---|---|---|---|
UTF-8 | 1-4 bytes |
|
|
UTF-16 | 2 or 4 bytes |
|
|
UTF-32 | 4 bytes |
|
|
Common Encoding Issues
Mojibake
When text is decoded using an incorrect character encoding, resulting in garbage characters.
BOM (Byte Order Mark)
A special invisible character at the beginning of a text file that indicates the encoding and endianness.