getASCII Logo

    Character Encoding Comparison

    Understanding the differences between ASCII, Extended ASCII, and Unicode

    Overview

    ASCII

    • 7-bit encoding (0-127)
    • 128 unique characters
    • English-centric character set
    • Developed in 1963
    • Standard across all computing systems

    Extended ASCII

    • 8-bit encoding (0-255)
    • 256 unique characters
    • Multiple standards (CP437, ISO-8859)
    • Added symbols and non-English characters
    • Not standardized across systems

    Unicode

    • Variable-width encoding
    • Over 143,000 characters
    • Supports all writing systems
    • First published in 1991
    • UTF-8 is dominant implementation

    Encoding Evolution

    1963

    ASCII

    The original 7-bit encoding standard that defined 128 characters for electronic communication.

    1981

    Extended ASCII

    8-bit extensions that added 128 more characters to support additional languages and symbols.

    1991

    Unicode

    A comprehensive encoding standard that aims to include every character from all human languages.

    Key Milestones

    ASCII Development

    • 1963: First version of ASCII published
    • 1967: ASCII revised to include lowercase letters
    • 1968: US adopts ASCII as a standard (ANSI X3.4)
    • 1972: ASCII becomes international standard (ISO 646)

    Extended ASCII Variants

    • 1981: IBM PC introduces Code Page 437
    • 1987: ISO 8859 series introduced for European languages
    • 1992: Windows introduces CP1252 (Windows-1252)
    • 1990s: Multiple region-specific code pages developed

    Unicode Evolution

    • 1991: Unicode 1.0 released with 7,161 characters
    • 1996: Unicode and ISO 10646 synchronized
    • 2003: UTF-8 becomes dominant encoding on the web
    • 2010: Emoji characters added to Unicode 6.0
    • 2023: Unicode 15.1 with over 149,000 characters

    UTF-8 Adoption

    • 1993: UTF-8 invented by Ken Thompson and Rob Pike
    • 2007: UTF-8 surpasses legacy encodings on the web
    • 2008: UTF-8 becomes an official internet standard (RFC 3629)
    • 2022: Over 97% of websites use UTF-8 encoding

    Detailed Comparison

    FeatureASCIIExtended ASCIIUnicode
    Bit Depth7-bit8-bitVariable (8 to 32 bits)
    Character Range0-1270-2550-1,114,111
    Language SupportEnglish onlyLimited Western EuropeanAll world languages
    Control Characters0-31, 1270-31, 127-159Various ranges
    Common ImplementationsUniversally standardizedCP437, ISO-8859, Windows-1252UTF-8, UTF-16, UTF-32

    UTF Encoding Comparisons

    EncodingBytes Per CharacterAdvantagesDisadvantages
    UTF-81-4 bytes
    • Backward compatible with ASCII
    • Efficient for English text
    • Most popular on the web
    • Variable width makes indexing complex
    • Asian languages require more bytes
    UTF-162 or 4 bytes
    • Used by JavaScript, Java, Windows
    • Efficient for most languages
    • Not backward compatible with ASCII
    • Surrogate pairs complicate processing
    UTF-324 bytes
    • Fixed width for easy indexing
    • Simple processing logic
    • Very inefficient storage
    • Rarely used in practice

    Common Encoding Issues

    Mojibake

    When text is decoded using an incorrect character encoding, resulting in garbage characters.

    BOM (Byte Order Mark)

    A special invisible character at the beginning of a text file that indicates the encoding and endianness.