Technical infographic for UGS-1 Unicode Glyph-Variant Steganography, showing Latin and Cyrillic homoglyph code points, visible text, glyph-variant text, protocol layers, and hidden payload bits.

UGS-1: Unicode Glyph-Variant Steganography, Version 1.1

A deterministic, low-bandwidth Unicode homoglyph profile for hiding framed byte payloads in visible text

written by Ted Tschopp

UGS-1 Unicode Glyph-Variant Steganography: an image by

1. Status and Security Notice

UGS-1 is a deterministic, low-bandwidth steganographic encoding for hiding a byte payload inside visible text by replacing selected Latin characters with visually similar Unicode characters.

UGS-1 is not encryption, authentication, or tamper protection. The CRC in this specification only detects accidental corruption. Anyone who understands the format can alter the payload and recompute the CRC.

If secrecy or tamper resistance is required, applications should encrypt and authenticate the payload before placing it in a UGS-1 frame, for example with an AEAD scheme.

UGS-1 data is fragile. Editors, messaging systems, source-control tools, spellcheckers, security gateways, LLM rewrites, transliteration tools, ASCII sanitizers, and confusable-character detectors may modify or remove the channel.

Unicode’s own security guidance discusses confusable characters and mechanisms for detecting them, so UGS-1 should be treated as a technique with security implications.

Reference: https://www.unicode.org/reports/tr39/

2. Goal

UGS-1 embeds a hidden byte payload inside visible text.

Conceptual example:

Latin a      U+0061  encodes bit 0
Cyrillic а   U+0430  encodes bit 1

Both glyphs may appear visually similar:

But they are different Unicode code points.

UGS-1 profile H1 encodes one bit per eligible visible character.

3. Non-Goals

UGS-1 does not provide:

confidentiality
authentication
integrity against deliberate tampering
plausible deniability
robust survival through arbitrary text processing
high bandwidth

UGS-1 is intended to be simple, deterministic, testable, and easy to detect or remove when needed.

4. Terminology

Cover text
The visible text before embedding.

Carrier text
The visible text after embedding.

Payload
The hidden message as raw bytes.

Text payload
A payload interpreted as UTF-8 text.

Eligible character
A code point that appears in the UGS-1 H1 homoglyph table.

Zero variant
The ordinary Latin code point used to encode bit 0.

One variant
The visually similar Unicode code point used to encode bit 1.

Frame
The structured binary data embedded into the carrier text.

Variant reset
The process of replacing UGS-1 one variants with their corresponding zero variants before embedding. This is UGS-specific and must not be confused with Unicode normalization.

5. Required Unicode Handling

Implementations MUST operate on Unicode scalar values or code points, not encoded bytes.

Implementations MUST NOT iterate through UTF-16 strings in a way that incorrectly splits surrogate pairs. Profile H1 only uses BMP code points, but code-point handling keeps implementations portable.

Implementations MUST NOT normalize, case-fold, transliterate, ASCII-clean, confusable-map, or sanitize the carrier text after embedding.

Unicode normalization forms such as NFC, NFD, NFKC, and NFKD do not generally fold Cyrillic homoglyphs into Latin letters. However, other security and text-processing systems may do so. The larger practical risks are transliteration, ASCII filtering, confusable-skeleton mapping, spellcheck, OCR, LLM rewriting, and security sanitization.

Implementations SHOULD test the intended transport channel before use.

Reference: https://www.unicode.org/reports/tr15/

6. Structured Text Warning

Applications SHOULD avoid embedding into syntax-sensitive regions, including:

URLs
email addresses
domain names
source code
identifiers
HTML tags and attributes
Markdown syntax
JSON keys
YAML keys
CSV headers
shell commands
database queries
filenames
cryptographic material

For HTML, XML, Markdown, or rich-text formats, applications SHOULD embed only in visible prose text nodes, not in markup or syntax.

Replacing Latin letters with homoglyphs can preserve visual appearance while changing program behavior.

7. Profile H1: Binary Homoglyph Encoding

UGS-1 H1 encodes one bit per eligible character.

bit 0 -> use the Latin zero variant
bit 1 -> use the homoglyph one variant

7.1 Mandatory Homoglyph Table

All UGS-1 H1 implementations MUST support this exact table.

Bit 0 Latin	Code point	Bit 1 homoglyph	Code point	Bit-1 character name
`A`	U+0041	`А`	U+0410	Cyrillic Capital Letter A
`a`	U+0061	`а`	U+0430	Cyrillic Small Letter A
`B`	U+0042	`В`	U+0412	Cyrillic Capital Letter Ve
`C`	U+0043	`С`	U+0421	Cyrillic Capital Letter Es
`c`	U+0063	`с`	U+0441	Cyrillic Small Letter Es
`E`	U+0045	`Е`	U+0415	Cyrillic Capital Letter Ie
`e`	U+0065	`е`	U+0435	Cyrillic Small Letter Ie
`H`	U+0048	`Н`	U+041D	Cyrillic Capital Letter En
`I`	U+0049	`І`	U+0406	Cyrillic Capital Letter Byelorussian-Ukrainian I
`i`	U+0069	`і`	U+0456	Cyrillic Small Letter Byelorussian-Ukrainian I
`J`	U+004A	`Ј`	U+0408	Cyrillic Capital Letter Je
`j`	U+006A	`ј`	U+0458	Cyrillic Small Letter Je
`K`	U+004B	`К`	U+041A	Cyrillic Capital Letter Ka
`M`	U+004D	`М`	U+041C	Cyrillic Capital Letter Em
`O`	U+004F	`О`	U+041E	Cyrillic Capital Letter O
`o`	U+006F	`о`	U+043E	Cyrillic Small Letter O
`P`	U+0050	`Р`	U+0420	Cyrillic Capital Letter Er
`p`	U+0070	`р`	U+0440	Cyrillic Small Letter Er
`T`	U+0054	`Т`	U+0422	Cyrillic Capital Letter Te
`X`	U+0058	`Х`	U+0425	Cyrillic Capital Letter Ha
`x`	U+0078	`х`	U+0445	Cyrillic Small Letter Ha
`y`	U+0079	`у`	U+0443	Cyrillic Small Letter U

This table intentionally avoids many tempting lookalikes that are more font-dependent.

8. Frame Format

The payload is wrapped in a binary frame.

All multi-bit fields are encoded most-significant bit first.

All multi-byte integers are encoded big-endian.

8.1 Layout

MAGIC          16 bits
VERSION         4 bits
FLAGS           4 bits
LENGTH         24 bits
PAYLOAD      8 * LENGTH bits
CRC16          16 bits

Total overhead:

64 bits

Required capacity:

required_bits = 64 + 8 * payload_length_in_bytes

8.2 MAGIC

Fixed value:

0xAD53

Binary:

10101101 01010011

8.3 VERSION

For this specification:

0x1

Encoded as 4 bits:

8.4 FLAGS

For this specification:

0x0

Encoded as 4 bits:

All nonzero flag values are reserved.

A decoder MUST reject frames with unknown nonzero flags unless explicitly configured to support an extension.

8.5 LENGTH

LENGTH is a 24-bit unsigned integer.

It gives the number of payload bytes.

Maximum representable payload size:

16,777,215 bytes

Practical payload size is usually much smaller because the cover text must contain enough eligible characters.

Decoders SHOULD support a configurable maximum accepted payload size.

8.6 PAYLOAD

The payload is raw bytes.

For ordinary text messages, the payload MUST be UTF-8 encoded text without a byte-order mark.

8.7 CRC16

CRC-16/CCITT-FALSE is computed over:

VERSION_FLAGS || LENGTH || PAYLOAD

Where:

VERSION_FLAGS = one byte: high nibble VERSION, low nibble FLAGS

For UGS-1 v1 with flags 0:

VERSION_FLAGS = 0x10

CRC parameters:

Name:       CRC-16/CCITT-FALSE
Width:      16
Polynomial: 0x1021
Initial:    0xFFFF
RefIn:      false
RefOut:     false
XorOut:     0x0000
Output:     big-endian

Normative pseudocode:

function crc16_ccitt_false(bytes):
    crc = 0xFFFF

    for b in bytes:
        crc = crc XOR (b << 8)

        repeat 8 times:
            if (crc AND 0x8000) != 0:
                crc = ((crc << 1) XOR 0x1021) AND 0xFFFF
            else:
                crc = (crc << 1) AND 0xFFFF

    return crc

9. Payload Preparation

For a text message:

payload_bytes = UTF8_ENCODE(hidden_message)

Then construct:

version_flags = 0x10
length         = 24-bit big-endian byte length
crc_input      = version_flags || length || payload_bytes
crc            = CRC16_CCITT_FALSE(crc_input)
frame_bytes    = 0xAD 0x53 || version_flags || length || payload_bytes || crc
frame_bits     = frame_bytes as bits, MSB first

Example hidden message:

OK

UTF-8 payload bytes:

4F 4B

Frame bytes:

AD 53 10 00 00 02 4F 4B 9E A9

Frame bits:

10101101 01010011 00010000 00000000 00000000
00000010 01001111 01001011 10011110 10101001

10. Encoder Algorithm

Input:

cover_text
payload_bytes

Output:

carrier_text

Algorithm:

1. Build the UGS-1 frame.

2. Convert frame bytes to frame bits, MSB first.

3. Reset UGS-1 variants in the cover text:
   For every code point:
       if it is a one variant from the H1 table:
           replace it with its corresponding zero variant
       else:
           leave it unchanged

4. Count capacity:
       capacity_bits = number of code points in the reset cover text
                       that are zero variants in the H1 table

5. If capacity_bits < frame_bits.length:
       fail with INSUFFICIENT_CAPACITY

6. Initialize bit_index = 0.

7. For each code point cp in the reset cover text:

       if cp is not a zero variant in the H1 table:
           output cp
           continue

       if bit_index >= frame_bits.length:
           output cp
           continue

       bit = frame_bits[bit_index]

       if bit == 0:
           output cp

       if bit == 1:
           output corresponding one variant

       bit_index = bit_index + 1

8. Return the output text.

The encoder MUST NOT alter non-eligible characters except when resetting recognized one variants to their corresponding zero variants.

The encoder SHOULD reset variants before embedding so that accidental pre-existing homoglyphs do not become hidden bits.

11. Decoder Algorithm

Input:

carrier_text

Output:

payload_bytes

11.1 Bit Extraction

1. Initialize bits = empty list.

2. For each code point cp in carrier_text:

       if cp is a zero variant in the H1 table:
           append bit 0 to bits

       else if cp is a one variant in the H1 table:
           append bit 1 to bits

       else:
           ignore cp

11.2 Strict Frame Decode

Strict mode assumes the frame starts at the first extracted bit.

1. If bits.length < 64:
       fail with NO_FRAME

2. Read first 16 bits as MAGIC.
       If MAGIC != 0xAD53:
           fail with BAD_MAGIC

3. Read next 4 bits as VERSION.
       If VERSION != 1:
           fail with UNSUPPORTED_VERSION

4. Read next 4 bits as FLAGS.
       If FLAGS != 0:
           fail with UNSUPPORTED_FLAGS

5. Read next 24 bits as LENGTH.

6. required_bits = 64 + 8 * LENGTH

7. If required_bits > bits.length:
       fail with INCOMPLETE_FRAME

8. If LENGTH exceeds the implementation's configured maximum payload size:
       fail with PAYLOAD_TOO_LARGE

9. Read PAYLOAD as LENGTH bytes.

10. Read CRC16.

11. Recompute CRC16 over:
       VERSION_FLAGS || LENGTH || PAYLOAD

12. If computed CRC16 != stored CRC16:
       fail with BAD_CRC

13. Ignore extracted bits after the parsed frame.

14. Return PAYLOAD.

A stricter application MAY reject trailing one bits after the parsed frame, but that behavior is not required for UGS-1 compatibility.

If the caller expects text, decode PAYLOAD as UTF-8. If UTF-8 decoding fails, return INVALID_UTF8.

12. Optional Scan Mode

Strict mode is deterministic, but it fails if visible text appears before the embedded frame and that preceding text contains eligible characters.

Scan mode searches for a valid frame at any extracted-bit offset.

1. Extract all bits from the carrier text.

2. For offset from 0 to bits.length - 64:

       if bits[offset : offset + 16] != MAGIC:
           continue

       Try to parse a frame beginning at offset.

       Before allocating payload storage:
           read LENGTH
           compute required_bits
           ensure required_bits <= bits.length - offset
           ensure LENGTH <= configured maximum payload size

       If version, flags, length, and CRC are valid:
           return the decoded payload and the bit offset

3. If no valid frame is found:
       fail with NO_VALID_FRAME

Because the frame uses both a 16-bit magic value and a 16-bit CRC, accidental false positives are unlikely in ordinary text. They are not impossible.

Applications that need stronger false-positive resistance should use a larger checksum or authenticated payload wrapper.

13. Capacity Rules

For profile H1:

1 eligible character = 1 hidden bit

Required capacity:

64 + 8 * payload_length

Examples:

Payload	Payload bytes	Required eligible characters
empty payload	0	64
`OK`	2	80
`hello`	5	104
`meet at 9`	9	136
100-byte message	100	864

A practical encoder SHOULD report:

capacity_bits
required_bits
unused_capacity_bits
max_payload_bytes

Where:

if capacity_bits < 64:
    max_payload_bytes = 0
else:
    max_payload_bytes = floor((capacity_bits - 64) / 8)

14. Worked Test Vector

14.1 Hidden Message

OK

14.2 Payload Bytes

4F 4B

14.3 Full Frame Bytes

AD 53 10 00 00 02 4F 4B 9E A9

14.4 Full Frame Bits

10101101 01010011 00010000 00000000 00000000
00000010 01001111 01001011 10011110 10101001

14.5 Minimal Cover Text

Use 80 Latin a characters:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Since a is an eligible character, each a can encode one bit.

14.6 Encoded Carrier as Escapes

For this test vector:

bit 0 -> a        U+0061
bit 1 -> \u0430   Cyrillic small а

Encoded carrier, shown as escape groups:

\u0430a\u0430a\u0430\u0430a\u0430
a\u0430a\u0430aa\u0430\u0430
aaa\u0430aaaa
aaaaaaaa
aaaaaaaa
aaaaaa\u0430a
a\u0430aa\u0430\u0430\u0430\u0430
a\u0430aa\u0430a\u0430\u0430
\u0430aa\u0430\u0430\u0430\u0430a
\u0430a\u0430a\u0430aa\u0430

A correct decoder should recover:

OK

15. Error Codes

Implementations SHOULD expose these distinct errors:

INSUFFICIENT_CAPACITY
NO_FRAME
BAD_MAGIC
UNSUPPORTED_VERSION
UNSUPPORTED_FLAGS
INCOMPLETE_FRAME
PAYLOAD_TOO_LARGE
BAD_CRC
INVALID_UTF8
NO_VALID_FRAME

For privacy or simplicity, user-facing applications MAY collapse these into a generic result such as:

no valid hidden message found

16. Recommended API Shape

Implementations SHOULD expose byte-oriented primitives first:

encode_bytes(cover_text, payload_bytes) -> carrier_text

decode_bytes(carrier_text, mode = "strict" | "scan") -> {
    payload_bytes,
    offset_bits optional
}

capacity(cover_text) -> {
    capacity_bits,
    max_payload_bytes
}

UTF-8 convenience wrappers MAY be provided:

encode_text(cover_text, hidden_text) -> carrier_text

decode_text(carrier_text, mode = "strict" | "scan") -> hidden_text

The text wrappers MUST encode and decode payloads as UTF-8 without a byte-order mark.

17. Optional Extension: Zero-Width Profile Z1

Z1 is not part of mandatory UGS-1 H1 compatibility.

It may be useful when cover text lacks enough homoglyph-capable characters.

Unicode includes zero-width format characters such as:

U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER

A simple Z1 profile could be:

bit 0 -> U+200B ZERO WIDTH SPACE
bit 1 -> U+200C ZERO WIDTH NON-JOINER

A Z1 encoder would insert one zero-width character between visible code points.

However, Z1 is often more fragile than H1 because messaging systems, editors, and security tools commonly strip invisible format characters.

Variation selectors are another possible channel, but they are also fragile and can interact with rendering behavior.

For a dependable first implementation, use H1.

References:

https://www.unicode.org/charts/nameslist/n_2000.html
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/
https://www.unicode.org/reports/tr37/

18. Detector and Sanitizer

A companion detector can reveal or remove UGS-1 data.

18.1 Reveal Mode

For every code point:

if code point is a one variant:
    display it as [1:U+XXXX]
else if code point is a zero variant:
    display it as [0:U+XXXX]
else:
    display normally

Reveal mode is useful for debugging, auditing, and teaching.

18.2 Strip Mode

For every code point:

if code point is a one variant:
    replace it with the corresponding zero variant
else:
    keep it unchanged

Strip mode removes the H1 hidden channel while preserving the intended Latin visible text.

This is useful defensively because mixed-script confusables are a known Unicode security concern.

Reference: https://www.unicode.org/reports/tr39/

19. Implementation Checklist

A conforming UGS-1 H1 implementation MUST:

[ ] Use the mandatory H1 homoglyph table exactly.
[ ] Operate on Unicode code points.
[ ] Provide byte payload support.
[ ] Encode text payloads as UTF-8 without BOM when using text wrappers.
[ ] Build the exact frame format.
[ ] Use MAGIC = 0xAD53.
[ ] Use VERSION = 1.
[ ] Use FLAGS = 0.
[ ] Use 24-bit big-endian payload length.
[ ] Use CRC-16/CCITT-FALSE.
[ ] Encode bits MSB-first.
[ ] Encode one bit per eligible character.
[ ] Reset recognized one variants before embedding.
[ ] Reject bad magic.
[ ] Reject unsupported version.
[ ] Reject unsupported flags.
[ ] Reject incomplete frames.
[ ] Reject oversized payloads according to implementation limits.
[ ] Reject bad CRC.
[ ] Avoid post-encoding normalization, transliteration, or sanitization.

A robust implementation SHOULD also:

[ ] Provide scan mode.
[ ] Provide reveal mode.
[ ] Provide strip mode.
[ ] Report capacity before encoding.
[ ] Avoid syntax-sensitive regions in structured text.
[ ] Support a configurable maximum decode payload size.
[ ] Clearly state that CRC16 is not authentication.

20.1 Scope of This Section

UGS-1 does not claim novelty over the broad idea of hiding information in text by using Unicode look-alikes, invisible characters, or whitespace variants. Prior work exists in Unicode security, IDN homograph attacks, text watermarking, Unicode homoglyph substitution, Unicode space-character hiding, and zero-width-character steganography. The purpose of UGS-1 is narrower: it defines a small, deterministic, interoperable profile with a fixed homoglyph table, a byte-oriented frame, a length field, versioning, CRC validation, strict decoding, scan decoding, and explicit capacity/error behavior.

This section is a non-exhaustive prior-art and related-work note, not a legal patent-clearance opinion.

20.2 Unicode Visual Confusability and Homograph Security

Unicode visual confusability is a well-documented security issue. Unicode Technical Report #36 describes visual spoofing as relying on different Unicode strings whose appearance in common fonts and screen sizes is close enough that users can mistake one for another. It also notes that Internationalized Domain Names increase opportunities for visual spoofing because the usable character repertoire is much larger than ASCII. (Unicode)

Unicode Technical Standard #39 defines mechanisms for detecting visually confusable strings. It defines skeleton-based confusable detection, classifies confusables as single-script, mixed-script, or whole-script, and gives the canonical example of paypal versus pаypаl, where the second string contains U+0430 CYRILLIC SMALL LETTER A. (Unicode)

The IDN homograph attack literature predates UGS-1 by decades. Gabrilovich and Gontmakher’s 2002 “The Homograph Attack” framed character resemblance as a security problem rather than merely a typographic nuisance, and discussed non-Latin substitutions that could make domain names appear authentic while resolving differently. Later measurement work on IDN homographs describes Unicode homoglyphs as visually similar characters that can be abused to create spoofing domains, and notes that Gabrilovich and Gontmakher demonstrated an IDN homograph using Russian letters in 2002. (arXiv)

UGS-1 intentionally uses the same underlying property—different code points with similar glyphs—but for a framed steganographic/watermarking channel rather than for identifiers or domain-name spoofing. Implementations SHOULD NOT use UGS-1 inside security-sensitive identifiers such as domains, usernames, package names, certificates, executable filenames, payment addresses, or login prompts.

20.3 Homoglyph-Based Text Watermarking and Steganography

Academic work has already proposed Unicode homoglyph substitution for text watermarking. Rizzo, Bertini, and Montesi’s 2016 paper, “Content-preserving Text Watermarking through Unicode Homoglyph Substitution,” proposes a text watermarking technique based on substituting homoglyph characters for Latin symbols, using alternative Unicode symbols to preserve visual indistinguishability and text length. (ResearchGate)

Follow-on work on fine-grain text watermarking uses homoglyph substitution for Latin symbols and whitespaces to protect small portions of text, while preserving visual indistinguishability and length. That work explicitly treats text watermarking as difficult because text has low embedding capacity and limited safe transformation space. (Manara)

Hosmani, Bhat, and Chandrasekaran’s “Dual Stage Text Steganography Using Unicode Homoglyphs” is another direct related work. Its abstract describes a text steganography approach combining space manipulation, linguistic translation, and Unicode homoglyphs, implemented in Python. (ResearchGate)

There are also practical implementations. The stegtext project describes hiding text inside text by replacing ASCII characters with visually similar Unicode homoglyphs. Its documentation gives the same basic bit model used by UGS-1: leaving an eligible character unchanged can represent 0, while swapping it with a visually identical homoglyph can represent 1; it also notes that characters with multiple homoglyphs can encode more than one bit. (GitHub)

UGS-1 differs from these systems by deliberately choosing a conservative one-bit-per-character profile, a fixed mandatory table, a binary frame, a magic value, version and flags fields, a 24-bit payload length, CRC-16 validation, and specified decoder error behavior.

20.4 Unicode Whitespace, Zero-Width, and Invisible-Character Channels

Unicode text hiding is not limited to homoglyph substitution. UniSpaCh, published in 2012, uses Unicode space characters inserted into inter-word, inter-sentence, end-of-line, and paragraph spacing to encode external information in Microsoft Word documents, while aiming to improve imperceptibility and embedding efficiency. (ScienceDirect)

More recent whitespace-replacement watermarking work similarly hides byte-encoded sequences in unformatted text by substituting conventional whitespace with visually similar Unicode spaces. The 2025 HICSS paper by Hellmeier, Qarawlus, Norkowski, and Howar reports a proof-of-concept multiplatform implementation and evaluates robustness, capacity, and visibility. (ResearchGate)

Zero-width-character steganography is another established related family. The 330k Unicode steganography tool exposes text and binary encode/decode APIs and uses characters such as U+200C, U+200D, U+202C, and U+FEFF, while noting that U+200B may be deleted by Gmail in browser-sent mail. (330k) StegCloak is a JavaScript steganography tool that hides secrets inside text using special invisible Unicode characters, with optional compression, AES-256-CTR encryption, and HMAC integrity. (GitHub)

UGS-1’s primary H1 profile is therefore best understood as a homoglyph profile, not a general Unicode steganography invention. The optional Z1 profile belongs to the older zero-width/invisible-character family and should be treated as more fragile in copy/paste, email, messaging, search indexing, normalization, and security-filtering pipelines.

20.5 Normalization, Sanitization, and Defensive Detection

Unicode normalization is directly relevant to UGS-1 robustness. Unicode Standard Annex #15 defines normalization forms so equivalent strings can have unique binary representations; any process that normalizes, folds, transliterates, maps confusables, removes default-ignorable characters, or enforces a restricted character repertoire can alter or destroy a Unicode steganographic channel. (Unicode)

Defensive systems can detect or neutralize UGS-1-like channels by using Unicode confusable data, skeleton mappings, mixed-script detection, or explicit canonicalization of known homoglyph pairs. Unicode Technical Standard #39 provides the standard basis for confusable detection and includes mappings such as Cyrillic small es to Latin c. (Unicode)

A UGS-1 implementation SHOULD therefore include reveal and strip tooling. Reveal tooling helps auditors see code-point substitutions; strip tooling replaces UGS-1 one-variants with their Latin zero-variants, preserving visible text while removing the hidden channel.

20.6 UGS-1 Positioning

UGS-1 should be positioned as an interoperability profile for a known class of Unicode text steganography and watermarking techniques. It standardizes one conservative way to frame, embed, decode, validate, and strip a hidden byte payload using Unicode homoglyph variants.

UGS-1 is not encryption. It does not provide confidentiality unless the payload is encrypted before embedding. It is also not robust against determined sanitization, paraphrasing, transliteration, OCR, Unicode confusable canonicalization, or manual replacement of suspicious non-Latin code points.

Appropriate uses include controlled watermarking experiments, puzzles, protocol demonstrations, document-provenance research, educational security examples, and defensive testing. Inappropriate uses include spoofing identifiers, phishing, bypassing safety filters, hiding executable instructions, concealing policy-violating content, or impersonating trusted parties.

20.7 Informative References

[UTR36] Unicode Consortium, Unicode Technical Report #36: Unicode Security Considerations. Discusses visual spoofing, confusable strings, and IDN-related security risks. (Unicode)

[UTS39] Unicode Consortium, Unicode Technical Standard #39: Unicode Security Mechanisms. Defines confusable detection, skeleton mappings, mixed-script confusables, and related data files. (Unicode)

[UAX15] Unicode Consortium, Unicode Standard Annex #15: Unicode Normalization Forms. Defines normalization forms and explains how equivalent Unicode strings can be represented consistently. (Unicode)

[Gabrilovich2002] Evgeniy Gabrilovich and Alex Gontmakher, The Homograph Attack, Communications of the ACM, 2002. Early security treatment of visually confusable characters in domain-name spoofing.

[Rizzo2016] Stefano Giovanni Rizzo, Flavio Bertini, and Danilo Montesi, Content-preserving Text Watermarking through Unicode Homoglyph Substitution, IDEAS 2016. Proposes watermarking through Unicode homoglyph substitution for Latin symbols. (ResearchGate)

[Rizzo2019] Rizzo et al., Fine-grain watermarking for intellectual property protection, EURASIP Journal on Information Security, 2019. Extends homoglyph/whitespace substitution for fine-grained document watermarking. (Manara)

[Hosmani2015] Sachin Hosmani, H. G. Rama Bhat, and K. Chandrasekaran, Dual Stage Text Steganography Using Unicode Homoglyphs, 2015. Combines space manipulation, linguistic translation, and Unicode homoglyphs. (ResearchGate)

[UniSpaCh2012] Lip Yee Por, Koksheik Wong, and Kok Onn Chee, UniSpaCh: A text-based data hiding method using Unicode space characters, Journal of Systems and Software, 2012. Uses Unicode space characters for text-based data hiding. (ScienceDirect)

[Hellmeier2025] Hellmeier, Qarawlus, Norkowski, and Howar, A Hidden Digital Text Watermarking Method Using Unicode Whitespace Replacement, HICSS 2025. Replaces conventional whitespace with visually similar Unicode spaces to hide byte-encoded data. (ResearchGate)

[stegtext] btimby/stegtext, practical Unicode homoglyph steganography implementation using ASCII-to-homoglyph character swaps. (GitHub)

[StegCloak] KuroLabs/stegcloak, practical zero-width Unicode steganography implementation with compression, encryption, and integrity options. (GitHub)

[330kUnicodeStego] 330k Unicode Steganography with Zero-Width Characters, browser-based zero-width text/binary encoder and decoder. (330k)