1. Status and Security Notice
UGS-1 is a deterministic, low-bandwidth steganographic encoding for hiding a byte payload inside visible text by replacing selected Latin characters with visually similar Unicode characters.
UGS-1 is not encryption, authentication, or tamper protection. The CRC in this specification only detects accidental corruption. Anyone who understands the format can alter the payload and recompute the CRC.
If secrecy or tamper resistance is required, applications should encrypt and authenticate the payload before placing it in a UGS-1 frame, for example with an AEAD scheme.
UGS-1 data is fragile. Editors, messaging systems, source-control tools, spellcheckers, security gateways, LLM rewrites, transliteration tools, ASCII sanitizers, and confusable-character detectors may modify or remove the channel.
Unicode’s own security guidance discusses confusable characters and mechanisms for detecting them, so UGS-1 should be treated as a technique with security implications.
Reference: https://www.unicode.org/reports/tr39/
2. Goal
UGS-1 embeds a hidden byte payload inside visible text.
Conceptual example:
Latin a U+0061 encodes bit 0
Cyrillic а U+0430 encodes bit 1
Both glyphs may appear visually similar:
a
But they are different Unicode code points.
UGS-1 profile H1 encodes one bit per eligible visible character.
3. Non-Goals
UGS-1 does not provide:
confidentiality
authentication
integrity against deliberate tampering
plausible deniability
robust survival through arbitrary text processing
high bandwidth
UGS-1 is intended to be simple, deterministic, testable, and easy to detect or remove when needed.
4. Terminology
Cover text
The visible text before embedding.
Carrier text
The visible text after embedding.
Payload
The hidden message as raw bytes.
Text payload
A payload interpreted as UTF-8 text.
Eligible character
A code point that appears in the UGS-1 H1 homoglyph table.
Zero variant
The ordinary Latin code point used to encode bit 0.
One variant
The visually similar Unicode code point used to encode bit 1.
Frame
The structured binary data embedded into the carrier text.
Variant reset
The process of replacing UGS-1 one variants with their corresponding zero variants before embedding. This is UGS-specific and must not be confused with Unicode normalization.
5. Required Unicode Handling
Implementations MUST operate on Unicode scalar values or code points, not encoded bytes.
Implementations MUST NOT iterate through UTF-16 strings in a way that incorrectly splits surrogate pairs. Profile H1 only uses BMP code points, but code-point handling keeps implementations portable.
Implementations MUST NOT normalize, case-fold, transliterate, ASCII-clean, confusable-map, or sanitize the carrier text after embedding.
Unicode normalization forms such as NFC, NFD, NFKC, and NFKD do not generally fold Cyrillic homoglyphs into Latin letters. However, other security and text-processing systems may do so. The larger practical risks are transliteration, ASCII filtering, confusable-skeleton mapping, spellcheck, OCR, LLM rewriting, and security sanitization.
Implementations SHOULD test the intended transport channel before use.
Reference: https://www.unicode.org/reports/tr15/
6. Structured Text Warning
Applications SHOULD avoid embedding into syntax-sensitive regions, including:
URLs
email addresses
domain names
source code
identifiers
HTML tags and attributes
Markdown syntax
JSON keys
YAML keys
CSV headers
shell commands
database queries
filenames
cryptographic material
For HTML, XML, Markdown, or rich-text formats, applications SHOULD embed only in visible prose text nodes, not in markup or syntax.
Replacing Latin letters with homoglyphs can preserve visual appearance while changing program behavior.
7. Profile H1: Binary Homoglyph Encoding
UGS-1 H1 encodes one bit per eligible character.
bit 0 -> use the Latin zero variant
bit 1 -> use the homoglyph one variant
7.1 Mandatory Homoglyph Table
All UGS-1 H1 implementations MUST support this exact table.
| Bit 0 Latin | Code point | Bit 1 homoglyph | Code point | Bit-1 character name |
|---|---|---|---|---|
A |
U+0041 | А |
U+0410 | Cyrillic Capital Letter A |
a |
U+0061 | а |
U+0430 | Cyrillic Small Letter A |
B |
U+0042 | В |
U+0412 | Cyrillic Capital Letter Ve |
C |
U+0043 | С |
U+0421 | Cyrillic Capital Letter Es |
c |
U+0063 | с |
U+0441 | Cyrillic Small Letter Es |
E |
U+0045 | Е |
U+0415 | Cyrillic Capital Letter Ie |
e |
U+0065 | е |
U+0435 | Cyrillic Small Letter Ie |
H |
U+0048 | Н |
U+041D | Cyrillic Capital Letter En |
I |
U+0049 | І |
U+0406 | Cyrillic Capital Letter Byelorussian-Ukrainian I |
i |
U+0069 | і |
U+0456 | Cyrillic Small Letter Byelorussian-Ukrainian I |
J |
U+004A | Ј |
U+0408 | Cyrillic Capital Letter Je |
j |
U+006A | ј |
U+0458 | Cyrillic Small Letter Je |
K |
U+004B | К |
U+041A | Cyrillic Capital Letter Ka |
M |
U+004D | М |
U+041C | Cyrillic Capital Letter Em |
O |
U+004F | О |
U+041E | Cyrillic Capital Letter O |
o |
U+006F | о |
U+043E | Cyrillic Small Letter O |
P |
U+0050 | Р |
U+0420 | Cyrillic Capital Letter Er |
p |
U+0070 | р |
U+0440 | Cyrillic Small Letter Er |
T |
U+0054 | Т |
U+0422 | Cyrillic Capital Letter Te |
X |
U+0058 | Х |
U+0425 | Cyrillic Capital Letter Ha |
x |
U+0078 | х |
U+0445 | Cyrillic Small Letter Ha |
y |
U+0079 | у |
U+0443 | Cyrillic Small Letter U |
This table intentionally avoids many tempting lookalikes that are more font-dependent.
8. Frame Format
The payload is wrapped in a binary frame.
All multi-bit fields are encoded most-significant bit first.
All multi-byte integers are encoded big-endian.
8.1 Layout
MAGIC 16 bits
VERSION 4 bits
FLAGS 4 bits
LENGTH 24 bits
PAYLOAD 8 * LENGTH bits
CRC16 16 bits
Total overhead:
64 bits
Required capacity:
required_bits = 64 + 8 * payload_length_in_bytes
8.2 MAGIC
Fixed value:
0xAD53
Binary:
10101101 01010011
8.3 VERSION
For this specification:
0x1
Encoded as 4 bits:
0001
8.4 FLAGS
For this specification:
0x0
Encoded as 4 bits:
0000
All nonzero flag values are reserved.
A decoder MUST reject frames with unknown nonzero flags unless explicitly configured to support an extension.
8.5 LENGTH
LENGTH is a 24-bit unsigned integer.
It gives the number of payload bytes.
Maximum representable payload size:
16,777,215 bytes
Practical payload size is usually much smaller because the cover text must contain enough eligible characters.
Decoders SHOULD support a configurable maximum accepted payload size.
8.6 PAYLOAD
The payload is raw bytes.
For ordinary text messages, the payload MUST be UTF-8 encoded text without a byte-order mark.
8.7 CRC16
CRC-16/CCITT-FALSE is computed over:
VERSION_FLAGS || LENGTH || PAYLOAD
Where:
VERSION_FLAGS = one byte: high nibble VERSION, low nibble FLAGS
For UGS-1 v1 with flags 0:
VERSION_FLAGS = 0x10
CRC parameters:
Name: CRC-16/CCITT-FALSE
Width: 16
Polynomial: 0x1021
Initial: 0xFFFF
RefIn: false
RefOut: false
XorOut: 0x0000
Output: big-endian
Normative pseudocode:
function crc16_ccitt_false(bytes):
crc = 0xFFFF
for b in bytes:
crc = crc XOR (b << 8)
repeat 8 times:
if (crc AND 0x8000) != 0:
crc = ((crc << 1) XOR 0x1021) AND 0xFFFF
else:
crc = (crc << 1) AND 0xFFFF
return crc
9. Payload Preparation
For a text message:
payload_bytes = UTF8_ENCODE(hidden_message)
Then construct:
version_flags = 0x10
length = 24-bit big-endian byte length
crc_input = version_flags || length || payload_bytes
crc = CRC16_CCITT_FALSE(crc_input)
frame_bytes = 0xAD 0x53 || version_flags || length || payload_bytes || crc
frame_bits = frame_bytes as bits, MSB first
Example hidden message:
OK
UTF-8 payload bytes:
4F 4B
Frame bytes:
AD 53 10 00 00 02 4F 4B 9E A9
Frame bits:
10101101 01010011 00010000 00000000 00000000
00000010 01001111 01001011 10011110 10101001
10. Encoder Algorithm
Input:
cover_text
payload_bytes
Output:
carrier_text
Algorithm:
1. Build the UGS-1 frame.
2. Convert frame bytes to frame bits, MSB first.
3. Reset UGS-1 variants in the cover text:
For every code point:
if it is a one variant from the H1 table:
replace it with its corresponding zero variant
else:
leave it unchanged
4. Count capacity:
capacity_bits = number of code points in the reset cover text
that are zero variants in the H1 table
5. If capacity_bits < frame_bits.length:
fail with INSUFFICIENT_CAPACITY
6. Initialize bit_index = 0.
7. For each code point cp in the reset cover text:
if cp is not a zero variant in the H1 table:
output cp
continue
if bit_index >= frame_bits.length:
output cp
continue
bit = frame_bits[bit_index]
if bit == 0:
output cp
if bit == 1:
output corresponding one variant
bit_index = bit_index + 1
8. Return the output text.
The encoder MUST NOT alter non-eligible characters except when resetting recognized one variants to their corresponding zero variants.
The encoder SHOULD reset variants before embedding so that accidental pre-existing homoglyphs do not become hidden bits.
11. Decoder Algorithm
Input:
carrier_text
Output:
payload_bytes
11.1 Bit Extraction
1. Initialize bits = empty list.
2. For each code point cp in carrier_text:
if cp is a zero variant in the H1 table:
append bit 0 to bits
else if cp is a one variant in the H1 table:
append bit 1 to bits
else:
ignore cp
11.2 Strict Frame Decode
Strict mode assumes the frame starts at the first extracted bit.
1. If bits.length < 64:
fail with NO_FRAME
2. Read first 16 bits as MAGIC.
If MAGIC != 0xAD53:
fail with BAD_MAGIC
3. Read next 4 bits as VERSION.
If VERSION != 1:
fail with UNSUPPORTED_VERSION
4. Read next 4 bits as FLAGS.
If FLAGS != 0:
fail with UNSUPPORTED_FLAGS
5. Read next 24 bits as LENGTH.
6. required_bits = 64 + 8 * LENGTH
7. If required_bits > bits.length:
fail with INCOMPLETE_FRAME
8. If LENGTH exceeds the implementation's configured maximum payload size:
fail with PAYLOAD_TOO_LARGE
9. Read PAYLOAD as LENGTH bytes.
10. Read CRC16.
11. Recompute CRC16 over:
VERSION_FLAGS || LENGTH || PAYLOAD
12. If computed CRC16 != stored CRC16:
fail with BAD_CRC
13. Ignore extracted bits after the parsed frame.
14. Return PAYLOAD.
A stricter application MAY reject trailing one bits after the parsed frame, but that behavior is not required for UGS-1 compatibility.
If the caller expects text, decode PAYLOAD as UTF-8. If UTF-8 decoding fails, return INVALID_UTF8.
12. Optional Scan Mode
Strict mode is deterministic, but it fails if visible text appears before the embedded frame and that preceding text contains eligible characters.
Scan mode searches for a valid frame at any extracted-bit offset.
1. Extract all bits from the carrier text.
2. For offset from 0 to bits.length - 64:
if bits[offset : offset + 16] != MAGIC:
continue
Try to parse a frame beginning at offset.
Before allocating payload storage:
read LENGTH
compute required_bits
ensure required_bits <= bits.length - offset
ensure LENGTH <= configured maximum payload size
If version, flags, length, and CRC are valid:
return the decoded payload and the bit offset
3. If no valid frame is found:
fail with NO_VALID_FRAME
Because the frame uses both a 16-bit magic value and a 16-bit CRC, accidental false positives are unlikely in ordinary text. They are not impossible.
Applications that need stronger false-positive resistance should use a larger checksum or authenticated payload wrapper.
13. Capacity Rules
For profile H1:
1 eligible character = 1 hidden bit
Required capacity:
64 + 8 * payload_length
Examples:
| Payload | Payload bytes | Required eligible characters |
|---|---|---|
| empty payload | 0 | 64 |
OK |
2 | 80 |
hello |
5 | 104 |
meet at 9 |
9 | 136 |
| 100-byte message | 100 | 864 |
A practical encoder SHOULD report:
capacity_bits
required_bits
unused_capacity_bits
max_payload_bytes
Where:
if capacity_bits < 64:
max_payload_bytes = 0
else:
max_payload_bytes = floor((capacity_bits - 64) / 8)
14. Worked Test Vector
14.1 Hidden Message
OK
14.2 Payload Bytes
4F 4B
14.3 Full Frame Bytes
AD 53 10 00 00 02 4F 4B 9E A9
14.4 Full Frame Bits
10101101 01010011 00010000 00000000 00000000
00000010 01001111 01001011 10011110 10101001
14.5 Minimal Cover Text
Use 80 Latin a characters:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Since a is an eligible character, each a can encode one bit.
14.6 Encoded Carrier as Escapes
For this test vector:
bit 0 -> a U+0061
bit 1 -> \u0430 Cyrillic small а
Encoded carrier, shown as escape groups:
\u0430a\u0430a\u0430\u0430a\u0430
a\u0430a\u0430aa\u0430\u0430
aaa\u0430aaaa
aaaaaaaa
aaaaaaaa
aaaaaa\u0430a
a\u0430aa\u0430\u0430\u0430\u0430
a\u0430aa\u0430a\u0430\u0430
\u0430aa\u0430\u0430\u0430\u0430a
\u0430a\u0430a\u0430aa\u0430
A correct decoder should recover:
OK
15. Error Codes
Implementations SHOULD expose these distinct errors:
INSUFFICIENT_CAPACITY
NO_FRAME
BAD_MAGIC
UNSUPPORTED_VERSION
UNSUPPORTED_FLAGS
INCOMPLETE_FRAME
PAYLOAD_TOO_LARGE
BAD_CRC
INVALID_UTF8
NO_VALID_FRAME
For privacy or simplicity, user-facing applications MAY collapse these into a generic result such as:
no valid hidden message found
16. Recommended API Shape
Implementations SHOULD expose byte-oriented primitives first:
encode_bytes(cover_text, payload_bytes) -> carrier_text
decode_bytes(carrier_text, mode = "strict" | "scan") -> {
payload_bytes,
offset_bits optional
}
capacity(cover_text) -> {
capacity_bits,
max_payload_bytes
}
UTF-8 convenience wrappers MAY be provided:
encode_text(cover_text, hidden_text) -> carrier_text
decode_text(carrier_text, mode = "strict" | "scan") -> hidden_text
The text wrappers MUST encode and decode payloads as UTF-8 without a byte-order mark.
17. Optional Extension: Zero-Width Profile Z1
Z1 is not part of mandatory UGS-1 H1 compatibility.
It may be useful when cover text lacks enough homoglyph-capable characters.
Unicode includes zero-width format characters such as:
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
A simple Z1 profile could be:
bit 0 -> U+200B ZERO WIDTH SPACE
bit 1 -> U+200C ZERO WIDTH NON-JOINER
A Z1 encoder would insert one zero-width character between visible code points.
However, Z1 is often more fragile than H1 because messaging systems, editors, and security tools commonly strip invisible format characters.
Variation selectors are another possible channel, but they are also fragile and can interact with rendering behavior.
For a dependable first implementation, use H1.
References:
https://www.unicode.org/charts/nameslist/n_2000.html
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/
https://www.unicode.org/reports/tr37/
18. Detector and Sanitizer
A companion detector can reveal or remove UGS-1 data.
18.1 Reveal Mode
For every code point:
if code point is a one variant:
display it as [1:U+XXXX]
else if code point is a zero variant:
display it as [0:U+XXXX]
else:
display normally
Reveal mode is useful for debugging, auditing, and teaching.
18.2 Strip Mode
For every code point:
if code point is a one variant:
replace it with the corresponding zero variant
else:
keep it unchanged
Strip mode removes the H1 hidden channel while preserving the intended Latin visible text.
This is useful defensively because mixed-script confusables are a known Unicode security concern.
Reference: https://www.unicode.org/reports/tr39/
19. Implementation Checklist
A conforming UGS-1 H1 implementation MUST:
[ ] Use the mandatory H1 homoglyph table exactly.
[ ] Operate on Unicode code points.
[ ] Provide byte payload support.
[ ] Encode text payloads as UTF-8 without BOM when using text wrappers.
[ ] Build the exact frame format.
[ ] Use MAGIC = 0xAD53.
[ ] Use VERSION = 1.
[ ] Use FLAGS = 0.
[ ] Use 24-bit big-endian payload length.
[ ] Use CRC-16/CCITT-FALSE.
[ ] Encode bits MSB-first.
[ ] Encode one bit per eligible character.
[ ] Reset recognized one variants before embedding.
[ ] Reject bad magic.
[ ] Reject unsupported version.
[ ] Reject unsupported flags.
[ ] Reject incomplete frames.
[ ] Reject oversized payloads according to implementation limits.
[ ] Reject bad CRC.
[ ] Avoid post-encoding normalization, transliteration, or sanitization.
A robust implementation SHOULD also:
[ ] Provide scan mode.
[ ] Provide reveal mode.
[ ] Provide strip mode.
[ ] Report capacity before encoding.
[ ] Avoid syntax-sensitive regions in structured text.
[ ] Support a configurable maximum decode payload size.
[ ] Clearly state that CRC16 is not authentication.
20. Prior Art and Related Work — Informative
20.1 Scope of This Section
UGS-1 does not claim novelty over the broad idea of hiding information in text by using Unicode look-alikes, invisible characters, or whitespace variants. Prior work exists in Unicode security, IDN homograph attacks, text watermarking, Unicode homoglyph substitution, Unicode space-character hiding, and zero-width-character steganography. The purpose of UGS-1 is narrower: it defines a small, deterministic, interoperable profile with a fixed homoglyph table, a byte-oriented frame, a length field, versioning, CRC validation, strict decoding, scan decoding, and explicit capacity/error behavior.
This section is a non-exhaustive prior-art and related-work note, not a legal patent-clearance opinion.
20.2 Unicode Visual Confusability and Homograph Security
Unicode visual confusability is a well-documented security issue. Unicode Technical Report #36 describes visual spoofing as relying on different Unicode strings whose appearance in common fonts and screen sizes is close enough that users can mistake one for another. It also notes that Internationalized Domain Names increase opportunities for visual spoofing because the usable character repertoire is much larger than ASCII. (Unicode)
Unicode Technical Standard #39 defines mechanisms for detecting visually confusable strings. It defines skeleton-based confusable detection, classifies confusables as single-script, mixed-script, or whole-script, and gives the canonical example of paypal versus pаypаl, where the second string contains U+0430 CYRILLIC SMALL LETTER A. (Unicode)
The IDN homograph attack literature predates UGS-1 by decades. Gabrilovich and Gontmakher’s 2002 “The Homograph Attack” framed character resemblance as a security problem rather than merely a typographic nuisance, and discussed non-Latin substitutions that could make domain names appear authentic while resolving differently. Later measurement work on IDN homographs describes Unicode homoglyphs as visually similar characters that can be abused to create spoofing domains, and notes that Gabrilovich and Gontmakher demonstrated an IDN homograph using Russian letters in 2002. (arXiv)
UGS-1 intentionally uses the same underlying property—different code points with similar glyphs—but for a framed steganographic/watermarking channel rather than for identifiers or domain-name spoofing. Implementations SHOULD NOT use UGS-1 inside security-sensitive identifiers such as domains, usernames, package names, certificates, executable filenames, payment addresses, or login prompts.
20.3 Homoglyph-Based Text Watermarking and Steganography
Academic work has already proposed Unicode homoglyph substitution for text watermarking. Rizzo, Bertini, and Montesi’s 2016 paper, “Content-preserving Text Watermarking through Unicode Homoglyph Substitution,” proposes a text watermarking technique based on substituting homoglyph characters for Latin symbols, using alternative Unicode symbols to preserve visual indistinguishability and text length. (ResearchGate)
Follow-on work on fine-grain text watermarking uses homoglyph substitution for Latin symbols and whitespaces to protect small portions of text, while preserving visual indistinguishability and length. That work explicitly treats text watermarking as difficult because text has low embedding capacity and limited safe transformation space. (Manara)
Hosmani, Bhat, and Chandrasekaran’s “Dual Stage Text Steganography Using Unicode Homoglyphs” is another direct related work. Its abstract describes a text steganography approach combining space manipulation, linguistic translation, and Unicode homoglyphs, implemented in Python. (ResearchGate)
There are also practical implementations. The stegtext project describes hiding text inside text by replacing ASCII characters with visually similar Unicode homoglyphs. Its documentation gives the same basic bit model used by UGS-1: leaving an eligible character unchanged can represent 0, while swapping it with a visually identical homoglyph can represent 1; it also notes that characters with multiple homoglyphs can encode more than one bit. (GitHub)
UGS-1 differs from these systems by deliberately choosing a conservative one-bit-per-character profile, a fixed mandatory table, a binary frame, a magic value, version and flags fields, a 24-bit payload length, CRC-16 validation, and specified decoder error behavior.
20.4 Unicode Whitespace, Zero-Width, and Invisible-Character Channels
Unicode text hiding is not limited to homoglyph substitution. UniSpaCh, published in 2012, uses Unicode space characters inserted into inter-word, inter-sentence, end-of-line, and paragraph spacing to encode external information in Microsoft Word documents, while aiming to improve imperceptibility and embedding efficiency. (ScienceDirect)
More recent whitespace-replacement watermarking work similarly hides byte-encoded sequences in unformatted text by substituting conventional whitespace with visually similar Unicode spaces. The 2025 HICSS paper by Hellmeier, Qarawlus, Norkowski, and Howar reports a proof-of-concept multiplatform implementation and evaluates robustness, capacity, and visibility. (ResearchGate)
Zero-width-character steganography is another established related family. The 330k Unicode steganography tool exposes text and binary encode/decode APIs and uses characters such as U+200C, U+200D, U+202C, and U+FEFF, while noting that U+200B may be deleted by Gmail in browser-sent mail. (330k) StegCloak is a JavaScript steganography tool that hides secrets inside text using special invisible Unicode characters, with optional compression, AES-256-CTR encryption, and HMAC integrity. (GitHub)
UGS-1’s primary H1 profile is therefore best understood as a homoglyph profile, not a general Unicode steganography invention. The optional Z1 profile belongs to the older zero-width/invisible-character family and should be treated as more fragile in copy/paste, email, messaging, search indexing, normalization, and security-filtering pipelines.
20.5 Normalization, Sanitization, and Defensive Detection
Unicode normalization is directly relevant to UGS-1 robustness. Unicode Standard Annex #15 defines normalization forms so equivalent strings can have unique binary representations; any process that normalizes, folds, transliterates, maps confusables, removes default-ignorable characters, or enforces a restricted character repertoire can alter or destroy a Unicode steganographic channel. (Unicode)
Defensive systems can detect or neutralize UGS-1-like channels by using Unicode confusable data, skeleton mappings, mixed-script detection, or explicit canonicalization of known homoglyph pairs. Unicode Technical Standard #39 provides the standard basis for confusable detection and includes mappings such as Cyrillic small es to Latin c. (Unicode)
A UGS-1 implementation SHOULD therefore include reveal and strip tooling. Reveal tooling helps auditors see code-point substitutions; strip tooling replaces UGS-1 one-variants with their Latin zero-variants, preserving visible text while removing the hidden channel.
20.6 UGS-1 Positioning
UGS-1 should be positioned as an interoperability profile for a known class of Unicode text steganography and watermarking techniques. It standardizes one conservative way to frame, embed, decode, validate, and strip a hidden byte payload using Unicode homoglyph variants.
UGS-1 is not encryption. It does not provide confidentiality unless the payload is encrypted before embedding. It is also not robust against determined sanitization, paraphrasing, transliteration, OCR, Unicode confusable canonicalization, or manual replacement of suspicious non-Latin code points.
Appropriate uses include controlled watermarking experiments, puzzles, protocol demonstrations, document-provenance research, educational security examples, and defensive testing. Inappropriate uses include spoofing identifiers, phishing, bypassing safety filters, hiding executable instructions, concealing policy-violating content, or impersonating trusted parties.
20.7 Informative References
[UTR36] Unicode Consortium, Unicode Technical Report #36: Unicode Security Considerations. Discusses visual spoofing, confusable strings, and IDN-related security risks. (Unicode)
[UTS39] Unicode Consortium, Unicode Technical Standard #39: Unicode Security Mechanisms. Defines confusable detection, skeleton mappings, mixed-script confusables, and related data files. (Unicode)
[UAX15] Unicode Consortium, Unicode Standard Annex #15: Unicode Normalization Forms. Defines normalization forms and explains how equivalent Unicode strings can be represented consistently. (Unicode)
[Gabrilovich2002] Evgeniy Gabrilovich and Alex Gontmakher, The Homograph Attack, Communications of the ACM, 2002. Early security treatment of visually confusable characters in domain-name spoofing.
[Rizzo2016] Stefano Giovanni Rizzo, Flavio Bertini, and Danilo Montesi, Content-preserving Text Watermarking through Unicode Homoglyph Substitution, IDEAS 2016. Proposes watermarking through Unicode homoglyph substitution for Latin symbols. (ResearchGate)
[Rizzo2019] Rizzo et al., Fine-grain watermarking for intellectual property protection, EURASIP Journal on Information Security, 2019. Extends homoglyph/whitespace substitution for fine-grained document watermarking. (Manara)
[Hosmani2015] Sachin Hosmani, H. G. Rama Bhat, and K. Chandrasekaran, Dual Stage Text Steganography Using Unicode Homoglyphs, 2015. Combines space manipulation, linguistic translation, and Unicode homoglyphs. (ResearchGate)
[UniSpaCh2012] Lip Yee Por, Koksheik Wong, and Kok Onn Chee, UniSpaCh: A text-based data hiding method using Unicode space characters, Journal of Systems and Software, 2012. Uses Unicode space characters for text-based data hiding. (ScienceDirect)
[Hellmeier2025] Hellmeier, Qarawlus, Norkowski, and Howar, A Hidden Digital Text Watermarking Method Using Unicode Whitespace Replacement, HICSS 2025. Replaces conventional whitespace with visually similar Unicode spaces to hide byte-encoded data. (ResearchGate)
[stegtext] btimby/stegtext, practical Unicode homoglyph steganography implementation using ASCII-to-homoglyph character swaps. (GitHub)
[StegCloak] KuroLabs/stegcloak, practical zero-width Unicode steganography implementation with compression, encryption, and integrity options. (GitHub)
[330kUnicodeStego] 330k Unicode Steganography with Zero-Width Characters, browser-based zero-width text/binary encoder and decoder. (330k)