DNS labels must be normalized and encoded properly to prevent security issues. Unicode introduces complexity with visually similar characters from different scripts.
DNS Label Normalization:
- Case folding: DNS labels are case-insensitive; Example.COM = example.com
- Lowercase conversion: Standard practice to store labels in lowercase
- Length limits: 1-63 characters per label (RFC 1035)
- Character restrictions: Letters, digits, hyphens (a-z, 0-9, -) for ASCII labels
Internationalized Domain Names (IDN):
- RFC 3490 (IDNA2003): Original IDN specification
- RFC 5891 (IDNA2008): Updated IDN specification
- Punycode encoding: ASCII-compatible encoding of Unicode labels (RFC 3492)
- xn-- prefix: Identifies Punycode-encoded labels
- Example: "münchen" to "xn--mnchen-3ya" (ü encoded as -3ya)
Punycode Encoding Process:
- Extract ASCII characters from label (if any)
- Encode non-ASCII characters using variable-length encoding
- Add "xn--" prefix
- Example: "日本" (Japan) to "xn--wgv71a"
Unicode Script Detection:
- Script property: Every Unicode character belongs to a script
- Common scripts:
- Latin: a-z, A-Z (Unicode 0041-007A)
- Cyrillic: а-я (Unicode 0400-04FF) - note: looks like Latin but different
- Greek: α-ω (Unicode 0370-03FF)
- Chinese: 中文 (Unicode 4E00-9FFF CJK Unified Ideographs)
- Arabic: ا-ي (Unicode 0600-06FF)
- Mixed scripts: Combining characters from multiple scripts (security risk)
Homoglyph Attacks (IDN Homograph Attacks):
Homoglyphs are characters from different scripts that look identical or very similar but have different Unicode code points.
- Classic example:
- Latin 'a' (Unicode 0061) vs. Cyrillic 'а' (Unicode 0430) - visually identical
- "paypal.com" (Latin) vs. "pаypal.com" (Cyrillic а) - looks identical in browsers
- Common homoglyph pairs:
- Latin 'o' (Unicode 006F) vs Cyrillic 'о' (Unicode 043E)
- Latin 'e' (Unicode 0065) vs Cyrillic 'е' (Unicode 0435)
- Latin 'p' (Unicode 0070) vs Cyrillic 'р' (Unicode 0440)
- Latin 'c' (Unicode 0063) vs Cyrillic 'с' (Unicode 0441)
- Latin 'x' (Unicode 0078) vs Cyrillic 'х' (Unicode 0445)
- Attack scenario:
- Attacker registers "аpple.com" using Cyrillic 'а'
- Victim sees "аpple.com" and thinks it's "apple.com"
- Victim enters credentials on phishing site
- Actual domain is "xn--pple-43d.com" (Punycode)
Mixed Script Detection:
- Combining characters from multiple scripts in one label is suspicious
- Example: "microsоft.com" (Latin + Cyrillic 'о')
- Most legitimate domains use single script
- Browsers display warnings for mixed-script IDN domains
Browser Protections:
- Chrome/Firefox: Display Punycode (xn--...) instead of Unicode for suspicious domains
- Mixed script blocking: Show Punycode if multiple scripts mixed
- Top-level domain (TLD) restrictions: Some TLDs only allow specific scripts
- Confusability checks: Registries may block homoglyph domains
Example Homoglyph Attack:
- Target: paypal.com (all Latin)
- Attack domain: pаypal.com (Cyrillic 'а' at position 2)
- Punycode: xn--pypal-4ve.com
- Visual appearance: Identical in many fonts
- Detection: This tool identifies Cyrillic script, warns of homoglyph risk
Real-World Attack (2017):
Security researcher demonstrated homograph attack by registering "xn--80ak6aa92e.com" which displayed as "apple.com" (using Cyrillic characters) in browsers. Attack was used to show vulnerability in IDN handling.
Normalization Forms (Unicode):
- NFC (Canonical Composition): Preferred form for IDN
- NFD (Canonical Decomposition): Decomposes accented characters
- Example: "é" can be represented as:
- NFC: Unicode 00E9 (single character)
- NFD: Unicode 0065 + 0301 (e + combining acute accent)
- IDNA requires NFC normalization before Punycode encoding
Security Best Practices:
- Registry policies: Implement confusability checks before registration
- Browser warnings: Display Punycode for suspicious mixed-script domains
- User education: Train users to check address bar for Punycode (xn--...)
- Certificate validation: Check certificate CN/SAN against expected domain
- Brand protection: Proactively register homoglyph variants of your domain
TLD-Specific Restrictions:
- .com/.net/.org: Allow most scripts but monitor for abuse
- .de (Germany): Restricts to Latin + umlauts (ä, ö, ü, ß)
- .jp (Japan): Allows Japanese scripts (Hiragana, Katakana, Kanji)
- .ru (Russia): Primarily Cyrillic script
- Many TLDs use script-based restrictions to prevent homograph attacks
Detection Strategies:
- Script analysis: Detect mixed scripts (Latin + Cyrillic)
- Confusability checking: Compare visual similarity to known brands
- Punycode inspection: Decode and analyze Unicode characters
- Allowlists: Permit only expected domains in enterprise environments
- Reputation systems: Flag newly registered homoglyph domains
Common Typosquatting Patterns:
- Homoglyphs: Visual substitution (Cyrillic 'а' for Latin 'a')
- Typos: Keyboard adjacency (gogle.com instead of google.com)
- Bit-flipping: Single bit change in ASCII (google.com to gnoogle.com)
- Hyphenation: Adding/removing hyphens (pay-pal.com)
- TLD swapping: Different TLD (.co instead of .com)
Tool Output Interpretation:
- Normalized: Lowercase version of the label
- Scripts: Unicode scripts detected in label
- Homoglyphs: Possible: Contains characters with visual similarity risks
- IDN: Yes: Label contains non-ASCII Unicode characters
- Warnings: Security issues detected (mixed scripts, confusable characters)
When to Be Suspicious:
- Mixed scripts in well-known brand names
- Punycode (xn--...) in unexpected contexts
- Domains that look identical to known brands but decode differently
- Newly registered domains with homoglyphs of popular sites
Legitimate IDN Use Cases:
- Local language domains: мос.ru (Moscow), 中国.cn (China)
- Internationalized brand names: münchen.de (Munich)
- Local businesses serving non-English audiences
- Government sites in local languages