libu8ident

unicode security guidelines for identifiers

View the Project on GitHub rurban/libu8ident

libu8ident - Check unicode security guidelines for identifiers

without adding the full Unicode database.

This library does the unicode identifier security checks needed for all the compilers, interpreters, filesystems and login systems, which do support for whatever reason unicode identifiers (“names”), and wish to avoid the various unicode security loopholes, like bidi- or homoglyph attacks, mixed scripts, confusables, and support normalized storage. For the UTF-8 encoding only, wchar_t users are rare.

Supporting the various Unicode security profiles for identifiers can be small, performant and easy. Still everybody (roughly 98 out of 100) are vulnerable to unicode identifier attacks. Since I implemented proper unicode security in cperl in 2016, the 2nd language after Java which did so, I publish this little library so that others can follow.

Remember, the meaning of identifiers is to be identifiable. A user should not confuse one identifier with another. Only a program, IDE or library can properly check unicode identifiers, humans certainly not. Leaving such checks to a linter is not recommended, and they did’t even exist until now.

Motivation

with


    const [ ENV_PROD, ENV_DEV ] = [ 'PRODUCTION', 'DEVELOPMENT'];
    /* … */
    const environment = 'PRODUCTION';
    /* … */
    function isUserAdmin(user) {
        if(environmentǃ=ENV_PROD){
            // bypass authZ checks in DEV
            return true;
        }

        /* … */
        return false;
    }

where environmentǃ is an identifier, because the ǃ is the U+1C3 “LATIN LETTER ALVEOLAR CLICK”, a Technical, Lo identifier, completely flipping the logic. A safe TR31 ID set recommended by TR39 would have forbidden that. Such ID confusables with operators are U+1C0 ǀ, U+1C1 ǁ, U+1C3 ǃ. TR31 XID’s have a lot of insecure confusables, such as the Halfwidth and Fullwidth Forms, U+FF00..U+FFEF, the Arabic Presentation Forms-A: U+FB50–U+FDFF and Arabic Presentation Forms-B: U+FE70–U+FEFF. TR39 recommends only a subset.

There’s now even a Unicode taskforce, because of the https://trojansource.codes CVE’s, even when they were about bidi overrides, not identifiers.

Valid characters

Each identifer must start with a ID_Start and continue with a ID_Continue matcher. See TR31. As perl regex:

 / (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
   (?[ ( \p{Word} & \p{XID_Continue} ) ]) * /x

The tokenizer has variable options for the start and cont matchers. The default is to use none. This is normally done by a parser, to the library you just pass the len or null-terminated identifier. Optionally we can check for proper XID_Start/XID_Continue properties also. But this may not use the optimization -DDISABLE_CHECK_XID.

With the optional U8ID_TR31 options we check for valid UTF-8 encoding, valid ID_Start/Continue properties and allowed script of each character.

Some parsers also need to check for allowed medial characters, which are not allowed at the very end of an identifier. Esp. for unrestrictive mixed-script security profiles or insecure xid ranges. All the UCD ID_Start and XID_Start properties incorrectly list them there, and not in X?ID_Continue btw.

u8idlint has its own tokenizer, which can be configured with the –xid options: ASCII, SAFEC26, ALLOWED, C23, ID, XID, C11 and ALLUTF8. (sorted from most secure to most insecure). ASCII ignores all utf8 word boundaries. SAFEC26 is the optimized proposal for the C26 charset. ALLOWED, allows only TR39 IdentifierStatus Allowed characters. C23 is from the upcoming C23 standard with NFC and a maximal sequence length. ID allows all letters, plus numbers, punctuation and marks, including all exotic scripts. XID is ID plus some special exceptions to avoid the NFKC quirks, because NFKC has a lot of confusable mappings and no roundtrips. C11 uses the C11 standard insecure DefId unicode ranges. ALLUTF8 allows all unicode characters as letters > 127, as in php, D, nim or crystal.

Normalization

All utf8 identifiers and literals are parsed and stored as normalized NFC variants (unlike Python 3), which prevents from various TR31, TR36 and TR39 unicode confusable and spoofing security problems with identifiers. See http://www.unicode.org/reports/tr31/, http://www.unicode.org/reports/tr36/ and http://www.unicode.org/reports/tr39 Optionally we also support the NFKC, NFKD and NFD methods.

For example with NFKC the following two characters would be equal: ℌ => H, Ⅸ => IX, ℯ => e, ℨ => Z (and not 3), ⒛ => 20, µ (Math) => μ (Greek) … Under NFC only glyphs looking the same, but varying the underlying combining marks would be equal: Such as Café == Café.

Mixed Scripts

Many mixed scripts combinations in unicode identifiers for a certain context (such as a document or directory) are forbidden, but they can be declared via a special API. The ‘Common’, ‘Latin’ and ‘Inherited’ scripts are always allowed and don’t need to be declared. With some Common and Inherited the SCX Extended scripts needs to be checked. Also the following combinations are allowed without any declaration: Latin + Hangul + Han (:Korean), Latin + Katakana + Hiregana (::Japanese) and Latin + Han + Bopomofo (:Hanb). And some more recommended and aspirational scripts, which are not excluded, except Cyrillic and Greek.

The first allowed undeclared unicode script for an identifier is the only allowed one. This qualifies as single-script. More scripts lead to parsers errors. See http://www.unicode.org/reports/tr36/#Mixed_Script_Spoofing and http://www.unicode.org/reports/tr31/.

I.e. you may still declare those scripts as valid, but they are not automatically allowed, similar to the need to declare mixed scripts.

General Security Profiles

Certain combinations of mixed scripts are defined with a user-defined identifier security profile, the Restriction Levels 1-6. https://www.unicode.org/reports/tr39/#Restriction_Level_Detection

1. ASCII-Only

All characters in the string are in the ASCII range. You don’t need this library for that use-case. (Maybe still allow that for conformance testing. This is the recommended profile, don’t fall into the unicode identifier trap. E.g. zig was right, rejecting those proposals.)

2. Single Script

3. Highly Restrictive

4. Moderately Restrictive

5. Minimally Restrictive

6. Unrestricted

c26_4. SAFEC26

c11_6. C11STD

Recommended is Level 4, the Moderately Restrictive level or its improved C26_4 variant. It is always easier to widen restrictions than narrow them.

Non-spacing Combining marks

Confusables

An alternative API is to check only against the list of TR39 security/confusables.txt. This is comparable to the Minimally Restrictive security profile. The list of confusables is manually maintained and consists of pairs of codepoints which are visually confusable with other codepoints. It is described in TR39 Section 4, with the “skeleton” algorithm, and implemented via the API enum u8id_errors u8ident_check_confusables(const char *buf, const int len).

It uses a NFD lookup and three hash lookups per identifier, thus it is very slow. NFD is relatively cheap compared to NFC, mandatory since C23 and C++23, but much more expensive than the mixed script approach which uses only a single range-lookup in most cases.

Also the default confusables list is extremely buggy. It needs at least 7 manual exceptions for the ASCII range, 12 exceptions for Greek, and I didn’t check any others scripts. python and clang-tidy were very unsuccessful with this approach, compared to java, rust and cperl with the mixed-script approach.

The confusables API needs to be enabled with the --enable-confus resp. cmake -DHAVE_CONFUS=ON options.

configure options

By default the confusables API is disabled.

When you know beforehand which normalization or profile you will need, and your parsers knows about allowed identifier codepoints, define that via ./configure --with-norm=NFC --with-profile=4 --with-tr31=NONE, resp. cmake -DU8ID_NORM=NFC -DU8ID_PROFILE=4 -DU8ID_TR31=NONE -DBUILD_SHARED_LIBS=OFF. This skips a lot of unused code and branches. The generic shared library has all the code for all normalizations, all tr39 profiles, all tr31 xid checks and branches at run-time.

e.g codesizes for u8idnorm.o with -Os

amd64-gcc:   NFKC 217K, NFC+FCC 182K, NFD 113K, NFD 78K, FCD 52K
amd64-clang: NFKC 218K, NFC+FCC 183K, NFD 114K, NFD 78K, FCD 52K

default: 365K with -g on amd64-gcc

For -DU8ID_PROFILE_SAFEC26 see above. c26_4 is also called SAFEC26, previously SAFEC23, c11_6 is the std insecure C11 profile.

With confus enabled, the confusable API is added. With croaring the confus API is about twice as fast, and needs half the size.

See the likewise cmake options:

API

#include "u8ident.h"

u8id_options is the sum of the following bits:

enum u8id_norm: TR15

U8ID_NFC  = 0  // the default, shortest canonical composed normalization
U8ID_NFD  = 1  // the longer, decomposed normalization
U8ID_NFKC = 2  // the compatibility normalization
U8ID_NFKD = 3  // the longer compatibility decomposed normalization
U8ID_FCD  = 4, // the faster variants
U8ID_FCC  = 5, //

enum u8id_profile: TR39

U8ID_PROFILE_1 = 1      // ASCII only
U8ID_PROFILE_2 = 2      // Single Script only
U8ID_PROFILE_3 = 3      // Highly Restrictive
U8ID_PROFILE_4 = 4      // Moderately Restrictive
U8ID_PROFILE_5 = 5      // Minimally Restrictive
U8ID_PROFILE_6 = 6      // Unrestricted
U8ID_PROFILE_C11_6 = 7, // "C11STD"
U8ID_PROFILE_C26_4 = 8, // 4 + Greek with only Allowed ID's ("SAFEC26")

enum u8id_options: TR31

U8ID_TR31_XID = 64,     // without NFKC quirks, labelled stable
U8ID_TR31_ID = 65,      // The usual tr31 variants
U8ID_TR31_ALLOWED = 66, // The UCD IdentifierStatus.txt (default)
U8ID_TR31_SAFEC26 = 67, // safer XID's, without Limited_Use and Excluded Scripts
U8ID_TR31_C23 = 68,     // XID with NFC from the C23 standard
U8ID_TR31_C11 = 69,     // stable insecure AltId ranges from C11 Annex D
U8ID_TR31_ALLUTF8 = 70, // allow all > 128, e.g. D, php, nim, crystal
U8ID_TR31_ASCII = 71,   // only ASCII letters (as e.g. zig, j. older compilers)
// room for more tr31 profiles

U8ID_FOLDCASE = 128,         // optional for case-insensitive idents. case-folded
                             // when normalized
U8ID_WARN_CONFUSABLE = 256,  // requires -DHAVE_CONFUS
U8ID_ERROR_CONFUSABLE = 512, // requires -DHAVE_CONFUS

int u8ident_init (enum u8id_profile, enum u8id_norm, unsigned options)

Initialize the library with a bitmask of options, which define the performed checks. Recommended is (U8ID_PROFILE_DEFAULT, U8ID_NORM_DEFAULT, U8ID_TR31_SAFEC26).

int u8ident_set_maxlength (int maxlen)

of an identifier. Default: 1024. Beware that such longs identiers are not really identifiable anymore, and keep them under 80 or even less. Some filesystems do allow now 32K identifiers, which is a glaring security hole, waiting to be exploited.

u8id_ctx_t u8ident_new_ctx ()

Generates a new identifier document/context/directory, which initializes a new list of seen scripts. Contexts are optional. By default all checks are done in the same context 0. With compilers and interpreters a context is a source file, with filesystems a directory, with usernames you may choose if you need to support different languages at once. I cannot think of any such usage, so better avoid contexts with usernames to avoid mixups.

int u8ident_set_ctx (u8id_ctx_t ctx)

Changes to the context generated with u8ident_new_ctx.

int u8ident_add_script_name (const char *name) int u8ident_add_script (uint8_t script)

Adds the script to the context, if it’s known or declared beforehand. Such as use utf8 "Greek"; in cperl, or a #pragma unicode Braille in some C to add some Excluded scripts.

int u8ident_free_ctx (u8id_ctx_t ctx)

Deletes the context generated with u8ident_new_ctx. This is optional, all remaining contexts are deleted by u8ident_free().

int u8ident_free ()

End this library, cleaning up all internal structures.

uint8_t u8ident_get_script (const uint32_t cp)

Lookup the script property for a codepoint.

const char* u8ident_script_name (const int scr)

Lookup the long script name for the internal script byte/index.

bool u8ident_is_confusable (const uint32_t cp)

Lookup if the codepoint is a confusable. Only with --enable-confus / -DHAVE_CONFUS. With --with-croaring / -DHAVE_CROARING this is twice as fast, and needs half the size.

enum u8id_errors u8ident_check (const u8* string, char** outnorm)

enum u8id_errors u8ident_check_buf (const char* buf, int len, char** outnorm)

Two variants to check if this identifier is valid. u8ident_check_buf avoids allocating a fresh string from the parsed input. outnorm is set to a fresh normalized string if valid.

Return values (enum u8id_errors):

Note that we explicitly allow the Latin confusables: 0 1 I ` | i.e. U+30, U+31, U+49, U+60, U+7C

enum u8id_errors u8ident_check_confusables(const char *buf, const int len)

A different, but much less reliable check strategy via confusables.txt only, described in TR 39, Section 4, the skeleton algorithm. Each identifier is stored in two dynamic hash tables, and for each confusable match, normalized to NFC, the first wins. Only with --enable-confus / -DHAVE_CONFUS.

char * u8ident_normalize (const char* buf, int len)

Returns a freshly allocated normalized string, with the options defined at u8ident_init.

uint32_t u8ident_failed_char (const int ctx)

Returns the failing codepoint, which failed in the last check.

const char* u8ident_failed_script_name (const int ctx)

Returns the constant script name, which failed in the last check.

const char* u8ident_existing_scripts (int ctx)

Returns a fresh string of the list of the seen scripts in this context whenever a mixed script error occurs. Needed for the error message “Invalid script %s, already have %s”, where the 2nd %s is returned by this function. The returned string needs to be freed by the user.

Usage:

if (u8id_check("wrongᴧᴫ") == U8ID_ERR_SCRIPTS) {
  const char *errstr = u8ident_existing_scripts(ctx);
  fprintf(stdout, "Invalid script %s for U+%X, already have %s.\n",
     u8ident_failed_script_name(ctx),
     u8ident_failed_char(ctx), errstr);
  free(errstr);
}

u8idlint

Included is the sample program u8idlint which parses program source files for possible unicode identifier violations. See man u8idlint.

Packaging

Recommended is to use the static lib or sources with your known normalization, profile and xid options. E.g. via cmake or submodule integration. Because the size and run-time varies wildly between all the possible and needed options, and this is in the hot path of a parser.

A shared library needs to provide all the options at run-time, i.e. empty configure options.

Build dependencies: ronn (dnf install rubygem-ronn-ng)

Optional dependencies:

Maintainer dependencies: wget, perl, xxd. Needed every year when the UCD changes.

Internals

Each context keeps a list of all seen unique scripts, which are represented as a byte. 4-8 bytes are kept in a word, so you rarely need to allocate room for the scripts. All mixed-script profiles have either a max size of 4 allowed scripts, or ignore mixed-scripts.

The normalization tables need to be updated every year with the new Unicode database via make regen-norm (using my perl-Unicode-Normalize variant). This is generated from the current UCD via the current perl, every April/May. Stored are small 3-way tables for the canonical decomposition (NFD) of each utf-8 character. With U8ID_NFC also the canonical composition (NFC) tables and the reorder logic. With U8ID_NFKC even the larger compatibility composition tables and the reorder logic.

The NFC strings are always shorter, but need a 2nd table set (2x memory) and 3x longer lookup times. If compiled with --with-norm=NFD the library is smaller and faster, but the resulting identifiers maybe a bit longer. The NFD table is very sparse, only 3 of 17 initial planes are needed. Some rare overlong entries (296 codepoints with > 5 byte of UTF-8) are searched in an extra list to keep static memory usage low, contrary to most other unicode libs.

We don’t support the normalization variants FCC nor FCD yet.

The character to script lookup is done with a sorted list of ranges, for less space. This is also generated from the current UCD. The internally used script indices are arbitrarily created via mkscripts.pl from the current UCD, sorted into Recommended Scripts (sorted by codepoints), Not Recommended Scripts, i.e. Excluded Scripts (sorted alphabetically) and Limited Use Scripts (sorted by codepoint).

With CRoaring some boolean bitset queries can be optimized. So far only the confusables codepoints lookups are used. The others, allowed_croar.h and the nf*_croar.h headers are too slow. See perf. These optimizations on boolean ranges are work in progress, if I can find faster lookups than binary search. Also for special configurations, such as a new c11 profile a single header and optimized lookup method should be implemented, combining the script, xid and decomposition in it. This would replace the ~6 lookups per codepoint.

LICENSE

Copyright (c) 2021,2022,2024 Reini Urban. All rights reserved.

This software is dual-licensed under either the Apache-2.0 license or the GPL-2.0 or later. See the LICENSE file.

TODO

AUTHOR

Reini Urban rurban@cpan.org 2021-2022