Simple Script Detection

This page provides a simple illustration of how a GUI can visually indicated boundaries between different scripts, to help avoid spoofing. The code is rough, and only meant for illustration.

The boundaries basically use the following pseudo-code:

lastScript = COMMON;

for i = 0..n
  script = getScript(source[i]);

  // Certain characters assume the script of certain adjacent characters

  if (script == COMMON) script = lastScript;
  else if (lastScript == COMMON) lastScript = script;

  if (script == HAN_N && (lastScript == HAN_T || lastScript == HAN_S)) script = lastScript;
  else if (lastScript == HAN_N && (script == HAN_T || script == HAN_S)) lastScript = script;

  // Afterward the fixes, check to see if there is a boundary

  if (lastScript != script) {
    showBoundary(); // show boundary with color difference, lines, or other device
  }
  lastScript = script; // remember for next time

The getScript() call can use the Script value from the Unicode Character Database (see UTR #24: Script Names), with a few additional modifications:

  1. It returns a DIGIT value for numbers (since some digits may look like letters).
  2. It collapses COMMON and INHERITED, since they don't need to be distinguished.
  3. It distinguishes three kinds of HAN characters:

One could certainly refine this to call out more characters that are visually confusable. For example, many CJK Radicals are identical in appearance to CJK Ideographs.

Note: In this demo, the script values are stubbed out, and are present only for simple ASCII Latin and Greek. The HAN fields are generated with a rough pass through the Unihan fields in the UCD: any character with a kSimplifiedVariant is counted as Traditional; any character with a kTraditionalVariant is counted as Simplified, and all others are counted as neutral.