Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/nodejs/node.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTimothy Gu <timothygu99@gmail.com>2017-06-26 10:19:03 +0300
committerJames M Snell <jasnell@gmail.com>2017-06-29 07:50:55 +0300
commitf4b5b704447821dda56069a601595322344248ab (patch)
tree47eb0f762c5ee824774d81a3ff3fa937b22ee27e /src/node_i18n.cc
parent01aeb388000b250ce82036b2a6b8e4a676bc5b5d (diff)
src: revise character width calculation
- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX #11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/ PR-URL: https://github.com/nodejs/node/pull/13918 Reviewed-By: James M Snell <jasnell@gmail.com>
Diffstat (limited to 'src/node_i18n.cc')
-rw-r--r--src/node_i18n.cc27
1 files changed, 23 insertions, 4 deletions
diff --git a/src/node_i18n.cc b/src/node_i18n.cc
index 44d94d62558..3b337449495 100644
--- a/src/node_i18n.cc
+++ b/src/node_i18n.cc
@@ -601,14 +601,33 @@ static void ToASCII(const FunctionCallbackInfo<Value>& args) {
// newer wide characters. wcwidth, on the other hand, uses a fixed
// algorithm that does not take things like emoji into proper
// consideration.
+//
+// TODO(TimothyGu): Investigate Cc (C0/C1 control codes). Both VTE (used by
+// GNOME Terminal) and Konsole don't consider them to be zero-width (see refs
+// below), and when printed in VTE it is Narrow. However GNOME Terminal doesn't
+// allow it to be input. Linux's PTY terminal prints control characters as
+// Narrow rhombi.
+//
+// TODO(TimothyGu): Investigate Hangul jamo characters. Medial vowels and final
+// consonants are 0-width when combined with initial consonants; otherwise they
+// are technically Wide. But many terminals (including Konsole and
+// VTE/GLib-based) implement all medials and finals as 0-width.
+//
+// Refs: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/#combining-characters-and-character-width
+// Refs: https://github.com/GNOME/glib/blob/79e4d4c6be/glib/guniprop.c#L388-L420
+// Refs: https://github.com/KDE/konsole/blob/8c6a5d13c0/src/konsole_wcwidth.cpp#L101-L223
static int GetColumnWidth(UChar32 codepoint,
bool ambiguous_as_full_width = false) {
- if (!u_isdefined(codepoint) ||
- u_iscntrl(codepoint) ||
- u_getCombiningClass(codepoint) > 0 ||
- u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER)) {
+ const auto zero_width_mask = U_GC_CC_MASK | // C0/C1 control code
+ U_GC_CF_MASK | // Format control character
+ U_GC_ME_MASK | // Enclosing mark
+ U_GC_MN_MASK; // Nonspacing mark
+ if (codepoint != 0x00AD && // SOFT HYPHEN is Cf but not zero-width
+ ((U_MASK(u_charType(codepoint)) & zero_width_mask) ||
+ u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER))) {
return 0;
}
+
// UCHAR_EAST_ASIAN_WIDTH is the Unicode property that identifies a
// codepoint as being full width, wide, ambiguous, neutral, narrow,
// or halfwidth.