locale modifier @cjkwide

Message ID 328fd4a6-1637-e722-bc4d-d3c4fc265d57@towo.net
State New
Headers show
Series
  • locale modifier @cjkwide
Related show

Commit Message

Thomas Wolff Feb. 26, 2018, 9:42 p.m.
I wrote yesterday:
> It had been discussed how to reflect ambiguous character widths in 

> cygwin locales, with the result of an implicit wide property assumed 

> for the CJK locales, and an overriding @cjknarrow modifier:

> https://sourceware.org/ml/cygwin/2009-06/msg00240.html

> https://sourceware.org/ml/cygwin/2009-06/msg00521.html

> https://sourceware.org/ml/cygwin/2009-06/msg00616.html

>

> Now I’m getting occasional complaints about mintty support for wide 

> display of certain symbol characters, particularly as used for some 

> fancy “Powerline” add-on, and it seems that other terminals apply 

> “ambiguous wide mode” (e.g. xterm -cjk_width) in order to enable 

> Powerline.

> While mintty has an option Charwidth=ambig-wide meanwhile, using this 

> option clearly has the drawback that it makes character width handling 

> inconsistent with the locale model as used by wcwidth.

> Actually for mintty, the desired behaviour can be achieved in a 

> locale-consistent way by selecting one of the CJK locales for LC_CTYPE;

> that’s not what most people would expect, however, and if they do it 

> the easy way, using LANG or LC_ALL, they are baffled by also getting

> their message language obscured.

> So I would prefer the option to use ambiguous wide mode in combination 

> with non-CJK locales in a locale-compatible way.


So I suggest to revisit the proposal of another generic modifier, also 
for symmetry, which is @cjkwide applicable to non-CJK locales.
Patch attached.
Thomas
From 12b87350eb70c83cd654eec37dae3773bf58d231 Mon Sep 17 00:00:00 2001
From: Thomas Wolff <towo@towo.net>
Date: Sun, 25 Feb 2018 16:27:33 +0100
Subject: [PATCH] locale modifier @cjkwide

---
 newlib/libc/locale/locale.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

Comments

Corinna Vinschen Feb. 27, 2018, 2:58 p.m. | #1
Hi Thomas,

On Feb 26 22:42, Thomas Wolff wrote:
> I wrote yesterday:

> > It had been discussed how to reflect ambiguous character widths in

> > cygwin locales, with the result of an implicit wide property assumed for

> > the CJK locales, and an overriding @cjknarrow modifier:

> > https://sourceware.org/ml/cygwin/2009-06/msg00240.html

> > https://sourceware.org/ml/cygwin/2009-06/msg00521.html

> > https://sourceware.org/ml/cygwin/2009-06/msg00616.html

> > 

> > Now I’m getting occasional complaints about mintty support for wide

> > display of certain symbol characters, particularly as used for some

> > fancy “Powerline” add-on, and it seems that other terminals apply

> > “ambiguous wide mode” (e.g. xterm -cjk_width) in order to enable

> > Powerline.

> > While mintty has an option Charwidth=ambig-wide meanwhile, using this

> > option clearly has the drawback that it makes character width handling

> > inconsistent with the locale model as used by wcwidth.

> > Actually for mintty, the desired behaviour can be achieved in a

> > locale-consistent way by selecting one of the CJK locales for LC_CTYPE;

> > that’s not what most people would expect, however, and if they do it the

> > easy way, using LANG or LC_ALL, they are baffled by also getting

> > their message language obscured.

> > So I would prefer the option to use ambiguous wide mode in combination

> > with non-CJK locales in a locale-compatible way.

> 

> So I suggest to revisit the proposal of another generic modifier, also for

> symmetry, which is @cjkwide applicable to non-CJK locales.

> Patch attached.

> Thomas


Just one point:

> From 12b87350eb70c83cd654eec37dae3773bf58d231 Mon Sep 17 00:00:00 2001

> From: Thomas Wolff <towo@towo.net>

> Date: Sun, 25 Feb 2018 16:27:33 +0100

> Subject: [PATCH] locale modifier @cjkwide


It would be most helpful to get a v2 patch with a commit message
describing why adding cjkwide makes sense, for later reference.
The subject "locale modifier @cjkwide" is rather terse.


Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Maintainer
Red Hat
Thomas Wolff Feb. 27, 2018, 10:58 p.m. | #2
Am 27.02.2018 um 15:58 schrieb Corinna Vinschen:
> Hi Thomas,

>

> On Feb 26 22:42, Thomas Wolff wrote:

>> I wrote yesterday:

>>> It had been discussed how to reflect ambiguous character widths in

>>> cygwin locales, with the result of an implicit wide property assumed for

>>> the CJK locales, and an overriding @cjknarrow modifier:

>>> https://sourceware.org/ml/cygwin/2009-06/msg00240.html

>>> https://sourceware.org/ml/cygwin/2009-06/msg00521.html

>>> https://sourceware.org/ml/cygwin/2009-06/msg00616.html

>>>

>>> Now I’m getting occasional complaints about mintty support for wide

>>> display of certain symbol characters, particularly as used for some

>>> fancy “Powerline” add-on, and it seems that other terminals apply

>>> “ambiguous wide mode” (e.g. xterm -cjk_width) in order to enable

>>> Powerline.

>>> While mintty has an option Charwidth=ambig-wide meanwhile, using this

>>> option clearly has the drawback that it makes character width handling

>>> inconsistent with the locale model as used by wcwidth.

>>> Actually for mintty, the desired behaviour can be achieved in a

>>> locale-consistent way by selecting one of the CJK locales for LC_CTYPE;

>>> that’s not what most people would expect, however, and if they do it the

>>> easy way, using LANG or LC_ALL, they are baffled by also getting

>>> their message language obscured.

>>> So I would prefer the option to use ambiguous wide mode in combination

>>> with non-CJK locales in a locale-compatible way.

>> So I suggest to revisit the proposal of another generic modifier, also for

>> symmetry, which is @cjkwide applicable to non-CJK locales.

>> Patch attached.

>> Thomas

> Just one point:

>> ...

>> Subject: [PATCH] locale modifier @cjkwide

> It would be most helpful to get a v2 patch with a commit message

> describing why adding cjkwide makes sense, for later reference.

> The subject "locale modifier @cjkwide" is rather terse.

New patch attached. I'll also provide a patch for the Cygwin user guide, 
to cygwin-patches.
Thomas
From f97028789cb8e18fd97a65fb8f5b08f25856bb94 Mon Sep 17 00:00:00 2001
From: Thomas Wolff <towo@towo.net>
Date: Tue, 27 Feb 2018 23:47:21 +0100
Subject: [PATCH] Locale modifier @cjkwide makes Unicode "ambiguous width"
 characters wide. So ambiguous width characters can be enforced to have width
 2 even in non-CJK locales. This gives e.g. users of "Powerline symbols" the
 opportunity to adjust their width to the desired behaviour (and the behaviour
 apparently expected by some tools) without having to set a CJK locale and
 without losing consistence of terminal character width with wcwidth/wcswidth
 locale width.

---
 newlib/libc/locale/locale.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/newlib/libc/locale/locale.c b/newlib/libc/locale/locale.c
index baa5451..e654c5c 100644
--- a/newlib/libc/locale/locale.c
+++ b/newlib/libc/locale/locale.c
@@ -74,15 +74,16 @@ Cygwin additionally supports locales from the file
 (<<"">> is also accepted; if given, the settings are read from the
 corresponding LC_* environment variables and $LANG according to POSIX rules.)
 
-This implementation also supports the modifier <<"cjknarrow">>, which
-affects how the functions <<wcwidth>> and <<wcswidth>> handle characters
-from the "CJK Ambiguous Width" category of characters described at
-http://www.unicode.org/reports/tr11/#Ambiguous. These characters have a width
-of 1 for singlebyte charsets and a width of 2 for multibyte charsets
-other than UTF-8. For UTF-8, their width depends on the language specifier:
+This implementation also supports the modifiers <<"cjknarrow">> and
+<<"cjkwide">>, which affect how the functions <<wcwidth>> and <<wcswidth>>
+handle characters from the "CJK Ambiguous Width" category of characters
+described at http://www.unicode.org/reports/tr11/#Ambiguous.
+These characters have a width of 1 for singlebyte charsets and a width of 2
+for multibyte charsets other than UTF-8.
+For UTF-8, their width depends on the language specifier:
 it is 2 for <<"zh">> (Chinese), <<"ja">> (Japanese), and <<"ko">> (Korean),
-and 1 for everything else. Specifying <<"cjknarrow">> forces a width of 1,
-independent of charset and language.
+and 1 for everything else. Specifying <<"cjknarrow">> or <<"cjkwide">>
+forces a width of 1 or 2, respectively, independent of charset and language.
 
 If you use <<NULL>> as the <[locale]> argument, <<setlocale>> returns a
 pointer to the string representing the current locale.  The acceptable
@@ -480,6 +481,7 @@ __loadlocale (struct __locale_t *loc, int category, const char *new_locale)
   wctomb_p l_wctomb;
   mbtowc_p l_mbtowc;
   int cjknarrow = 0;
+  int cjkwide = 0;
 
   /* Avoid doing everything twice if nothing has changed.
 
@@ -593,11 +595,13 @@ restart:
   if (c && c[0] == '@')
     {
       /* Modifier */
-      /* Only one modifier is recognized right now.  "cjknarrow" is used
-         to modify the behaviour of wcwidth() for East Asian languages.
+      /* Modifiers "cjknarrow" or "cjkwide" are recognized to modify the 
+         behaviour of wcwidth() and wcswidth() for East Asian languages.
          For details see the comment at the end of this function. */
       if (!strcmp (c + 1, "cjknarrow"))
 	cjknarrow = 1;
+      else if (!strcmp (c + 1, "cjkwide"))
+	cjkwide = 1;
     }
   /* We only support this subset of charsets. */
   switch (charset[0])
@@ -894,12 +898,15 @@ restart:
          single-byte charsets, and double width for multi-byte charsets
          other than UTF-8. For UTF-8, use double width for the East Asian
          languages ("ja", "ko", "zh"), and single width for everything else.
-         Single width can also be forced with the "@cjknarrow" modifier. */
-      loc->cjk_lang = !cjknarrow && mbc_max > 1
-		      && (charset[0] != 'U'
-			  || strncmp (locale, "ja", 2) == 0
-			  || strncmp (locale, "ko", 2) == 0
-			  || strncmp (locale, "zh", 2) == 0);
+         Single width can also be forced with the "@cjknarrow" modifier.
+         Double width can also be forced with the "@cjkwide" modifier.
+       */
+      loc->cjk_lang = cjkwide ||
+		      (!cjknarrow && mbc_max > 1
+		       && (charset[0] != 'U'
+			   || strncmp (locale, "ja", 2) == 0
+			   || strncmp (locale, "ko", 2) == 0
+			   || strncmp (locale, "zh", 2) == 0));
 #ifdef __HAVE_LOCALE_INFO__
       ret = __ctype_load_locale (loc, locale, (void *) l_wctomb, charset,
 				 mbc_max);
Corinna Vinschen March 1, 2018, 5:21 p.m. | #3
On Feb 27 23:58, Thomas Wolff wrote:
> Am 27.02.2018 um 15:58 schrieb Corinna Vinschen:

> > It would be most helpful to get a v2 patch with a commit message

> > describing why adding cjkwide makes sense, for later reference.

> > The subject "locale modifier @cjkwide" is rather terse.

> New patch attached. I'll also provide a patch for the Cygwin user guide, to

> cygwin-patches.


Thanks, but the commit message is incorrect.  A single, short first
line, followed by an empty line, followed by the more detailed commit
message.  Otherwise, as you can see, the entire message will become the
commit title.

See `man git-commit', chapter "DISCUSSION".


Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Maintainer
Red Hat
Thomas Wolff March 2, 2018, 7:26 p.m. | #4
Am 01.03.2018 um 18:21 schrieb Corinna Vinschen:
> On Feb 27 23:58, Thomas Wolff wrote:

>> Am 27.02.2018 um 15:58 schrieb Corinna Vinschen:

>>> It would be most helpful to get a v2 patch with a commit message

>>> describing why adding cjkwide makes sense, for later reference.

>>> The subject "locale modifier @cjkwide" is rather terse.

>> New patch attached. I'll also provide a patch for the Cygwin user guide, to

>> cygwin-patches.

> Thanks, but the commit message is incorrect.  A single, short first

> line, followed by an empty line, followed by the more detailed commit

> message.  Otherwise, as you can see, the entire message will become the

> commit title.

Update attached, hope it's OK this time.
Thomas
From 3979072d80a2b4cc079aa719776d9e338fc62fd3 Mon Sep 17 00:00:00 2001
From: Thomas Wolff <towo@towo.net>
Date: Fri, 2 Mar 2018 20:21:09 +0100
Subject: [PATCH] Locale modifier @cjkwide to adjust ambiguous-width in non-CJK locales

Locale modifier @cjkwide makes Unicode "ambiguous width" characters wide.
So ambiguous width characters can be enforced to have width 2 even in
non-CJK locales. This gives e.g. users of "Powerline symbols" the opportunity
to adjust their width to the desired behaviour (and the behaviour apparently
expected by some tools) without having to set a CJK locale and without losing
consistence of terminal character width with wcwidth/wcswidth locale width.
---
 newlib/libc/locale/locale.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/newlib/libc/locale/locale.c b/newlib/libc/locale/locale.c
index baa5451..e654c5c 100644
--- a/newlib/libc/locale/locale.c
+++ b/newlib/libc/locale/locale.c
@@ -74,15 +74,16 @@ Cygwin additionally supports locales from the file
 (<<"">> is also accepted; if given, the settings are read from the
 corresponding LC_* environment variables and $LANG according to POSIX rules.)
 
-This implementation also supports the modifier <<"cjknarrow">>, which
-affects how the functions <<wcwidth>> and <<wcswidth>> handle characters
-from the "CJK Ambiguous Width" category of characters described at
-http://www.unicode.org/reports/tr11/#Ambiguous. These characters have a width
-of 1 for singlebyte charsets and a width of 2 for multibyte charsets
-other than UTF-8. For UTF-8, their width depends on the language specifier:
+This implementation also supports the modifiers <<"cjknarrow">> and
+<<"cjkwide">>, which affect how the functions <<wcwidth>> and <<wcswidth>>
+handle characters from the "CJK Ambiguous Width" category of characters
+described at http://www.unicode.org/reports/tr11/#Ambiguous.
+These characters have a width of 1 for singlebyte charsets and a width of 2
+for multibyte charsets other than UTF-8.
+For UTF-8, their width depends on the language specifier:
 it is 2 for <<"zh">> (Chinese), <<"ja">> (Japanese), and <<"ko">> (Korean),
-and 1 for everything else. Specifying <<"cjknarrow">> forces a width of 1,
-independent of charset and language.
+and 1 for everything else. Specifying <<"cjknarrow">> or <<"cjkwide">>
+forces a width of 1 or 2, respectively, independent of charset and language.
 
 If you use <<NULL>> as the <[locale]> argument, <<setlocale>> returns a
 pointer to the string representing the current locale.  The acceptable
@@ -480,6 +481,7 @@ __loadlocale (struct __locale_t *loc, int category, const char *new_locale)
   wctomb_p l_wctomb;
   mbtowc_p l_mbtowc;
   int cjknarrow = 0;
+  int cjkwide = 0;
 
   /* Avoid doing everything twice if nothing has changed.
 
@@ -593,11 +595,13 @@ restart:
   if (c && c[0] == '@')
     {
       /* Modifier */
-      /* Only one modifier is recognized right now.  "cjknarrow" is used
-         to modify the behaviour of wcwidth() for East Asian languages.
+      /* Modifiers "cjknarrow" or "cjkwide" are recognized to modify the 
+         behaviour of wcwidth() and wcswidth() for East Asian languages.
          For details see the comment at the end of this function. */
       if (!strcmp (c + 1, "cjknarrow"))
 	cjknarrow = 1;
+      else if (!strcmp (c + 1, "cjkwide"))
+	cjkwide = 1;
     }
   /* We only support this subset of charsets. */
   switch (charset[0])
@@ -894,12 +898,15 @@ restart:
          single-byte charsets, and double width for multi-byte charsets
          other than UTF-8. For UTF-8, use double width for the East Asian
          languages ("ja", "ko", "zh"), and single width for everything else.
-         Single width can also be forced with the "@cjknarrow" modifier. */
-      loc->cjk_lang = !cjknarrow && mbc_max > 1
-		      && (charset[0] != 'U'
-			  || strncmp (locale, "ja", 2) == 0
-			  || strncmp (locale, "ko", 2) == 0
-			  || strncmp (locale, "zh", 2) == 0);
+         Single width can also be forced with the "@cjknarrow" modifier.
+         Double width can also be forced with the "@cjkwide" modifier.
+       */
+      loc->cjk_lang = cjkwide ||
+		      (!cjknarrow && mbc_max > 1
+		       && (charset[0] != 'U'
+			   || strncmp (locale, "ja", 2) == 0
+			   || strncmp (locale, "ko", 2) == 0
+			   || strncmp (locale, "zh", 2) == 0));
 #ifdef __HAVE_LOCALE_INFO__
       ret = __ctype_load_locale (loc, locale, (void *) l_wctomb, charset,
 				 mbc_max);
Corinna Vinschen March 5, 2018, 4:40 p.m. | #5
On Mar  2 20:26, Thomas Wolff wrote:
> 

> Am 01.03.2018 um 18:21 schrieb Corinna Vinschen:

> > On Feb 27 23:58, Thomas Wolff wrote:

> > > Am 27.02.2018 um 15:58 schrieb Corinna Vinschen:

> > > > It would be most helpful to get a v2 patch with a commit message

> > > > describing why adding cjkwide makes sense, for later reference.

> > > > The subject "locale modifier @cjkwide" is rather terse.

> > > New patch attached. I'll also provide a patch for the Cygwin user guide, to

> > > cygwin-patches.

> > Thanks, but the commit message is incorrect.  A single, short first

> > line, followed by an empty line, followed by the more detailed commit

> > message.  Otherwise, as you can see, the entire message will become the

> > commit title.

> Update attached, hope it's OK this time.

> Thomas


Pushed.


Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Maintainer
Red Hat

Patch

diff --git a/newlib/libc/locale/locale.c b/newlib/libc/locale/locale.c
index baa5451..e654c5c 100644
--- a/newlib/libc/locale/locale.c
+++ b/newlib/libc/locale/locale.c
@@ -74,15 +74,16 @@  Cygwin additionally supports locales from the file
 (<<"">> is also accepted; if given, the settings are read from the
 corresponding LC_* environment variables and $LANG according to POSIX rules.)
 
-This implementation also supports the modifier <<"cjknarrow">>, which
-affects how the functions <<wcwidth>> and <<wcswidth>> handle characters
-from the "CJK Ambiguous Width" category of characters described at
-http://www.unicode.org/reports/tr11/#Ambiguous. These characters have a width
-of 1 for singlebyte charsets and a width of 2 for multibyte charsets
-other than UTF-8. For UTF-8, their width depends on the language specifier:
+This implementation also supports the modifiers <<"cjknarrow">> and
+<<"cjkwide">>, which affect how the functions <<wcwidth>> and <<wcswidth>>
+handle characters from the "CJK Ambiguous Width" category of characters
+described at http://www.unicode.org/reports/tr11/#Ambiguous.
+These characters have a width of 1 for singlebyte charsets and a width of 2
+for multibyte charsets other than UTF-8.
+For UTF-8, their width depends on the language specifier:
 it is 2 for <<"zh">> (Chinese), <<"ja">> (Japanese), and <<"ko">> (Korean),
-and 1 for everything else. Specifying <<"cjknarrow">> forces a width of 1,
-independent of charset and language.
+and 1 for everything else. Specifying <<"cjknarrow">> or <<"cjkwide">>
+forces a width of 1 or 2, respectively, independent of charset and language.
 
 If you use <<NULL>> as the <[locale]> argument, <<setlocale>> returns a
 pointer to the string representing the current locale.  The acceptable
@@ -480,6 +481,7 @@  __loadlocale (struct __locale_t *loc, int category, const char *new_locale)
   wctomb_p l_wctomb;
   mbtowc_p l_mbtowc;
   int cjknarrow = 0;
+  int cjkwide = 0;
 
   /* Avoid doing everything twice if nothing has changed.
 
@@ -593,11 +595,13 @@  restart:
   if (c && c[0] == '@')
     {
       /* Modifier */
-      /* Only one modifier is recognized right now.  "cjknarrow" is used
-         to modify the behaviour of wcwidth() for East Asian languages.
+      /* Modifiers "cjknarrow" or "cjkwide" are recognized to modify the 
+         behaviour of wcwidth() and wcswidth() for East Asian languages.
          For details see the comment at the end of this function. */
       if (!strcmp (c + 1, "cjknarrow"))
 	cjknarrow = 1;
+      else if (!strcmp (c + 1, "cjkwide"))
+	cjkwide = 1;
     }
   /* We only support this subset of charsets. */
   switch (charset[0])
@@ -894,12 +898,15 @@  restart:
          single-byte charsets, and double width for multi-byte charsets
          other than UTF-8. For UTF-8, use double width for the East Asian
          languages ("ja", "ko", "zh"), and single width for everything else.
-         Single width can also be forced with the "@cjknarrow" modifier. */
-      loc->cjk_lang = !cjknarrow && mbc_max > 1
-		      && (charset[0] != 'U'
-			  || strncmp (locale, "ja", 2) == 0
-			  || strncmp (locale, "ko", 2) == 0
-			  || strncmp (locale, "zh", 2) == 0);
+         Single width can also be forced with the "@cjknarrow" modifier.
+         Double width can also be forced with the "@cjkwide" modifier.
+       */
+      loc->cjk_lang = cjkwide ||
+		      (!cjknarrow && mbc_max > 1
+		       && (charset[0] != 'U'
+			   || strncmp (locale, "ja", 2) == 0
+			   || strncmp (locale, "ko", 2) == 0
+			   || strncmp (locale, "zh", 2) == 0));
 #ifdef __HAVE_LOCALE_INFO__
       ret = __ctype_load_locale (loc, locale, (void *) l_wctomb, charset,
 				 mbc_max);