[v4,2/4] Update UTF-8 charmap processing.

Message ID 20210428130033.3196848-3-carlos@redhat.com
State New
Headers show
Series
  • Add new C.UTF-8 locale (Bug 17318)
Related show

Commit Message

Florian Weimer via Libc-alpha April 28, 2021, 1 p.m.
The UTF-8 character map processing is updated to use the new wider
ellipsis support. On top of this the Unicode Noncharacters compliance
is improved by adding Noncharacters to the UTF-8 character map to
allow them to be processed and transformed correctly when considering
the character map only. All gaps, excluding surrogates, for the UTF-8
character map are filled with unassigned blocks of characters. The
UTF-8 character map now includes all Unicode Scalar values.

Tested by regenerating the locale data from the Unicode data and
running the testsuite.

Tested on x86_64 and i686 without regression.
---
 localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++----------
 1 file changed, 86 insertions(+), 47 deletions(-)

-- 
2.26.3

Comments

Florian Weimer via Libc-alpha April 29, 2021, 2:07 p.m. | #1
* Carlos O'Donell:

>  def convert_to_hex(code_point):

>      '''Converts a code point to a hexadecimal UTF-8 representation

> -    like /x**/x**/x**.'''

> -    # Getting UTF8 of Unicode characters.

> -    # In Python3, .encode('UTF-8') does not work for

> -    # surrogates. Therefore, we use this conversion table

> -    surrogates = {

> -        0xD800: '/xed/xa0/x80',

> -        0xDB7F: '/xed/xad/xbf',

> -        0xDB80: '/xed/xae/x80',

> -        0xDBFF: '/xed/xaf/xbf',

> -        0xDC00: '/xed/xb0/x80',

> -        0xDFFF: '/xed/xbf/xbf',

> -    }

> -    if code_point in surrogates:

> -        return surrogates[code_point]

> -    return ''.join([

> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')

> -    ])

> +    ready for use in a locale character map specification e.g.

> +    /xc2/xaf for MACRON.

> +

> +    '''

> +    cp_locale = ''

> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')

> +    for byte in cp_bytes:

> +       cp_locale += ''.join('/x{:02x}'.format(byte))

> +    return cp_locale


I think you should keep the list comprehension.  That ''.join() is
unnecessary.

Thanks,
Florian
Florian Weimer via Libc-alpha April 29, 2021, 9:02 p.m. | #2
On 4/29/21 10:07 AM, Florian Weimer wrote:
> * Carlos O'Donell:

> 

>>  def convert_to_hex(code_point):

>>      '''Converts a code point to a hexadecimal UTF-8 representation

>> -    like /x**/x**/x**.'''

>> -    # Getting UTF8 of Unicode characters.

>> -    # In Python3, .encode('UTF-8') does not work for

>> -    # surrogates. Therefore, we use this conversion table

>> -    surrogates = {

>> -        0xD800: '/xed/xa0/x80',

>> -        0xDB7F: '/xed/xad/xbf',

>> -        0xDB80: '/xed/xae/x80',

>> -        0xDBFF: '/xed/xaf/xbf',

>> -        0xDC00: '/xed/xb0/x80',

>> -        0xDFFF: '/xed/xbf/xbf',

>> -    }

>> -    if code_point in surrogates:

>> -        return surrogates[code_point]

>> -    return ''.join([

>> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')

>> -    ])

>> +    ready for use in a locale character map specification e.g.

>> +    /xc2/xaf for MACRON.

>> +

>> +    '''

>> +    cp_locale = ''

>> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')

>> +    for byte in cp_bytes:

>> +       cp_locale += ''.join('/x{:02x}'.format(byte))

>> +    return cp_locale

> 

> I think you should keep the list comprehension.  That ''.join() is

> unnecessary.


Like this?

    return ''.join(['/x{:02x}'.format(c) \
        for c in chr(code_point).encode('UTF-8', 'surrogatepass')])

(tested works fine and produces the same results)

-- 
Cheers,
Carlos.
Florian Weimer via Libc-alpha April 30, 2021, 4:18 a.m. | #3
* Carlos O'Donell:

> On 4/29/21 10:07 AM, Florian Weimer wrote:

>> * Carlos O'Donell:

>> 

>>>  def convert_to_hex(code_point):

>>>      '''Converts a code point to a hexadecimal UTF-8 representation

>>> -    like /x**/x**/x**.'''

>>> -    # Getting UTF8 of Unicode characters.

>>> -    # In Python3, .encode('UTF-8') does not work for

>>> -    # surrogates. Therefore, we use this conversion table

>>> -    surrogates = {

>>> -        0xD800: '/xed/xa0/x80',

>>> -        0xDB7F: '/xed/xad/xbf',

>>> -        0xDB80: '/xed/xae/x80',

>>> -        0xDBFF: '/xed/xaf/xbf',

>>> -        0xDC00: '/xed/xb0/x80',

>>> -        0xDFFF: '/xed/xbf/xbf',

>>> -    }

>>> -    if code_point in surrogates:

>>> -        return surrogates[code_point]

>>> -    return ''.join([

>>> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')

>>> -    ])

>>> +    ready for use in a locale character map specification e.g.

>>> +    /xc2/xaf for MACRON.

>>> +

>>> +    '''

>>> +    cp_locale = ''

>>> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')

>>> +    for byte in cp_bytes:

>>> +       cp_locale += ''.join('/x{:02x}'.format(byte))

>>> +    return cp_locale

>> 

>> I think you should keep the list comprehension.  That ''.join() is

>> unnecessary.

>

> Like this?

>

>     return ''.join(['/x{:02x}'.format(c) \

>         for c in chr(code_point).encode('UTF-8', 'surrogatepass')])

>

> (tested works fine and produces the same results)


Yes, exactly.  Thanks.  The patch should be fine with this.

Florian
Florian Weimer via Libc-alpha May 2, 2021, 7:18 p.m. | #4
On 4/30/21 12:18 AM, Florian Weimer wrote:
> * Carlos O'Donell:

> 

>> On 4/29/21 10:07 AM, Florian Weimer wrote:

>>> * Carlos O'Donell:

>>>

>>>>  def convert_to_hex(code_point):

>>>>      '''Converts a code point to a hexadecimal UTF-8 representation

>>>> -    like /x**/x**/x**.'''

>>>> -    # Getting UTF8 of Unicode characters.

>>>> -    # In Python3, .encode('UTF-8') does not work for

>>>> -    # surrogates. Therefore, we use this conversion table

>>>> -    surrogates = {

>>>> -        0xD800: '/xed/xa0/x80',

>>>> -        0xDB7F: '/xed/xad/xbf',

>>>> -        0xDB80: '/xed/xae/x80',

>>>> -        0xDBFF: '/xed/xaf/xbf',

>>>> -        0xDC00: '/xed/xb0/x80',

>>>> -        0xDFFF: '/xed/xbf/xbf',

>>>> -    }

>>>> -    if code_point in surrogates:

>>>> -        return surrogates[code_point]

>>>> -    return ''.join([

>>>> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')

>>>> -    ])

>>>> +    ready for use in a locale character map specification e.g.

>>>> +    /xc2/xaf for MACRON.

>>>> +

>>>> +    '''

>>>> +    cp_locale = ''

>>>> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')

>>>> +    for byte in cp_bytes:

>>>> +       cp_locale += ''.join('/x{:02x}'.format(byte))

>>>> +    return cp_locale

>>>

>>> I think you should keep the list comprehension.  That ''.join() is

>>> unnecessary.

>>

>> Like this?

>>

>>     return ''.join(['/x{:02x}'.format(c) \

>>         for c in chr(code_point).encode('UTF-8', 'surrogatepass')])

>>

>> (tested works fine and produces the same results)

> 

> Yes, exactly.  Thanks.  The patch should be fine with this.


Fixed. This will be part of the v5 repost.

-- 
Cheers,
Carlos.

Patch

diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 899840923a..56a680bc06 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -81,25 +81,46 @@  def process_range(start, end, outfile, name):
     # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
     # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
     #
-    # The glibc UTF-8 file splits ranges like these into shorter
+    # The old glibc UTF-8 file splits ranges like these into shorter
     # ranges of 64 code points each:
     #
     # <U3400>..<U343F>     /xe3/x90/x80         <CJK Ideograph Extension A>
     # …
     # <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>
-    for i in range(int(start, 16), int(end, 16), 64 ):
-        if i > (int(end, 16)-64):
-            outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                    unicode_utils.ucs_symbol(i),
-                    unicode_utils.ucs_symbol(int(end,16)),
-                    convert_to_hex(i),
-                    name))
-            break
-        outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                unicode_utils.ucs_symbol(i),
-                unicode_utils.ucs_symbol(i+63),
-                convert_to_hex(i),
-                name))
+    #
+    # We do not split the ranges like this. It is not required. The
+    # ellipsis processing in ld-collate.c can handle any sized ranges.
+    outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
+                  unicode_utils.ucs_symbol(int (start, 16)),
+                  unicode_utils.ucs_symbol(int (end, 16)),
+                  convert_to_hex (int (start, 16)),
+                  name))
+
+def process_gap (start, end, outfile):
+    '''This function processes a gap and fills it if needed.  The value
+       of start is the last value output, and the value of end is the
+       next value which may be output.  Therefore if there is a gap
+       between the two then it is filled with an ellipsis or a single
+       symbol.
+
+    '''
+    # If start and end are more than 1 away then we have a gap, and
+    # that needs filling to provide proper code-point collation
+    # support.
+    cp_prev = int(start, 16)
+    cp_next = int(end, 16)
+
+    # Special case of just one symbol missing?
+    if cp_next - 1 == cp_prev + 1:
+        outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+                      unicode_utils.ucs_symbol(cp_prev + 1),
+                      convert_to_hex(cp_prev + 1),
+                      '<Unassigned>'))
+    elif cp_next > cp_prev + 1:
+        # More than one symbol, so use an ellipsis.
+        process_range ('{:x}'.format(cp_prev + 1),
+                       '{:x}'.format(cp_next - 1),
+                       outfile, '<Unassigned>')
 
 def process_charmap(flines, outfile):
     '''This function takes an array which contains *all* lines of
@@ -129,63 +150,81 @@  def process_charmap(flines, outfile):
     %<UDB7F>     /xed/xad/xbf <Non Private Use High Surrogate, Last>
     <U0010FFC0>..<U0010FFFD>     /xf4/x8f/xbf/x80 <Plane 16 Private Use>
 
+    The old glibc UTF-8 charmap left the surrogates commented out.
+    Surrogates are not Unicode scalar values, and are ill-formed code
+    sequences. We continue to comment them out in the character map to
+    ensure no locale accidentally uses these values. The use of
+    surrogate symbols will be treated as if they were UNDEFINED. The
+    converters will handle them as ill-formed code sequences and either
+    raise an error or transform them to REPLACEMENT CHARACTER.
     '''
     fields_start = []
+    fields_end = []
     for line in flines:
         fields = line.split(";")
-         # Some characters have “<control>” as their name. We try to
-         # use the “Unicode 1.0 Name” (10th field in
-         # UnicodeData.txt) for them.
-         #
-         # The Characters U+0080, U+0081, U+0084 and U+0099 have
-         # “<control>” as their name but do not even have aa
-         # ”Unicode 1.0 Name”. We could write code to take their
-         # alternate names from NameAliases.txt.
+        # Some characters have "<control>" as their name. We try to
+        # use the "Unicode 1.0 Name" (10th field in
+        # UnicodeData.txt) for them.
+        #
+        # The Characters U+0080, U+0081, U+0084 and U+0099 have
+        # "<control>" as their name but do not even have a
+        # "Unicode 1.0 Name". We could write code to take their
+        # alternate names from NameAliases.txt.
         if fields[1] == "<control>" and fields[10]:
             fields[1] = fields[10]
         # Handling code point ranges like:
         #
         # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
         # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
-        if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', First>'):
             fields_start = fields
             continue
-        if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', Last>'):
+            # 1. Process the gap.
+            # First process the gap between the last entry and the
+            # newly started range.
+            process_gap (fields_end[0], fields_start[0], outfile)
+            # 2. Exclude surrogate ranges.
+            # Comment out the surrogates in the UTF-8 file.
+            # One could of course skip them completely but
+            # the original UTF-8 file in glibc had them as
+            # comments, so we keep these comment lines.
+            if 'Surrogate,' in fields[1]:
+                outfile.write('%')
+            # 3. Process the range.
             process_range(fields_start[0], fields[0],
                           outfile, fields[1][:-7]+'>')
             fields_start = []
+            fields_end = fields
             continue
         fields_start = []
-        if 'Surrogate,' in fields[1]:
-            # Comment out the surrogates in the UTF-8 file.
-            # One could of course skip them completely but
-            # the original UTF-8 file in glibc had them as
-            # comments, so we keep these comment lines.
-            outfile.write('%')
+
+        if len (fields_end) > 0:
+            process_gap (fields_end[0], fields[0], outfile)
+
         outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 unicode_utils.ucs_symbol(int(fields[0], 16)),
                 convert_to_hex(int(fields[0], 16)),
                 fields[1]))
 
+        fields_end = fields
+    # We may need to output a final set of symbols if we are not yet at
+    # U+10FFFF, so check that last gap.  We use U+110000 as the
+    # hypothetical next entry.  In practice UTF-8 ends at U+10FFFD and
+    # so indeed we have 2 missing symbols at the end.
+    process_gap (fields_end[0], '110000', outfile)
+
 def convert_to_hex(code_point):
     '''Converts a code point to a hexadecimal UTF-8 representation
-    like /x**/x**/x**.'''
-    # Getting UTF8 of Unicode characters.
-    # In Python3, .encode('UTF-8') does not work for
-    # surrogates. Therefore, we use this conversion table
-    surrogates = {
-        0xD800: '/xed/xa0/x80',
-        0xDB7F: '/xed/xad/xbf',
-        0xDB80: '/xed/xae/x80',
-        0xDBFF: '/xed/xaf/xbf',
-        0xDC00: '/xed/xb0/x80',
-        0xDFFF: '/xed/xbf/xbf',
-    }
-    if code_point in surrogates:
-        return surrogates[code_point]
-    return ''.join([
-        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
-    ])
+    ready for use in a locale character map specification e.g.
+    /xc2/xaf for MACRON.
+
+    '''
+    cp_locale = ''
+    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
+    for byte in cp_bytes:
+       cp_locale += ''.join('/x{:02x}'.format(byte))
+    return cp_locale
 
 def write_header_charmap(outfile):
     '''Write the header on top of the CHARMAP section to the output file'''