[4/4] gdb/python: handle non utf-8 characters when source highlighting

Message ID 825abc2257c992be90af28973c54f98e7cf4371f.1641565040.git.aburgess@redhat.com
State New
Headers show
Series
  • Source highlight non utf-8 characters using Python
Related show

Commit Message

Simon Marchi via Gdb-patches Jan. 7, 2022, 2:23 p.m.
This commit adds support for source files that contain non utf-8
characters when performing source styling using the Python pygments
package.  This does not change the behaviour of GDB when the GNU
Source Highlight library is used.

For the following problem description, assume that either GDB is built
without GNU Source Highlight support, of that this has been disabled
using 'maintenance set gnu-source-highlight enabled off'.

The initial problem reported was that a source file containing non
utf-8 characters would cause GDB to print a Python exception, and then
display the source without styling, e.g.:

  Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
  /* Source code here, without styling...  */

Further, as the user steps through different source files, each time
the problematic source file was evicted from the source cache, and
then later reloaded, the exception would be printed again.

Finally, this problem is only present when using Python 3, this issue
is not present for Python 2.

What makes this especially frustrating is that GDB can clearly print
the source file contents, they're right there...  If we disable
styling completely, or make use of the GNU Source Highlight library,
then everything is fine.  So why is there an error when we try to
apply styling using Python?

The problem is the use of PyString_FromString (which is an alias for
PyUnicode_FromString in Python 3), this function converts a C string
into a either a Unicode object (Py3) or a str object (Py2).  For
Python 2 there is no unicode encoding performed during this function
call, but for Python 3 the input is assumed to be a uft-8 encoding
string for the purpose of the conversion.  And here of course, is the
problem, if the source file contains non utf-8 characters, then it
should not be treated as utf-8, but that's what we do, and that's why
we get an error.

My first thought when looking at this was to spot when the
PyString_FromString call failed with a UnicodeDecodeError and silently
ignore the error.  This would mean that GDB would print the source
without styling, but would also avoid the annoying exception message.

However, I also make use of `pygmentize`, a command line wrapper
around the Python pygments module, which I use to apply syntax
highlighting in the output of `less`.  And this command line wrapper
is quite happy to syntax highlight my source file that contains non
utf-8 characters, so it feels like the problem should be solvable.

It turns out that inside the pygments module there is already support
for guessing the encoding of the incoming file content, if the
incoming content is not already a Unicode string.  This is what
happens for Python 2 where the incoming content is of `str` type.

We could try and make GDB smarter when it comes to converting C
strings into Python Unicode objects; this would probably require us to
just try a couple of different encoding schemes rather than just
giving up after utf-8.

However, I figure, why bother?  The pygments module already does this
for us, and the colorize API is not part of the documented external
API of GDB.  So, why not just change the colorize API, instead of the
content being a Unicode string (for Python 3), lets just make the
content be a bytes object.  The pygments module can then take
responsibility for guessing the encoding.

So, currently, the colorize API receives a utf-8 unicode object, and
returns a host charset unicode object.  I propose that the colorize
API receive a bytes object, and return a bytes object.

This works perfectly, and I now see styled output for my source file
containing non utf-8 characters.

Well, the above is a little bit of a lie.  The actual code I've
written allows the colorize API to return either a bytes object, or a
host charset unicode object.  This just makes things a little more
flexible in the Python code.

Then I added a DejaGNU test, and .... my test failed.

I wasn't seeing styled source lines when the test was run through
DejaGNU, but I could when I ran the same steps at the console.

The test I was using is the exact one that is included with this
commit, you'll notice I did remember to set the TERM environment
variable to 'ansi' as is done in gdb.base/style.exp, which should be
all that's needed to see the styled output (I thought).  The problem
then, must be something in GDB.

It turns out that its the exact same problem that I solved above, but
this time, on the way out of the colorize handler, rather than on the
way in.  Only, interestingly, the problem now effects both Python 2
and Python 3.

The output from pygments is always a Unicode object for both Python 2
and 3.  Once we return from Python code back to gdbpy_colorize in
python.c, we do the following steps:

  if (!gdbpy_is_string (result.get ()))
    return {};

  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
  if (unic == nullpt) return {};

  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
						   host_charset (),
						   nullptr));
  if (host_str == nullpt) return {};

For both Python 2 and 3 the gdbpy_is_string check passes.

For both Python 2 and 3 the python_string_to_unicode call just returns
its input, as the string will already by a Unicode object.

Now we know we have a Unicode object we convert the Unicode object
into a bytes object using the host_charset.

After that (not shown above) we convert the bytes object into a C++
std::string, and return this as the result.

The problem is that the host_charset is based on the capabilities of
the output terminal, and, when the tests are run through DejaGNU, with
the current environment, the host_charset becomes 'ANSI_X3.4-1968',
which ends up trying to convert the string using the 'ascii' codec.
This codec only supports 7-bit characters, and so we once again get a
Python exception.

Now I can force the host_charset value to change, if I set LC_ALL to
'en_US.utf-8' (like we currently set TERM) then now the host_charset
will be utf-8, and the test passes.  Maybe that's OK, but I notice
that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and
it manages to style other parts of GDB's output just fine.  So, I want
that please.

The problem then is how GDB tries to convert the Unicode object into a
bytes object.  When I run interactively in a terminal the host_charset
is utf-8, and that's great, we convert the output just right for
display in my terminal.  But in the test situation we need to do
better.

So, I wonder if we should consider using something other than nullptr
for the last argument to PyUnicode_AsEncodedString.  I propose that we
make use of 'backslashreplace'.  This will replace any character that
is not encodable in host_charset() with an \xNN escape sequence.

This seems fine for the test case I have - the non utf-8 character is
within a C string in the source file, and replacing it with an escape
sequence seems OK.  This might be more questionable if the non
encodable character were in say, an identifier within the source code,
but at this point I think (a) we'll probably be hitting other problems
with the Python API and its bytes/unicode conversion, and (b) the user
really needs to get their host_charset() set correctly anyway.
---
 gdb/python/python.c                           | 56 ++++++++++++----
 gdb/testsuite/gdb.python/py-source-styling.c  | 29 +++++++++
 .../gdb.python/py-source-styling.exp          | 64 +++++++++++++++++++
 3 files changed, 135 insertions(+), 14 deletions(-)
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.c
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.exp

-- 
2.25.4

Comments

Simon Marchi via Gdb-patches Jan. 10, 2022, 3:27 a.m. | #1
You haven't linked it, but I found this page that you might have read:

  https://pygments.org/docs/unicode/

On 2022-01-07 09:23, Andrew Burgess via Gdb-patches wrote:
> This commit adds support for source files that contain non utf-8

> characters when performing source styling using the Python pygments

> package.  This does not change the behaviour of GDB when the GNU

> Source Highlight library is used.

> 

> For the following problem description, assume that either GDB is built

> without GNU Source Highlight support, of that this has been disabled

> using 'maintenance set gnu-source-highlight enabled off'.

> 

> The initial problem reported was that a source file containing non

> utf-8 characters would cause GDB to print a Python exception, and then

> display the source without styling, e.g.:

> 

>   Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte

>   /* Source code here, without styling...  */

> 

> Further, as the user steps through different source files, each time

> the problematic source file was evicted from the source cache, and

> then later reloaded, the exception would be printed again.

> 

> Finally, this problem is only present when using Python 3, this issue

> is not present for Python 2.

> 

> What makes this especially frustrating is that GDB can clearly print

> the source file contents, they're right there...  If we disable

> styling completely, or make use of the GNU Source Highlight library,

> then everything is fine.  So why is there an error when we try to

> apply styling using Python?

> 

> The problem is the use of PyString_FromString (which is an alias for

> PyUnicode_FromString in Python 3), this function converts a C string

> into a either a Unicode object (Py3) or a str object (Py2).  For

> Python 2 there is no unicode encoding performed during this function


"there is not unicode encoding", I think it would make more sense to say
"there is not utf-8 decoding".

> call, but for Python 3 the input is assumed to be a uft-8 encoding


uft -> utf.

> string for the purpose of the conversion.  And here of course, is the

> problem, if the source file contains non utf-8 characters, then it

> should not be treated as utf-8, but that's what we do, and that's why

> we get an error.

> 

> My first thought when looking at this was to spot when the

> PyString_FromString call failed with a UnicodeDecodeError and silently

> ignore the error.  This would mean that GDB would print the source

> without styling, but would also avoid the annoying exception message.

> 

> However, I also make use of `pygmentize`, a command line wrapper

> around the Python pygments module, which I use to apply syntax

> highlighting in the output of `less`.  And this command line wrapper

> is quite happy to syntax highlight my source file that contains non

> utf-8 characters, so it feels like the problem should be solvable.

> 

> It turns out that inside the pygments module there is already support

> for guessing the encoding of the incoming file content, if the

> incoming content is not already a Unicode string.  This is what

> happens for Python 2 where the incoming content is of `str` type.

> 

> We could try and make GDB smarter when it comes to converting C

> strings into Python Unicode objects; this would probably require us to

> just try a couple of different encoding schemes rather than just

> giving up after utf-8.

> 

> However, I figure, why bother?  The pygments module already does this

> for us, and the colorize API is not part of the documented external

> API of GDB.  So, why not just change the colorize API, instead of the

> content being a Unicode string (for Python 3), lets just make the

> content be a bytes object.  The pygments module can then take

> responsibility for guessing the encoding.

> 

> So, currently, the colorize API receives a utf-8 unicode object, and

> returns a host charset unicode object.  I propose that the colorize

> API receive a bytes object, and return a bytes object.

> 

> This works perfectly, and I now see styled output for my source file

> containing non utf-8 characters.

> 

> Well, the above is a little bit of a lie.  The actual code I've

> written allows the colorize API to return either a bytes object, or a

> host charset unicode object.  This just makes things a little more

> flexible in the Python code.

> 

> Then I added a DejaGNU test, and .... my test failed.

> 

> I wasn't seeing styled source lines when the test was run through

> DejaGNU, but I could when I ran the same steps at the console.

> 

> The test I was using is the exact one that is included with this

> commit, you'll notice I did remember to set the TERM environment

> variable to 'ansi' as is done in gdb.base/style.exp, which should be

> all that's needed to see the styled output (I thought).  The problem

> then, must be something in GDB.

> 

> It turns out that its the exact same problem that I solved above, but

> this time, on the way out of the colorize handler, rather than on the

> way in.  Only, interestingly, the problem now effects both Python 2

> and Python 3.

> 

> The output from pygments is always a Unicode object for both Python 2

> and 3.  Once we return from Python code back to gdbpy_colorize in

> python.c, we do the following steps:

> 

>   if (!gdbpy_is_string (result.get ()))

>     return {};

> 

>   gdbpy_ref<> unic = python_string_to_unicode (result.get ());

>   if (unic == nullpt) return {};

> 

>   gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),

> 						   host_charset (),

> 						   nullptr));

>   if (host_str == nullpt) return {};

> 

> For both Python 2 and 3 the gdbpy_is_string check passes.

> 

> For both Python 2 and 3 the python_string_to_unicode call just returns

> its input, as the string will already by a Unicode object.

> 

> Now we know we have a Unicode object we convert the Unicode object

> into a bytes object using the host_charset.

> 

> After that (not shown above) we convert the bytes object into a C++

> std::string, and return this as the result.

> 

> The problem is that the host_charset is based on the capabilities of

> the output terminal, and, when the tests are run through DejaGNU, with

> the current environment, the host_charset becomes 'ANSI_X3.4-1968',

> which ends up trying to convert the string using the 'ascii' codec.

> This codec only supports 7-bit characters, and so we once again get a

> Python exception.


Huh, that's not ideal.  It would seem better to me if the testsuite ran
with host-charset == utf-8.  That would be more representative of the
reality than testing with a random charset.

> Now I can force the host_charset value to change, if I set LC_ALL to

> 'en_US.utf-8' (like we currently set TERM) then now the host_charset

> will be utf-8, and the test passes.  Maybe that's OK, but I notice

> that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and

> it manages to style other parts of GDB's output just fine.  So, I want

> that please.

> 

> The problem then is how GDB tries to convert the Unicode object into a

> bytes object.  When I run interactively in a terminal the host_charset

> is utf-8, and that's great, we convert the output just right for

> display in my terminal.  But in the test situation we need to do

> better.

> 

> So, I wonder if we should consider using something other than nullptr

> for the last argument to PyUnicode_AsEncodedString.  I propose that we

> make use of 'backslashreplace'.  This will replace any character that

> is not encodable in host_charset() with an \xNN escape sequence.

> 

> This seems fine for the test case I have - the non utf-8 character is

> within a C string in the source file, and replacing it with an escape

> sequence seems OK.  This might be more questionable if the non

> encodable character were in say, an identifier within the source code,

> but at this point I think (a) we'll probably be hitting other problems

> with the Python API and its bytes/unicode conversion, and (b) the user

> really needs to get their host_charset() set correctly anyway.


I must admit I did not follow everything (it's getting late here).  But
passing a "bytes" to pygments and receiving a coloured "bytes"
(hopefully encoded the same way as the original "bytes", only with ANSI
sequences added) sounds good.  Then the coloured string goes through the
GDB machine just as the uncoloured string would have gone, had Pygments
not been available.  It seems like GDB never really needs to decode a
source string, it just sends it to the terminal and the terminal deals
with it?  Eexcept maybe in the TUI?

I don't understand when the Python colorize function returns a unicode
string now, though.  If you always send a bytes, don't you always get
back a bytes?

Simon
Simon Marchi via Gdb-patches Jan. 10, 2022, 10:41 a.m. | #2
* Simon Marchi <simon.marchi@polymtl.ca> [2022-01-09 22:27:10 -0500]:

> You haven't linked it, but I found this page that you might have read:

> 

>   https://pygments.org/docs/unicode/

> 

> On 2022-01-07 09:23, Andrew Burgess via Gdb-patches wrote:

> > This commit adds support for source files that contain non utf-8

> > characters when performing source styling using the Python pygments

> > package.  This does not change the behaviour of GDB when the GNU

> > Source Highlight library is used.

> > 

> > For the following problem description, assume that either GDB is built

> > without GNU Source Highlight support, of that this has been disabled

> > using 'maintenance set gnu-source-highlight enabled off'.

> > 

> > The initial problem reported was that a source file containing non

> > utf-8 characters would cause GDB to print a Python exception, and then

> > display the source without styling, e.g.:

> > 

> >   Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte

> >   /* Source code here, without styling...  */

> > 

> > Further, as the user steps through different source files, each time

> > the problematic source file was evicted from the source cache, and

> > then later reloaded, the exception would be printed again.

> > 

> > Finally, this problem is only present when using Python 3, this issue

> > is not present for Python 2.

> > 

> > What makes this especially frustrating is that GDB can clearly print

> > the source file contents, they're right there...  If we disable

> > styling completely, or make use of the GNU Source Highlight library,

> > then everything is fine.  So why is there an error when we try to

> > apply styling using Python?

> > 

> > The problem is the use of PyString_FromString (which is an alias for

> > PyUnicode_FromString in Python 3), this function converts a C string

> > into a either a Unicode object (Py3) or a str object (Py2).  For

> > Python 2 there is no unicode encoding performed during this function

> 

> "there is not unicode encoding", I think it would make more sense to say

> "there is not utf-8 decoding".

> 

> > call, but for Python 3 the input is assumed to be a uft-8 encoding

> 

> uft -> utf.

> 

> > string for the purpose of the conversion.  And here of course, is the

> > problem, if the source file contains non utf-8 characters, then it

> > should not be treated as utf-8, but that's what we do, and that's why

> > we get an error.

> > 

> > My first thought when looking at this was to spot when the

> > PyString_FromString call failed with a UnicodeDecodeError and silently

> > ignore the error.  This would mean that GDB would print the source

> > without styling, but would also avoid the annoying exception message.

> > 

> > However, I also make use of `pygmentize`, a command line wrapper

> > around the Python pygments module, which I use to apply syntax

> > highlighting in the output of `less`.  And this command line wrapper

> > is quite happy to syntax highlight my source file that contains non

> > utf-8 characters, so it feels like the problem should be solvable.

> > 

> > It turns out that inside the pygments module there is already support

> > for guessing the encoding of the incoming file content, if the

> > incoming content is not already a Unicode string.  This is what

> > happens for Python 2 where the incoming content is of `str` type.

> > 

> > We could try and make GDB smarter when it comes to converting C

> > strings into Python Unicode objects; this would probably require us to

> > just try a couple of different encoding schemes rather than just

> > giving up after utf-8.

> > 

> > However, I figure, why bother?  The pygments module already does this

> > for us, and the colorize API is not part of the documented external

> > API of GDB.  So, why not just change the colorize API, instead of the

> > content being a Unicode string (for Python 3), lets just make the

> > content be a bytes object.  The pygments module can then take

> > responsibility for guessing the encoding.

> > 

> > So, currently, the colorize API receives a utf-8 unicode object, and

> > returns a host charset unicode object.  I propose that the colorize

> > API receive a bytes object, and return a bytes object.

> > 

> > This works perfectly, and I now see styled output for my source file

> > containing non utf-8 characters.

> > 

> > Well, the above is a little bit of a lie.  The actual code I've

> > written allows the colorize API to return either a bytes object, or a

> > host charset unicode object.  This just makes things a little more

> > flexible in the Python code.

> > 

> > Then I added a DejaGNU test, and .... my test failed.

> > 

> > I wasn't seeing styled source lines when the test was run through

> > DejaGNU, but I could when I ran the same steps at the console.

> > 

> > The test I was using is the exact one that is included with this

> > commit, you'll notice I did remember to set the TERM environment

> > variable to 'ansi' as is done in gdb.base/style.exp, which should be

> > all that's needed to see the styled output (I thought).  The problem

> > then, must be something in GDB.

> > 

> > It turns out that its the exact same problem that I solved above, but

> > this time, on the way out of the colorize handler, rather than on the

> > way in.  Only, interestingly, the problem now effects both Python 2

> > and Python 3.

> > 

> > The output from pygments is always a Unicode object for both Python 2

> > and 3.  Once we return from Python code back to gdbpy_colorize in

> > python.c, we do the following steps:

> > 

> >   if (!gdbpy_is_string (result.get ()))

> >     return {};

> > 

> >   gdbpy_ref<> unic = python_string_to_unicode (result.get ());

> >   if (unic == nullpt) return {};

> > 

> >   gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),

> > 						   host_charset (),

> > 						   nullptr));

> >   if (host_str == nullpt) return {};

> > 

> > For both Python 2 and 3 the gdbpy_is_string check passes.

> > 

> > For both Python 2 and 3 the python_string_to_unicode call just returns

> > its input, as the string will already by a Unicode object.

> > 

> > Now we know we have a Unicode object we convert the Unicode object

> > into a bytes object using the host_charset.

> > 

> > After that (not shown above) we convert the bytes object into a C++

> > std::string, and return this as the result.

> > 

> > The problem is that the host_charset is based on the capabilities of

> > the output terminal, and, when the tests are run through DejaGNU, with

> > the current environment, the host_charset becomes 'ANSI_X3.4-1968',

> > which ends up trying to convert the string using the 'ascii' codec.

> > This codec only supports 7-bit characters, and so we once again get a

> > Python exception.

> 

> Huh, that's not ideal.  It would seem better to me if the testsuite ran

> with host-charset == utf-8.  That would be more representative of the

> reality than testing with a random charset.

> 

> > Now I can force the host_charset value to change, if I set LC_ALL to

> > 'en_US.utf-8' (like we currently set TERM) then now the host_charset

> > will be utf-8, and the test passes.  Maybe that's OK, but I notice

> > that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and

> > it manages to style other parts of GDB's output just fine.  So, I want

> > that please.

> > 

> > The problem then is how GDB tries to convert the Unicode object into a

> > bytes object.  When I run interactively in a terminal the host_charset

> > is utf-8, and that's great, we convert the output just right for

> > display in my terminal.  But in the test situation we need to do

> > better.

> > 

> > So, I wonder if we should consider using something other than nullptr

> > for the last argument to PyUnicode_AsEncodedString.  I propose that we

> > make use of 'backslashreplace'.  This will replace any character that

> > is not encodable in host_charset() with an \xNN escape sequence.

> > 

> > This seems fine for the test case I have - the non utf-8 character is

> > within a C string in the source file, and replacing it with an escape

> > sequence seems OK.  This might be more questionable if the non

> > encodable character were in say, an identifier within the source code,

> > but at this point I think (a) we'll probably be hitting other problems

> > with the Python API and its bytes/unicode conversion, and (b) the user

> > really needs to get their host_charset() set correctly anyway.

> 

> I must admit I did not follow everything (it's getting late here).  But

> passing a "bytes" to pygments and receiving a coloured "bytes"

> (hopefully encoded the same way as the original "bytes", only with ANSI

> sequences added) sounds good.  Then the coloured string goes through the

> GDB machine just as the uncoloured string would have gone, had Pygments

> not been available.  It seems like GDB never really needs to decode a

> source string, it just sends it to the terminal and the terminal deals

> with it?  Eexcept maybe in the TUI?

> 

> I don't understand when the Python colorize function returns a unicode

> string now, though.  If you always send a bytes, don't you always get

> back a bytes?


Unfortunately it's not as simple as bytes in bytes out.  See:

  https://pygments.org/docs/unicode/?highlight=encoding

In summary, Pygments uses unicode internally, but has some logic for
guessing the encoding of the incoming bytes.  This logic is better (I
claim) than GDB's hard-coded use UTF-8.  The link above outlines how
the guess is done in more detail.

Pygments always returns a unicode object, which is one of the reasons
I have GDB handle both bytes and unicode being returned from the
colorize API.  We could always make the API for restricted, and insist
on a bytes object being returned, this would just require us to
convert the output of Pygments to bytes before returning to GDB.

Hope that makes things a little clearer...

Thanks,
Andrew
Simon Marchi via Gdb-patches Jan. 10, 2022, 3:32 p.m. | #3
> Unfortunately it's not as simple as bytes in bytes out.  See:

> 

>   https://pygments.org/docs/unicode/?highlight=encoding

> 

> In summary, Pygments uses unicode internally, but has some logic for

> guessing the encoding of the incoming bytes.  This logic is better (I

> claim) than GDB's hard-coded use UTF-8.  The link above outlines how

> the guess is done in more detail.

> 

> Pygments always returns a unicode object, which is one of the reasons

> I have GDB handle both bytes and unicode being returned from the

> colorize API.  We could always make the API for restricted, and insist

> on a bytes object being returned, this would just require us to

> convert the output of Pygments to bytes before returning to GDB.


Ok, so when does "colorize" returns bytes?

Simon
Simon Marchi via Gdb-patches Jan. 11, 2022, 1:10 p.m. | #4
* Simon Marchi via Gdb-patches <gdb-patches@sourceware.org> [2022-01-10 10:32:02 -0500]:

> > Unfortunately it's not as simple as bytes in bytes out.  See:

> > 

> >   https://pygments.org/docs/unicode/?highlight=encoding

> > 

> > In summary, Pygments uses unicode internally, but has some logic for

> > guessing the encoding of the incoming bytes.  This logic is better (I

> > claim) than GDB's hard-coded use UTF-8.  The link above outlines how

> > the guess is done in more detail.

> > 

> > Pygments always returns a unicode object, which is one of the reasons

> > I have GDB handle both bytes and unicode being returned from the

> > colorize API.  We could always make the API for restricted, and insist

> > on a bytes object being returned, this would just require us to

> > convert the output of Pygments to bytes before returning to GDB.

> 

> Ok, so when does "colorize" returns bytes?


  (1) Python 2 (for now), and
  (2) Never, unless a user overrides gdb.colorize.

Thanks,
Andrew
Tom Tromey Jan. 11, 2022, 7:24 p.m. | #5
>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:


Andrew> We could try and make GDB smarter when it comes to converting C
Andrew> strings into Python Unicode objects; this would probably require us to
Andrew> just try a couple of different encoding schemes rather than just
Andrew> giving up after utf-8.

Perhaps it should be using the host charset here.

Anyway, FWIW, I think this patch looks reasonable.

Tom
Simon Marchi via Gdb-patches Jan. 11, 2022, 7:42 p.m. | #6
On 1/11/22 20:24, Tom Tromey wrote:
>>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

> Andrew> We could try and make GDB smarter when it comes to converting C

> Andrew> strings into Python Unicode objects; this would probably require us to

> Andrew> just try a couple of different encoding schemes rather than just

> Andrew> giving up after utf-8.

>

> Perhaps it should be using the host charset here.

>

> Anyway, FWIW, I think this patch looks reasonable.

>

I did not follow all the discussion, but did you consider using 
surrogate escapes 
(https://docs.python.org/3/library/codecs.html#error-handlers) ?

I used that in RabbitCVS with quite good results.

Just my 2 cents,

Patrick

Patch

diff --git a/gdb/python/python.c b/gdb/python/python.c
index e05b99c0bec..76871c86eca 100644
--- a/gdb/python/python.c
+++ b/gdb/python/python.c
@@ -1147,7 +1147,25 @@  gdbpy_colorize (const std::string &filename, const std::string &contents)
       gdbpy_print_stack ();
       return {};
     }
-  gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
+
+  /* The pygments library, which is what we currently use for applying
+     styling, is happy to take input as a bytes object, and to figure out
+     the encoding for itself.  This removes the need for us to figure out
+     (guess?) at how the content is encoded, which is probably a good
+     thing.
+
+     In an ideal world we should be able to use the host_charset here, but
+     the host_charset is most likely to represent the capabilities of the
+     users terminal, rather than the encoding of a particular source file.
+
+     Another possibility would be to use Python's ability to escape
+     unknown characters when converting the bytes into unicode.  However,
+     by passing the raw file bytes to Python we can differ this choice.
+     This gives us (or potentially a user) the option of taking an approach
+     that works for them, rather than hard coding a single fixed choice
+     into GDB.  */
+  gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
+						       contents.size ()));
   if (contents_arg == nullptr)
     {
       gdbpy_print_stack ();
@@ -1164,25 +1182,35 @@  gdbpy_colorize (const std::string &filename, const std::string &contents)
       return {};
     }
 
-  if (!gdbpy_is_string (result.get ()))
-    return {};
 
-  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
-  if (unic == nullptr)
+  if (PyBytes_Check (result.get ()))
     {
-      gdbpy_print_stack ();
-      return {};
+      /* The Python code has handled us a stream of bytes, so there's no
+	 need to convert this to unicode and back - the only outcome of
+	 which is that we might run into encoding errors if the bytes don't
+	 represent a string in host_charset.  */
+      return std::string (PyBytes_AsString (result.get ()));
     }
-  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
-						   host_charset (),
-						   nullptr));
-  if (host_str == nullptr)
+  else if (PyUnicode_Check (result.get ()))
     {
-      gdbpy_print_stack ();
-      return {};
+      /* Convert to bytes assuming the host_charset.  Any characters that
+	 can't be converted are replaced with \xNN escape sequences.  We
+	 see this case crop up in the gdb.python/py-source-styling.exp test
+	 when writing the output to the log file.  The host_charset is
+	 ASCII, and the unicode string from Pygments can't be encoded.  */
+      gdbpy_ref<> bytes (PyUnicode_AsEncodedString (result.get (),
+						    host_charset (),
+						    "backslashreplace"));
+      if (bytes == nullptr)
+	{
+	  gdbpy_print_stack ();
+	  return {};
+	}
+
+      return std::string (PyBytes_AsString (bytes.get ()));
     }
 
-  return std::string (PyBytes_AsString (host_str.get ()));
+  return {};
 }
 
 
diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
new file mode 100644
index 00000000000..6a1a9d59a8c
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.c
@@ -0,0 +1,29 @@ 
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2022 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+int
+main ()
+{
+  int some_variable = 1234;
+
+  /* The following line contains a character that is non-utf-8.  This is a
+     critical part of the test as Python 3 can't convert this into a string
+     using its default mechanism.  */
+  char c = '�';		/* List this line.  */
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
new file mode 100644
index 00000000000..68bbc9f9758
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.exp
@@ -0,0 +1,64 @@ 
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This file is part of the GDB testsuite.  It checks for memory leaks
+# associated with allocating and deallocation gdb.Inferior objects.
+
+load_lib gdb-python.exp
+
+standard_testfile
+
+save_vars { env(TERM) } {
+    # We need an ANSI-capable terminal to get the output, additionally
+    # we need to set LC_ALL so GDB knows the terminal is UTF-8
+    # capable, otherwise we'll get a UnicodeEncodeError trying to
+    # encode the output.
+    setenv TERM ansi
+
+    if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
+	return -1
+    }
+
+    if { [skip_python_tests] } { continue }
+
+    if { ![gdb_py_module_available "pygments"] } {
+	unsupported "pygments module not available"
+	return -1
+    }
+
+    if ![runto_main] {
+	return
+    }
+
+    gdb_test_no_output "maint set gnu-source-highlight enabled off"
+
+    gdb_test "maint flush source-cache" "Source cache flushed\\."
+
+    set seen_style_escape false
+    set line_number [gdb_get_line_number "List this line."]
+    gdb_test_multiple "list ${line_number}" "" {
+	-re "Python Exception.*" {
+	    fail $gdb_test_name
+	}
+	-re "\033" {
+	    set seen_style_escape true
+	    exp_continue
+	}
+	-re "$gdb_prompt $" {
+	    gdb_assert { $seen_style_escape }
+	    pass $gdb_test_name
+	}
+    }
+}