svfwscanf: Simplify _sungetwc_r to eliminate apparent buffer overflow

Message ID 874kbnevjm.fsf@keithp.com
State New
Headers show
Series
  • svfwscanf: Simplify _sungetwc_r to eliminate apparent buffer overflow
Related show

Commit Message

Keith Packard Aug. 17, 2021, 7:11 p.m.
svfwscanf replaces getwc and ungetwc_r. The comments in the code talk
about avoiding file operations, but they also need to bypass the
mbtowc calls as svfwscanf operates on wchar_t, not multibyte data,
which is a more important reason here; they would not work correctly
otherwise.

The ungetwc replacement has code which uses the 3 byte FILE _ubuf
field, but if wchar_t is 32-bits, this field is not large enough to
hold even one wchar_t value. Building in this mode generates warnings
about array overflow:

	In file included from ../../newlib/libc/stdio/svfiwscanf.c:35:
	../../newlib/libc/stdio/vfwscanf.c: In function '_sungetwc_r.isra':
	../../newlib/libc/stdio/vfwscanf.c:316:12: warning: array subscript 4294967295 is above array bounds of 'unsigned char[3]' [-Warray-bounds]
	  316 |   fp->_p = &fp->_ubuf[sizeof (fp->_ubuf) - sizeof (wchar_t)];
	      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	In file included from ../../newlib/libc/stdio/stdio.h:46,
			 from ../../newlib/libc/stdio/vfwscanf.c:82,
			 from ../../newlib/libc/stdio/svfiwscanf.c:35:
	../../newlib/libc/include/sys/reent.h:216:17: note: while referencing '_ubuf'
	  216 |   unsigned char _ubuf[3]; /* guarantee an ungetc() buffer */
	      |                 ^~~~~

However, the vfwscanf code *never* ungets data before the start of the
scanning operation, and *always* ungets data which matches the input
at that point, so the code always hits the block which backs up over
the input data and never hits the block which uses the _ubuf field.

In addition, the svfwscanf code will always start with the unget
buffer empty, so the ungetwc replacement never needs to support an
unget buffer at all.

Simplify the code by removing support for everything other than
backing up over the input data, leaving the check to make sure it
doesn't get underflowed in case the vfscanf code has a bug in it.

Signed-off-by: Keith Packard <keithp@keithp.com>
-- 
-keith

Comments

Corinna Vinschen Aug. 18, 2021, 8:59 a.m. | #1
Hi Keith,

On Aug 17 12:11, Keith Packard wrote:
> svfwscanf replaces getwc and ungetwc_r. The comments in the code talk

> about avoiding file operations, but they also need to bypass the

> mbtowc calls as svfwscanf operates on wchar_t, not multibyte data,

> which is a more important reason here; they would not work correctly

> otherwise.

> 

> The ungetwc replacement has code which uses the 3 byte FILE _ubuf

> field, but if wchar_t is 32-bits, this field is not large enough to

> hold even one wchar_t value. Building in this mode generates warnings

> about array overflow:

> 

> 	In file included from ../../newlib/libc/stdio/svfiwscanf.c:35:

> 	../../newlib/libc/stdio/vfwscanf.c: In function '_sungetwc_r.isra':

> 	../../newlib/libc/stdio/vfwscanf.c:316:12: warning: array subscript 4294967295 is above array bounds of 'unsigned char[3]' [-Warray-bounds]

> 	  316 |   fp->_p = &fp->_ubuf[sizeof (fp->_ubuf) - sizeof (wchar_t)];

> 	      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> 	In file included from ../../newlib/libc/stdio/stdio.h:46,

> 			 from ../../newlib/libc/stdio/vfwscanf.c:82,

> 			 from ../../newlib/libc/stdio/svfiwscanf.c:35:

> 	../../newlib/libc/include/sys/reent.h:216:17: note: while referencing '_ubuf'

> 	  216 |   unsigned char _ubuf[3]; /* guarantee an ungetc() buffer */

> 	      |                 ^~~~~

> 

> However, the vfwscanf code *never* ungets data before the start of the

> scanning operation, and *always* ungets data which matches the input

> at that point, so the code always hits the block which backs up over

> the input data and never hits the block which uses the _ubuf field.


LGTM.  Under the unlikely assumption that wscanf gets extended in future
and has to ungetc a char different from the input char, how do we catch
that?  Do we need a hint, somehow, somewhere?


Corinna
Keith Packard Aug. 18, 2021, 4:30 p.m. | #2
Corinna Vinschen <vinschen@redhat.com> writes:

> LGTM.  Under the unlikely assumption that wscanf gets extended in future

> and has to ungetc a char different from the input char, how do we catch

> that?  Do we need a hint, somehow, somewhere?


I can't imagine a case where that wouldn't be a bug; the only reason I
know that ungetc takes the old char is because stdio might not still
have it in the buffer, and that can't happen for string sources.

However, I also don't know how we'd catch this bug at compile time, and
having the string source work differently in this case than the file
source means testing using string sources might not uncover a bug that
would appear with file sources.

A reasonable alternative would be to have the wchar_t version call the
char version multiple times, instead of just doing it inline; that would
avoid any semantic difference, which seems like a feature. I'll post a
patch that does it this way shortly and see which you prefer.

-- 
-keith
Keith Packard Aug. 18, 2021, 4:50 p.m. | #3
Keith Packard <keithp@keithp.com> writes:

> A reasonable alternative would be to have the wchar_t version call the

> char version multiple times, instead of just doing it inline; that would

> avoid any semantic difference, which seems like a feature. I'll post a

> patch that does it this way shortly and see which you prefer.


Never mind -- the wchar_t function cannot simply call the char version
as that would require converting to multi-byte so that EOF is handled
correctly, *and* would end up breaking the getchar function which
assumes that the unget buffer has wchar_t values, not multibyte values.

After looking at the code further and discovering numerous other bugs in
this path, I think we should just use the patch as submitted; I cannot
imagine a case where POSIX would change the semantics of scanf to ever
re-write the input with different characters.

other bugs:

 1. Assumes wchar_t is 2 bytes by setting the bytes remaining (_r) to 2.

 2. Computes the pointer value by subtracting the wchar_t size from the
    _ubuf size, resulting in an unaligned pointer. Then uses this
    pointer directly in both getwc and ungetwc, which will cause a bus
    error on processors unable to access unaligned data.

We could also simplify the matching code path in vfscanf.c at some
point.

-- 
-keith
Corinna Vinschen Aug. 19, 2021, 10:10 a.m. | #4
On Aug 18 09:50, Keith Packard wrote:
> Keith Packard <keithp@keithp.com> writes:

> 

> > A reasonable alternative would be to have the wchar_t version call the

> > char version multiple times, instead of just doing it inline; that would

> > avoid any semantic difference, which seems like a feature. I'll post a

> > patch that does it this way shortly and see which you prefer.

> 

> Never mind -- the wchar_t function cannot simply call the char version

> as that would require converting to multi-byte so that EOF is handled

> correctly, *and* would end up breaking the getchar function which

> assumes that the unget buffer has wchar_t values, not multibyte values.

> 

> After looking at the code further and discovering numerous other bugs in

> this path, I think we should just use the patch as submitted; I cannot

> imagine a case where POSIX would change the semantics of scanf to ever

> re-write the input with different characters.


Ok, pushed.

> other bugs:

> 

>  1. Assumes wchar_t is 2 bytes by setting the bytes remaining (_r) to 2.


That was an oversight.  A few lines above I had replaced that with
sizeof(wchar_t) but apparently forgot this line.  D'oh.

>  2. Computes the pointer value by subtracting the wchar_t size from the

>     _ubuf size, resulting in an unaligned pointer. Then uses this

>     pointer directly in both getwc and ungetwc, which will cause a bus

>     error on processors unable to access unaligned data.


Given all chars are sizeof(wchar_t), how's the buffer ever going to
become unaligned?


Corinna
Keith Packard Aug. 19, 2021, 3:17 p.m. | #5
Corinna Vinschen <vinschen@redhat.com> writes:

> Given all chars are sizeof(wchar_t), how's the buffer ever going to

> become unaligned?


It places the wchar_t at the end of _ubuf, which is 3 bytes long, making
it unaligned:

  fp->_p = &fp->_ubuf[sizeof (fp->_ubuf) - sizeof (wchar_t)];

-- 
-keith
Corinna Vinschen Aug. 19, 2021, 3:38 p.m. | #6
[Please don't CC me, I'm reading the ML all the time.  Thanks]

On Aug 19 08:17, Keith Packard wrote:
> Corinna Vinschen <vinschen@redhat.com> writes:

> 

> > Given all chars are sizeof(wchar_t), how's the buffer ever going to

> > become unaligned?

> 

> It places the wchar_t at the end of _ubuf, which is 3 bytes long, making

> it unaligned:

> 

>   fp->_p = &fp->_ubuf[sizeof (fp->_ubuf) - sizeof (wchar_t)];


Oops, right.


Thanks,
Corinna

Patch

From 2bbbf24fab03472b73296356b7dfb13d616302e9 Mon Sep 17 00:00:00 2001
From: Keith Packard <keithp@keithp.com>
Date: Tue, 17 Aug 2021 11:51:52 -0700
Subject: [PATCH] svfwscanf: Simplify _sungetwc_r to eliminate apparent buffer
 overflow

svfwscanf replaces getwc and ungetwc_r. The comments in the code talk
about avoiding file operations, but they also need to bypass the
mbtowc calls as svfwscanf operates on wchar_t, not multibyte data,
which is a more important reason here; they would not work correctly
otherwise.

The ungetwc replacement has code which uses the 3 byte FILE _ubuf
field, but if wchar_t is 32-bits, this field is not large enough to
hold even one wchar_t value. Building in this mode generates warnings
about array overflow:

	In file included from ../../newlib/libc/stdio/svfiwscanf.c:35:
	../../newlib/libc/stdio/vfwscanf.c: In function '_sungetwc_r.isra':
	../../newlib/libc/stdio/vfwscanf.c:316:12: warning: array subscript 4294967295 is above array bounds of 'unsigned char[3]' [-Warray-bounds]
	  316 |   fp->_p = &fp->_ubuf[sizeof (fp->_ubuf) - sizeof (wchar_t)];
	      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	In file included from ../../newlib/libc/stdio/stdio.h:46,
			 from ../../newlib/libc/stdio/vfwscanf.c:82,
			 from ../../newlib/libc/stdio/svfiwscanf.c:35:
	../../newlib/libc/include/sys/reent.h:216:17: note: while referencing '_ubuf'
	  216 |   unsigned char _ubuf[3]; /* guarantee an ungetc() buffer */
	      |                 ^~~~~

However, the vfwscanf code *never* ungets data before the start of the
scanning operation, and *always* ungets data which matches the input
at that point, so the code always hits the block which backs up over
the input data and never hits the block which uses the _ubuf field.

In addition, the svfwscanf code will always start with the unget
buffer empty, so the ungetwc replacement never needs to support an
unget buffer at all.

Simplify the code by removing support for everything other than
backing up over the input data, leaving the check to make sure it
doesn't get underflowed in case the vfscanf code has a bug in it.

Signed-off-by: Keith Packard <keithp@keithp.com>
---
 newlib/libc/stdio/vfwscanf.c | 40 +++---------------------------------
 1 file changed, 3 insertions(+), 37 deletions(-)

diff --git a/newlib/libc/stdio/vfwscanf.c b/newlib/libc/stdio/vfwscanf.c
index f00d41a09..e9e00dfec 100644
--- a/newlib/libc/stdio/vfwscanf.c
+++ b/newlib/libc/stdio/vfwscanf.c
@@ -272,49 +272,15 @@  _sungetwc_r (struct _reent *data,
   /* After ungetc, we won't be at eof anymore */
   fp->_flags &= ~__SEOF;
 
-  /*
-   * If we are in the middle of ungetwc'ing, just continue.
-   * This may require expanding the current ungetc buffer.
+  /* All ungetwc usage in scanf un-gets the current character, so
+   * just back up over the string if we aren't at the start
    */
-
-  if (HASUB (fp))
-    {
-      if (fp->_r >= fp->_ub._size && __submore (data, fp))
-        {
-          return EOF;
-        }
-      fp->_p -= sizeof (wchar_t);
-      *fp->_p = (wchar_t) wc;
-      fp->_r += sizeof (wchar_t);
-      return wc;
-    }
-
-  /*
-   * If we can handle this by simply backing up, do so,
-   * but never replace the original character.
-   * (This makes swscanf() work when scanning `const' data.)
-   */
-
-  if (fp->_bf._base != NULL && fp->_p > fp->_bf._base
-      && ((wchar_t *)fp->_p)[-1] == wc)
+  if (fp->_bf._base != NULL && fp->_p > fp->_bf._base)
     {
       fp->_p -= sizeof (wchar_t);
       fp->_r += sizeof (wchar_t);
-      return wc;
     }
 
-  /*
-   * Create an ungetc buffer.
-   * Initially, we will use the `reserve' buffer.
-   */
-
-  fp->_ur = fp->_r;
-  fp->_up = fp->_p;
-  fp->_ub._base = fp->_ubuf;
-  fp->_ub._size = sizeof (fp->_ubuf);
-  fp->_p = &fp->_ubuf[sizeof (fp->_ubuf) - sizeof (wchar_t)];
-  *(wchar_t *) fp->_p = wc;
-  fp->_r = 2;
   return wc;
 }
 
-- 
2.32.0