[v2,4/4] libio: Fix race access to _IO_FLAGS2_NEED_LOCK

Message ID 20210510170426.2533768-4-adhemerval.zanella@linaro.org
State New
Headers show
Series
  • [v2,1/4] linux: Move flockfile/_IO_flockfile into libc
Related show

Commit Message

Mark Brown via Libc-alpha May 10, 2021, 5:04 p.m.
The flag is set on:

  1. the stream creation (for memstream and wmemstream).
  2. by pthread_create when it detects the process become multi-thread.
  3. by flockfile.

The 1. and 2. cases do not generate a race condition since for 1. the
stream reference is not accessible by other threads and for 2. the
process is still single-threaded.

However, for 3. the functions feof, ferror, fputc, getc, getchar,
ungetc, and putc access it without locking the stream to check whether
to use the optimized _unlocked variants.  So to avoid a racy condition,
both the check and set are done using atomic operations.

Checked on x86_64-linux-gnu.
---
 libio/feof.c             |  2 +-
 libio/ferror.c           |  2 +-
 libio/fputc.c            |  2 +-
 libio/getc.c             |  2 +-
 libio/getchar.c          |  2 +-
 libio/ioungetc.c         |  2 +-
 libio/libio.h            | 31 +++++++++++++++++++++++++++++--
 libio/putc.c             |  2 +-
 stdio-common/flockfile.c |  2 +-
 9 files changed, 37 insertions(+), 10 deletions(-)

-- 
2.30.2

Comments

Mark Brown via Libc-alpha May 10, 2021, 5:46 p.m. | #1
* Adhemerval Zanella:

> The flag is set on:

>

>   1. the stream creation (for memstream and wmemstream).

>   2. by pthread_create when it detects the process become multi-thread.

>   3. by flockfile.

>

> The 1. and 2. cases do not generate a race condition since for 1. the

> stream reference is not accessible by other threads and for 2. the

> process is still single-threaded.

>

> However, for 3. the functions feof, ferror, fputc, getc, getchar,

> ungetc, and putc access it without locking the stream to check whether

> to use the optimized _unlocked variants.  So to avoid a racy condition,

> both the check and set are done using atomic operations.


Sorry, I don't think this works, purely for performance reasons.

Here's why:

> diff --git a/libio/feof.c b/libio/feof.c

> index 8275321788..823b40637a 100644

> --- a/libio/feof.c

> +++ b/libio/feof.c

> @@ -32,7 +32,7 @@ _IO_feof (FILE *fp)

>  {

>    int result;

>    CHECK_FILE (fp, EOF);

> -  if (!_IO_need_lock (fp))

> +  if (!io_need_lock (fp))

>      return _IO_feof_unlocked (fp);

>    _IO_flockfile (fp);

>    result = _IO_feof_unlocked (fp);


io_need_lock is used on the fast path here.

> diff --git a/libio/libio.h b/libio/libio.h

> index 511b39457f..6b569511ec 100644

> --- a/libio/libio.h

> +++ b/libio/libio.h


> +/* The _IO_FLAGS2_NEED_LOCK flag is set on:

> +

> +   1. the stream creation (for memstream and wmemstream).

> +   2. by pthread_create when it detects the process become multi-thread.

> +   3. by flockfile.

> +

> +   The 1. and 2. cases do not generate a race condition since for 1.

> +   the stream reference is not accessible by other threads and for 2.

> +   the process is still single-threaded.

> +

> +   However, for 3. the functions feof, ferror, fputc, getc, getchar, ungetc,

> +   and putc access it without locking the stream to check whether to use the

> +   optimized _unlocked variants.  So to avoid a racy condition, both the

> +   check and set are done using atomic operations.  */

> +

> +static inline bool

> +io_need_lock (FILE *fp)

> +{

> +  return atomic_load_acquire (&fp->_flags2) & _IO_FLAGS2_NEED_LOCK;

> +}


And this means we have an acquire load on the fast path.  And I think
that's … not good on the architecture for which the optimization was
written.

> diff --git a/stdio-common/flockfile.c b/stdio-common/flockfile.c

> index a66e0a731e..3b42fe891d 100644

> --- a/stdio-common/flockfile.c

> +++ b/stdio-common/flockfile.c

> @@ -22,7 +22,7 @@

>  void

>  __flockfile (FILE *stream)

>  {

> -  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;

> +  io_set_need_lock (stream);

>    _IO_lock_lock (*stream->_lock);

>  }

>  weak_alias (__flockfile, flockfile);


My impression is that _IO_FLAGS2_NEED_LOCK is set here for consistency,
not for concurrency control (i.e., so that other internal locking
operations produce the desired result).

As you pointed out, pthread_create automatically sets
_IO_FLAGS2_NEED_LOCK in those cases where it is not set by default.  So
the _flags2 update only matters in the single-threaded case.  But the
lock does not matter, so we can do this instead:

  void
  __flockfile (FILE *stream)
  {
    _IO_lock_lock (*stream->_lock);
    stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
  }

Since the _flags2 update happens after the lock operation, the data race
is gone.

Ideally this should have a comment, but I'm not sure what to write
there.

With more careful review, maybe we could eliminate the _flags2 update
altogether.  What we cannot do is make the flockfile operation itself
conditional on _IO_FLAGS2_NEED_LOCK because of observable impact
vis-a-vis funlockfile.  If we want to optimize flockfile, I think we'd
have to change _IO_lock_lock and add an atomics-free fast path for
SINGLE_THREAD_P (that still performs the lock update).  With that, the
entire _IO_FLAGS2_NEED_LOCK business could be removed again, I think.

Thanks,
Florian
Mark Brown via Libc-alpha May 10, 2021, 7:11 p.m. | #2
On 10/05/2021 14:46, Florian Weimer wrote:
> * Adhemerval Zanella:

> 

>> The flag is set on:

>>

>>   1. the stream creation (for memstream and wmemstream).

>>   2. by pthread_create when it detects the process become multi-thread.

>>   3. by flockfile.

>>

>> The 1. and 2. cases do not generate a race condition since for 1. the

>> stream reference is not accessible by other threads and for 2. the

>> process is still single-threaded.

>>

>> However, for 3. the functions feof, ferror, fputc, getc, getchar,

>> ungetc, and putc access it without locking the stream to check whether

>> to use the optimized _unlocked variants.  So to avoid a racy condition,

>> both the check and set are done using atomic operations.

> 

> Sorry, I don't think this works, purely for performance reasons.

> 

> Here's why:

> 

>> diff --git a/libio/feof.c b/libio/feof.c

>> index 8275321788..823b40637a 100644

>> --- a/libio/feof.c

>> +++ b/libio/feof.c

>> @@ -32,7 +32,7 @@ _IO_feof (FILE *fp)

>>  {

>>    int result;

>>    CHECK_FILE (fp, EOF);

>> -  if (!_IO_need_lock (fp))

>> +  if (!io_need_lock (fp))

>>      return _IO_feof_unlocked (fp);

>>    _IO_flockfile (fp);

>>    result = _IO_feof_unlocked (fp);

> 

> io_need_lock is used on the fast path here.

> 

>> diff --git a/libio/libio.h b/libio/libio.h

>> index 511b39457f..6b569511ec 100644

>> --- a/libio/libio.h

>> +++ b/libio/libio.h

> 

>> +/* The _IO_FLAGS2_NEED_LOCK flag is set on:

>> +

>> +   1. the stream creation (for memstream and wmemstream).

>> +   2. by pthread_create when it detects the process become multi-thread.

>> +   3. by flockfile.

>> +

>> +   The 1. and 2. cases do not generate a race condition since for 1.

>> +   the stream reference is not accessible by other threads and for 2.

>> +   the process is still single-threaded.

>> +

>> +   However, for 3. the functions feof, ferror, fputc, getc, getchar, ungetc,

>> +   and putc access it without locking the stream to check whether to use the

>> +   optimized _unlocked variants.  So to avoid a racy condition, both the

>> +   check and set are done using atomic operations.  */

>> +

>> +static inline bool

>> +io_need_lock (FILE *fp)

>> +{

>> +  return atomic_load_acquire (&fp->_flags2) & _IO_FLAGS2_NEED_LOCK;

>> +}

> 

> And this means we have an acquire load on the fast path.  And I think

> that's … not good on the architecture for which the optimization was

> written.

> 

>> diff --git a/stdio-common/flockfile.c b/stdio-common/flockfile.c

>> index a66e0a731e..3b42fe891d 100644

>> --- a/stdio-common/flockfile.c

>> +++ b/stdio-common/flockfile.c

>> @@ -22,7 +22,7 @@

>>  void

>>  __flockfile (FILE *stream)

>>  {

>> -  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;

>> +  io_set_need_lock (stream);

>>    _IO_lock_lock (*stream->_lock);

>>  }

>>  weak_alias (__flockfile, flockfile);

> 

> My impression is that _IO_FLAGS2_NEED_LOCK is set here for consistency,

> not for concurrency control (i.e., so that other internal locking

> operations produce the desired result).


Indeed thinking again, the _IO_FLAGS2_NEED_LOCK is similar to
__libc_single_threaded: once the process goes multithread 
_IO_FLAGS2_NEED_LOCK is set and all stream function should take
the lock.  And currently there is no support to reset to default
(which is also tricky because open_memstream should always have it
set).

> 

> As you pointed out, pthread_create automatically sets

> _IO_FLAGS2_NEED_LOCK in those cases where it is not set by default.  So

> the _flags2 update only matters in the single-threaded case.  But the

> lock does not matter, so we can do this instead:

> 

>   void

>   __flockfile (FILE *stream)

>   {

>     _IO_lock_lock (*stream->_lock);

>     stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;

>   }

> 

> Since the _flags2 update happens after the lock operation, the data race

> is gone.

> 

> Ideally this should have a comment, but I'm not sure what to write

> there.


I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after
the lock should be suffice.  The data race is not really related to 
_IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 
the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some
bits update.

Maybe the comment:

/* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic
   fast path on function that call _IO_need_lock.  However it requires
   to be set with stream locked to avoid a potential data race.  */

> 

> With more careful review, maybe we could eliminate the _flags2 update

> altogether.  What we cannot do is make the flockfile operation itself

> conditional on _IO_FLAGS2_NEED_LOCK because of observable impact

> vis-a-vis funlockfile.  If we want to optimize flockfile, I think we'd

> have to change _IO_lock_lock and add an atomics-free fast path for

> SINGLE_THREAD_P (that still performs the lock update).  With that, the

> entire _IO_FLAGS2_NEED_LOCK business could be removed again, I think.


I am not sure we could eliminate since we need a way to inform the
function that call _IO_need_lock that they can elide _IO_flockfile.
The single-thread optimization won't really work: um multithread 
process the idea exactly to make a set of stream operation to avoid
taking the lock.

And _IO_FILE::_flag does not see to have space (high-order word is
_IO_MAGIC and all flags seems to be used).
Mark Brown via Libc-alpha May 10, 2021, 7:18 p.m. | #3
* Adhemerval Zanella:

> I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after

> the lock should be suffice.  The data race is not really related to 

> _IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 

> the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some

> bits update.


Right.

> Maybe the comment:

>

> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>    fast path on function that call _IO_need_lock.  However it requires

>    to be set with stream locked to avoid a potential data race.  */


s/fast path/path/ I suppose, and “it necessary to set it with the stream
locked”.

>> With more careful review, maybe we could eliminate the _flags2 update

>> altogether.  What we cannot do is make the flockfile operation itself

>> conditional on _IO_FLAGS2_NEED_LOCK because of observable impact

>> vis-a-vis funlockfile.  If we want to optimize flockfile, I think we'd

>> have to change _IO_lock_lock and add an atomics-free fast path for

>> SINGLE_THREAD_P (that still performs the lock update).  With that, the

>> entire _IO_FLAGS2_NEED_LOCK business could be removed again, I think.

>

> I am not sure we could eliminate since we need a way to inform the

> function that call _IO_need_lock that they can elide _IO_flockfile.


_IO_need_lock would go away (along with _IO_FLAGS2_NEED_LOCK) if we had
the SINGLE_THREAD_P optimization inside _IO_lock_lock and
_IO_lock_unlock.

> The single-thread optimization won't really work: um multithread 

> process the idea exactly to make a set of stream operation to avoid

> taking the lock.


Why not?  The single-thread optimization would still avoid the atomic
compare-and-set, which should be a significant win on AArch64 and POWER.
There is more work to be done for the single-threaded case compared to
_IO_FLAGS2_NEED_LOCK, but all that is without atomics, so I think those
should be reasonably fast.

> And _IO_FILE::_flag does not see to have space (high-order word is

> _IO_MAGIC and all flags seems to be used).


This general optimzation pattern for locks relies on SINGLE_THREAD_P
only.

Thanks,
Florian
Mark Brown via Libc-alpha May 10, 2021, 7:25 p.m. | #4
* Florian Weimer:

> * Adhemerval Zanella:

>

>> I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after

>> the lock should be suffice.  The data race is not really related to 

>> _IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 

>> the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some

>> bits update.

>

> Right.


Oh, and this a user-visible bug, so I filed:

  Race condition in _IO_FLAGS2_NEED_LOCK optimization
  <https://sourceware.org/bugzilla/show_bug.cgi?id=27842>

Please reference this bug in your commit message.

Thanks,
Florian
Mark Brown via Libc-alpha May 10, 2021, 7:25 p.m. | #5
On 10/05/2021 16:18, Florian Weimer wrote:
> * Adhemerval Zanella:

> 

>> I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after

>> the lock should be suffice.  The data race is not really related to 

>> _IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 

>> the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some

>> bits update.

> 

> Right.

> 

>> Maybe the comment:

>>

>> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>>    fast path on function that call _IO_need_lock.  However it requires

>>    to be set with stream locked to avoid a potential data race.  */

> 

> s/fast path/path/ I suppose, and “it necessary to set it with the stream

> locked”.



/* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic
   path on function that call _IO_need_lock.  However it is necessary
   to set it with the stream locked  to avoid a potential data race.  */

> 

>>> With more careful review, maybe we could eliminate the _flags2 update

>>> altogether.  What we cannot do is make the flockfile operation itself

>>> conditional on _IO_FLAGS2_NEED_LOCK because of observable impact

>>> vis-a-vis funlockfile.  If we want to optimize flockfile, I think we'd

>>> have to change _IO_lock_lock and add an atomics-free fast path for

>>> SINGLE_THREAD_P (that still performs the lock update).  With that, the

>>> entire _IO_FLAGS2_NEED_LOCK business could be removed again, I think.

>>

>> I am not sure we could eliminate since we need a way to inform the

>> function that call _IO_need_lock that they can elide _IO_flockfile.

> 

> _IO_need_lock would go away (along with _IO_FLAGS2_NEED_LOCK) if we had

> the SINGLE_THREAD_P optimization inside _IO_lock_lock and

> _IO_lock_unlock.

> 

>> The single-thread optimization won't really work: um multithread 

>> process the idea exactly to make a set of stream operation to avoid

>> taking the lock.

> 

> Why not?  The single-thread optimization would still avoid the atomic

> compare-and-set, which should be a significant win on AArch64 and POWER.

> There is more work to be done for the single-threaded case compared to

> _IO_FLAGS2_NEED_LOCK, but all that is without atomics, so I think those

> should be reasonably fast.


Because it would solve only for single-thread case, for multithread
case the single-thread optimization won't give the function that
call _IO_need_lock the correct information to elide the lock.

The seqeuence:

  flockfile (arq);

  feof (arq);
  ferror (arq);
  ...

  funlockfile (arq);

Should only incur is atomic operation on flockfile and funlockfile
in both single and multi-thread case.  BY tying the feof taking the
lock to single-thread optimization it incur in extra atomic operations
on multi-thread case.
Mark Brown via Libc-alpha May 10, 2021, 7:33 p.m. | #6
* Adhemerval Zanella:

> On 10/05/2021 16:18, Florian Weimer wrote:

>> * Adhemerval Zanella:

>> 

>>> I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after

>>> the lock should be suffice.  The data race is not really related to 

>>> _IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 

>>> the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some

>>> bits update.

>> 

>> Right.

>> 

>>> Maybe the comment:

>>>

>>> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>>>    fast path on function that call _IO_need_lock.  However it requires

>>>    to be set with stream locked to avoid a potential data race.  */

>> 

>> s/fast path/path/ I suppose, and “it necessary to set it with the stream

>> locked”.

>

>

> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>    path on function that call _IO_need_lock.  However it is necessary

>    to set it with the stream locked  to avoid a potential data race.  */


Meh, I'm confused.  Aren't we disabling the non-atomic path?

>> Why not?  The single-thread optimization would still avoid the atomic

>> compare-and-set, which should be a significant win on AArch64 and POWER.

>> There is more work to be done for the single-threaded case compared to

>> _IO_FLAGS2_NEED_LOCK, but all that is without atomics, so I think those

>> should be reasonably fast.

>

> Because it would solve only for single-thread case, for multithread

> case the single-thread optimization won't give the function that

> call _IO_need_lock the correct information to elide the lock.

>

> The seqeuence:

>

>   flockfile (arq);

>

>   feof (arq);

>   ferror (arq);

>   ...

>

>   funlockfile (arq);

>

> Should only incur is atomic operation on flockfile and funlockfile

> in both single and multi-thread case.  BY tying the feof taking the

> lock to single-thread optimization it incur in extra atomic operations

> on multi-thread case.


There are two relevant locking optimizations: flockfile acquires the
lock for the current thread.  At that point, the ownership check for the
recursive lock succeeds.  That check doesn't use an atomic load, so it's
fast.

The other optimization is generally enabled via __fsetlocking (arq,
FSETLOCKING_BYCALLER), which enables the _IO_USER_LOCK flag that
disables locking in feof and ferror as well (but not for flockfile and
funlockfile).

The latter is perhaps not necessary because the application could just
call flockfile.

Thanks,
Florian
Mark Brown via Libc-alpha May 10, 2021, 9:42 p.m. | #7
On 10/05/2021 16:33, Florian Weimer wrote:
> * Adhemerval Zanella:

> 

>> On 10/05/2021 16:18, Florian Weimer wrote:

>>> * Adhemerval Zanella:

>>>

>>>> I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after

>>>> the lock should be suffice.  The data race is not really related to 

>>>> _IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 

>>>> the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some

>>>> bits update.

>>>

>>> Right.

>>>

>>>> Maybe the comment:

>>>>

>>>> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>>>>    fast path on function that call _IO_need_lock.  However it requires

>>>>    to be set with stream locked to avoid a potential data race.  */

>>>

>>> s/fast path/path/ I suppose, and “it necessary to set it with the stream

>>> locked”.

>>

>>

>> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>>    path on function that call _IO_need_lock.  However it is necessary

>>    to set it with the stream locked  to avoid a potential data race.  */

> 

> Meh, I'm confused.  Aren't we disabling the non-atomic path?


With this discussion it seems to me that the _IO_FLAGS2_NEED_LOCK set on
flockfile is indeed superfluous.  On the code:

  { 
    flockfile (arq);

    write (STDERR_FILENO, "(1)\n", sizeof ("(1)\n"));
    feof (arq);

    funlockfile (arq);
  }

  { 
    write (STDERR_FILENO, "(2)\n", sizeof ("(2)\n"));
    feof (arq);
  }

The second feof will always take the lock because _IO_FLAGS2_NEED_LOCK is
sticky, even though this is fine to use the fast non-atomic path (I am
assuming single-thread here).

> 

>>> Why not?  The single-thread optimization would still avoid the atomic

>>> compare-and-set, which should be a significant win on AArch64 and POWER.

>>> There is more work to be done for the single-threaded case compared to

>>> _IO_FLAGS2_NEED_LOCK, but all that is without atomics, so I think those

>>> should be reasonably fast.

>>

>> Because it would solve only for single-thread case, for multithread

>> case the single-thread optimization won't give the function that

>> call _IO_need_lock the correct information to elide the lock.

>>

>> The seqeuence:

>>

>>   flockfile (arq);

>>

>>   feof (arq);

>>   ferror (arq);

>>   ...

>>

>>   funlockfile (arq);

>>

>> Should only incur is atomic operation on flockfile and funlockfile

>> in both single and multi-thread case.  BY tying the feof taking the

>> lock to single-thread optimization it incur in extra atomic operations

>> on multi-thread case.

> 

> There are two relevant locking optimizations: flockfile acquires the

> lock for the current thread.  At that point, the ownership check for the

> recursive lock succeeds.  That check doesn't use an atomic load, so it's

> fast.

> 

> The other optimization is generally enabled via __fsetlocking (arq,

> FSETLOCKING_BYCALLER), which enables the _IO_USER_LOCK flag that

> disables locking in feof and ferror as well (but not for flockfile and

> funlockfile).

> 

> The latter is perhaps not necessary because the application could just

> call flockfile.


This is indeed messy, ideally I think we would like something like:

  int 
  _IO_feof (FILE *fp)
  {
    __flockfile (fp);
    int result = _IO_feof_unlocked ();
    __funlockfile (fp);
    return result;
  }

  void
  __flockfile (FILE *stream)
  {
    _IO_lock_lock (*stream->lock);
  }

  static inline void
  _IO_lock_lock (_IO_lock_t *lock)
  {
    void *self = THREAD_SELF;
    if (lock->owner != self)
      {
        lll_lock (name->lock, LLL_PRIVATE);
        lock->owner = self;
      }
    ++lock->cnt;
  }

The __flockfile should provide single-thread optimization through the
_IO_lock_lock (and the recursive lock).  However we can't really use
__flockfile internally because of __fsetlocking (FSETLOCKING_BYCALLER),
so we need a wrapper to check for _IO_USER_LOCK (which is provided by
the _IO_need_lock wrapper).

The _IO_need_lock and _unlocked call really seems unnecessary, at least
on Linux that already provides a recursive lock with owner.
Mark Brown via Libc-alpha May 10, 2021, 9:46 p.m. | #8
On 10/05/2021 18:42, Adhemerval Zanella wrote:
> 

> 

> On 10/05/2021 16:33, Florian Weimer wrote:

>> * Adhemerval Zanella:

>>

>>> On 10/05/2021 16:18, Florian Weimer wrote:

>>>> * Adhemerval Zanella:

>>>>

>>>>> I see that my first assumption of moving the _IO_FLAGS2_NEED_LOCK after

>>>>> the lock should be suffice.  The data race is not really related to 

>>>>> _IO_FLAGS2_NEED_LOCK set in fact, but rather the small window where 

>>>>> the _flags2 read and update with _IO_FLAGS2_NEED_LOCK might lost some

>>>>> bits update.

>>>>

>>>> Right.

>>>>

>>>>> Maybe the comment:

>>>>>

>>>>> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>>>>>    fast path on function that call _IO_need_lock.  However it requires

>>>>>    to be set with stream locked to avoid a potential data race.  */

>>>>

>>>> s/fast path/path/ I suppose, and “it necessary to set it with the stream

>>>> locked”.

>>>

>>>

>>> /* The _IO_FLAGS2_NEED_LOCK is required here to enable the non-atomic

>>>    path on function that call _IO_need_lock.  However it is necessary

>>>    to set it with the stream locked  to avoid a potential data race.  */

>>

>> Meh, I'm confused.  Aren't we disabling the non-atomic path?

> 

> With this discussion it seems to me that the _IO_FLAGS2_NEED_LOCK set on

> flockfile is indeed superfluous.  On the code:

> 

>   { 

>     flockfile (arq);

> 

>     write (STDERR_FILENO, "(1)\n", sizeof ("(1)\n"));

>     feof (arq);

> 

>     funlockfile (arq);

>   }

> 

>   { 

>     write (STDERR_FILENO, "(2)\n", sizeof ("(2)\n"));

>     feof (arq);

>   }

> 

> The second feof will always take the lock because _IO_FLAGS2_NEED_LOCK is

> sticky, even though this is fine to use the fast non-atomic path (I am

> assuming single-thread here).

> 

>>

>>>> Why not?  The single-thread optimization would still avoid the atomic

>>>> compare-and-set, which should be a significant win on AArch64 and POWER.

>>>> There is more work to be done for the single-threaded case compared to

>>>> _IO_FLAGS2_NEED_LOCK, but all that is without atomics, so I think those

>>>> should be reasonably fast.

>>>

>>> Because it would solve only for single-thread case, for multithread

>>> case the single-thread optimization won't give the function that

>>> call _IO_need_lock the correct information to elide the lock.

>>>

>>> The seqeuence:

>>>

>>>   flockfile (arq);

>>>

>>>   feof (arq);

>>>   ferror (arq);

>>>   ...

>>>

>>>   funlockfile (arq);

>>>

>>> Should only incur is atomic operation on flockfile and funlockfile

>>> in both single and multi-thread case.  BY tying the feof taking the

>>> lock to single-thread optimization it incur in extra atomic operations

>>> on multi-thread case.

>>

>> There are two relevant locking optimizations: flockfile acquires the

>> lock for the current thread.  At that point, the ownership check for the

>> recursive lock succeeds.  That check doesn't use an atomic load, so it's

>> fast.

>>

>> The other optimization is generally enabled via __fsetlocking (arq,

>> FSETLOCKING_BYCALLER), which enables the _IO_USER_LOCK flag that

>> disables locking in feof and ferror as well (but not for flockfile and

>> funlockfile).

>>

>> The latter is perhaps not necessary because the application could just

>> call flockfile.

> 

> This is indeed messy, ideally I think we would like something like:

> 

>   int 

>   _IO_feof (FILE *fp)

>   {

>     __flockfile (fp);

>     int result = _IO_feof_unlocked ();

>     __funlockfile (fp);

>     return result;

>   }

> 

>   void

>   __flockfile (FILE *stream)

>   {

>     _IO_lock_lock (*stream->lock);

>   }

> 

>   static inline void

>   _IO_lock_lock (_IO_lock_t *lock)

>   {

>     void *self = THREAD_SELF;

>     if (lock->owner != self)

>       {

>         lll_lock (name->lock, LLL_PRIVATE);

>         lock->owner = self;

>       }

>     ++lock->cnt;

>   }

> 

> The __flockfile should provide single-thread optimization through the

> _IO_lock_lock (and the recursive lock).  However we can't really use

> __flockfile internally because of __fsetlocking (FSETLOCKING_BYCALLER),

> so we need a wrapper to check for _IO_USER_LOCK (which is provided by

> the _IO_need_lock wrapper).

> 

> The _IO_need_lock and _unlocked call really seems unnecessary, at least

> on Linux that already provides a recursive lock with owner.

> 


Sigh... we still need for the open_memstream case (de895ddcd7fc45c).
Maybe we make open_memstream use a invalid owner value (either NULL
or -1) to force the _IO_lock_lock always call lll_lock
Mark Brown via Libc-alpha May 11, 2021, 10:06 a.m. | #9
* Adhemerval Zanella:

> The __flockfile should provide single-thread optimization through the

> _IO_lock_lock (and the recursive lock).  However we can't really use

> __flockfile internally because of __fsetlocking (FSETLOCKING_BYCALLER),

> so we need a wrapper to check for _IO_USER_LOCK (which is provided by

> the _IO_need_lock wrapper).


Right.  So how we move forward with this?  Remove _IO_FLAGS2_NEED_LOCK
and do the optimization in _IO_lock_lock and _IO_lock_unlock?

Thanks,
Florian
Mark Brown via Libc-alpha May 11, 2021, 12:29 p.m. | #10
On 11/05/2021 07:06, Florian Weimer wrote:
> * Adhemerval Zanella:

> 

>> The __flockfile should provide single-thread optimization through the

>> _IO_lock_lock (and the recursive lock).  However we can't really use

>> __flockfile internally because of __fsetlocking (FSETLOCKING_BYCALLER),

>> so we need a wrapper to check for _IO_USER_LOCK (which is provided by

>> the _IO_need_lock wrapper).

> 

> Right.  So how we move forward with this?  Remove _IO_FLAGS2_NEED_LOCK

> and do the optimization in _IO_lock_lock and _IO_lock_unlock?


I am checking if we can remove it, I think we can use the locker owner
itself to force open_memstream to avoid this optimization and thus we
can remove the _IO_FLAGS2_NEED_LOCK altogether.

Patch

diff --git a/libio/feof.c b/libio/feof.c
index 8275321788..823b40637a 100644
--- a/libio/feof.c
+++ b/libio/feof.c
@@ -32,7 +32,7 @@  _IO_feof (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
-  if (!_IO_need_lock (fp))
+  if (!io_need_lock (fp))
     return _IO_feof_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_feof_unlocked (fp);
diff --git a/libio/ferror.c b/libio/ferror.c
index 588f2eb50b..13f1a15ecc 100644
--- a/libio/ferror.c
+++ b/libio/ferror.c
@@ -32,7 +32,7 @@  _IO_ferror (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
-  if (!_IO_need_lock (fp))
+  if (!io_need_lock (fp))
     return _IO_ferror_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_ferror_unlocked (fp);
diff --git a/libio/fputc.c b/libio/fputc.c
index c3a1fa47db..b3925c34e8 100644
--- a/libio/fputc.c
+++ b/libio/fputc.c
@@ -32,7 +32,7 @@  fputc (int c, FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
-  if (!_IO_need_lock (fp))
+  if (!io_need_lock (fp))
     return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
diff --git a/libio/getc.c b/libio/getc.c
index e3e658bf79..df2e40dd51 100644
--- a/libio/getc.c
+++ b/libio/getc.c
@@ -34,7 +34,7 @@  _IO_getc (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
-  if (!_IO_need_lock (fp))
+  if (!io_need_lock (fp))
     return _IO_getc_unlocked (fp);
   _IO_acquire_lock (fp);
   result = _IO_getc_unlocked (fp);
diff --git a/libio/getchar.c b/libio/getchar.c
index 0d6225853f..0607cd40dd 100644
--- a/libio/getchar.c
+++ b/libio/getchar.c
@@ -33,7 +33,7 @@  int
 getchar (void)
 {
   int result;
-  if (!_IO_need_lock (stdin))
+  if (!io_need_lock (stdin))
     return _IO_getc_unlocked (stdin);
   _IO_acquire_lock (stdin);
   result = _IO_getc_unlocked (stdin);
diff --git a/libio/ioungetc.c b/libio/ioungetc.c
index dcf1c4ca9e..dcc99aa5a3 100644
--- a/libio/ioungetc.c
+++ b/libio/ioungetc.c
@@ -33,7 +33,7 @@  _IO_ungetc (int c, FILE *fp)
   CHECK_FILE (fp, EOF);
   if (c == EOF)
     return EOF;
-  if (!_IO_need_lock (fp))
+  if (!io_need_lock (fp))
     return _IO_sputbackc (fp, (unsigned char) c);
   _IO_acquire_lock (fp);
   result = _IO_sputbackc (fp, (unsigned char) c);
diff --git a/libio/libio.h b/libio/libio.h
index 511b39457f..6b569511ec 100644
--- a/libio/libio.h
+++ b/libio/libio.h
@@ -47,6 +47,8 @@ 
 #include <bits/types/__mbstate_t.h>
 #include <bits/types/wint_t.h>
 #include <gconv.h>
+#include <atomic.h>
+#include <stdbool.h>
 
 typedef struct
 {
@@ -210,8 +212,33 @@  extern int _IO_ftrylockfile (FILE *) __THROW;
 #define _IO_cleanup_region_end(_Doit) /**/
 #endif
 
-#define _IO_need_lock(_fp) \
-  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
+
+/* The _IO_FLAGS2_NEED_LOCK flag is set on:
+
+   1. the stream creation (for memstream and wmemstream).
+   2. by pthread_create when it detects the process become multi-thread.
+   3. by flockfile.
+
+   The 1. and 2. cases do not generate a race condition since for 1.
+   the stream reference is not accessible by other threads and for 2.
+   the process is still single-threaded.
+
+   However, for 3. the functions feof, ferror, fputc, getc, getchar, ungetc,
+   and putc access it without locking the stream to check whether to use the
+   optimized _unlocked variants.  So to avoid a racy condition, both the
+   check and set are done using atomic operations.  */
+
+static inline bool
+io_need_lock (FILE *fp)
+{
+  return atomic_load_acquire (&fp->_flags2) & _IO_FLAGS2_NEED_LOCK;
+}
+
+static inline void
+io_set_need_lock (FILE *fp)
+{
+  atomic_fetch_or_release (&fp->_flags2, _IO_FLAGS2_NEED_LOCK);
+}
 
 extern int _IO_vfscanf (FILE * __restrict, const char * __restrict,
 			__gnuc_va_list, int *__restrict);
diff --git a/libio/putc.c b/libio/putc.c
index 0b0dd40736..1f1da21239 100644
--- a/libio/putc.c
+++ b/libio/putc.c
@@ -25,7 +25,7 @@  _IO_putc (int c, FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
-  if (!_IO_need_lock (fp))
+  if (!io_need_lock (fp))
     return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
diff --git a/stdio-common/flockfile.c b/stdio-common/flockfile.c
index a66e0a731e..3b42fe891d 100644
--- a/stdio-common/flockfile.c
+++ b/stdio-common/flockfile.c
@@ -22,7 +22,7 @@ 
 void
 __flockfile (FILE *stream)
 {
-  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
+  io_set_need_lock (stream);
   _IO_lock_lock (*stream->_lock);
 }
 weak_alias (__flockfile, flockfile);