[libstdc++] Refactor/cleanup of atomic wait implementation

Message ID 20210223215722.140761-1-rodgert@appliantology.com
State Superseded
Headers show
Series
  • [libstdc++] Refactor/cleanup of atomic wait implementation
Related show

Commit Message

Thomas Rodgers Feb. 23, 2021, 9:57 p.m.
From: Thomas Rodgers <rodgert@twrodgers.com>


* This revises the previous version to fix std::__condvar::wait_until() usage.

This is a substantial rewrite of the atomic wait/notify (and timed wait
counterparts) implementation.

The previous __platform_wait looped on EINTR however this behavior is
not required by the standard. A new _GLIBCXX_HAVE_PLATFORM_WAIT macro
now controls whether wait/notify are implemented using a platform
specific primitive or with a platform agnostic mutex/condvar. This
patch only supplies a definition for linux futexes. A future update
could add support __ulock_wait/wake on Darwin, for instance.

The members of __waiters were lifted to a new base class. The members
are now arranged such that overall sizeof(__waiters_base) fits in two
cache lines (on platforms with at least 64 byte cache lines). The
definition will also use destructive_interference_size for this if it
is available.

The __waiters type is now specific to untimed waits. Timed waits have a
corresponding __timed_waiters type. Much of the code has been moved from
the previous __atomic_wait() free function to the __waiter_base template
and a __waiter derived type is provided to implement the un-timed wait
operations. A similar change has been made to the timed wait
implementation.

The __atomic_spin code has been extended to take a spin policy which is
invoked after the initial busy wait loop. The default policy is to
return from the spin. The timed wait code adds a timed backoff spinning
policy. The code from <thread> which implements this_thread::sleep_for,
sleep_until has been moved to a new <bits/std_thread_sleep.h> header
which allows the thread sleep code to be consumed without pulling in the
whole of <thread>.

The entry points into the wait/notify code have been restructured to
support either -
   * Testing the current value of the atomic stored at the given address
     and waiting on a notification.
   * Applying a predicate to determine if the wait was satisfied.
The entry points were renamed to make it clear that the wait and wake
operations operate on addresses. The first variant takes the expected
value and a function which returns the current value that should be used
in comparison operations, these operations are named with a _v suffix
(e.g. 'value'). All atomic<_Tp> wait/notify operations use the first
variant. Barriers, latches and semaphores use the predicate variant.

This change also centralizes what it means to compare values for the
purposes of atomic<T>::wait rather than scattering through individual
predicates.

This change also centralizes the repetitive code which adjusts for
different user supplied clocks (this should be moved elsewhere
and all such adjustments should use a common implementation).

libstdc++-v3/ChangeLog:
	* include/Makefile.am: Add new <bits/std_thread_sleep.h> header.
	* include/Makefile.in: Regenerate.
	* include/bits/atomic_base.h: Adjust all calls
	to __atomic_wait/__atomic_notify for new call signatures.
	* include/bits/atomic_wait.h: Extensive rewrite.
	* include/bits/atomic_timed_wait.h: Likewise.
	* include/bits/semaphore_base.h: Adjust all calls
	to __atomic_wait/__atomic_notify for new call signatures.
	* include/bits/std_thread_sleep.h: New file.
	* include/std/atomic: Likewise.
	* include/std/barrier: Likewise.
	* include/std/latch: Likewise.
	* testsuite/29_atomics/atomic/wait_notify/bool.cc: Simplify
	test.
	* testsuite/29_atomics/atomic/wait_notify/generic.cc: Likewise.
	* testsuite/29_atomics/atomic/wait_notify/pointers.cc: Likewise.
	* testsuite/29_atomics/atomic_flag/wait_notify.cc: Likewise.
	* testsuite/29_atomics/atomic_float/wait_notify.cc: Likewise.
	* testsuite/29_atomics/atomic_integral/wait_notify.cc: Likewise.
	* testsuite/29_atomics/atomic_ref/wait_notify.cc: Likewise.
---
 libstdc++-v3/include/Makefile.am              |   1 +
 libstdc++-v3/include/Makefile.in              |   1 +
 libstdc++-v3/include/bits/atomic_base.h       |  36 +-
 libstdc++-v3/include/bits/atomic_timed_wait.h | 410 +++++++++++-------
 libstdc++-v3/include/bits/atomic_wait.h       | 400 +++++++++++------
 libstdc++-v3/include/bits/semaphore_base.h    |  73 +---
 libstdc++-v3/include/bits/std_thread_sleep.h  | 119 +++++
 libstdc++-v3/include/std/atomic               |  15 +-
 libstdc++-v3/include/std/barrier              |   4 +-
 libstdc++-v3/include/std/latch                |   4 +-
 libstdc++-v3/include/std/thread               |  68 +--
 .../29_atomics/atomic/wait_notify/bool.cc     |  37 +-
 .../29_atomics/atomic/wait_notify/generic.cc  |  19 +-
 .../29_atomics/atomic/wait_notify/pointers.cc |  36 +-
 .../29_atomics/atomic_flag/wait_notify/1.cc   |  37 +-
 .../29_atomics/atomic_float/wait_notify.cc    |  26 +-
 .../29_atomics/atomic_integral/wait_notify.cc |  73 ++--
 .../29_atomics/atomic_ref/wait_notify.cc      |  74 +---
 18 files changed, 802 insertions(+), 631 deletions(-)
 create mode 100644 libstdc++-v3/include/bits/std_thread_sleep.h

-- 
2.29.2

Comments

Jeff Law via Gcc-patches March 3, 2021, 3:14 p.m. | #1
On 23/02/21 13:57 -0800, Thomas Rodgers wrote:
>diff --git a/libstdc++-v3/include/bits/atomic_wait.h b/libstdc++-v3/include/bits/atomic_wait.h

>index 1a0f0943ebd..fa83ef6c231 100644

>--- a/libstdc++-v3/include/bits/atomic_wait.h

>+++ b/libstdc++-v3/include/bits/atomic_wait.h

>@@ -39,17 +39,16 @@

> #include <ext/numeric_traits.h>

>

> #ifdef _GLIBCXX_HAVE_LINUX_FUTEX

>+#define _GLIBCXX_HAVE_PLATFORM_WAIT 1


This is defined here (to 1) and then ...

> # include <cerrno>

> # include <climits>

> # include <unistd.h>

> # include <syscall.h>

> # include <bits/functexcept.h>

>-// TODO get this from Autoconf

>-# define _GLIBCXX_HAVE_LINUX_FUTEX_PRIVATE 1

>-#else

>-# include <bits/std_mutex.h>  // std::mutex, std::__condvar

> #endif

>

>+# include <bits/std_mutex.h>  // std::mutex, std::__condvar

>+

> #define __cpp_lib_atomic_wait 201907L

>

> namespace std _GLIBCXX_VISIBILITY(default)

>@@ -57,20 +56,27 @@ namespace std _GLIBCXX_VISIBILITY(default)

> _GLIBCXX_BEGIN_NAMESPACE_VERSION

>   namespace __detail

>   {

>+#ifdef _GLIBCXX_HAVE_LINUX_FUTEX

>     using __platform_wait_t = int;

>+#else

>+    using __platform_wait_t = uint64_t;

>+#endif

>+  } // namespace __detail

>

>-    constexpr auto __atomic_spin_count_1 = 16;

>-    constexpr auto __atomic_spin_count_2 = 12;

>-

>-    template<typename _Tp>

>-      inline constexpr bool __platform_wait_uses_type

>-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX

>-	= is_same_v<remove_cv_t<_Tp>, __platform_wait_t>;

>+  template<typename _Tp>

>+    inline constexpr bool __platform_wait_uses_type

>+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT

>+      = is_same_v<remove_cv_t<_Tp>, __detail::__platform_wait_t>

>+	|| ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))

>+	    && (alignof(_Tp*) == alignof(__detail::__platform_wait_t)));

> #else

>-	= false;

>+      = false;

> #endif

>

>+  namespace __detail

>+  {

> #ifdef _GLIBCXX_HAVE_LINUX_FUTEX

>+#define _GLIBCXX_HAVE_PLATFORM_WAIT


Redefined here (to empty), after it's already been tested.

Presumably this redefinition shouldn't be here.

Also the HAVE_PLATFORM_TIMED_WAIT macro is defined to empty. I think
they should both be defined to 1 (or both empty, but not
inconsistently).

I'm still going through the rest of the patch.
Jeff Law via Gcc-patches March 3, 2021, 5:31 p.m. | #2
On 23/02/21 13:57 -0800, Thomas Rodgers wrote:
>From: Thomas Rodgers <rodgert@twrodgers.com>

>

>* This revises the previous version to fix std::__condvar::wait_until() usage.

>

>This is a substantial rewrite of the atomic wait/notify (and timed wait

>counterparts) implementation.

>

>The previous __platform_wait looped on EINTR however this behavior is

>not required by the standard. A new _GLIBCXX_HAVE_PLATFORM_WAIT macro

>now controls whether wait/notify are implemented using a platform

>specific primitive or with a platform agnostic mutex/condvar. This

>patch only supplies a definition for linux futexes. A future update

>could add support __ulock_wait/wake on Darwin, for instance.

>

>The members of __waiters were lifted to a new base class. The members

>are now arranged such that overall sizeof(__waiters_base) fits in two

>cache lines (on platforms with at least 64 byte cache lines). The

>definition will also use destructive_interference_size for this if it

>is available.

>

>The __waiters type is now specific to untimed waits. Timed waits have a

>corresponding __timed_waiters type. Much of the code has been moved from

>the previous __atomic_wait() free function to the __waiter_base template

>and a __waiter derived type is provided to implement the un-timed wait

>operations. A similar change has been made to the timed wait

>implementation.

>

>The __atomic_spin code has been extended to take a spin policy which is

>invoked after the initial busy wait loop. The default policy is to

>return from the spin. The timed wait code adds a timed backoff spinning

>policy. The code from <thread> which implements this_thread::sleep_for,

>sleep_until has been moved to a new <bits/std_thread_sleep.h> header

>which allows the thread sleep code to be consumed without pulling in the

>whole of <thread>.

>

>The entry points into the wait/notify code have been restructured to

>support either -

>   * Testing the current value of the atomic stored at the given address

>     and waiting on a notification.

>   * Applying a predicate to determine if the wait was satisfied.

>The entry points were renamed to make it clear that the wait and wake

>operations operate on addresses. The first variant takes the expected

>value and a function which returns the current value that should be used

>in comparison operations, these operations are named with a _v suffix

>(e.g. 'value'). All atomic<_Tp> wait/notify operations use the first

>variant. Barriers, latches and semaphores use the predicate variant.

>

>This change also centralizes what it means to compare values for the

>purposes of atomic<T>::wait rather than scattering through individual

>predicates.

>

>This change also centralizes the repetitive code which adjusts for

>different user supplied clocks (this should be moved elsewhere

>and all such adjustments should use a common implementation).

>

>libstdc++-v3/ChangeLog:

>	* include/Makefile.am: Add new <bits/std_thread_sleep.h> header.

>	* include/Makefile.in: Regenerate.

>	* include/bits/atomic_base.h: Adjust all calls

>	to __atomic_wait/__atomic_notify for new call signatures.

>	* include/bits/atomic_wait.h: Extensive rewrite.

>	* include/bits/atomic_timed_wait.h: Likewise.

>	* include/bits/semaphore_base.h: Adjust all calls

>	to __atomic_wait/__atomic_notify for new call signatures.

>	* include/bits/std_thread_sleep.h: New file.

>	* include/std/atomic: Likewise.

>	* include/std/barrier: Likewise.

>	* include/std/latch: Likewise.

>	* testsuite/29_atomics/atomic/wait_notify/bool.cc: Simplify

>	test.

>	* testsuite/29_atomics/atomic/wait_notify/generic.cc: Likewise.

>	* testsuite/29_atomics/atomic/wait_notify/pointers.cc: Likewise.

>	* testsuite/29_atomics/atomic_flag/wait_notify.cc: Likewise.

>	* testsuite/29_atomics/atomic_float/wait_notify.cc: Likewise.

>	* testsuite/29_atomics/atomic_integral/wait_notify.cc: Likewise.

>	* testsuite/29_atomics/atomic_ref/wait_notify.cc: Likewise.


Some of this diff is very confusing, where the context being shown as
removed is actually a completely different function. Please try
--diff-algorithm=histogram for the next version of this patch. It
might make it easier to read.

>+    struct __timed_backoff_spin_policy

>+    {

>+      __wait_clock_t::time_point _M_deadline;

>+      __wait_clock_t::time_point _M_t0;

>+

>+      template<typename _Clock, typename _Dur>

>+	__timed_backoff_spin_policy(chrono::time_point<_Clock, _Dur>

>+				      __deadline = _Clock::time_point::max(),

>+				    chrono::time_point<_Clock, _Dur>

>+				      __t0 = _Clock::now()) noexcept

>+	  : _M_deadline(__to_wait_clock(__deadline))

>+	  , _M_t0(__to_wait_clock(__t0))


If this policy object is constructed with a time_point using the
steady_clock then it will still call __to_wait_clock to convert it to
the steady_clock, making multiple unnecessary (and expensive) calls to
steady_clock::now().

I think you either need to overload the constructor or overload
__to_wait_clock.

>+	{ }

>+

>+      bool

>+      operator()() noexcept


This can be const.

>       {

>-	static_assert(sizeof(__timed_waiters) == sizeof(__waiters));

>-	return static_cast<__timed_waiters&>(__waiters::_S_for(__t));

>+	using namespace literals::chrono_literals;

>+	auto __now = __wait_clock_t::now();

>+	if (_M_deadline <= __now)

>+	  return false;

>+

>+	auto __elapsed = __now - _M_t0;

>+	if (__elapsed > 128ms)

>+	  {

>+	    this_thread::sleep_for(64ms);

>+	  }

>+	else if (__elapsed > 64us)

>+	  {

>+	    this_thread::sleep_for(__elapsed / 2);

>+	  }

>+	else if (__elapsed > 4us)

>+	  {

>+	    __thread_yield();

>+	  }

>+	else

>+	  return false;

>       }

>     };

>-  } // namespace __detail





>+      template<typename _Tp, typename _ValFn,

>+	       typename _Rep, typename _Period>

>+	bool

>+	_M_do_wait_for_v(_Tp __old, _ValFn __vfn,

>+			 const chrono::duration<_Rep, _Period>&

>+							      __rtime) noexcept

>+	{

>+	  __platform_wait_t __val;

>+	  if (_M_do_spin_v(__old, move(__vfn), __val))


This should be std::move (there's another case of this in the patch
too).

>+	    return true;

>+

>+	  if (!__rtime.count())

>+	    return false; // no rtime supplied, and spin did not acquire

>+

>+	  using __dur = chrono::steady_clock::duration;

>+	  auto __reltime = chrono::duration_cast<__dur>(__rtime);

>+	  if (__reltime < __rtime)

>+	    ++__reltime;


This is C++20 code so it can use chrono::ceil here instead of
duration_cast, then you don't need the increment.

>+	  return _M_w._M_do_wait_until(_M_addr, __val,

>+				       chrono::steady_clock::now() + __reltime);

> 	}

>-      while (!__pred() && __atime < _Clock::now());

>-      __w._M_leave_wait();

>

>-      // if timed out, return false

>-      return (_Clock::now() < __atime);

>+      template<typename _Pred,

>+	       typename _Rep, typename _Period>

>+	bool

>+	_M_do_wait_for(_Pred __pred,

>+		       const chrono::duration<_Rep, _Period>& __rtime) noexcept

>+	{

>+	  __platform_wait_t __val;

>+	  if (_M_do_spin(__pred, __val))

>+	    return true;

>+

>+	  if (!__rtime.count())

>+	    return false; // no rtime supplied, and spin did not acquire

>+

>+	  using __dur = chrono::steady_clock::duration;

>+	  auto __reltime = chrono::duration_cast<__dur>(__rtime);

>+	  if (__reltime < __rtime)

>+	    ++__reltime;


chrono::ceil here too.

>+  template<typename _Tp>

>+    inline constexpr bool __platform_wait_uses_type

>+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT

>+      = is_same_v<remove_cv_t<_Tp>, __detail::__platform_wait_t>


This is_same check seems redundant, as the following will be true
anyway.

>+	|| ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))

>+	    && (alignof(_Tp*) == alignof(__detail::__platform_wait_t)));


This should be alignof(_Tp) not alignof(_Tp*) shouldn't it?

And alignof(_Tp) > alignof(__platform_wait_t) is OK too, so >= not ==.

We need the is_scalar check from Thiago's patch. We don't want to try
and use a futex for something like:

struct S { short s; char c; /* padding */ };


> #else

>-	= false;

>+      = false;

> #endif

>

>+  namespace __detail

>+  {

> #ifdef _GLIBCXX_HAVE_LINUX_FUTEX

>+#define _GLIBCXX_HAVE_PLATFORM_WAIT


Redefinition, as I pointed out in my earlier mail.




>+    struct __default_spin_policy

>+    {

>+      bool

>+      operator()() noexcept


This can be const.

>+      { return false; }

>+    };

>+

>+    template<typename _Pred,

>+	     typename _Spin = __default_spin_policy>

>+      bool

>+      __atomic_spin(_Pred& __pred, _Spin __spin = _Spin{ }) noexcept

>       {

>-	__platform_wait_t __res;

>-	__atomic_load(&_M_ver, &__res, __ATOMIC_ACQUIRE);

>-	__atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL);

>-	return __res;

>+	for (auto __i = 0; __i < __detail::__atomic_spin_count_1; ++__i)

>+	  {

>+	    if (__pred())

>+	      return true;

>+

>+	    if (__i < __detail::__atomic_spin_count_2)

>+	      __detail::__thread_relax();

>+	    else

>+	      __detail::__thread_yield();

>+	  }


I keep wondering (and not bothering to check) whether having two loops
(for counts of 12 and then 4) would make more sense than this branch
in each loop. It doesn't matter though.

>+	while (__spin())

>+	  {

>+	    if (__pred())

>+	      return true;

>+	  }

>+

>+	return false;

>       }

>

>-      void

>-      _M_leave_wait() noexcept

>+    template<typename _Tp>

>+      bool __atomic_compare(const _Tp& __a, const _Tp& __b)

>       {

>-	__atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL);

>+	// TODO make this do the correct padding bit ignoring comparison

>+	return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) != 0;

>       }

>

>-      void

>-      _M_do_wait(__platform_wait_t __version) noexcept

>-      {

>-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX

>-	__platform_wait(&_M_ver, __version);

>+#ifdef __cpp_lib_hardware_interference_size

>+    struct alignas(hardware_destructive_interference_size)

> #else

>-	__platform_wait_t __cur = 0;

>-	while (__cur <= __version)

>-	  {

>-	    __waiters::__lock_t __l(_M_mtx);

>-	    _M_cv.wait(_M_mtx);

>-	    __platform_wait_t __last = __cur;

>-	    __atomic_load(&_M_ver, &__cur, __ATOMIC_ACQUIRE);

>-	    if (__cur < __last)

>-	      break; // break the loop if version overflows

>-	  }

>+    struct alignas(64)

>+#endif

>+    __waiters_base

>+    {

>+      __platform_wait_t _M_wait = 0;

>+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT

>+      mutex _M_mtx;

> #endif

>-      }

>+

>+#ifdef __cpp_lib_hardware_interference_size

>+      alignas(hardware_destructive_interference_size)

>+#else

>+      alignas(64)

>+#endif


Please do this #ifdef dance once and define a constant that can be
used in both places, instead of repeating the #ifdef.

e.g.

     struct __waiters_base
     {
#ifdef __cpp_lib_hardware_interference_size
       static constexpr _S_align = hardware_destructive_interference_size;
#else
       static constexpr _S_align = 64;
#endif

       alignas(_S_align) __platform_wait_t _M_wait = 0;
#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
       mutex _M_mtx;
#endif

       alignas(_S_align) __platform_wait_t _M_ver = 0;
#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
       __condvar _M_cond;
#endif

       __waiters_base() = default;

       // ...

>+      __platform_wait_t _M_ver = 0;

>+

>+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT

>+      __condvar _M_cv;

>+

>+      __waiters_base() noexcept = default;


Should this be outside the #ifdef block?

I think the noexcept is redundant, but harmless.

>+#endif

>+

>+      void

>+      _M_enter_wait() noexcept

>+      { __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL); }

>+

>+      void

>+      _M_leave_wait() noexcept

>+      { __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL); }

>

>       bool

>       _M_waiting() const noexcept

>       {

> 	__platform_wait_t __res;

> 	__atomic_load(&_M_wait, &__res, __ATOMIC_ACQUIRE);

>-	return __res;

>+	return __res > 0;

>       }

>

>       void

>-      _M_notify(bool __all) noexcept

>+      _M_notify(const __platform_wait_t* __addr, bool __all) noexcept

>       {

>-	__atomic_fetch_add(&_M_ver, 1, __ATOMIC_ACQ_REL);

>+	if (!_M_waiting())

>+	  return;

>+

> #ifdef _GLIBCXX_HAVE_LINUX_FUTEX


Should this check HAVE_PLATFORM_WAIT instead?

>-	__platform_notify(&_M_ver, __all);

>+	__platform_notify(__addr, __all);

> #else

> 	if (__all)

> 	  _M_cv.notify_all();



>+    struct __waiter : __waiter_base<__waiters>

>     {

>-      using namespace __detail;

>-      if (std::__atomic_spin(__pred))

>-	return;

>+      template<typename _Tp>

>+	__waiter(const _Tp* __addr, bool __waiting = true) noexcept


Make this constructor explicit please.

>+	  : __waiter_base(__addr, __waiting)

>+	{ }

>


>diff --git a/libstdc++-v3/include/bits/semaphore_base.h b/libstdc++-v3/include/bits/semaphore_base.h

>index b65717e64d7..95d5414ff80 100644

>--- a/libstdc++-v3/include/bits/semaphore_base.h

>+++ b/libstdc++-v3/include/bits/semaphore_base.h

>@@ -181,40 +181,32 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

>       __atomic_semaphore(const __atomic_semaphore&) = delete;

>       __atomic_semaphore& operator=(const __atomic_semaphore&) = delete;

>

>+      static _GLIBCXX_ALWAYS_INLINE bool

>+      _S_do_try_acquire(_Tp* __counter) noexcept

>+      {

>+	auto __old = __atomic_impl::load(__counter, memory_order::acquire);

>+

>+	if (__old == 0)

>+	  return false;

>+

>+	return __atomic_impl::compare_exchange_strong(__counter,

>+						      __old, __old - 1,

>+						      memory_order::acquire,

>+						      memory_order::release);

>+      }


If we keep calling this in a loop it means that we reload the value
every time using atomic_load, despite the compare_exchange telling us
that value. Can't we reuse that value returned from the CAS?

If the caller provides it by reference:

       static _GLIBCXX_ALWAYS_INLINE bool
       _S_do_try_acquire(_Tp* __counter, _Tp& __old) noexcept
       {
	if (__old == 0)
	  return false;
	return __atomic_impl::compare_exchange_strong(__counter,
						      __old, __old - 1,
						      memory_order::acquire,
						      memory_order::release);
       }


>+

>       _GLIBCXX_ALWAYS_INLINE void

>       _M_acquire() noexcept

>       {

>-	auto const __pred = [this]

>-	  {

>-	    auto __old = __atomic_impl::load(&this->_M_counter,

>-			    memory_order::acquire);

>-	    if (__old == 0)

>-	      return false;

>-	    return __atomic_impl::compare_exchange_strong(&this->_M_counter,

>-		      __old, __old - 1,

>-		      memory_order::acquire,

>-		      memory_order::release);

>-	  };

>-	auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed);

>-	std::__atomic_wait(&_M_counter, __old, __pred);

>+	auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); };


Then the predicate can maintain that state:

         auto __old = __atomic_impl::load(_M_counter, memory_order::acquire);
	auto const __pred = [this, __old] () mutable {
           return _S_do_try_acquire(&this->_M_counter, __old);
         };

Or is realoading it every time needed, because we do a
yield/relax/spin after the CAS and so the value it returns might be
stale before the next CAS?

>+	std::__atomic_wait_address(&_M_counter, __pred);

Patch

diff --git a/libstdc++-v3/include/Makefile.am b/libstdc++-v3/include/Makefile.am
index f24a5489e8e..d651e040cf5 100644
--- a/libstdc++-v3/include/Makefile.am
+++ b/libstdc++-v3/include/Makefile.am
@@ -195,6 +195,7 @@  bits_headers = \
 	${bits_srcdir}/std_function.h \
 	${bits_srcdir}/std_mutex.h \
 	${bits_srcdir}/std_thread.h \
+	${bits_srcdir}/std_thread_sleep.h \
 	${bits_srcdir}/stl_algo.h \
 	${bits_srcdir}/stl_algobase.h \
 	${bits_srcdir}/stl_bvector.h \
diff --git a/libstdc++-v3/include/bits/atomic_base.h b/libstdc++-v3/include/bits/atomic_base.h
index 2dc00676054..2e46691c59a 100644
--- a/libstdc++-v3/include/bits/atomic_base.h
+++ b/libstdc++-v3/include/bits/atomic_base.h
@@ -235,22 +235,21 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     wait(bool __old,
 	memory_order __m = memory_order_seq_cst) const noexcept
     {
-      std::__atomic_wait(&_M_i, static_cast<__atomic_flag_data_type>(__old),
-			 [__m, this, __old]()
-			 { return this->test(__m) != __old; });
+      std::__atomic_wait_address_v(&_M_i, static_cast<__atomic_flag_data_type>(__old),
+			 [__m, this] { return this->test(__m); });
     }
 
     // TODO add const volatile overload
 
     _GLIBCXX_ALWAYS_INLINE void
     notify_one() const noexcept
-    { std::__atomic_notify(&_M_i, false); }
+    { std::__atomic_notify_address(&_M_i, false); }
 
     // TODO add const volatile overload
 
     _GLIBCXX_ALWAYS_INLINE void
     notify_all() const noexcept
-    { std::__atomic_notify(&_M_i, true); }
+    { std::__atomic_notify_address(&_M_i, true); }
 
     // TODO add const volatile overload
 #endif // __cpp_lib_atomic_wait
@@ -609,22 +608,21 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       wait(__int_type __old,
 	  memory_order __m = memory_order_seq_cst) const noexcept
       {
-	std::__atomic_wait(&_M_i, __old,
-			   [__m, this, __old]
-			   { return this->load(__m) != __old; });
+	std::__atomic_wait_address_v(&_M_i, __old,
+			   [__m, this] { return this->load(__m); });
       }
 
       // TODO add const volatile overload
 
       _GLIBCXX_ALWAYS_INLINE void
       notify_one() const noexcept
-      { std::__atomic_notify(&_M_i, false); }
+      { std::__atomic_notify_address(&_M_i, false); }
 
       // TODO add const volatile overload
 
       _GLIBCXX_ALWAYS_INLINE void
       notify_all() const noexcept
-      { std::__atomic_notify(&_M_i, true); }
+      { std::__atomic_notify_address(&_M_i, true); }
 
       // TODO add const volatile overload
 #endif // __cpp_lib_atomic_wait
@@ -903,22 +901,22 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       wait(__pointer_type __old,
 	   memory_order __m = memory_order_seq_cst) noexcept
       {
-	std::__atomic_wait(&_M_p, __old,
-		      [__m, this, __old]()
-		      { return this->load(__m) != __old; });
+	std::__atomic_wait_address_v(&_M_p, __old,
+				     [__m, this]
+				     { return this->load(__m); });
       }
 
       // TODO add const volatile overload
 
       _GLIBCXX_ALWAYS_INLINE void
       notify_one() const noexcept
-      { std::__atomic_notify(&_M_p, false); }
+      { std::__atomic_notify_address(&_M_p, false); }
 
       // TODO add const volatile overload
 
       _GLIBCXX_ALWAYS_INLINE void
       notify_all() const noexcept
-      { std::__atomic_notify(&_M_p, true); }
+      { std::__atomic_notify_address(&_M_p, true); }
 
       // TODO add const volatile overload
 #endif // __cpp_lib_atomic_wait
@@ -1017,8 +1015,8 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       wait(const _Tp* __ptr, _Val<_Tp> __old,
 	   memory_order __m = memory_order_seq_cst) noexcept
       {
-	std::__atomic_wait(__ptr, __old,
-	    [=]() { return load(__ptr, __m) == __old; });
+	std::__atomic_wait_address_v(__ptr, __old,
+	    [__ptr, __m]() { return load(__ptr, __m); });
       }
 
       // TODO add const volatile overload
@@ -1026,14 +1024,14 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     template<typename _Tp>
       _GLIBCXX_ALWAYS_INLINE void
       notify_one(const _Tp* __ptr) noexcept
-      { std::__atomic_notify(__ptr, false); }
+      { std::__atomic_notify_address(__ptr, false); }
 
       // TODO add const volatile overload
 
     template<typename _Tp>
       _GLIBCXX_ALWAYS_INLINE void
       notify_all(const _Tp* __ptr) noexcept
-      { std::__atomic_notify(__ptr, true); }
+      { std::__atomic_notify_address(__ptr, true); }
 
       // TODO add const volatile overload
 #endif // __cpp_lib_atomic_wait
diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h b/libstdc++-v3/include/bits/atomic_timed_wait.h
index a0c5ef4374e..3f8c2904798 100644
--- a/libstdc++-v3/include/bits/atomic_timed_wait.h
+++ b/libstdc++-v3/include/bits/atomic_timed_wait.h
@@ -36,6 +36,7 @@ 
 
 #if __cpp_lib_atomic_wait
 #include <bits/functional_hash.h>
+#include <bits/std_thread_sleep.h>
 
 #include <chrono>
 
@@ -48,19 +49,28 @@  namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
-  enum class __atomic_wait_status { no_timeout, timeout };
-
   namespace __detail
   {
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-    using __platform_wait_clock_t = chrono::steady_clock;
+    using __wait_clock_t = chrono::steady_clock;
+
+    template<typename _Clock, typename _Dur>
+      __wait_clock_t::time_point
+      __to_wait_clock(const chrono::time_point<_Clock, _Dur>& __atime) noexcept
+      {
+	const typename _Clock::time_point __c_entry = _Clock::now();
+	const __wait_clock_t::time_point __s_entry = __wait_clock_t::now();
+	const auto __delta = __atime - __c_entry;
+	return __s_entry + __delta;
+      }
 
-    template<typename _Duration>
-      __atomic_wait_status
+#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
+#define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+    // returns true if wait ended before timeout
+    template<typename _Dur>
+      bool
       __platform_wait_until_impl(__platform_wait_t* __addr,
-				 __platform_wait_t __val,
-				 const chrono::time_point<
-					  __platform_wait_clock_t, _Duration>&
+				 __platform_wait_t __old,
+				 const chrono::time_point<__wait_clock_t, _Dur>&
 				      __atime) noexcept
       {
 	auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
@@ -75,52 +85,55 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	auto __e = syscall (SYS_futex, __addr,
 			    static_cast<int>(__futex_wait_flags::
 						__wait_bitset_private),
-			    __val, &__rt, nullptr,
+			    __old, &__rt, nullptr,
 			    static_cast<int>(__futex_wait_flags::
 						__bitset_match_any));
-	if (__e && !(errno == EINTR || errno == EAGAIN || errno == ETIMEDOUT))
-	    std::terminate();
-	return (__platform_wait_clock_t::now() < __atime)
-	       ? __atomic_wait_status::no_timeout
-	       : __atomic_wait_status::timeout;
+
+	if (__e)
+	  {
+	    if ((errno != ETIMEDOUT) && (errno != EINTR)
+		&& (errno != EAGAIN))
+	      __throw_system_error(errno);
+	    return true;
+	  }
+	return false;
       }
 
-    template<typename _Clock, typename _Duration>
-      __atomic_wait_status
-      __platform_wait_until(__platform_wait_t* __addr, __platform_wait_t __val,
-			    const chrono::time_point<_Clock, _Duration>&
-				__atime)
+    // returns true if wait ended before timeout
+    template<typename _Clock, typename _Dur>
+      bool
+      __platform_wait_until(__platform_wait_t* __addr, __platform_wait_t __old,
+			    const chrono::time_point<_Clock, _Dur>& __atime)
       {
-	if constexpr (is_same_v<__platform_wait_clock_t, _Clock>)
+	if constexpr (is_same_v<__wait_clock_t, _Clock>)
 	  {
-	    return __detail::__platform_wait_until_impl(__addr, __val, __atime);
+	    return __platform_wait_until_impl(__addr, __old, __atime);
 	  }
 	else
 	  {
-	    const typename _Clock::time_point __c_entry = _Clock::now();
-	    const __platform_wait_clock_t::time_point __s_entry =
-		    __platform_wait_clock_t::now();
-	    const auto __delta = __atime - __c_entry;
-	    const auto __s_atime = __s_entry + __delta;
-	    if (__detail::__platform_wait_until_impl(__addr, __val, __s_atime)
-		  == __atomic_wait_status::no_timeout)
-	      return __atomic_wait_status::no_timeout;
-
-	    // We got a timeout when measured against __clock_t but
-	    // we need to check against the caller-supplied clock
-	    // to tell whether we should return a timeout.
-	    if (_Clock::now() < __atime)
-	      return __atomic_wait_status::no_timeout;
-	    return __atomic_wait_status::timeout;
+	    if (!__platform_wait_until_impl(__addr, __old,
+					    __to_wait_clock(__atime)))
+	      {
+		// We got a timeout when measured against __clock_t but
+		// we need to check against the caller-supplied clock
+		// to tell whether we should return a timeout.
+		if (_Clock::now() < __atime)
+		  return true;
+	      }
+	    return false;
 	  }
       }
-#else // ! FUTEX
-
-#ifdef _GLIBCXX_USE_PTHREAD_COND_CLOCKWAIT
-    template<typename _Duration>
-      __atomic_wait_status
+#else
+// define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement __platform_wait_until()
+// if there is a more efficient primitive supported by the platform
+// (e.g. __ulock_wait())which is better than pthread_cond_clockwait
+#endif // ! PLATFORM_TIMED_WAIT
+
+    // returns true if wait ended before timeout
+    template<typename _Dur>
+      bool
       __cond_wait_until_impl(__condvar& __cv, mutex& __mx,
-	  const chrono::time_point<chrono::steady_clock, _Duration>& __atime)
+	  const chrono::time_point<chrono::steady_clock, _Dur>& __atime)
       {
 	auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
 	auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime - __s);
@@ -131,40 +144,20 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	    static_cast<long>(__ns.count())
 	  };
 
+#ifdef _GLIBCXX_USE_PTHREAD_COND_CLOCKWAIT
 	__cv.wait_until(__mx, CLOCK_MONOTONIC, __ts);
-
-	return (chrono::steady_clock::now() < __atime)
-	       ? __atomic_wait_status::no_timeout
-	       : __atomic_wait_status::timeout;
-      }
-#endif
-
-    template<typename _Duration>
-      __atomic_wait_status
-      __cond_wait_until_impl(__condvar& __cv, mutex& __mx,
-	  const chrono::time_point<chrono::system_clock, _Duration>& __atime)
-      {
-	auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
-	auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime - __s);
-
-	__gthread_time_t __ts =
-	{
-	  static_cast<std::time_t>(__s.time_since_epoch().count()),
-	  static_cast<long>(__ns.count())
-	};
-
+	return chrono::steady_clock::now() < __atime;
+#else
 	__cv.wait_until(__mx, __ts);
-
-	return (chrono::system_clock::now() < __atime)
-	       ? __atomic_wait_status::no_timeout
-	       : __atomic_wait_status::timeout;
+	return chrono::system_clock::now() < __atime;
+#endif // ! _GLIBCXX_USE_PTHREAD_COND_CLOCKWAIT
       }
 
-    // return true if timeout
-    template<typename _Clock, typename _Duration>
-      __atomic_wait_status
+    // returns true if wait ended before timeout
+    template<typename _Clock, typename _Dur>
+      bool
       __cond_wait_until(__condvar& __cv, mutex& __mx,
-	  const chrono::time_point<_Clock, _Duration>& __atime)
+	  const chrono::time_point<_Clock, _Dur>& __atime)
       {
 #ifndef _GLIBCXX_USE_PTHREAD_COND_CLOCKWAIT
 	using __clock_t = chrono::system_clock;
@@ -178,118 +171,229 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  return __detail::__cond_wait_until_impl(__cv, __mx, __atime);
 	else
 	  {
-	    const typename _Clock::time_point __c_entry = _Clock::now();
-	    const __clock_t::time_point __s_entry = __clock_t::now();
-	    const auto __delta = __atime - __c_entry;
-	    const auto __s_atime = __s_entry + __delta;
-	    if (__detail::__cond_wait_until_impl(__cv, __mx, __s_atime)
-		== __atomic_wait_status::no_timeout)
-	      return __atomic_wait_status::no_timeout;
-	    // We got a timeout when measured against __clock_t but
-	    // we need to check against the caller-supplied clock
-	    // to tell whether we should return a timeout.
-	    if (_Clock::now() < __atime)
-	      return __atomic_wait_status::no_timeout;
-	    return __atomic_wait_status::timeout;
+	    if (__cond_wait_until_impl(__cv, __mx,
+				       __to_wait_clock(__atime)))
+	      {
+		// We got a timeout when measured against __clock_t but
+		// we need to check against the caller-supplied clock
+		// to tell whether we should return a timeout.
+		if (_Clock::now() < __atime)
+		  return true;
+	      }
+	    return false;
 	  }
       }
-#endif // FUTEX
 
-    struct __timed_waiters : __waiters
+    struct __timed_waiters : __waiters_base
     {
-      template<typename _Clock, typename _Duration>
-	__atomic_wait_status
-	_M_do_wait_until(__platform_wait_t __version,
-			 const chrono::time_point<_Clock, _Duration>& __atime)
+      // returns true if wait ended before timeout
+      template<typename _Clock, typename _Dur>
+	bool
+	_M_do_wait_until(__platform_wait_t* __addr, __platform_wait_t __old,
+			 const chrono::time_point<_Clock, _Dur>& __atime)
 	{
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-	  return __detail::__platform_wait_until(&_M_ver, __version, __atime);
+#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+	  return __platform_wait_until(__addr, __old, __atime);
 #else
-	  __platform_wait_t __cur = 0;
-	  __waiters::__lock_t __l(_M_mtx);
-	  while (__cur <= __version)
+	  __platform_wait_t __val;
+	  __atomic_load(__addr, &__val, __ATOMIC_RELAXED);
+	  if (__val == __old)
 	    {
-	      if (__detail::__cond_wait_until(_M_cv, _M_mtx, __atime)
-		    == __atomic_wait_status::timeout)
-		return __atomic_wait_status::timeout;
-
-	      __platform_wait_t __last = __cur;
-	      __atomic_load(&_M_ver, &__cur, __ATOMIC_ACQUIRE);
-	      if (__cur < __last)
-		break; // break the loop if version overflows
+	      lock_guard<mutex>__l(_M_mtx);
+	      return __cond_wait_until(_M_cv, _M_mtx, __atime);
 	    }
-	  return __atomic_wait_status::no_timeout;
-#endif
+#endif // _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
 	}
+    };
 
-      static __timed_waiters&
-      _S_timed_for(void* __t)
+    struct __timed_backoff_spin_policy
+    {
+      __wait_clock_t::time_point _M_deadline;
+      __wait_clock_t::time_point _M_t0;
+
+      template<typename _Clock, typename _Dur>
+	__timed_backoff_spin_policy(chrono::time_point<_Clock, _Dur>
+				      __deadline = _Clock::time_point::max(),
+				    chrono::time_point<_Clock, _Dur>
+				      __t0 = _Clock::now()) noexcept
+	  : _M_deadline(__to_wait_clock(__deadline))
+	  , _M_t0(__to_wait_clock(__t0))
+	{ }
+
+      bool
+      operator()() noexcept
       {
-	static_assert(sizeof(__timed_waiters) == sizeof(__waiters));
-	return static_cast<__timed_waiters&>(__waiters::_S_for(__t));
+	using namespace literals::chrono_literals;
+	auto __now = __wait_clock_t::now();
+	if (_M_deadline <= __now)
+	  return false;
+
+	auto __elapsed = __now - _M_t0;
+	if (__elapsed > 128ms)
+	  {
+	    this_thread::sleep_for(64ms);
+	  }
+	else if (__elapsed > 64us)
+	  {
+	    this_thread::sleep_for(__elapsed / 2);
+	  }
+	else if (__elapsed > 4us)
+	  {
+	    __thread_yield();
+	  }
+	else
+	  return false;
       }
     };
-  } // namespace __detail
 
-  template<typename _Tp, typename _Pred,
-	   typename _Clock, typename _Duration>
-    bool
-    __atomic_wait_until(const _Tp* __addr, _Tp __old, _Pred __pred,
-			const chrono::time_point<_Clock, _Duration>&
-			    __atime) noexcept
+    struct __timed_waiter : __waiter_base<__timed_waiters>
     {
-      using namespace __detail;
-
-      if (std::__atomic_spin(__pred))
-	return true;
+      template<typename _Tp>
+	__timed_waiter(const _Tp* __addr, bool __waiting = true) noexcept
+	: __waiter_base(__addr, __waiting)
+      { }
+
+      // returns true if wait ended before timeout
+      template<typename _Tp, typename _ValFn,
+	       typename _Clock, typename _Dur>
+	bool
+	_M_do_wait_until_v(_Tp __old, _ValFn __vfn,
+			   const chrono::time_point<_Clock, _Dur>&
+							      __atime) noexcept
+	{
+	  __platform_wait_t __val;
+	  if (_M_do_spin(__old, move(__vfn), __val,
+			 __timed_backoff_spin_policy(__atime)))
+	    return true;
+	  return _M_w._M_do_wait_until(_M_addr, __val, __atime);
+	}
 
-      auto& __w = __timed_waiters::_S_timed_for((void*)__addr);
-      auto __version = __w._M_enter_wait();
-      do
+      // returns true if wait ended before timeout
+      template<typename _Pred,
+	       typename _Clock, typename _Dur>
+	bool
+	_M_do_wait_until(_Pred __pred, __platform_wait_t __val,
+			const chrono::time_point<_Clock, _Dur>&
+							    __atime) noexcept
 	{
-	  __atomic_wait_status __res;
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-	  if constexpr (__platform_wait_uses_type<_Tp>)
-	    {
-	      __res = __detail::__platform_wait_until((__platform_wait_t*)(void*) __addr,
-						      __old, __atime);
-	    }
-	  else
-#endif
+	  for (auto __now = _Clock::now(); __now < __atime;
+		__now = _Clock::now())
 	    {
-	      __res = __w._M_do_wait_until(__version, __atime);
+	      if (_M_w._M_do_wait_until(_M_addr, __val, __atime) && __pred())
+		return true;
+
+	      if (_M_do_spin(__pred, __val,
+			     __timed_backoff_spin_policy(__atime, __now)))
+		return true;
 	    }
-	  if (__res == __atomic_wait_status::timeout)
-	    return false;
+	  return false;
+	}
+
+      // returns true if wait ended before timeout
+      template<typename _Pred,
+	       typename _Clock, typename _Dur>
+	bool
+	_M_do_wait_until(_Pred __pred,
+			const chrono::time_point<_Clock, _Dur>&
+							      __atime) noexcept
+	{
+	  __platform_wait_t __val;
+	  if (_M_do_spin(__pred, __val,
+			  __timed_backoff_spin_policy(__atime)))
+	    return true;
+	  return _M_do_wait_until(__pred, __val, __atime);
+	}
+
+      template<typename _Tp, typename _ValFn,
+	       typename _Rep, typename _Period>
+	bool
+	_M_do_wait_for_v(_Tp __old, _ValFn __vfn,
+			 const chrono::duration<_Rep, _Period>&
+							      __rtime) noexcept
+	{
+	  __platform_wait_t __val;
+	  if (_M_do_spin_v(__old, move(__vfn), __val))
+	    return true;
+
+	  if (!__rtime.count())
+	    return false; // no rtime supplied, and spin did not acquire
+
+	  using __dur = chrono::steady_clock::duration;
+	  auto __reltime = chrono::duration_cast<__dur>(__rtime);
+	  if (__reltime < __rtime)
+	    ++__reltime;
+
+	  return _M_w._M_do_wait_until(_M_addr, __val,
+				       chrono::steady_clock::now() + __reltime);
 	}
-      while (!__pred() && __atime < _Clock::now());
-      __w._M_leave_wait();
 
-      // if timed out, return false
-      return (_Clock::now() < __atime);
+      template<typename _Pred,
+	       typename _Rep, typename _Period>
+	bool
+	_M_do_wait_for(_Pred __pred,
+		       const chrono::duration<_Rep, _Period>& __rtime) noexcept
+	{
+	  __platform_wait_t __val;
+	  if (_M_do_spin(__pred, __val))
+	    return true;
+
+	  if (!__rtime.count())
+	    return false; // no rtime supplied, and spin did not acquire
+
+	  using __dur = chrono::steady_clock::duration;
+	  auto __reltime = chrono::duration_cast<__dur>(__rtime);
+	  if (__reltime < __rtime)
+	    ++__reltime;
+
+	  return _M_do_wait_until(__pred, __val,
+				  chrono::steady_clock::now() + __reltime);
+	}
+    };
+  } // namespace __detail
+
+  // returns true if wait ended before timeout
+  template<typename _Tp, typename _ValFn,
+	   typename _Clock, typename _Dur>
+    bool
+    __atomic_wait_address_until_v(const _Tp* __addr, _Tp&& __old, _ValFn&& __vfn,
+			const chrono::time_point<_Clock, _Dur>&
+			    __atime) noexcept
+    {
+      __detail::__timed_waiter __w{__addr};
+      return __w._M_do_wait_until_v(__old, __vfn, __atime);
     }
 
   template<typename _Tp, typename _Pred,
+	   typename _Clock, typename _Dur>
+    bool
+    __atomic_wait_address_until(const _Tp* __addr, _Pred __pred,
+				const chrono::time_point<_Clock, _Dur>&
+							      __atime) noexcept
+    {
+      __detail::__timed_waiter __w{__addr};
+      return __w._M_do_wait_until(__pred, __atime);
+    }
+
+  template<typename _Tp, typename _ValFn,
 	   typename _Rep, typename _Period>
     bool
-    __atomic_wait_for(const _Tp* __addr, _Tp __old, _Pred __pred,
+    __atomic_wait_address_for_v(const _Tp* __addr, _Tp&& __old, _ValFn&& __vfn,
 		      const chrono::duration<_Rep, _Period>& __rtime) noexcept
     {
-      using namespace __detail;
-
-      if (std::__atomic_spin(__pred))
-	return true;
 
-      if (!__rtime.count())
-	return false; // no rtime supplied, and spin did not acquire
+      __detail::__timed_waiter __w{__addr};
+      return __w._M_do_wait_for_v(__old, __vfn, __rtime);
+    }
 
-      using __dur = chrono::steady_clock::duration;
-      auto __reltime = chrono::duration_cast<__dur>(__rtime);
-      if (__reltime < __rtime)
-	++__reltime;
+  template<typename _Tp, typename _Pred,
+	   typename _Rep, typename _Period>
+    bool
+    __atomic_wait_address_for(const _Tp* __addr, _Pred __pred,
+		      const chrono::duration<_Rep, _Period>& __rtime) noexcept
+    {
 
-      return __atomic_wait_until(__addr, __old, std::move(__pred),
-				 chrono::steady_clock::now() + __reltime);
+      __detail::__timed_waiter __w{__addr};
+      return __w._M_do_wait_for(__pred, __rtime);
     }
 _GLIBCXX_END_NAMESPACE_VERSION
 } // namespace std
diff --git a/libstdc++-v3/include/bits/atomic_wait.h b/libstdc++-v3/include/bits/atomic_wait.h
index 1a0f0943ebd..fa83ef6c231 100644
--- a/libstdc++-v3/include/bits/atomic_wait.h
+++ b/libstdc++-v3/include/bits/atomic_wait.h
@@ -39,17 +39,16 @@ 
 #include <ext/numeric_traits.h>
 
 #ifdef _GLIBCXX_HAVE_LINUX_FUTEX
+#define _GLIBCXX_HAVE_PLATFORM_WAIT 1
 # include <cerrno>
 # include <climits>
 # include <unistd.h>
 # include <syscall.h>
 # include <bits/functexcept.h>
-// TODO get this from Autoconf
-# define _GLIBCXX_HAVE_LINUX_FUTEX_PRIVATE 1
-#else
-# include <bits/std_mutex.h>  // std::mutex, std::__condvar
 #endif
 
+# include <bits/std_mutex.h>  // std::mutex, std::__condvar
+
 #define __cpp_lib_atomic_wait 201907L
 
 namespace std _GLIBCXX_VISIBILITY(default)
@@ -57,20 +56,27 @@  namespace std _GLIBCXX_VISIBILITY(default)
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
   namespace __detail
   {
+#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
     using __platform_wait_t = int;
+#else
+    using __platform_wait_t = uint64_t;
+#endif
+  } // namespace __detail
 
-    constexpr auto __atomic_spin_count_1 = 16;
-    constexpr auto __atomic_spin_count_2 = 12;
-
-    template<typename _Tp>
-      inline constexpr bool __platform_wait_uses_type
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-	= is_same_v<remove_cv_t<_Tp>, __platform_wait_t>;
+  template<typename _Tp>
+    inline constexpr bool __platform_wait_uses_type
+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
+      = is_same_v<remove_cv_t<_Tp>, __detail::__platform_wait_t>
+	|| ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))
+	    && (alignof(_Tp*) == alignof(__detail::__platform_wait_t)));
 #else
-	= false;
+      = false;
 #endif
 
+  namespace __detail
+  {
 #ifdef _GLIBCXX_HAVE_LINUX_FUTEX
+#define _GLIBCXX_HAVE_PLATFORM_WAIT
     enum class __futex_wait_flags : int
     {
 #ifdef _GLIBCXX_HAVE_LINUX_FUTEX_PRIVATE
@@ -93,16 +99,13 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       void
       __platform_wait(const _Tp* __addr, __platform_wait_t __val) noexcept
       {
-	for(;;)
-	  {
-	    auto __e = syscall (SYS_futex, static_cast<const void*>(__addr),
-				  static_cast<int>(__futex_wait_flags::__wait_private),
-				    __val, nullptr);
-	    if (!__e || errno == EAGAIN)
-	      break;
-	    else if (errno != EINTR)
-	      __throw_system_error(__e);
-	  }
+	auto __e = syscall (SYS_futex, static_cast<const void*>(__addr),
+			    static_cast<int>(__futex_wait_flags::__wait_private),
+			    __val, nullptr);
+	if (!__e || errno == EAGAIN)
+	  return;
+	if (errno != EINTR)
+	  __throw_system_error(errno);
       }
 
     template<typename _Tp>
@@ -110,72 +113,125 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       __platform_notify(const _Tp* __addr, bool __all) noexcept
       {
 	syscall (SYS_futex, static_cast<const void*>(__addr),
-		  static_cast<int>(__futex_wait_flags::__wake_private),
-		    __all ? INT_MAX : 1);
+		 static_cast<int>(__futex_wait_flags::__wake_private),
+		 __all ? INT_MAX : 1);
       }
+#else
+// define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait()
+// and __platform_notify() if there is a more efficient primitive supported
+// by the platform (e.g. __ulock_wait()/__ulock_wake()) which is better than
+// a mutex/condvar based wait
 #endif
 
-    struct __waiters
+    inline void
+    __thread_yield() noexcept
     {
-      alignas(64) __platform_wait_t _M_ver = 0;
-      alignas(64) __platform_wait_t _M_wait = 0;
-
-#ifndef _GLIBCXX_HAVE_LINUX_FUTEX
-      using __lock_t = lock_guard<mutex>;
-      mutex _M_mtx;
-      __condvar _M_cv;
+#if defined _GLIBCXX_HAS_GTHREADS && defined _GLIBCXX_USE_SCHED_YIELD
+     __gthread_yield();
+#endif
+    }
 
-      __waiters() noexcept = default;
+    inline void
+    __thread_relax() noexcept
+    {
+#if defined __i386__ || defined __x86_64__
+      __builtin_ia32_pause();
+#else
+      __thread_yield();
 #endif
+    }
 
-      __platform_wait_t
-      _M_enter_wait() noexcept
+    constexpr auto __atomic_spin_count_1 = 16;
+    constexpr auto __atomic_spin_count_2 = 12;
+
+    struct __default_spin_policy
+    {
+      bool
+      operator()() noexcept
+      { return false; }
+    };
+
+    template<typename _Pred,
+	     typename _Spin = __default_spin_policy>
+      bool
+      __atomic_spin(_Pred& __pred, _Spin __spin = _Spin{ }) noexcept
       {
-	__platform_wait_t __res;
-	__atomic_load(&_M_ver, &__res, __ATOMIC_ACQUIRE);
-	__atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL);
-	return __res;
+	for (auto __i = 0; __i < __detail::__atomic_spin_count_1; ++__i)
+	  {
+	    if (__pred())
+	      return true;
+
+	    if (__i < __detail::__atomic_spin_count_2)
+	      __detail::__thread_relax();
+	    else
+	      __detail::__thread_yield();
+	  }
+
+	while (__spin())
+	  {
+	    if (__pred())
+	      return true;
+	  }
+
+	return false;
       }
 
-      void
-      _M_leave_wait() noexcept
+    template<typename _Tp>
+      bool __atomic_compare(const _Tp& __a, const _Tp& __b)
       {
-	__atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL);
+	// TODO make this do the correct padding bit ignoring comparison
+	return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) != 0;
       }
 
-      void
-      _M_do_wait(__platform_wait_t __version) noexcept
-      {
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-	__platform_wait(&_M_ver, __version);
+#ifdef __cpp_lib_hardware_interference_size
+    struct alignas(hardware_destructive_interference_size)
 #else
-	__platform_wait_t __cur = 0;
-	while (__cur <= __version)
-	  {
-	    __waiters::__lock_t __l(_M_mtx);
-	    _M_cv.wait(_M_mtx);
-	    __platform_wait_t __last = __cur;
-	    __atomic_load(&_M_ver, &__cur, __ATOMIC_ACQUIRE);
-	    if (__cur < __last)
-	      break; // break the loop if version overflows
-	  }
+    struct alignas(64)
+#endif
+    __waiters_base
+    {
+      __platform_wait_t _M_wait = 0;
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
+      mutex _M_mtx;
 #endif
-      }
+
+#ifdef __cpp_lib_hardware_interference_size
+      alignas(hardware_destructive_interference_size)
+#else
+      alignas(64)
+#endif
+      __platform_wait_t _M_ver = 0;
+
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
+      __condvar _M_cv;
+
+      __waiters_base() noexcept = default;
+#endif
+
+      void
+      _M_enter_wait() noexcept
+      { __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL); }
+
+      void
+      _M_leave_wait() noexcept
+      { __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL); }
 
       bool
       _M_waiting() const noexcept
       {
 	__platform_wait_t __res;
 	__atomic_load(&_M_wait, &__res, __ATOMIC_ACQUIRE);
-	return __res;
+	return __res > 0;
       }
 
       void
-      _M_notify(bool __all) noexcept
+      _M_notify(const __platform_wait_t* __addr, bool __all) noexcept
       {
-	__atomic_fetch_add(&_M_ver, 1, __ATOMIC_ACQ_REL);
+	if (!_M_waiting())
+	  return;
+
 #ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-	__platform_notify(&_M_ver, __all);
+	__platform_notify(__addr, __all);
 #else
 	if (__all)
 	  _M_cv.notify_all();
@@ -184,114 +240,172 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 #endif
       }
 
-      static __waiters&
-      _S_for(const void* __t)
+      static __waiters_base&
+      _S_for(const void* __addr)
       {
-	const unsigned char __mask = 0xf;
-	static __waiters __w[__mask + 1];
-
-	auto __key = _Hash_impl::hash(__t) & __mask;
+	constexpr auto __mask = 0xf;
+	static __waiters_base __w[__mask + 1];
+	auto __key = _Hash_impl::hash(__addr) & __mask;
 	return __w[__key];
       }
     };
 
-    struct __waiter
+    struct __waiters : __waiters_base
     {
-      __waiters& _M_w;
-      __platform_wait_t _M_version;
-
-      template<typename _Tp>
-	__waiter(const _Tp* __addr) noexcept
-	  : _M_w(__waiters::_S_for(static_cast<const void*>(__addr)))
-	  , _M_version(_M_w._M_enter_wait())
-	{ }
+      void
+      _M_do_wait(__platform_wait_t* __addr, __platform_wait_t __old) noexcept
+      {
+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
+	__platform_wait(&__addr, __old);
+#else
+	__platform_wait_t __val;
+	__atomic_load(_M_addr, &__val, __ATOMIC_RELAXED);
+	if (__val == __old)
+	  {
+	    lock_guard<mutex> __l(_M_mtx);
+	    _M_cv.wait(_M_mtx);
+	  }
+#endif // __GLIBCXX_HAVE_PLATFORM_WAIT
+      }
+    };
 
-      ~__waiter()
-      { _M_w._M_leave_wait(); }
+    template<typename _Tp>
+      struct __waiter_base
+      {
+	using __waiter_type = _Tp;
 
-      void _M_do_wait() noexcept
-      { _M_w._M_do_wait(_M_version); }
-    };
+	__waiter_type& _M_w;
+	__platform_wait_t* _M_addr;
+	bool _M_waiting;
 
-    inline void
-    __thread_relax() noexcept
-    {
-#if defined __i386__ || defined __x86_64__
-      __builtin_ia32_pause();
-#elif defined _GLIBCXX_USE_SCHED_YIELD
-      __gthread_yield();
-#endif
-    }
+	template<typename _Up>
+	  static __platform_wait_t*
+	  _S_wait_addr(const _Up* __a, __platform_wait_t* __b)
+	  {
+	    if constexpr (__platform_wait_uses_type<_Up>)
+	      return reinterpret_cast<__platform_wait_t*>(const_cast<_Up*>(__a));
+	    else
+	      return __b;
+	  }
 
-    inline void
-    __thread_yield() noexcept
-    {
-#if defined _GLIBCXX_USE_SCHED_YIELD
-     __gthread_yield();
-#endif
-    }
+	template<typename _Up>
+	  static __waiter_type&
+	  _S_for(const _Up* __addr)
+	  {
+	    static_assert(sizeof(__waiter_type) == sizeof(__waiters_base));
+	    auto& res = __waiters_base::_S_for(static_cast<const void*>(__addr));
+	    return reinterpret_cast<__waiter_type&>(res);
+	  }
 
-  } // namespace __detail
+	template<typename _Up>
+	  __waiter_base(const _Up* __addr, bool __waiting) noexcept
+	    : _M_w(_S_for(__addr))
+	    , _M_addr(_S_wait_addr(__addr, &_M_w._M_ver))
+	    , _M_waiting(__waiting)
+	  { }
 
-  template<typename _Pred>
-    bool
-    __atomic_spin(_Pred& __pred) noexcept
-    {
-      for (auto __i = 0; __i < __detail::__atomic_spin_count_1; ++__i)
+	~__waiter_base()
 	{
-	  if (__pred())
-	    return true;
+	  if (_M_waiting)
+	    _M_w._M_leave_wait();
+	}
 
-	  if (__i < __detail::__atomic_spin_count_2)
-	    __detail::__thread_relax();
-	  else
-	    __detail::__thread_yield();
+	void
+	_M_notify(bool __all)
+	{
+	  if (_M_addr == &_M_w._M_ver)
+	    __atomic_fetch_add(_M_addr, 1, __ATOMIC_ACQ_REL);
+	  _M_w._M_notify(_M_addr, __all);
 	}
-      return false;
-    }
 
-  template<typename _Tp, typename _Pred>
-    void
-    __atomic_wait(const _Tp* __addr, _Tp __old, _Pred __pred) noexcept
+	template<typename _Up, typename _ValFn,
+		 typename _Spin = __default_spin_policy>
+	  bool
+	  _M_do_spin_v(const _Up& __old, _ValFn __vfn,
+		       __platform_wait_t& __val,
+		       _Spin __spin = _Spin{ })
+	  {
+	    auto const __pred = [=]
+	      { return __atomic_compare(__old, __vfn()); };
+
+	    if constexpr (__platform_wait_uses_type<_Up>)
+	      {
+		__val == __old;
+	      }
+	    else
+	      {
+		__atomic_load(_M_addr, &__val, __ATOMIC_RELAXED);
+	      }
+	    return __atomic_spin(__pred, __spin);
+	  }
+
+	template<typename _Pred,
+		 typename _Spin = __default_spin_policy>
+	  bool
+	  _M_do_spin(_Pred __pred, __platform_wait_t& __val,
+	             _Spin __spin = _Spin{ })
+	  {
+	    __atomic_load(_M_addr, &__val, __ATOMIC_RELAXED);
+	    return __atomic_spin(__pred, __spin);
+	  }
+      };
+
+    struct __waiter : __waiter_base<__waiters>
     {
-      using namespace __detail;
-      if (std::__atomic_spin(__pred))
-	return;
+      template<typename _Tp>
+	__waiter(const _Tp* __addr, bool __waiting = true) noexcept
+	  : __waiter_base(__addr, __waiting)
+	{ }
 
-      __waiter __w(__addr);
-      while (!__pred())
+      template<typename _Tp, typename _ValFn>
+	void
+	_M_do_wait_v(_Tp __old, _ValFn __vfn)
 	{
-	  if constexpr (__platform_wait_uses_type<_Tp>)
-	    {
-	      __platform_wait(__addr, __old);
-	    }
-	  else
+	  __platform_wait_t __val;
+	  if (_M_do_spin_v(__old, __vfn, __val))
+	    return;
+	  _M_w._M_do_wait(_M_addr, __val);
+	}
+
+      template<typename _Pred>
+	void
+	_M_do_wait(_Pred __pred)
+	{
+	  do
 	    {
-	      // TODO support timed backoff when this can be moved into the lib
-	      __w._M_do_wait();
+	      __platform_wait_t __val;
+	      if (_M_do_spin(__pred, __val))
+		return;
+	      _M_w._M_do_wait(_M_addr, __val);
 	    }
+	  while (!__pred());
 	}
+    };
+  } // namespace __detail
+
+  template<typename _Tp, typename _ValFn>
+    void
+    __atomic_wait_address_v(const _Tp* __addr, _Tp __old,
+			    _ValFn __vfn) noexcept
+    {
+      __detail::__waiter __w(__addr);
+      __w._M_do_wait_v(__old, __vfn);
     }
 
+  template<typename _Tp, typename _Pred>
+  void
+  __atomic_wait_address(const _Tp* __addr, _Pred __pred) noexcept
+  {
+    __detail::__waiter __w(__addr);
+    __w._M_do_wait(__pred);
+  }
+
   template<typename _Tp>
     void
-    __atomic_notify(const _Tp* __addr, bool __all) noexcept
+    __atomic_notify_address(const _Tp* __addr, bool __all) noexcept
     {
-      using namespace __detail;
-      auto& __w = __waiters::_S_for((void*)__addr);
-      if (!__w._M_waiting())
-	return;
-
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-      if constexpr (__platform_wait_uses_type<_Tp>)
-	{
-	  __platform_notify((__platform_wait_t*)(void*) __addr, __all);
-	}
-      else
-#endif
-	{
-	  __w._M_notify(__all);
-	}
+      __detail::__waiter __w(__addr);
+      __w._M_notify(__all);
     }
 _GLIBCXX_END_NAMESPACE_VERSION
 } // namespace std
diff --git a/libstdc++-v3/include/bits/semaphore_base.h b/libstdc++-v3/include/bits/semaphore_base.h
index b65717e64d7..95d5414ff80 100644
--- a/libstdc++-v3/include/bits/semaphore_base.h
+++ b/libstdc++-v3/include/bits/semaphore_base.h
@@ -181,40 +181,32 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       __atomic_semaphore(const __atomic_semaphore&) = delete;
       __atomic_semaphore& operator=(const __atomic_semaphore&) = delete;
 
+      static _GLIBCXX_ALWAYS_INLINE bool
+      _S_do_try_acquire(_Tp* __counter) noexcept
+      {
+	auto __old = __atomic_impl::load(__counter, memory_order::acquire);
+
+	if (__old == 0)
+	  return false;
+
+	return __atomic_impl::compare_exchange_strong(__counter,
+						      __old, __old - 1,
+						      memory_order::acquire,
+						      memory_order::release);
+      }
+
       _GLIBCXX_ALWAYS_INLINE void
       _M_acquire() noexcept
       {
-	auto const __pred = [this]
-	  {
-	    auto __old = __atomic_impl::load(&this->_M_counter,
-			    memory_order::acquire);
-	    if (__old == 0)
-	      return false;
-	    return __atomic_impl::compare_exchange_strong(&this->_M_counter,
-		      __old, __old - 1,
-		      memory_order::acquire,
-		      memory_order::release);
-	  };
-	auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed);
-	std::__atomic_wait(&_M_counter, __old, __pred);
+	auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); };
+	std::__atomic_wait_address(&_M_counter, __pred);
       }
 
       bool
       _M_try_acquire() noexcept
       {
-	auto __old = __atomic_impl::load(&_M_counter, memory_order::acquire);
-	auto const __pred = [this, __old]
-	  {
-	    if (__old == 0)
-	      return false;
-
-	    auto __prev = __old;
-	    return __atomic_impl::compare_exchange_weak(&this->_M_counter,
-		      __prev, __prev - 1,
-		      memory_order::acquire,
-		      memory_order::release);
-	  };
-	return std::__atomic_spin(__pred);
+	auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); };
+	return std::__detail::__atomic_spin(__pred);
       }
 
       template<typename _Clock, typename _Duration>
@@ -222,20 +214,10 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	_M_try_acquire_until(const chrono::time_point<_Clock,
 			     _Duration>& __atime) noexcept
 	{
-	  auto const __pred = [this]
-	    {
-	      auto __old = __atomic_impl::load(&this->_M_counter,
-			      memory_order::acquire);
-	      if (__old == 0)
-		return false;
-	      return __atomic_impl::compare_exchange_strong(&this->_M_counter,
-			      __old, __old - 1,
-			      memory_order::acquire,
-			      memory_order::release);
-	    };
+	  auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); };
 
 	  auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed);
-	  return __atomic_wait_until(&_M_counter, __old, __pred, __atime);
+	  return __atomic_wait_address_until(&_M_counter, __pred, __atime);
 	}
 
       template<typename _Rep, typename _Period>
@@ -243,20 +225,9 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	_M_try_acquire_for(const chrono::duration<_Rep, _Period>& __rtime)
 	  noexcept
 	{
-	  auto const __pred = [this]
-	    {
-	      auto __old = __atomic_impl::load(&this->_M_counter,
-			      memory_order::acquire);
-	      if (__old == 0)
-		return false;
-	      return  __atomic_impl::compare_exchange_strong(&this->_M_counter,
-			      __old, __old - 1,
-			      memory_order::acquire,
-			      memory_order::release);
-	    };
+	  auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); };
 
-	  auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed);
-	  return __atomic_wait_for(&_M_counter, __old, __pred, __rtime);
+	  return __atomic_wait_address_for(&_M_counter, __pred, __rtime);
 	}
 
       _GLIBCXX_ALWAYS_INLINE void
diff --git a/libstdc++-v3/include/bits/std_thread_sleep.h b/libstdc++-v3/include/bits/std_thread_sleep.h
new file mode 100644
index 00000000000..545bff2aea3
--- /dev/null
+++ b/libstdc++-v3/include/bits/std_thread_sleep.h
@@ -0,0 +1,119 @@ 
+// std::this_thread::sleep_for/until declarations -*- C++ -*-
+
+// Copyright (C) 2008-2021 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+/** @file bits/std_thread_sleep.h
+ *  This is an internal header file, included by other library headers.
+ *  Do not attempt to use it directly. @headername{thread}
+ */
+
+#ifndef _GLIBCXX_THREAD_SLEEP_H
+#define _GLIBCXX_THREAD_SLEEP_H 1
+
+#pragma GCC system_header
+
+#if __cplusplus >= 201103L
+#include <bits/c++config.h>
+
+#include <chrono> // std::chrono::*
+
+#ifdef _GLIBCXX_USE_NANOSLEEP
+# include <cerrno>  // errno, EINTR
+# include <time.h>  // nanosleep
+#endif
+
+namespace std _GLIBCXX_VISIBILITY(default)
+{
+_GLIBCXX_BEGIN_NAMESPACE_VERSION
+
+  /** @addtogroup threads
+   *  @{
+   */
+
+  /** @namespace std::this_thread
+   *  @brief ISO C++ 2011 namespace for interacting with the current thread
+   *
+   *  C++11 30.3.2 [thread.thread.this] Namespace this_thread.
+   */
+  namespace this_thread
+  {
+#ifndef _GLIBCXX_NO_SLEEP
+
+#ifndef _GLIBCXX_USE_NANOSLEEP
+    void
+    __sleep_for(chrono::seconds, chrono::nanoseconds);
+#endif
+
+    /// this_thread::sleep_for
+    template<typename _Rep, typename _Period>
+      inline void
+      sleep_for(const chrono::duration<_Rep, _Period>& __rtime)
+      {
+	if (__rtime <= __rtime.zero())
+	  return;
+	auto __s = chrono::duration_cast<chrono::seconds>(__rtime);
+	auto __ns = chrono::duration_cast<chrono::nanoseconds>(__rtime - __s);
+#ifdef _GLIBCXX_USE_NANOSLEEP
+	struct ::timespec __ts =
+	  {
+	    static_cast<std::time_t>(__s.count()),
+	    static_cast<long>(__ns.count())
+	  };
+	while (::nanosleep(&__ts, &__ts) == -1 && errno == EINTR)
+	  { }
+#else
+	__sleep_for(__s, __ns);
+#endif
+      }
+
+    /// this_thread::sleep_until
+    template<typename _Clock, typename _Duration>
+      inline void
+      sleep_until(const chrono::time_point<_Clock, _Duration>& __atime)
+      {
+#if __cplusplus > 201703L
+	static_assert(chrono::is_clock_v<_Clock>);
+#endif
+	auto __now = _Clock::now();
+	if (_Clock::is_steady)
+	  {
+	    if (__now < __atime)
+	      sleep_for(__atime - __now);
+	    return;
+	  }
+	while (__now < __atime)
+	  {
+	    sleep_for(__atime - __now);
+	    __now = _Clock::now();
+	  }
+      }
+  } // namespace this_thread
+#endif // ! NO_SLEEP
+
+  /// @}
+
+_GLIBCXX_END_NAMESPACE_VERSION
+} // namespace
+#endif // C++11
+
+#endif // _GLIBCXX_THREAD_SLEEP_H
diff --git a/libstdc++-v3/include/std/atomic b/libstdc++-v3/include/std/atomic
index de5591d8e14..a56da8a9683 100644
--- a/libstdc++-v3/include/std/atomic
+++ b/libstdc++-v3/include/std/atomic
@@ -384,26 +384,19 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     void
     wait(_Tp __old, memory_order __m = memory_order_seq_cst) const noexcept
     {
-      std::__atomic_wait(&_M_i, __old,
-			 [__m, this, __old]
-			 {
-			   const auto __v = this->load(__m);
-			   // TODO make this ignore padding bits when we
-			   // can do that
-			   return __builtin_memcmp(&__old, &__v,
-						    sizeof(_Tp)) != 0;
-			 });
+      std::__atomic_wait_address_v(&_M_i, __old,
+			 [__m, this] { return this->load(__m); });
     }
 
     // TODO add const volatile overload
 
     void
     notify_one() const noexcept
-    { std::__atomic_notify(&_M_i, false); }
+    { std::__atomic_notify_address(&_M_i, false); }
 
     void
     notify_all() const noexcept
-    { std::__atomic_notify(&_M_i, true); }
+    { std::__atomic_notify_address(&_M_i, true); }
 #endif // __cpp_lib_atomic_wait 
 
     };
diff --git a/libstdc++-v3/include/std/barrier b/libstdc++-v3/include/std/barrier
index e09212dfcb9..dfb1fb476d1 100644
--- a/libstdc++-v3/include/std/barrier
+++ b/libstdc++-v3/include/std/barrier
@@ -185,11 +185,11 @@  It looks different from literature pseudocode for two main reasons:
       wait(arrival_token&& __old_phase) const
       {
 	__atomic_phase_const_ref_t __phase(_M_phase);
-	auto const __test_fn = [=, this]
+	auto const __test_fn = [=]
 	  {
 	    return __phase.load(memory_order_acquire) != __old_phase;
 	  };
-	std::__atomic_wait(&_M_phase, __old_phase, __test_fn);
+	std::__atomic_wait_address(&_M_phase, __test_fn);
       }
 
       void
diff --git a/libstdc++-v3/include/std/latch b/libstdc++-v3/include/std/latch
index ef8c301e5e9..0b2d3c4f51c 100644
--- a/libstdc++-v3/include/std/latch
+++ b/libstdc++-v3/include/std/latch
@@ -73,8 +73,8 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _GLIBCXX_ALWAYS_INLINE void
     wait() const noexcept
     {
-      auto const __old = __atomic_impl::load(&_M_a, memory_order::acquire);
-      std::__atomic_wait(&_M_a, __old, [this] { return this->try_wait(); });
+      auto const __pred = [this] { return this->try_wait(); };
+      std::__atomic_wait_address(&_M_a, __pred);
     }
 
     _GLIBCXX_ALWAYS_INLINE void
diff --git a/libstdc++-v3/include/std/thread b/libstdc++-v3/include/std/thread
index ad383395ee9..63c0f38a83c 100644
--- a/libstdc++-v3/include/std/thread
+++ b/libstdc++-v3/include/std/thread
@@ -35,19 +35,13 @@ 
 # include <bits/c++0x_warning.h>
 #else
 
-#include <chrono> // std::chrono::*
-
 #if __cplusplus > 201703L
 # include <compare>	// std::strong_ordering
 # include <stop_token>	// std::stop_source, std::stop_token, std::nostopstate
 #endif
 
 #include <bits/std_thread.h> // std::thread, get_id, yield
-
-#ifdef _GLIBCXX_USE_NANOSLEEP
-# include <cerrno>  // errno, EINTR
-# include <time.h>  // nanosleep
-#endif
+#include <bits/std_thread_sleep.h> // std::this_thread::sleep_for, sleep_until
 
 namespace std _GLIBCXX_VISIBILITY(default)
 {
@@ -103,66 +97,6 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	return __out << __id._M_thread;
     }
 
-  /** @namespace std::this_thread
-   *  @brief ISO C++ 2011 namespace for interacting with the current thread
-   *
-   *  C++11 30.3.2 [thread.thread.this] Namespace this_thread.
-   */
-  namespace this_thread
-  {
-#ifndef _GLIBCXX_NO_SLEEP
-
-#ifndef _GLIBCXX_USE_NANOSLEEP
-    void
-    __sleep_for(chrono::seconds, chrono::nanoseconds);
-#endif
-
-    /// this_thread::sleep_for
-    template<typename _Rep, typename _Period>
-      inline void
-      sleep_for(const chrono::duration<_Rep, _Period>& __rtime)
-      {
-	if (__rtime <= __rtime.zero())
-	  return;
-	auto __s = chrono::duration_cast<chrono::seconds>(__rtime);
-	auto __ns = chrono::duration_cast<chrono::nanoseconds>(__rtime - __s);
-#ifdef _GLIBCXX_USE_NANOSLEEP
-	struct ::timespec __ts =
-	  {
-	    static_cast<std::time_t>(__s.count()),
-	    static_cast<long>(__ns.count())
-	  };
-	while (::nanosleep(&__ts, &__ts) == -1 && errno == EINTR)
-	  { }
-#else
-	__sleep_for(__s, __ns);
-#endif
-      }
-
-    /// this_thread::sleep_until
-    template<typename _Clock, typename _Duration>
-      inline void
-      sleep_until(const chrono::time_point<_Clock, _Duration>& __atime)
-      {
-#if __cplusplus > 201703L
-	static_assert(chrono::is_clock_v<_Clock>);
-#endif
-	auto __now = _Clock::now();
-	if (_Clock::is_steady)
-	  {
-	    if (__now < __atime)
-	      sleep_for(__atime - __now);
-	    return;
-	  }
-	while (__now < __atime)
-	  {
-	    sleep_for(__atime - __now);
-	    __now = _Clock::now();
-	  }
-      }
-  } // namespace this_thread
-#endif // ! NO_SLEEP
-
 #ifdef __cpp_lib_jthread
 
   /// A thread that can be requested to stop and automatically joined.
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/bool.cc b/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/bool.cc
index 0550f17c69d..26a7dfbfcec 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/bool.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/bool.cc
@@ -22,42 +22,21 @@ 
 
 #include <atomic>
 #include <thread>
-#include <mutex>
-#include <condition_variable>
-#include <type_traits>
-#include <chrono>
 
 #include <testsuite_hooks.h>
 
 int
 main ()
 {
-  using namespace std::literals::chrono_literals;
-
-  std::mutex m;
-  std::condition_variable cv;
-  std::unique_lock<std::mutex> l(m);
-
-  std::atomic<bool> a(false);
-  std::atomic<bool> b(false);
+  std::atomic<bool> a{ true };
+  VERIFY( a.load() );
+  a.wait(false);
   std::thread t([&]
-		{
-		  {
-		    // This ensures we block until cv.wait(l) starts.
-		    std::lock_guard<std::mutex> ll(m);
-		  }
-		  cv.notify_one();
-		  a.wait(false);
-		  if (a.load())
-		    {
-		      b.store(true);
-		    }
-		});
-  cv.wait(l);
-  std::this_thread::sleep_for(100ms);
-  a.store(true);
-  a.notify_one();
+    {
+      a.store(false);
+      a.notify_one();
+    });
+  a.wait(true);
   t.join();
-  VERIFY( b.load() );
   return 0;
 }
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/generic.cc b/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/generic.cc
index 9ab1b071c96..0f1b9cd69d2 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/generic.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/generic.cc
@@ -20,12 +20,27 @@ 
 // with this library; see the file COPYING3.  If not see
 // <http://www.gnu.org/licenses/>.
 
-#include "atomic/wait_notify_util.h"
+#include <atomic>
+#include <thread>
+
+#include <testsuite_hooks.h>
 
 int
 main ()
 {
   struct S{ int i; };
-  check<S> check_s{S{0},S{42}};
+  S aa{ 0 };
+  S bb{ 42 };
+
+  std::atomic<S> a{ aa };
+  VERIFY( a.load().i == aa.i );
+  a.wait(bb);
+  std::thread t([&]
+    {
+      a.store(bb);
+      a.notify_one();
+    });
+  a.wait(aa);
+  t.join();
   return 0;
 }
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/pointers.cc b/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/pointers.cc
index cc63694f596..17365a17228 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/pointers.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic/wait_notify/pointers.cc
@@ -22,42 +22,24 @@ 
 
 #include <atomic>
 #include <thread>
-#include <mutex>
-#include <condition_variable>
-#include <type_traits>
-#include <chrono>
 
 #include <testsuite_hooks.h>
 
 int
 main ()
 {
-  using namespace std::literals::chrono_literals;
-
-  std::mutex m;
-  std::condition_variable cv;
-  std::unique_lock<std::mutex> l(m);
-
   long aa;
   long bb;
-
-  std::atomic<long*> a(nullptr);
+  std::atomic<long*> a(&aa);
+  VERIFY( a.load() == &aa );
+  a.wait(&bb);
   std::thread t([&]
-		{
-		  {
-		    // This ensures we block until cv.wait(l) starts.
-		    std::lock_guard<std::mutex> ll(m);
-		  }
-		  cv.notify_one();
-		  a.wait(nullptr);
-		  if (a.load() == &aa)
-		    a.store(&bb);
-		});
-  cv.wait(l);
-  std::this_thread::sleep_for(100ms);
-  a.store(&aa);
-  a.notify_one();
+    {
+      a.store(&bb);
+      a.notify_one();
+    });
+  a.wait(&aa);
   t.join();
-  VERIFY( a.load() == &bb);
+
   return 0;
 }
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic_flag/wait_notify/1.cc b/libstdc++-v3/testsuite/29_atomics/atomic_flag/wait_notify/1.cc
index 45b68c5bbb8..9d12889ed59 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic_flag/wait_notify/1.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic_flag/wait_notify/1.cc
@@ -21,10 +21,6 @@ 
 // <http://www.gnu.org/licenses/>.
 
 #include <atomic>
-#include <chrono>
-#include <condition_variable>
-#include <concepts>
-#include <mutex>
 #include <thread>
 
 #include <testsuite_hooks.h>
@@ -32,34 +28,15 @@ 
 int
 main()
 {
-  using namespace std::literals::chrono_literals;
-
-  std::mutex m;
-  std::condition_variable cv;
-  std::unique_lock<std::mutex> l(m);
-
   std::atomic_flag a;
-  std::atomic_flag b;
+  VERIFY( !a.test() );
+  a.wait(true);
   std::thread t([&]
-		{
-		  {
-		    // This ensures we block until cv.wait(l) starts.
-		    std::lock_guard<std::mutex> ll(m);
-		  }
-		  cv.notify_one();
-		  a.wait(false);
-		  b.test_and_set();
-		  b.notify_one();
-		});
-
-  cv.wait(l);
-  std::this_thread::sleep_for(100ms);
-  a.test_and_set();
-  a.notify_one();
-  b.wait(false);
+    {
+      a.test_and_set();
+      a.notify_one();
+    });
+  a.wait(false);
   t.join();
-
-  VERIFY( a.test() );
-  VERIFY( b.test() );
   return 0;
 }
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic_float/wait_notify.cc b/libstdc++-v3/testsuite/29_atomics/atomic_float/wait_notify.cc
index d8ec5fbe24e..01768da290b 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic_float/wait_notify.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic_float/wait_notify.cc
@@ -21,12 +21,32 @@ 
 // with this library; see the file COPYING3.  If not see
 // <http://www.gnu.org/licenses/>.
 
-#include "atomic/wait_notify_util.h"
+
+#include <atomic>
+#include <thread>
+
+#include <testsuite_hooks.h>
+
+template<typename Tp>
+  void
+  check()
+  {
+    std::atomic<Tp> a{ 1.0 };
+    VERIFY( a.load() != 0.0 );
+    a.wait( 0.0 );
+    std::thread t([&]
+      {
+        a.store(0.0);
+        a.notify_one();
+      });
+    a.wait(1.0);
+    t.join();
+  }
 
 int
 main ()
 {
-  check<float> f;
-  check<double> d;
+  check<float>();
+  check<double>();
   return 0;
 }
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic_integral/wait_notify.cc b/libstdc++-v3/testsuite/29_atomics/atomic_integral/wait_notify.cc
index 19c1ec4bc12..d12b091c635 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic_integral/wait_notify.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic_integral/wait_notify.cc
@@ -21,46 +21,57 @@ 
 // with this library; see the file COPYING3.  If not see
 // <http://www.gnu.org/licenses/>.
 
-#include "atomic/wait_notify_util.h"
 
-void
-test01()
-{
-  struct S{ int i; };
-  std::atomic<S> s;
+#include <atomic>
+#include <thread>
 
-  s.wait(S{42});
-}
+#include <testsuite_hooks.h>
+
+template<typename Tp>
+  void
+  check()
+  {
+    std::atomic<Tp> a{ Tp(1) };
+    VERIFY( a.load() == Tp(0) );
+    a.wait( Tp(0) );
+    std::thread t([&]
+      {
+        a.store(Tp(0));
+        a.notify_one();
+      });
+    a.wait(Tp(1));
+    t.join();
+  }
 
 int
 main ()
 {
   // check<bool> bb;
-  check<char> ch;
-  check<signed char> sch;
-  check<unsigned char> uch;
-  check<short> s;
-  check<unsigned short> us;
-  check<int> i;
-  check<unsigned int> ui;
-  check<long> l;
-  check<unsigned long> ul;
-  check<long long> ll;
-  check<unsigned long long> ull;
+  check<char>();
+  check<signed char>();
+  check<unsigned char>();
+  check<short>();
+  check<unsigned short>();
+  check<int>();
+  check<unsigned int>();
+  check<long>();
+  check<unsigned long>();
+  check<long long>();
+  check<unsigned long long>();
 
-  check<wchar_t> wch;
-  check<char8_t> ch8;
-  check<char16_t> ch16;
-  check<char32_t> ch32;
+  check<wchar_t>();
+  check<char8_t>();
+  check<char16_t>();
+  check<char32_t>();
 
-  check<int8_t> i8;
-  check<int16_t> i16;
-  check<int32_t> i32;
-  check<int64_t> i64;
+  check<int8_t>();
+  check<int16_t>();
+  check<int32_t>();
+  check<int64_t>();
 
-  check<uint8_t> u8;
-  check<uint16_t> u16;
-  check<uint32_t> u32;
-  check<uint64_t> u64;
+  check<uint8_t>();
+  check<uint16_t>();
+  check<uint32_t>();
+  check<uint64_t>();
   return 0;
 }
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic_ref/wait_notify.cc b/libstdc++-v3/testsuite/29_atomics/atomic_ref/wait_notify.cc
index a6740857172..2fd31304222 100644
--- a/libstdc++-v3/testsuite/29_atomics/atomic_ref/wait_notify.cc
+++ b/libstdc++-v3/testsuite/29_atomics/atomic_ref/wait_notify.cc
@@ -23,73 +23,25 @@ 
 
 #include <atomic>
 #include <thread>
-#include <mutex>
-#include <condition_variable>
-#include <chrono>
-#include <type_traits>
 
 #include <testsuite_hooks.h>
 
-template<typename Tp>
-Tp check_wait_notify(Tp val1, Tp val2)
+int
+main ()
 {
-  using namespace std::literals::chrono_literals;
+  struct S{ int i; };
+  S aa{ 0 };
+  S bb{ 42 };
 
-  std::mutex m;
-  std::condition_variable cv;
-  std::unique_lock<std::mutex> l(m);
-
-  Tp aa = val1;
-  std::atomic_ref<Tp> a(aa);
+  std::atomic_ref<S> a{ aa };
+  VERIFY( a.load().i == aa.i );
+  a.wait(bb);
   std::thread t([&]
-		{
-		  {
-		    // This ensures we block until cv.wait(l) starts.
-		    std::lock_guard<std::mutex> ll(m);
-		  }
-		  cv.notify_one();
-		  a.wait(val1);
-		  if (a.load() != val2)
-		    a = val1;
-		});
-  cv.wait(l);
-  std::this_thread::sleep_for(100ms);
-  a.store(val2);
-  a.notify_one();
+    {
+      a.store(bb);
+      a.notify_one();
+    });
+  a.wait(aa);
   t.join();
-  return a.load();
-}
-
-template<typename Tp,
-	 bool = std::is_integral_v<Tp>
-	 || std::is_floating_point_v<Tp>>
-struct check;
-
-template<typename Tp>
-struct check<Tp, true>
-{
-  check()
-  {
-    Tp a = 0;
-    Tp b = 42;
-    VERIFY(check_wait_notify(a, b) == b);
-  }
-};
-
-template<typename Tp>
-struct check<Tp, false>
-{
-  check(Tp b)
-  {
-    Tp a;
-    VERIFY(check_wait_notify(a, b) == b);
-  }
-};
-
-int
-main ()
-{
-  check<long>();
-  check<double>();
   return 0;
 }