[RFC,0/4] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP

Message ID 1530519116-13103-1-git-send-email-kemi.wang@intel.com
Headers show
Series
  • Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
Related show

Message

kemi July 2, 2018, 8:11 a.m.
The pthread adaptive mutex is designed based-on the truth that the lock
probably would be released after a few of CPU cycles in most cases.
Especially for the case: the applications code is well-behaved with a short
critical section that is highly contended. Thus, spinning on the lock for
a while instead of calling into the kernel to block immediately can help
to improve the system performance.

But there are two main problems in current adaptive mutex. The first one is
fairness, multiple spinners contend for the lock simultaneously and there
is no any guarantee for a spinner to acquire the lock no matter how long it
has been waiting for. The other is the heavy cache line bouncing. Since the
cache line including the lock is shared among all of the spinners, when
lock is released, each spinner will try to acquire the lock via cmpxchg
instruction which constantly floods the system via "read-for-ownership"
requests. As a result, there will be a lot of cache line bouncing in a
large system with a lots of CPUs.

To address the problems mentioned above, the idea for queuing mutex
spinners with MCS lock referred to the implementation of mutex in kernel is
proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is introduced.
Compare to adaptive mutex (read only while spinning), the test result on
Intel 2 sockets Skylake platform has showns significant performance
improvement (See the first patch for details).

Though, queuing spinner with mcs lock can help to improve the performance
of adaptive mutex when multiple threads contending for a lock, people
probably want to know how queue spinner mutex performs when compare to
other lock discipline widely knowns as pthread spin lock and pthread mutex.
We can see the details of the test result in the first patch. Simply, some
conclusion is summarized as below:
a) In case of little lock contention, spin lock performs best, queue
spinner mutex performs similar to adaptive mutex, and both perform a
little better than pthread mutex.
b) In the case of severe lock contention with large number of CPUs when
protecting a small critical section (less than 1000ns). Most of lock
acquisition is got via spinning. Queue spinner mutex.
performs much better than spin lock and adaptive mutex. This is because the
overhead of heavy cache line bouncing plays a big role on lock performance.
c) With the increase of the size of a critical section, the advantage of
queue spinner mutex on performance in reduced gradually. This is because
the overhead of cache line bouncing will not become the bottleneck of lock
performance, instead, the overhead of futex_wait and futex_wake
plays a big role. When the critical section is increased to 1ms, even the
latency of futex syscall would be ignorable compared to the total time of
lock acquisition.

As we can see above, queue spinner mutex performs well in kinds of workload,
but there would be a potential risk to use this type of mutex. When the
lock holder is transformed to the next spinner in the queue, but it is not
running (the CPU is scheduled to run other task). Thus, the other spinners
have to wait in the queue, this would probably collapse the lock performance.
To emulate this case, we run two same processes simultaneously, the
process has 28 threads each of which sets CPU affinity to an individual CPU
according to the thread id. Thus, CPU [0~27] are subscribed by two threads.
In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by
58.1% (2205245->924263).
Therefore, queue spinner mutex would be carefully used for applications to
pursue fairness and performance without oversubscribing CPU resource. E.g.
Containers in public cloud infrastructure.

Kemi Wang (4):
  Mutex: Queue spinners to reduce cache line bouncing and ensure
    fairness
  Mutex: add unit tests for new type
  BENCHMARK: add a benchmark for testing new type of mutex
  Manual: Add manual for pthread mutex

 benchtests/Makefile                          |  4 +-
 benchtests/bench-mutex-adaptive-thread.c     |  8 +++-
 benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++++
 manual/Makefile                              |  2 +-
 manual/mutex.texi                            | 68 ++++++++++++++++++++++++++++
 nptl/Makefile                                |  8 ++--
 nptl/allocatestack.c                         |  2 +-
 nptl/descr.h                                 | 26 +++++------
 nptl/mcs_lock.c                              | 68 ++++++++++++++++++++++++++++
 nptl/mcs_lock.h                              | 21 +++++++++
 nptl/nptl-init.c                             |  2 +-
 nptl/pthreadP.h                              |  2 +-
 nptl/pthread_mutex_init.c                    |  3 +-
 nptl/pthread_mutex_lock.c                    | 35 +++++++++++++-
 nptl/pthread_mutex_timedlock.c               | 35 ++++++++++++--
 nptl/pthread_mutex_trylock.c                 |  5 +-
 nptl/pthread_mutex_unlock.c                  |  7 ++-
 nptl/pthread_mutexattr_settype.c             |  2 +-
 nptl/tst-initializers1.c                     | 11 +++--
 nptl/tst-mutex5b.c                           |  2 +
 nptl/tst-mutex7b.c                           |  2 +
 sysdeps/nptl/bits/thread-shared-types.h      | 21 +++++++--
 sysdeps/nptl/pthread.h                       | 15 ++++--
 sysdeps/unix/sysv/linux/hppa/pthread.h       |  4 ++
 24 files changed, 324 insertions(+), 50 deletions(-)
 create mode 100644 benchtests/bench-mutex-queuespinner-thread.c
 create mode 100644 manual/mutex.texi
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h
 create mode 100644 nptl/tst-mutex5b.c
 create mode 100644 nptl/tst-mutex7b.c

-- 
2.7.4

Comments

Carlos O'Donell July 5, 2018, 8:12 p.m. | #1
On 07/02/2018 04:11 AM, Kemi Wang wrote:
> The pthread adaptive mutex is designed based-on the truth that the lock

> probably would be released after a few of CPU cycles in most cases.

> Especially for the case: the applications code is well-behaved with a short

> critical section that is highly contended. Thus, spinning on the lock for

> a while instead of calling into the kernel to block immediately can help

> to improve the system performance.


This patch set is going to take a while to review.

We are currently freezing the ABI/API for glibc 2.28 release on August 1st.

Are you OK if we delay the inclusion of this feature to 2.29 once master
reopens?

-- 
Cheers,
Carlos.
kemi July 6, 2018, 1:32 a.m. | #2
On 2018年07月06日 04:12, Carlos O'Donell wrote:
> On 07/02/2018 04:11 AM, Kemi Wang wrote:

>> The pthread adaptive mutex is designed based-on the truth that the lock

>> probably would be released after a few of CPU cycles in most cases.

>> Especially for the case: the applications code is well-behaved with a short

>> critical section that is highly contended. Thus, spinning on the lock for

>> a while instead of calling into the kernel to block immediately can help

>> to improve the system performance.

> 

> This patch set is going to take a while to review.

> 


Yes. Thanks for taking your time to review!

> We are currently freezing the ABI/API for glibc 2.28 release on August 1st.

> 


I know. This patch series introduces a new mutex type and would not break the
existed ABI/API.


> Are you OK if we delay the inclusion of this feature to 2.29 once master

> reopens?


I hope we can pick this feature before 2.28 release if you are comfortable.

The new mutex type would benefit some guys who have severe lock contention in
their applications (Actually, some of our customers have complained to us). And
we can keep improving this feature after 2.28 release later.

>
kemi July 6, 2018, 7:42 a.m. | #3
On 2018年07月06日 09:32, kemi wrote:
> 

> 

> On 2018年07月06日 04:12, Carlos O'Donell wrote:

>> On 07/02/2018 04:11 AM, Kemi Wang wrote:

>>> The pthread adaptive mutex is designed based-on the truth that the lock

>>> probably would be released after a few of CPU cycles in most cases.

>>> Especially for the case: the applications code is well-behaved with a short

>>> critical section that is highly contended. Thus, spinning on the lock for

>>> a while instead of calling into the kernel to block immediately can help

>>> to improve the system performance.

>>

>> This patch set is going to take a while to review.

>>

> 


BTW, I can separate 1/4 patch into several sub-patches if you think it would
be easier for review:)

> Yes. Thanks for taking your time to review!

> 

>> We are currently freezing the ABI/API for glibc 2.28 release on August 1st.

>>

> 

> I know. This patch series introduces a new mutex type and would not break the

> existed ABI/API.

> 

> 

>> Are you OK if we delay the inclusion of this feature to 2.29 once master

>> reopens?

> 

> I hope we can pick this feature before 2.28 release if you are comfortable.

> 

> The new mutex type would benefit some guys who have severe lock contention in

> their applications (Actually, some of our customers have complained to us). And

> we can keep improving this feature after 2.28 release later.

> 

>>
Carlos O'Donell July 6, 2018, 12:42 p.m. | #4
On 07/05/2018 09:32 PM, kemi wrote:
> 

> 

> On 2018年07月06日 04:12, Carlos O'Donell wrote:

>> On 07/02/2018 04:11 AM, Kemi Wang wrote:

>>> The pthread adaptive mutex is designed based-on the truth that the lock

>>> probably would be released after a few of CPU cycles in most cases.

>>> Especially for the case: the applications code is well-behaved with a short

>>> critical section that is highly contended. Thus, spinning on the lock for

>>> a while instead of calling into the kernel to block immediately can help

>>> to improve the system performance.

>>

>> This patch set is going to take a while to review.

>>

> 

> Yes. Thanks for taking your time to review!

> 

>> We are currently freezing the ABI/API for glibc 2.28 release on August 1st.

>>

> 

> I know. This patch series introduces a new mutex type and would not break the

> existed ABI/API.

 
You don't break the ABI, but you do change it with the addition of a constant
PTHREAD_MUTEX_QUEUESPINNER_NP in the public pthread.h headers.

In theory this makes it OK for backport to the various distributions, since
the constant addition is not tied to any versioned symbol and is simply a flag
that you pass into the existing implementation. Your program has to be able to
handle failure if you request this type and don't get it.
 
>> Are you OK if we delay the inclusion of this feature to 2.29 once master

>> reopens?

> 

> I hope we can pick this feature before 2.28 release if you are comfortable.


It's very unlikely to happen. I wand to set that expectation up front.

We have only 3 weeks to get through the other issues that have been waiting
longer in the queue than this, namely Intel CET, C11 threads, etc, everything
you see here:

https://sourceware.org/glibc/wiki/Release/2.28#Planning

> The new mutex type would benefit some guys who have severe lock contention in

> their applications (Actually, some of our customers have complained to us). And

> we can keep improving this feature after 2.28 release later.


Given that distributions could backport this (no new symbols, just a new flag),
is there really any rush in getting it reviewed?

-- 
Cheers,
Carlos.
kemi July 6, 2018, 3:56 p.m. | #5
> It's very unlikely to happen. I wand to set that expectation up front.


All right.

> We have only 3 weeks to get through the other issues that have been waiting longer in the queue than this, namely Intel CET, C11 threads, etc, everything you see here:


>https://sourceware.org/glibc/wiki/Release/2.28#Planning


OK. Could you take a look at another patch series (read only while spinning) which is simpler and has been updated to V7. Thanks in advance.
https://sourceware.org/ml/libc-alpha/2018-07/msg00164.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00163.html

> Given that distributions could backport this (no new symbols, just a new flag), is there really any rush in getting it reviewed?


To be honest,  yes, please. 
And I will respond your feedback ASAP.
Carlos O'Donell July 6, 2018, 4:17 p.m. | #6
On 07/06/2018 11:56 AM, Wang, Kemi wrote:
>> Given that distributions could backport this (no new symbols, just

>> a new flag), is there really any rush in getting it reviewed?

> 

> To be honest,  yes, please. And I will respond your feedback ASAP.

 
This sounds mysterious. Why are you in a rush?

-- 
Cheers,
Carlos.