[v2,0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP

Message ID 1531464752-18830-1-git-send-email-kemi.wang@intel.com
Headers show
Series
  • Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
Related show

Message

kemi July 13, 2018, 6:52 a.m.
The pthread adaptive mutex is designed based-on the truth that the lock
probably would be released after a few of CPU cycles in most cases.
Especially for the case: the applications code is well-behaved with a short
critical section that is highly contended. Thus, spinning on the lock for
a while instead of calling into the kernel to block immediately can help
to improve the system performance.

But there are two main problems in current adaptive mutex. The first one is
fairness, multiple spinners contend for the lock simultaneously and there
is no any guarantee for a spinner to acquire the lock no matter how long it
has been waiting for. The other is the heavy cache line bouncing. Since the
cache line including the lock is shared among all of the spinners, when
lock is released, each spinner will try to acquire the lock via cmpxchg
instruction which constantly floods the system via "read-for-ownership"
requests. As a result, there will be a lot of cache line bouncing in a
large system with a lots of CPUs.

To address the problems mentioned above, the idea for queuing mutex
spinners with MCS lock[1] referred to the implementation of mutex[2] in
kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is
introduced. Compare to adaptive mutex (read only while spinning), the test
result on Intel 2 sockets Skylake platform has showns significant
performance improvement (See the first patch for details).

Though, queuing spinner with mcs lock can help to improve the performance
of adaptive mutex when multiple threads contending for a lock, people
probably want to know how queue spinner mutex performs when compare to
other lock discipline widely knowns as pthread spin lock and pthread mutex.
We can see the details of the test result in the first patch. Simply, some
conclusion is summarized as below:
a) In case of little lock contention, spin lock performs best, queue
spinner mutex performs similar to adaptive mutex, and both perform a
little better than pthread mutex.
b) In the case of severe lock contention with large number of CPUs when
protecting a small critical section (less than 1000ns). Most of lock
acquisition is got via spinning. Queue spinner mutex.
performs much better than spin lock and adaptive mutex. This is because the
overhead of heavy cache line bouncing plays a big role on lock performance.
c) With the increase of the size of a critical section, the advantage of
queue spinner mutex on performance in reduced gradually. This is because
the overhead of cache line bouncing will not become the bottleneck of lock
performance, instead, the overhead of futex_wait and futex_wake
plays a big role. When the critical section is increased to 1ms, even the
latency of futex syscall would be ignorable compared to the total time of
lock acquisition.

As we can see above, queue spinner mutex performs well in kinds of workload,
but there would be a potential risk to use this type of mutex. When the
lock holder is transformed to the next spinner in the queue, but it is not
running (the CPU is scheduled to run other task). Thus, the other spinners
have to wait in the queue, this would probably collapse the lock performance.
To emulate this case, we run two same processes simultaneously, the
process has 28 threads each of which sets CPU affinity to an individual CPU
according to the thread id. Thus, CPU [0~27] are subscribed by two threads.
In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by
58.1% (2205245->924263).
Therefore, queue spinner mutex would be carefully used for applications to
pursue fairness and performance without oversubscribing CPU resource. E.g.
Containers in public cloud infrastructure.

To overcome the disadvantage of lock performance collapse in the legacy MCS
lock when CPUs are oversubscribed, we proposed an enhanced MCS lock
algorithm in the second patch which allows a spinner to spin on mutex lock
without having to be waken up by its previous spinner. In such case, this
new mutex type would be widely and safely used in kinds of usage scenarios.

ChangeLog:
   V1->V2
   a) Propose an enhanced MCS lock algrithm to avoid potential lock
   performance collapse.
   b) Add a link of the paper which introduces original MCS lock.
   c) Add the commit id for mutex implementation with MCS lock in kernel.

[1] Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for scalable
synchronization on shared-memory multiprocessors." ACM Transactions on
Computer Systems (TOCS) 9.1 (1991): 21-65.
https://www.cos.ufrj.br/~ines/courses/progpar01/papers/p21-mellor-
crummey.pdf
[2] Commit id for mutex with MCS lock in kernel:
2bd2c92cf07cc4a373bf316c75b78ac465fefd35

Kemi Wang (5):
  Mutex: Queue spinners to reduce cache line bouncing and ensure
    fairness
  MCS lock: Enhance legacy MCS lock algorithm
  Mutex: add unit tests for new type
  BENCHMARK: add a benchmark for testing new type of mutex
  Manual: Add manual for pthread mutex

 benchtests/Makefile                          |  4 +-
 benchtests/bench-mutex-adaptive-thread.c     |  8 ++-
 benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++
 manual/Makefile                              |  2 +-
 manual/mutex.texi                            | 68 +++++++++++++++++++++
 nptl/Makefile                                |  8 +--
 nptl/allocatestack.c                         |  2 +-
 nptl/descr.h                                 | 26 ++++----
 nptl/mcs_lock.c                              | 88 ++++++++++++++++++++++++++++
 nptl/mcs_lock.h                              | 21 +++++++
 nptl/nptl-init.c                             |  2 +-
 nptl/pthreadP.h                              |  2 +-
 nptl/pthread_mutex_init.c                    |  3 +-
 nptl/pthread_mutex_lock.c                    | 40 ++++++++++++-
 nptl/pthread_mutex_timedlock.c               | 35 ++++++++++-
 nptl/pthread_mutex_trylock.c                 |  5 +-
 nptl/pthread_mutex_unlock.c                  |  7 ++-
 nptl/pthread_mutexattr_settype.c             |  2 +-
 nptl/tst-initializers1.c                     | 11 ++--
 nptl/tst-mutex5b.c                           |  2 +
 nptl/tst-mutex7b.c                           |  2 +
 sysdeps/nptl/bits/thread-shared-types.h      | 22 +++++--
 sysdeps/nptl/pthread.h                       | 15 +++--
 sysdeps/unix/sysv/linux/hppa/pthread.h       |  4 ++
 24 files changed, 350 insertions(+), 50 deletions(-)
 create mode 100644 benchtests/bench-mutex-queuespinner-thread.c
 create mode 100644 manual/mutex.texi
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h
 create mode 100644 nptl/tst-mutex5b.c
 create mode 100644 nptl/tst-mutex7b.c

-- 
2.7.4

Comments

kemi Aug. 2, 2018, 12:34 a.m. | #1
Hi, Gentle maintainers
  Could we get this patch series reviewed next? Thanks

https://sourceware.org/ml/libc-alpha/2018-07/msg00354.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00357.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00355.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00358.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00356.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00359.html

On 2018年07月13日 14:52, Kemi Wang wrote:
> The pthread adaptive mutex is designed based-on the truth that the lock

> probably would be released after a few of CPU cycles in most cases.

> Especially for the case: the applications code is well-behaved with a short

> critical section that is highly contended. Thus, spinning on the lock for

> a while instead of calling into the kernel to block immediately can help

> to improve the system performance.

> 

> But there are two main problems in current adaptive mutex. The first one is

> fairness, multiple spinners contend for the lock simultaneously and there

> is no any guarantee for a spinner to acquire the lock no matter how long it

> has been waiting for. The other is the heavy cache line bouncing. Since the

> cache line including the lock is shared among all of the spinners, when

> lock is released, each spinner will try to acquire the lock via cmpxchg

> instruction which constantly floods the system via "read-for-ownership"

> requests. As a result, there will be a lot of cache line bouncing in a

> large system with a lots of CPUs.

> 

> To address the problems mentioned above, the idea for queuing mutex

> spinners with MCS lock[1] referred to the implementation of mutex[2] in

> kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is

> introduced. Compare to adaptive mutex (read only while spinning), the test

> result on Intel 2 sockets Skylake platform has showns significant

> performance improvement (See the first patch for details).

> 

> Though, queuing spinner with mcs lock can help to improve the performance

> of adaptive mutex when multiple threads contending for a lock, people

> probably want to know how queue spinner mutex performs when compare to

> other lock discipline widely knowns as pthread spin lock and pthread mutex.

> We can see the details of the test result in the first patch. Simply, some

> conclusion is summarized as below:

> a) In case of little lock contention, spin lock performs best, queue

> spinner mutex performs similar to adaptive mutex, and both perform a

> little better than pthread mutex.

> b) In the case of severe lock contention with large number of CPUs when

> protecting a small critical section (less than 1000ns). Most of lock

> acquisition is got via spinning. Queue spinner mutex.

> performs much better than spin lock and adaptive mutex. This is because the

> overhead of heavy cache line bouncing plays a big role on lock performance.

> c) With the increase of the size of a critical section, the advantage of

> queue spinner mutex on performance in reduced gradually. This is because

> the overhead of cache line bouncing will not become the bottleneck of lock

> performance, instead, the overhead of futex_wait and futex_wake

> plays a big role. When the critical section is increased to 1ms, even the

> latency of futex syscall would be ignorable compared to the total time of

> lock acquisition.

> 

> As we can see above, queue spinner mutex performs well in kinds of workload,

> but there would be a potential risk to use this type of mutex. When the

> lock holder is transformed to the next spinner in the queue, but it is not

> running (the CPU is scheduled to run other task). Thus, the other spinners

> have to wait in the queue, this would probably collapse the lock performance.

> To emulate this case, we run two same processes simultaneously, the

> process has 28 threads each of which sets CPU affinity to an individual CPU

> according to the thread id. Thus, CPU [0~27] are subscribed by two threads.

> In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by

> 58.1% (2205245->924263).

> Therefore, queue spinner mutex would be carefully used for applications to

> pursue fairness and performance without oversubscribing CPU resource. E.g.

> Containers in public cloud infrastructure.

> 

> To overcome the disadvantage of lock performance collapse in the legacy MCS

> lock when CPUs are oversubscribed, we proposed an enhanced MCS lock

> algorithm in the second patch which allows a spinner to spin on mutex lock

> without having to be waken up by its previous spinner. In such case, this

> new mutex type would be widely and safely used in kinds of usage scenarios.

> 

> ChangeLog:

>    V1->V2

>    a) Propose an enhanced MCS lock algrithm to avoid potential lock

>    performance collapse.

>    b) Add a link of the paper which introduces original MCS lock.

>    c) Add the commit id for mutex implementation with MCS lock in kernel.

> 

> [1] Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for scalable

> synchronization on shared-memory multiprocessors." ACM Transactions on

> Computer Systems (TOCS) 9.1 (1991): 21-65.

> https://www.cos.ufrj.br/~ines/courses/progpar01/papers/p21-mellor-

> crummey.pdf

> [2] Commit id for mutex with MCS lock in kernel:

> 2bd2c92cf07cc4a373bf316c75b78ac465fefd35

> 

> Kemi Wang (5):

>   Mutex: Queue spinners to reduce cache line bouncing and ensure

>     fairness

>   MCS lock: Enhance legacy MCS lock algorithm

>   Mutex: add unit tests for new type

>   BENCHMARK: add a benchmark for testing new type of mutex

>   Manual: Add manual for pthread mutex

> 

>  benchtests/Makefile                          |  4 +-

>  benchtests/bench-mutex-adaptive-thread.c     |  8 ++-

>  benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++

>  manual/Makefile                              |  2 +-

>  manual/mutex.texi                            | 68 +++++++++++++++++++++

>  nptl/Makefile                                |  8 +--

>  nptl/allocatestack.c                         |  2 +-

>  nptl/descr.h                                 | 26 ++++----

>  nptl/mcs_lock.c                              | 88 ++++++++++++++++++++++++++++

>  nptl/mcs_lock.h                              | 21 +++++++

>  nptl/nptl-init.c                             |  2 +-

>  nptl/pthreadP.h                              |  2 +-

>  nptl/pthread_mutex_init.c                    |  3 +-

>  nptl/pthread_mutex_lock.c                    | 40 ++++++++++++-

>  nptl/pthread_mutex_timedlock.c               | 35 ++++++++++-

>  nptl/pthread_mutex_trylock.c                 |  5 +-

>  nptl/pthread_mutex_unlock.c                  |  7 ++-

>  nptl/pthread_mutexattr_settype.c             |  2 +-

>  nptl/tst-initializers1.c                     | 11 ++--

>  nptl/tst-mutex5b.c                           |  2 +

>  nptl/tst-mutex7b.c                           |  2 +

>  sysdeps/nptl/bits/thread-shared-types.h      | 22 +++++--

>  sysdeps/nptl/pthread.h                       | 15 +++--

>  sysdeps/unix/sysv/linux/hppa/pthread.h       |  4 ++

>  24 files changed, 350 insertions(+), 50 deletions(-)

>  create mode 100644 benchtests/bench-mutex-queuespinner-thread.c

>  create mode 100644 manual/mutex.texi

>  create mode 100644 nptl/mcs_lock.c

>  create mode 100644 nptl/mcs_lock.h

>  create mode 100644 nptl/tst-mutex5b.c

>  create mode 100644 nptl/tst-mutex7b.c

>