Linux lockdep

Table of Contents

简介

lockdep用来分析内核中的持锁顺序,使用情况等问题,本文简单的分析一下其实现的基本原理以及使用方法.
内核文档可参考 https://docs.kernel.org/locking/lockdep-design.html

可以检测到的问题

问题汇总

BUG: Invalid wait context

这个用来反馈内核中持锁上下文的问题,可以检测到如下类型的问题, 详情可以参考提交 https://lwn.net/Articles/579849/

raw_spin_lock(&foo);
spin_lock(&bar);
spin_unlock(&bar);
raw_spin_unlock(&foo);

下面是一段实际发现问题的log

[   15.090335][    T1]
[   15.090359][    T1] =============================
[   15.090365][    T1] [ BUG: Invalid wait context ]
[   15.090373][    T1] 5.10.59-rt52-00983-g186a6841c682-dirty #3 Not tainted
[   15.090386][    T1] -----------------------------
[   15.090392][    T1] swapper/0/1 is trying to lock:
[   15.090402][    T1] 70ff00018507c188 (&gc->bgpio_lock){....}-{3:3}, at: _raw_spin_lock_irqsave+0x1c/0x28
[   15.090470][    T1] other info that might help us debug this:
[   15.090477][    T1] context-{5:5}
[   15.090485][    T1] 3 locks held by swapper/0/1:
[   15.090497][    T1]  #0: c2ff0001816de1a0 (&dev->mutex){....}-{4:4}, at: __device_driver_lock+0x98/0x104
[   15.090553][    T1]  #1: ffff90001485b4b8 (irq_domain_mutex){+.+.}-{4:4}, at: irq_domain_associate+0xbc/0x6d4
[   15.090606][    T1]  #2: 4bff000185d7a8e0 (lock_class){....}-{2:2}, at: _raw_spin_lock_irqsave+0x1c/0x28
[   15.090654][    T1] stack backtrace:
[   15.090661][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.10.59-rt52-00983-g186a6841c682-dirty #3
[   15.090682][    T1] Hardware name: Horizon Robotics Journey 5 DVB (DT)
[   15.090692][    T1] Call trace:
[   15.090698][    T1]  dump_backtrace+0x0/0x708
[   15.090717][    T1]  show_stack+0x24/0x30
[   15.090733][    T1]  dump_stack+0x1b4/0x2bc
[   15.090752][    T1]  __lock_acquire+0x2800/0xb760
[   15.090774][    T1]  lock_acquire+0x248/0x85c
[   15.090793][    T1]  __raw_spin_lock_irqsave+0xd4/0x270
[   15.090811][    T1]  _raw_spin_lock_irqsave+0x1c/0x28
[   15.090828][    T1]  dwapb_irq_ack+0xb4/0x300
[   15.090846][    T1]  __irq_do_set_handler+0x494/0xb2c
[   15.090864][    T1]  __irq_set_handler+0x74/0x114
[   15.090881][    T1]  irq_set_chip_and_handler_name+0x44/0x58
[   15.090900][    T1]  gpiochip_irq_map+0x210/0x644
[   15.090917][    T1]  irq_domain_associate+0x168/0x6d4
[   15.090935][    T1]  irq_create_mapping_affinity+0x15c/0x354
[   15.090955][    T1]  gpiochip_to_irq+0xb4/0x390
[   15.090970][    T1]  gpiod_to_irq+0xf8/0x1e4
[   15.090987][    T1]  gpio_keys_probe+0x9a0/0x24d0
[   15.091008][    T1]  platform_drv_probe+0x138/0x214
[   15.091026][    T1]  really_probe+0x2e0/0x15d4
[   15.091045][    T1]  driver_probe_device+0x100/0x418
[   15.091064][    T1]  device_driver_attach+0x138/0x234
[   15.091084][    T1]  __driver_attach+0x184/0x358
[   15.091103][    T1]  bus_for_each_dev+0x100/0x1e0
[   15.091122][    T1]  driver_attach+0x5c/0x98
[   15.091141][    T1]  bus_add_driver+0x33c/0x9d8
[   15.091159][    T1]  driver_register+0x1e4/0x6b0
[   15.091179][    T1]  __platform_driver_register+0x118/0x214
[   15.091196][    T1]  gpio_keys_init+0x28/0x34
[   15.091218][    T1]  do_one_initcall+0x8c/0x2d8
[   15.091236][    T1]  do_initcall_level+0x144/0x230
[   15.091257][    T1]  do_initcalls+0xb0/0x170
[   15.091275][    T1]  do_basic_setup+0xd0/0xe4
[   15.091293][    T1]  kernel_init_freeable+0x180/0x3e4
[   15.091311][    T1]  kernel_init+0x2c/0x398
[   15.091330][    T1]  ret_from_fork+0x10/0x30

WARNING: bad contention detected!

这个错误表示lockdep内部逻辑出错,一般不应该报出来

WARNING: possible circular locking dependency detected

见下面的分析,通过依赖路径分析来检测
ABBA, ABBCCA 死锁

WARNING: possible recursive locking detected

AA死锁,重入
这个通过遍历已获取到的锁列表,判断当前的锁是否已经被获取到.读锁除外.

WARNING: bad unlock balance detected

lock和unlock不成对调用

  • unlock时进行检测发现当前锁没有被持有时报错.
  • 初始化时检测是否持有锁,持有则报错.

WARNING: suspicious RCU usage

在rcu代码中通过RCU_LOCKDEP_WARN宏来报出来的warning

WARNING: lock held when returning to user space!

在离开内核空间时,依旧持有内核空间中的锁, 通过在离开内核空间之前过滤进程已持有锁的列表来判断.
kernel source for call lockdep_sys_exit();

WARNING: held lock freed!

通过debug_check_no_locks_freed报错,内核中在很多锁,对象初始化的地方加入该检查,通过检查free/初始化的区域中包含已经持有的锁来判断.
kernel source for lock freed check in mm/slub.c

WARNING: Nested lock was not taken

检测持有nest锁时,是否已经持有锁

WARNING: possible irq lock inversion dependency detected

检测由于irq重入而造成可能的死锁
https://bugzilla.redhat.com/show_bug.cgi?id=1304242

[  108.223030]        CPU0                    CPU1
[  108.223030]        ----                    ----
[  108.223030]   lock(&(&list->lock)->rlock#2);
[  108.223030]                                local_irq_disable();
[  108.223030]                                lock(policy_rwlock);
[  108.223030]                                lock(&(&list->lock)->rlock#2);
[  108.223030]   <Interrupt>
[  108.223030]     lock(policy_rwlock);

WARNING: inconsistent lock state

锁状态不匹配,i.e. 同一把锁在hardirq, softirq, 进程上下文中使用到,但是_bh, _irq的后缀使用错误,导致系统有死锁
如下面的patch,以及修复bug

[  313.402947]        CPU0
[  313.402948]        ----
[  313.402949]   lock(&(&wsm.lock)->rlock);
[  313.402951]   <Interrupt>
[  313.402952]     lock(&(&wsm.lock)->rlock);
[  313.402954]
*** DEADLOCK ***

https://lore.kernel.org/all/20190517231626.85614-1-dennis@kernel.org/

WARNING: %s-safe -> %s-unsafe lock order detected

锁上下文安全性检查
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2299913.html

实现原理

将锁抽象成lock class,并且在其中记录各种事件

  1. 通过定义static的 lock_class_key来记录每个key

    #ifdef CONFIG_DEBUG_SPINLOCK
    extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name,
                    struct lock_class_key *key, short inner);
    
    # define raw_spin_lock_init(lock)                                       \
        do {                                                                \
            static struct lock_class_key __key;                     \
                                        \
            __raw_spin_lock_init((lock), #lock, &__key, LD_WAIT_SPIN); \
        } while (0)
    
    #else
    

    在锁初始化时,将具体的key和class联系在一起 (默认情况下代码中同一个初始化位址具有相同的class)

  2. 部分的同类型锁对象在代码的不同的位址初始化,可以自定义key class来设置同样的class.
    lockdep_set_class* API可以用来实现这个功能
    kernel source irq_desc.lock keyclass
  3. 在相同代码位址初始化的key class也可以具备不同的subclass
    kernel source for tty_struct->termios_rwsem
    https://github.com/torvalds/linux/commit/d2b6f44779d3be22d32a5697bd30b59367fd2b33
    不同的subclass也会被lockdep当作是不同的锁,subclass可以支持动态进行添加,而keyclass只能是编译的时候就确定好。无法动态根据示例化对性的个数做扩展
  4. 通过以上的方法,lockdep将内核中的锁按照class的类别进行检测
    lockdep认为同class的锁在逻辑上是同一种锁,具有相同的处理方法. 每个class可能包含很多个不同的锁对象,lockdep通过记录这些锁使用时进出的上下文信息,以及进出进程已经持有的锁的上下文信息进行比较来发现问题

锁状态

在内核中,各种锁共用9类状态

===  ===================================================
'.'  acquired while irqs disabled and not in irq context
'-'  acquired in irq context
'+'  acquired with irqs enabled
'?'  acquired in irq context with irqs enabled.
===  ===================================================

The bits are illustrated with an example::
                     \----> hardirq
                     | \---> hardirq readlock
                     || \--> softirq
                     ||| \-> softirq readlock
                     ||||
(&sio_locks[i].lock){-.-.}, at: [<c02867fd>] mutex_lock+0x21/0x24
                     ||||
                     ||| \-> softirq disabled and not in softirq context
                     || \--> acquired in softirq context
                     | \---> hardirq disabled and not in hardirq context
                     \----> acquired in hardirq context

上面的文档已经解决的很清楚了
从内核的使用中,由于不同上下文间具有固定的抢占规则,可以锁添加上softirq/hardirq-safe/unsafe的标签
根据上面的9种状态,lockdep通过记录锁在获取时的上下文,以及中断开启状态,来机率锁的状态,每把锁可以在不同safe/unsafe的标签中转化

softirq/hardirq的safe状态可以用来判断持锁是否会有风险,入下图所示,红色箭头所表示的锁依赖顺序是有死锁风险的

在当前的task struct中记录相关的加锁信息

在记录时通过lock_class来进行,所以即使没有发生死锁,系统也可以上报具体的错误信息

RSLS Call Trace

下面是raw spinlock持锁时的lockdep相关的操作逻辑

- __raw_spin_lock_irqsave
  - spin_acquire
   - if !try_lock success
    - lock_contended
     - lock
   - lock_acquired

Mutex lock

下面是mutex_lock持锁时的lockdep相关的操作逻辑

- __mutex_lock
 - __mutex_lock_common
  - mutex_acquire_nest
   - if try_lock success
    - lock_acquired
    - return
   - else
    - lock_contended
    - wait until lock acquired
  - lock_acquired

从上面的两个例子可以看出,对于不同的锁,lockdep处理逻辑都是类似的,主要用到3个api函数 lock_acquire, lock_contended, lock_acquired.

在进行锁操作时,对比之前已经持有的锁信息,通过规则判断出具体的异常

上下文判断

这个很简单,通过判断每个锁所处于的上下文来进行判断
下面是linux对于上下文类型的定义, 代码提交如下 https://lwn.net/Articles/579849/
lockdep_wait_type

enum lockdep_wait_type {
    LD_WAIT_INV = 0,    /* not checked, catch all */

    LD_WAIT_FREE,               /* wait free, rcu etc.. */
    LD_WAIT_SPIN,               /* spin loops, raw_spinlock_t etc.. */

#ifdef CONFIG_PROVE_RAW_LOCK_NESTING
    LD_WAIT_CONFIG,             /* preemptible in PREEMPT_RT, spinlock_t etc.. */
#else
    LD_WAIT_CONFIG = LD_WAIT_SPIN,
#endif
    LD_WAIT_SLEEP,              /* sleeping locks, mutex_t etc.. */

    LD_WAIT_MAX,                /* must be last */
};

系统在运行时会检查所有已经获取到的锁,并且以此为依据检查持锁的上下文是否正确,代码可以参考 check_wait_context

  • wait_type_outer
    可以持有此锁的上下文类型, wait_type >= wait_type_outer既可认为是正确的持持锁顺序
    LD_WAIT_SLEEP 表示可以在进程上下文中持有该锁
    LD_WAIT_SPIN 表示可以在进程上下文,以及原子上下文中持有该锁
    LD_WAIT_INV 表示outer和inner保持一致 (大部份锁都是这样的设置)
  • wait_type_inner
    持有该锁之后进入的上下文类型
    eg:
    Mutex lock的wait_type_inner为LD_WAIT_SLEEP, 表示持有该锁之后系统可以睡眠。
    raw_spinlock的wait_type_inner为LD_WAIT_SPIN, 表示持有该锁之后系统就会进入不可抢占的原子上下文

    MUTEX LOCK wait_type_inner

    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    # define __DEP_MAP_MUTEX_INITIALIZER(lockname)          \
        , .dep_map = {                                      \
            .name = #lockname,                      \
            .wait_type_inner = LD_WAIT_SLEEP,       \
        }
    #else
    # define __DEP_MAP_MUTEX_INITIALIZER(lockname)
    #endif
    

系统根据着个类型,就可以跟踪,记录目前系统所持有的锁的上下文状态,并且于即将持有的锁inner waittype做比较,检测出相应的问题

锁依赖

此部份是lockdep设计的核心,死锁检测的方法,以及定理将从此处给出

  • lock type

    因为不同性质的锁阻塞的条件不同,所以内核根据阻塞产生条件,给内核中的锁分为了以下的几大类.lockdep将锁分为以下几种类型

    writers
    exclusive lockers, like spin_lock() or write_lock()
    此种锁是内核编程中最常用的类型,完全独占的锁,同一时间只会有一个owner,最容易分析
    non-recursive readers
    shared lockers, like down_read()
    此种锁是普通的读锁,因为读锁可以同时被读端锁持有,因此允许多线程同时持有,有相互依赖的持有关系也不一定会死锁
    recursive readers

    recursive shared lockers, like rcu_read_lock()
    这种锁是更不易死锁的类型,由于支持重入,具有更佳宽松个的死锁条件

    并且为了方便文章编写,做出缩写如下

    abbr type
    W/E exclusive lockers
    r non-recursive readers
    R recursive readers
    S all readers (non-recursive + recursive), as both are shared lockers.
    N writers and non-recursive readers, as both are not recursive.
  • non-recursive readers的死锁情形

    为了方便理解,通过下面的例子,可以看出重入属性对死锁的产生的影响
    这个例子展示了non-recursive readers被writer的waiter阻塞导致出现deadlock的情形
    虽然read_lock可以同时被多个cpu/单个cpu持有,但是如果有线程在等待write锁, 系统依旧会因此而阻塞

    TASK A: TASK B:

    read_lock(X);
                        write_lock(X);
    read_lock_2(X);

    简而言之就是锁X在第二次获取读锁时,会因为另一个task在等待获取写锁而导致死
    锁。这是因为写锁会block之后尝试获取读锁的操作,如果没有这个block操作,系统
    可能会因为一致有进程不断获取读锁,而导致,写入的进程无法拿到锁,造成饿死.

    具体可以参考 read_lock will wait for write waiter

    void queued_read_lock_slowpath(struct qrwlock *lock)
    {
        /*
         * Readers come here when they cannot get the lock without waiting
         */
        if (unlikely(in_interrupt())) {
            /*
             * Readers in interrupt context will get the lock immediately
             * if the writer is just waiting (not holding the lock yet),
             * so spin with ACQUIRE semantics until the lock is available
             * without waiting in the queue.
             */
            atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
            return;
        }
        atomic_sub(_QR_BIAS, &lock->cnts);
    
        /*
         * Put the reader into the wait queue
         */
        arch_spin_lock(&lock->wait_lock);
        atomic_add(_QR_BIAS, &lock->cnts);
    
        /*
         * The ACQUIRE semantics of the following spinning code ensure
         * that accesses can't leak upwards out of our subsequent critical
         * section in the case that the lock is currently held for write.
         */
        atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
    
        /*
         * Signal the next one in queue to become queue head
         */
        arch_spin_unlock(&lock->wait_lock);
    }
    EXPORT_SYMBOL(queued_read_lock_slowpath);
    

    如上面的内核实现,读锁会block在arch_spin_lock(&lock->wait_lock);或者 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));,在此处等待写锁离开临界区,单此时,读写被Task A拿到,导致死锁.

  • 根据锁类型,来观察死锁的条件关系

    针对这三种锁类型,可以有如下几种的阻塞关系矩阵. Y表示行标会阻塞列标, N表示不会产生阻塞的问题

    上面列出了不同类型的锁在相互依赖时是否会产生block,N表示完全不会形成阻塞等待
    下面简单列举两个不会产生阻塞的路径

    • R won't block r. 递归的读锁不会阻塞非递归的读锁
    • R won't block R. 递归的读锁不会相互block
  • 依赖分析

    通过上面的表格,在检测死锁时,共需要检测9种情况

    考虑如下的依赖情况
    L1 -> L2
    我们需要关心在获取L1的情况下,再去获取L2系统会不会被block. 换句话说就是有没有可能存在L3,L3依赖于L1,而L2依赖于L3.
    这样我们只需要检测一下获取L1之前获取过的锁列表, 以及获取L2之后获取过的锁列表,再综合锁的类型进行考虑就可以检测出来死锁。

    综合上面的阻塞关系图,具有如下规律

    根据这个规律,可以方便的将关系简化为4种,这样,就可以大大降低复杂度.
    lockdep会动态生成一张依赖图,系统将对应的lockclass添加到这张图中,通过对应的算法照出相应的依赖关系.
    依赖类型
    针对不同的锁类型,可以抽象出以下四中依赖关系, 按照上面的阻塞关系图。

    • -(ER)->
      exclusive writer to recursive reader dependency
    • -(EN)->
      exclusive writer to non-recursive locker dependency
    • -(SR)->
      shared reader to recursive reader dependency
    • -(SN)->
      shared reader to non-recursive locker dependency
  • strong path定义

    根据阻塞图,如果一条依赖路径上全部都是可阻塞的类型,则定义为strong path
    简单来说就是不经过-(R)-> -(S)->路径的定义为strong path. 因为经过这种路径的都有非block的锁关系,lockdep认为不会造成死锁.
    下面是内核文档中对strong path定义的原文

    A "path" is a series of conjunct dependency edges in the graph. And we define a
    "strong" path, which indicates the strong dependency throughout each dependency
    in the path, as the path that doesn't have two conjunct edges (dependencies) as
    -(xR)-> and -(Sx)->. In other words, a "strong" path is a path from a lock
    walking to another through the lock dependencies, and if X -> Y -> Z is in the
    path (where X, Y, Z are locks), and the walk from X to Y is through a -(SR)-> or
    -(ER)-> dependency, the walk from Y to Z must not be through a -(SN)-> or
    -(SR)-> dependency.

  • 死锁关系证明
    • 定理1
      如果一条路径上有闭环的strong path,那么这个就会构成死锁.
    • 定理2
      如果一条路径上没有闭环的strong path,那么一定不会构成死锁.

      在这两个定理的前提下,可以很容易得出,闭环的strong path是死锁的充分必要条件. 上面两条定理的证明在内核文档中有详细的描述。在此不再赘述.

      所以在lockdep的检测中,只需要找出持锁路径上是否有闭环的strong path就可以了.

  • 代码实现

    source to call check_noncircular
    通过代码注释可以看到,内核通过广度优先的遍历方式,寻找是否会形成依赖环路, 在找到对应的环路之后,再经过对闭环strong path的判断来验证是否存在死锁.
    检测是否存在闭环的strong path

    /*
     * We are about to add B -> A into the dependency graph, and in __bfs() a
     * strong dependency path A -> .. -> B is found: hlock_class equals
     * entry->class.
     *
     * We will have a deadlock case (conflict) if A -> .. -> B -> A is a strong
     * dependency cycle, that means:
     *
     * Either
     *
     *     a) B -> A is -(E*)->
     *
     * or
     *
     *     b) A -> .. -> B is -(*N)-> (i.e. A -> .. -(*N)-> B)
     *
     * as then we don't have -(*R)-> -(S*)-> in the cycle.
     */
    static inline bool hlock_conflict(struct lock_list *entry, void *data)
    {
        struct held_lock *hlock = (struct held_lock *)data;
    
        return hlock_class(hlock) == entry->class && /* Found A -> .. -> B */
            (hlock->read == 0 || /* B -> A is -(E*)-> */
                !entry->only_xr); /* A -> .. -> B is -(*N)-> */
    }
    

    如上面的注释,判断几个东西

    hlock_class(hlock) == entry->class
    A -> B 这个锁和即将获取的锁形成了依赖关系
    以前有依赖关系 A -> B, 再尝试通过B -> A的顺序拿锁是会造成这个条件成立
    hlock->read == 0
    B -(E*)-> A 这个情况满足闭环strong path的条件
    !entry->only_xr
    A -> .. -> B is -(*N)-> (i.e. A -> .. -(*N)-> B)
    如果这几个条件成立,则认为满足了闭环strong path的充要条件.

Debug

/proc/lockdep 节点可以查看每个lock class的详细debug信息, 下面列出了示例信息
代码请参考 source for /proc/lockdep

00000000d17fd3f7 OPS:     137 FD:  308 BD:    9 ++++: &tty->termios_rwsem
  -> [00000000b06224ce] &lock->wait_lock
  -> [00000000d4898666] &tty->write_wait
  -> [000000007b4e289b] &ldata->output_lock
  -> [000000006ca1fe00] (console_sem).lock
  -> [00000000ee9c07cf] console_lock
  -> [000000002bdc3481] &port->mutex
  -> [00000000bc14037e] &rq->lock
  -> [000000004119195b] &tty->read_wait
  -> [000000004246edf4] &port_lock_key
  -> [00000000b287322b] &mm->mmap_lock
  -> [0000000007fc188b] &tty->throttle_mutex
  -> [000000003fde6c48] &p->pi_lock
00000000d17fd3f7
lock的地址,%p显示,这里由于打印的保护机制,被显示为指针的hash值
OPS
137 尝试获取锁的次数, __lock_acquire时加1
FD
308 在获取该锁之前获取过的锁的个数
BD
9 在获取该锁之后获取过的锁个数
++
锁的状态,见上文中的解释
&tty->termios_rwsem
锁的名字

refs

Contact me via :)
虚怀乃若谷,水深则流缓。