【严重程度】 一般
【特性】
【重现类型】 有条件概率重现
【定位分析】 </p><p><br></p><p>该问题的根因是CIFS在writepages 时 lock_page ,等待其cifsd 进程清理PG_writeback标记后unlock_page。但是由于cifsd 清理PG_writeback 发生reconnect,导致进入cifs_writev_requeue() 函数中,基线lock_page导致死锁。</p><p><br></p><div class="ql-code-block-container" spellcheck="false"><div class="ql-code-block" data-language="plain">进程1(dd) 进程2(cifsd) 进程3(cifsiod worker)</div><div class="ql-code-block" data-language="plain">cifs_writepages</div><div class="ql-code-block" data-language="plain"> lock_page - [1]</div><div class="ql-code-block" data-language="plain"> wait_on_page_writeback - [2] 等待write标记被清理,被 [4] 阻塞</div><div class="ql-code-block" data-language="plain"> wait_on_page_bit</div><div class="ql-code-block" data-language="plain"> cifs_demultiplex_thread</div><div class="ql-code-block" data-language="plain"> cifs_read_from_socket</div><div class="ql-code-block" data-language="plain"> cifs_readv_from_socket</div><div class="ql-code-block" data-language="plain"> - 如果此时其他进程触发了reconnect</div><div class="ql-code-block" data-language="plain"> cifs_reconnect</div><div class="ql-code-block" data-language="plain"> - mid->mid_state 被更新为 MID_RETRY_NEEDED</div><div class="ql-code-block" data-language="plain"> smb2_writev_callback mid_entry->callback()</div><div class="ql-code-block" data-language="plain"> - mid_state 进而导致wdata->result = -EAGAIN</div><div class="ql-code-block" data-language="plain"> wdata->result = -EAGAIN</div><div class="ql-code-block" data-language="plain"> queue_work(cifsiod_wq, &wdata->work);</div><div class="ql-code-block" data-language="plain"> cifs_writev_complete - worker函数</div><div class="ql-code-block" data-language="plain"> - 条件满足</div><div class="ql-code-block" data-language="plain"> else if (..&& wdata->result == -EAGAIN)</div><div class="ql-code-block" data-language="plain"> cifs_writev_requeue</div><div class="ql-code-block" data-language="plain"> lock_page - [3] 被[1]阻塞</div><div class="ql-code-block" data-language="plain"> end_page_writeback - [4] 不会诶执行,被 [3]阻塞</div><div class="ql-code-block" data-language="plain"> unlock_page</div></div><p><br></p><p><br></p><p>这就导致了cifs_writepages 和 cifsiod 的worker 陷入循环等待中,导致死锁</p><p><br></p><p> 主线重构修补程序d08089f649a0(“cifs:将I/O路径更改为使用迭代器而不是页列表”)在等待写回完成时解锁页,从而避免在重新连接期间由锁顺序问题导致的潜在死锁。</p><p><br></p><p> 由于主线的大重构,补丁不能直接backport。因此,本补丁只是借鉴了主线补丁的部分思路来修复死锁。</p><p><br></p><p>