BUG 分析: 大量 D 进程卡在 shrink_inactive_list 导

一个项目中偶现几十上百个 D 进程卡住在,导致卡顿/卡死/ SWT 等问题,前前后后,提交了 3 次修复,还没有彻底解决 。
山重水复疑无路
LOG:
[149459.897408] [3:2065:watchdog] Binder:1042_16 D 0 9917 635 0x00000008[149459.897427] [3:2065:watchdog] Call trace:[149459.897435] [3:2065:watchdog] [] _switch_to+0xb4/0xc0[149459.897452] [3:2065:watchdog] [] _schedule+0x7f0/0xad0[149459.897468] [3:2065:watchdog] [] schedule+0x70/0x90[149459.897485] [3:2065:watchdog] [] schedule_timeout+0x548/0x668[149459.897502] [3:2065:watchdog] [] msleep+0x28/0x38[149459.897517] [3:2065:watchdog] [] shrink_inactive_list+0x118/0x998[149459.897534] [3:2065:watchdog] [] shrink_node_memcg+0xa18/0x1100[149459.897552] [3:2065:watchdog] [] shrink_node+0x108/0x2f8[149459.897568] [3:2065:watchdog] [] do_try_to_free_pages+0x178/0x380[149459.897586] [3:2065:watchdog] [] try_to_free_pages+0x370/0x4d8[149459.897605] [3:2065:watchdog] [] _alloc_pages_nodemask+0x868/0x1380[149459.897623] [3:2065:watchdog] [] __do_pagecache_readahead+0xbc/0x358[149459.897640] [3:2065:watchdog] [] filemapfault+0x11c/0x600[149459.897647] [3:2065:watchdog] [] ext4_filemap_fault+0x30/0x50[149459.897664] [3:2065:watchdog] [] handle_pte_fault+0xb38/0xfa8[149459.897681] [3:2065:watchdog] [] handle_mm_fault+0x1d0/0x328[149459.897699] [3:2065:watchdog] [] do_page_fault+0x2a0/0x3e0[149459.897716] [3:2065:watchdog] [] do_translation_fault+0x44/0xa8[149459.897732] [3:2065:watchdog] [] do_mem_abort+0x4c/0xd0[149459.897750] [3:2065:watchdog] [] el0_da+0x20/0x24[149459.897767] [3:2065:watchdog] Binder:1042_19 D 0 11188 635 0x00000008[149459.897786] [3:2065:watchdog] Call trace:[149459.897797] [3:2065:watchdog] [] _switch_to+0xb4/0xc0[149459.897804] [3:2065:watchdog] [] _schedule+0x7f0/0xad0[149459.897820] [3:2065:watchdog] [] schedule+0x70/0x90[149459.897835] [3:2065:watchdog] [] schedule_timeout+0x548/0x668[149459.897853] [3:2065:watchdog] [] msleep+0x28/0x38[149459.897868] [3:2065:watchdog] [] shrink_inactive_list+0x118/0x998[149459.897887] [3:2065:watchdog] [] shrink_node_memcg+0xa18/0x1100[149459.897904] [3:2065:watchdog] [] shrink_node+0x108/0x2f8[149459.897922] [3:2065:watchdog] [] do_try_to_free_pages+0x178/0x380[149459.897940] [3:2065:watchdog] [] try_to_free_pages+0x370/0x4d8[149459.897957] [3:2065:watchdog] [] __alloc_pages_nodemask+0x868/0x1380[149459.897977] [3:2065:watchdog] [] _do_page_cache_readahead+0xbc/0x358[149459.897996] [3:2065:watchdog] [] filemap_fault+0x11c/0x600[149459.898013] [3:2065:watchdog] [] ext4_filemap_fault+0x30/0x50[149459.898031] [3:2065:watchdog] [] handle_pte_fault+0xb38/0xfa8[149459.898048] [3:2065:watchdog] [] handle_mm_fault+0x1d0/0x328[149459.898065] [3:2065:watchdog] [] do_page_fault+0x2a0/0x3e0[149459.898083] [3:2065:watchdog] [] do_translation_fault+0x44/0xa8[149459.898100] [3:2065:watchdog] [] do_el0_ia_bp_hardening+0xc0/0x158[149459.898118] [3:2065:watchdog] [] el0_ia+0x1c/0x20
现象:大量进程从缺页异常入口,调用内存回收接口:->,使得该进程状态变为 D.
【BUG 分析: 大量 D 进程卡在 shrink_inactive_list 导】void msleep(unsigned int msecs){unsigned long timeout = msecs_to_jiffies(msecs) + 1;while (timeout)timeout = schedule_timeout_uninterruptible(timeout);}
signed long __sched schedule_timeout_uninterruptible(signed long timeout){__set_current_state(TASK_UNINTERRUPTIBLE);return schedule_timeout(timeout);}