Revision 8e7f82deb0c0386a03b62e30082574347f8b57d5 authored by Filipe Manana on 12 September 2023, 10:45:39 UTC, committed by David Sterba on 14 September 2023, 21:24:42 UTC
When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we are not holding the directory's inode locked, and this can result in later attempting to add two entries to the directory with the same index number, resulting in a transaction abort, with -EEXIST (-17), when inserting the second delayed dir index. This results in a trace like the following: Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17) Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------ Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504! Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1 Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018 Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...) Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282 Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000 Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540 Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938 Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014 Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef Sep 11 22:34:59 myhostname kernel: FS: 00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000 Sep 11 22:34:59 myhostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0 Sep 11 22:34:59 myhostname kernel: Call Trace: Sep 11 22:34:59 myhostname kernel: <TASK> Sep 11 22:34:59 myhostname kernel: ? die+0x36/0x90 Sep 11 22:34:59 myhostname kernel: ? do_trap+0xda/0x100 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? do_error_trap+0x6a/0x90 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? exc_invalid_op+0x50/0x70 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? asm_exc_invalid_op+0x1a/0x20 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: btrfs_insert_dir_item+0x200/0x280 Sep 11 22:34:59 myhostname kernel: btrfs_add_link+0xab/0x4f0 Sep 11 22:34:59 myhostname kernel: ? ktime_get_real_ts64+0x47/0xe0 Sep 11 22:34:59 myhostname kernel: btrfs_create_new_inode+0x7cd/0xa80 Sep 11 22:34:59 myhostname kernel: btrfs_symlink+0x190/0x4d0 Sep 11 22:34:59 myhostname kernel: ? schedule+0x5e/0xd0 Sep 11 22:34:59 myhostname kernel: ? __d_lookup+0x7e/0xc0 Sep 11 22:34:59 myhostname kernel: vfs_symlink+0x148/0x1e0 Sep 11 22:34:59 myhostname kernel: do_symlinkat+0x130/0x140 Sep 11 22:34:59 myhostname kernel: __x64_sys_symlinkat+0x3d/0x50 Sep 11 22:34:59 myhostname kernel: do_syscall_64+0x5d/0x90 Sep 11 22:34:59 myhostname kernel: ? syscall_exit_to_user_mode+0x2b/0x40 Sep 11 22:34:59 myhostname kernel: ? do_syscall_64+0x6c/0x90 Sep 11 22:34:59 myhostname kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc The race leading to the problem happens like this: 1) Directory inode X is loaded into memory, its ->index_cnt field is initialized to (u64)-1 (at btrfs_alloc_inode()); 2) Task A is adding a new file to directory X, holding its vfs inode lock, and calls btrfs_set_inode_index() to get an index number for the entry. Because the inode's index_cnt field is set to (u64)-1 it calls btrfs_inode_delayed_dir_index_count() which fails because no dir index entries were added yet to the delayed inode and then it calls btrfs_set_inode_index_count(). This functions finds the last dir index key and then sets index_cnt to that index value + 1. It found that the last index key has an offset of 100. However before it assigns a value of 101 to index_cnt... 3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS lock for inode X is not taken, so it calls btrfs_get_dir_last_index() and sees index_cnt still with a value of (u64)-1. Because of that it calls btrfs_inode_delayed_dir_index_count() which fails since no dir index entries were added to the delayed inode yet, and then it also calls btrfs_set_inode_index_count(). This also finds that the last index key has an offset of 100, and before it assigns the value 101 to the index_cnt field of inode X... 4) Task A assigns a value of 101 to index_cnt. And then the code flow goes to btrfs_set_inode_index() where it increments index_cnt from 101 to 102. Task A then creates a delayed dir index entry with a sequence number of 101 and adds it to the delayed inode; 5) Task B assigns 101 to the index_cnt field of inode X; 6) At some later point when someone tries to add a new entry to the directory, btrfs_set_inode_index() will return 101 again and shortly after an attempt to add another delayed dir index key with index number 101 will fail with -EEXIST resulting in a transaction abort. Fix this by locking the inode at btrfs_get_dir_last_index(), which is only only used when opening a directory or attempting to lseek on it. Reported-by: ken <ken@bllue.org> Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/ Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads") CC: stable@vger.kernel.org # 6.5+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent e60aa5d
Kconfig.kgdb
# SPDX-License-Identifier: GPL-2.0-only
config HAVE_ARCH_KGDB
bool
# set if architecture has the its kgdb_arch_handle_qxfer_pkt
# function to enable gdb stub to address XML packet sent from GDB.
config HAVE_ARCH_KGDB_QXFER_PKT
bool
menuconfig KGDB
bool "KGDB: kernel debugger"
depends on HAVE_ARCH_KGDB
depends on DEBUG_KERNEL
help
If you say Y here, it will be possible to remotely debug the
kernel using gdb. It is recommended but not required, that
you also turn on the kernel config option
CONFIG_FRAME_POINTER to aid in producing more reliable stack
backtraces in the external debugger. Documentation of
kernel debugger is available at http://kgdb.sourceforge.net
as well as in Documentation/dev-tools/kgdb.rst. If
unsure, say N.
if KGDB
config KGDB_HONOUR_BLOCKLIST
bool "KGDB: use kprobe blocklist to prohibit unsafe breakpoints"
depends on HAVE_KPROBES
depends on MODULES
select KPROBES
default y
help
If set to Y the debug core will use the kprobe blocklist to
identify symbols where it is unsafe to set breakpoints.
In particular this disallows instrumentation of functions
called during debug trap handling and thus makes it very
difficult to inadvertently provoke recursive trap handling.
If unsure, say Y.
config KGDB_SERIAL_CONSOLE
tristate "KGDB: use kgdb over the serial console"
select CONSOLE_POLL
select MAGIC_SYSRQ
depends on TTY && HW_CONSOLE
default y
help
Share a serial console with kgdb. Sysrq-g must be used
to break in initially.
config KGDB_TESTS
bool "KGDB: internal test suite"
default n
help
This is a kgdb I/O module specifically designed to test
kgdb's internal functions. This kgdb I/O module is
intended to for the development of new kgdb stubs
as well as regression testing the kgdb internals.
See the drivers/misc/kgdbts.c for the details about
the tests. The most basic of this I/O module is to boot
a kernel boot arguments "kgdbwait kgdbts=V1F100"
config KGDB_TESTS_ON_BOOT
bool "KGDB: Run tests on boot"
depends on KGDB_TESTS
default n
help
Run the kgdb tests on boot up automatically without the need
to pass in a kernel parameter
config KGDB_TESTS_BOOT_STRING
string "KGDB: which internal kgdb tests to run"
depends on KGDB_TESTS_ON_BOOT
default "V1F100"
help
This is the command string to send the kgdb test suite on
boot. See the drivers/misc/kgdbts.c for detailed
information about other strings you could use beyond the
default of V1F100.
config KGDB_LOW_LEVEL_TRAP
bool "KGDB: Allow debugging with traps in notifiers"
depends on X86 || MIPS
default n
help
This will add an extra call back to kgdb for the breakpoint
exception handler which will allow kgdb to step through a
notify handler.
config KGDB_KDB
bool "KGDB_KDB: include kdb frontend for kgdb"
default n
help
KDB frontend for kernel
config KDB_DEFAULT_ENABLE
hex "KDB: Select kdb command functions to be enabled by default"
depends on KGDB_KDB
default 0x1
help
Specifiers which kdb commands are enabled by default. This may
be set to 1 or 0 to enable all commands or disable almost all
commands.
Alternatively the following bitmask applies:
0x0002 - allow arbitrary reads from memory and symbol lookup
0x0004 - allow arbitrary writes to memory
0x0008 - allow current register state to be inspected
0x0010 - allow current register state to be modified
0x0020 - allow passive inspection (backtrace, process list, lsmod)
0x0040 - allow flow control management (breakpoint, single step)
0x0080 - enable signalling of processes
0x0100 - allow machine to be rebooted
The config option merely sets the default at boot time. Both
issuing 'echo X > /sys/module/kdb/parameters/cmd_enable' or
setting with kdb.cmd_enable=X kernel command line option will
override the default settings.
config KDB_KEYBOARD
bool "KGDB_KDB: keyboard as input device"
depends on VT && KGDB_KDB && !PARISC
default n
help
KDB can use a PS/2 type keyboard for an input device
config KDB_CONTINUE_CATASTROPHIC
int "KDB: continue after catastrophic errors"
depends on KGDB_KDB
default "0"
help
This integer controls the behaviour of kdb when the kernel gets a
catastrophic error, i.e. for a panic or oops.
When KDB is active and a catastrophic error occurs, nothing extra
will happen until you type 'go'.
CONFIG_KDB_CONTINUE_CATASTROPHIC == 0 (default). The first time
you type 'go', you will be warned by kdb. The secend time you type
'go', KDB tries to continue. No guarantees that the
kernel is still usable in this situation.
CONFIG_KDB_CONTINUE_CATASTROPHIC == 1. KDB tries to continue.
No guarantees that the kernel is still usable in this situation.
CONFIG_KDB_CONTINUE_CATASTROPHIC == 2. KDB forces a reboot.
If you are not sure, say 0.
config ARCH_HAS_EARLY_DEBUG
bool
default n
help
If an architecture can definitely handle entering the debugger
when early_param's are parsed then it select this config.
Otherwise, if "kgdbwait" is passed on the kernel command line it
won't actually be processed until dbg_late_init() just after the
call to kgdb_arch_late() is made.
NOTE: Even if this isn't selected by an architecture we will
still try to register kgdb to handle breakpoints and crashes
when early_param's are parsed, we just won't act on the
"kgdbwait" parameter until dbg_late_init(). If you get a
crash and try to drop into kgdb somewhere between these two
places you might or might not end up being able to use kgdb
depending on exactly how far along the architecture has initted.
endif # KGDB
![swh spinner](/static/img/swh-spinner.gif)
Computing file changes ...