Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/torvalds/linux.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-02-02md: fix NULL pointer deref with nowait but no mddev->queueSong Liu
Leon reported NULL pointer deref with nowait support: [ 15.123761] device-mapper: raid: Loading target version 1.15.1 [ 15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 15.124192] device-mapper: raid: Choosing default region size of 4MiB [ 15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 15.129530] #PF: supervisor write access in kernel mode [ 15.129533] #PF: error_code(0x0002) - not-present page [ 15.129535] PGD 0 P4D 0 [ 15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e [ 15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021 [ 15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20 [ 15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \ 00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \ 31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00 [ 15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202 [ 15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000 [ 15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d [ 15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000 [ 15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070 [ 15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001 [ 15.129572] FS: 00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000 [ 15.129575] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0 [ 15.129580] Call Trace: [ 15.129582] <TASK> [ 15.129584] md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531] [ 15.129597] raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63] [ 15.129604] ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129615] dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129625] table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129635] ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129644] ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129655] dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129663] __x64_sys_ioctl+0x8e/0xd0 [ 15.129667] do_syscall_64+0x5c/0x90 [ 15.129672] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129675] ? do_syscall_64+0x69/0x90 [ 15.129677] ? do_syscall_64+0x69/0x90 [ 15.129679] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129682] ? do_syscall_64+0x69/0x90 [ 15.129684] ? do_syscall_64+0x69/0x90 [ 15.129686] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 15.129689] RIP: 0033:0x7fa96ecd559b [ 15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \ c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \ ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48 [ 15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b [ 15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003 [ 15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27 [ 15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610 [ 15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530 [ 15.129709] </TASK> This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check. Fixes: f51d46d0e7cb ("md: add support for REQ_NOWAIT") Reported-by: Leon Möller <jkhsjdhjs@totally.rip> Tested-by: Leon Möller <jkhsjdhjs@totally.rip> Cc: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-12Merge tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - mtip32xx pci cleanups (Bjorn) - mtip32xx conversion to generic power management (Vaibhav) - rsxx pci powermanagement cleanups (Bjorn) - Remove the rsxx driver. This hardware never saw much adoption, and it's been end of lifed for a while. (Christoph) - MD pull request from Song: - REQ_NOWAIT support (Vishal Verma) - raid6 benchmark optimization (Dirk Müller) - Fix for acct bioset (Xiao Ni) - Clean up max_queued_requests (Mariusz Tkaczyk) - PREEMPT_RT optimization (Davidlohr Bueso) - Use default_groups in kobj_type (Greg Kroah-Hartman) - Use attribute groups in pktcdvd and rnbd (Greg) - NVMe pull request from Christoph: - increment request genctr on completion (Keith Busch, Geliang Tang) - add a 'iopolicy' module parameter (Hannes Reinecke) - print out valid arguments when reading from /dev/nvme-fabrics (Hannes Reinecke) - Use struct_group() in drbd (Kees) - null_blk fixes (Ming) - Get rid of congestion logic in pktcdvd (Neil) - Floppy ejection hang fix (Tasos) - Floppy max user request size fix (Xiongwei) - Loop locking fix (Tetsuo) * tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block: (32 commits) md: use default_groups in kobj_type md: Move alloc/free acct bioset in to personality lib/raid6: Use strict priority ranking for pq gen() benchmarking lib/raid6: skip benchmark of non-chosen xor_syndrome functions md: fix spelling of "its" md: raid456 add nowait support md: raid10 add nowait support md: raid1 add nowait support md: add support for REQ_NOWAIT md: drop queue limitation for RAID1 and RAID10 md/raid5: play nice with PREEMPT_RT block/rnbd-clt-sysfs: use default_groups in kobj_type pktcdvd: convert to use attribute groups block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0 nvme: add 'iopolicy' module parameter nvme: drop unused variable ctrl in nvme_setup_cmd nvme: increment request genctr on completion nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics block: remove the rsxx driver rsxx: Drop PCI legacy power management ...
2022-01-12Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Unify where the struct request handling code is located in the blk-mq code (Christoph) - Header cleanups (Christoph) - Clean up the io_context handling code (Christoph, me) - Get rid of ->rq_disk in struct request (Christoph) - Error handling fix for add_disk() (Christoph) - request allocation cleanusp (Christoph) - Documentation updates (Eric, Matthew) - Remove trivial crypto unregister helper (Eric) - Reduce shared tag overhead (John) - Reduce poll_stats memory overhead (me) - Known indirect function call for dio (me) - Use atomic references for struct request (me) - Support request list issue for block and NVMe (me) - Improve queue dispatch pinning (Ming) - Improve the direct list issue code (Keith) - BFQ improvements (Jan) - Direct completion helper and use it in mmc block (Sebastian) - Use raw spinlock for the blktrace code (Wander) - fsync error handling fix (Ye) - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me) * tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits) MAINTAINERS: add entries for block layer documentation docs: block: remove queue-sysfs.rst docs: sysfs-block: document virt_boundary_mask docs: sysfs-block: document stable_writes docs: sysfs-block: fill in missing documentation from queue-sysfs.rst docs: sysfs-block: add contact for nomerges docs: sysfs-block: sort alphabetically docs: sysfs-block: move to stable directory block: don't protect submit_bio_checks by q_usage_counter block: fix old-style declaration nvme-pci: fix queue_rqs list splitting block: introduce rq_list_move block: introduce rq_list_for_each_safe macro block: move rq_list macros to blk-mq.h block: drop needless assignment in set_task_ioprio() block: remove unnecessary trailing '\' bio.h: fix kernel-doc warnings block: check minor range in device_add_disk() block: use "unsigned long" for blk_validate_block_size(). block: fix error unwinding in device_add_disk ...
2022-01-06md: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the md rdev sysfs code to use default_groups field which has been the preferred way since commit aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: Move alloc/free acct bioset in to personalityXiao Ni
bioset acct is only needed for raid0 and raid5. Therefore, md_run only allocates it for raid0 and raid5. However, this does not cover personality takeover, which may cause uninitialized bioset. For example, the following repro steps: mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1 mdadm --wait /dev/md0 mkfs.xfs /dev/md0 mdadm /dev/md0 --grow -l5 mount /dev/md0 /mnt causes panic like: [ 225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 225.934903] #PF: supervisor instruction fetch in kernel mode [ 225.935639] #PF: error_code(0x0010) - not-present page [ 225.936361] PGD 0 P4D 0 [ 225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI [ 225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706 [ 225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014 [ 225.939922] RIP: 0010:0x0 [ 225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246 [ 225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39 [ 225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800 [ 225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01 [ 225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58 [ 225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040 [ 225.946709] FS: 00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000 [ 225.947814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0 [ 225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 225.951414] Call Trace: [ 225.951787] <TASK> [ 225.952120] mempool_alloc+0xe5/0x250 [ 225.952625] ? mempool_resize+0x370/0x370 [ 225.953187] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 225.953862] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 225.954464] ? sched_clock_cpu+0x15/0x120 [ 225.955019] ? find_held_lock+0xac/0xd0 [ 225.955564] bio_alloc_bioset+0x1ed/0x2a0 [ 225.956080] ? lock_downgrade+0x3a0/0x3a0 [ 225.956644] ? bvec_alloc+0xc0/0xc0 [ 225.957135] bio_clone_fast+0x19/0x80 [ 225.957651] raid5_make_request+0x1370/0x1b70 [ 225.958286] ? sched_clock_cpu+0x15/0x120 [ 225.958797] ? __lock_acquire+0x8b2/0x3510 [ 225.959339] ? raid5_get_active_stripe+0xce0/0xce0 [ 225.959986] ? lock_is_held_type+0xd8/0x130 [ 225.960528] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 225.961135] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 225.961703] ? sched_clock_cpu+0x15/0x120 [ 225.962232] ? lock_release+0x27a/0x6c0 [ 225.962746] ? do_wait_intr_irq+0x130/0x130 [ 225.963302] ? lock_downgrade+0x3a0/0x3a0 [ 225.963815] ? lock_release+0x6c0/0x6c0 [ 225.964348] md_handle_request+0x342/0x530 [ 225.964888] ? set_in_sync+0x170/0x170 [ 225.965397] ? blk_queue_split+0x133/0x150 [ 225.965988] ? __blk_queue_split+0x8b0/0x8b0 [ 225.966524] ? submit_bio_checks+0x3b2/0x9d0 [ 225.967069] md_submit_bio+0x127/0x1c0 [...] Fix this by moving alloc/free of acct bioset to pers->run and pers->free. While we are on this, properly handle md_integrity_register() error in raid0_run(). Fixes: daee2024715d (md: check level before create and exit io_acct_set) Cc: stable@vger.kernel.org Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: fix spelling of "its"Randy Dunlap
Use the possessive "its" instead of the contraction "it's" in printed messages. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: add support for REQ_NOWAITVishal Verma
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios. This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created. md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks Before patch: $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100 Running top while the above runs: $ ps -eL | grep $(pidof io_uring) 38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397 We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait. After patch: $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100 Running top while the above runs: $ ps -eL | grep $(pidof io_uring) 38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait. For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2021-12-10md: fix double free of mddev->private in autorun_array()zhangyue
In driver/md/md.c, if the function autorun_array() is called, the problem of double free may occur. In function autorun_array(), when the function do_md_run() returns an error, the function do_md_stop() will be called. The function do_md_run() called function md_run(), but in function md_run(), the pointer mddev->private may be freed. The function do_md_stop() called the function __md_stop(), but in function __md_stop(), the pointer mddev->private also will be freed without judging null. At this time, the pointer mddev->private will be double free, so it needs to be judged null or not. Signed-off-by: zhangyue <zhangyue1@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-12-10md: fix update super 1.0 on rdev size changeMarkus Hochholdinger
The superblock of version 1.0 doesn't get moved to the new position on a device size change. This leads to a rdev without a superblock on a known position, the raid can't be re-assembled. The line was removed by mistake and is re-added by this patch. Fixes: d9c0fa509eaf ("md: fix max sectors calculation for super 1.0") Cc: stable@vger.kernel.org Signed-off-by: Markus Hochholdinger <markus@hochholdinger.net> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-11-29block: remove GENHD_FL_EXT_DEVTChristoph Hellwig
All modern drivers can support extra partitions using the extended dev_t. In fact except for the ioctl method drivers never even see partitions in normal operation. So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all block devices that do support partitions, and require those that do not support partitions to explicit disallow them using GENHD_FL_NO_PART. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-01Merge tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull bdev size cleanups from Jens Axboe: "Clean up the bdev size handling with new bdev_nr_bytes() helper" * tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits) partitions/ibm: use bdev_nr_sectors instead of open coding it partitions/efi: use bdev_nr_bytes instead of open coding it block/ioctl: use bdev_nr_sectors and bdev_nr_bytes block: cache inode size in bdev udf: use sb_bdev_nr_blocks reiserfs: use sb_bdev_nr_blocks ntfs: use sb_bdev_nr_blocks jfs: use sb_bdev_nr_blocks ext4: use sb_bdev_nr_blocks block: add a sb_bdev_nr_blocks helper block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate squashfs: use bdev_nr_bytes instead of open coding it reiserfs: use bdev_nr_bytes instead of open coding it pstore/blk: use bdev_nr_bytes instead of open coding it ntfs3: use bdev_nr_bytes instead of open coding it nilfs2: use bdev_nr_bytes instead of open coding it nfs/blocklayout: use bdev_nr_bytes instead of open coding it jfs: use bdev_nr_bytes instead of open coding it hfsplus: use bdev_nr_sectors instead of open coding it hfs: use bdev_nr_sectors instead of open coding it ...
2021-10-18md: update superblock after changing rdev flags in state_storeXiao Ni
When the in memory flag is changed, we need to persist the change in the rdev superblock flags. This is needed for "writemostly" and "failfast". Reviewed-by: Li Feng <fengli@smartx.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: remove unused argument from md_new_eventGuoqing Jiang
Actually, mddev is not used by md_new_event. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: properly unwind when failing to add the kobject in md_allocChristoph Hellwig
Add proper error handling to delete the gendisk when failing to add the md kobject and clean up the error unwinding in general. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: extend disks_mutex coverageChristoph Hellwig
disks_mutex is intended to serialize md_alloc. Extended it to also cover the kobject_uevent call and getting the sysfs dirent to help reducing error handling complexity. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: add the bitmap group to the default groups for the md kobjectChristoph Hellwig
Replace the deprecated default_attrs with the default_groups mechanism, and add the always visible bitmap group to the groups created add kobject_add time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: add error handling support for add_disk()Luis Chamberlain
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. We just do the unwinding of what was not done before, and are sure to unlock prior to bailing. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: use bdev_nr_sectors instead of open coding itChristoph Hellwig
Use the proper helper to read the block device size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20211018101130.1838532-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: switch polling to be bio basedChristoph Hellwig
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: move integrity handling out of <linux/blkdev.h>Christoph Hellwig
Split the integrity/metadata handling definitions out into a new header. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: drop unused includes in <linux/genhd.h>Christoph Hellwig
Drop various include not actually used in genhd.h itself, and move the remaning includes closer together. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-22md: fix a lock order reversal in md_allocChristoph Hellwig
Commit b0140891a8cea3 ("md: Fix race when creating a new md device.") not only moved assigning mddev->gendisk before calling add_disk, which fixes the races described in the commit log, but also added a mddev->open_mutex critical section over add_disk and creation of the md kobj. Adding a kobject after add_disk is racy vs deleting the gendisk right after adding it, but md already prevents against that by holding a mddev->active reference. On the other hand taking this lock added a lock order reversal with what is not disk->open_mutex (used to be bdev->bd_mutex when the commit was added) for partition devices, which need that lock for the internal open for the partition scan, and a recent commit also takes it for non-partitioned devices, leading to further lockdep splatter. Fixes: b0140891a8ce ("md: Fix race when creating a new md device.") Fixes: d62633873590 ("block: support delayed holder registration") Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-06-30Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "Pretty calm round, mostly just NVMe and a bit of MD: - NVMe updates (via Christoph) - improve the APST configuration algorithm (Alexey Bogoslavsky) - look for StorageD3Enable on companion ACPI device (Mario Limonciello) - allow selecting the network interface for TCP connections (Martin Belanger) - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King, Christoph) - move the ACPI StorageD3 code to drivers/acpi/ and add quirks for certain AMD CPUs (Mario Limonciello) - zoned device support for nvmet (Chaitanya Kulkarni) - fix the rules for changing the serial number in nvmet (Noam Gottlieb) - various small fixes and cleanups (Dan Carpenter, JK Kim, Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert Uytterhoeven, Daniel Wagner) - MD updates (Via Song) - iostats rewrite (Guoqing Jiang) - raid5 lock contention optimization (Gal Ofri) - Fall through warning fix (Gustavo) - Misc fixes (Gustavo, Jiapeng)" * tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits) nvmet: use NVMET_MAX_NAMESPACES to set nn value loop: Fix missing discard support when using LOOP_CONFIGURE nvme.h: add missing nvme_lba_range_type endianness annotations nvme: remove zeroout memset call for struct nvme-pci: remove zeroout memset call for struct nvmet: remove zeroout memset call for struct nvmet: add ZBD over ZNS backend support nvmet: add Command Set Identifier support nvmet: add nvmet_req_bio put helper for backends nvmet: add req cns error complete helper block: export blk_next_bio() nvmet: remove local variable nvmet: use nvme status value directly nvmet: use u32 type for the local variable nsid nvmet: use u32 for nvmet_subsys max_nsid nvmet: use req->cmd directly in file-ns fast path nvmet: use req->cmd directly in bdev-ns fast path nvmet: make ver stable once connection established nvmet: allow mn change if subsys not discovered nvmet: make sn stable once connection was established ...
2021-06-15md: add comments in md_integrity_registerGuoqing Jiang
Given it is not obvious for the error handling, let's try to add some comments here to make it clear. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-15md: check level before create and exit io_acct_setGuoqing Jiang
The bio_set (io_acct_set) is used by personalities to clone bio and trace the timestamp of bio. Some personalities such as raid1/10 don't need the bio_set, so add check to not create it unconditionally. Also update the comment for md_account_bio to make it more clear. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-15md: Constify attribute_group structsRikard Falkeborn
The attribute_group structs are never modified, they're only passed to sysfs_create_group() and sysfs_remove_group(). Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
2021-06-15md: add io accounting for raid0 and raid5Guoqing Jiang
We introduce a new bioset (io_acct_set) for raid0 and raid5 since they don't own clone infrastructure to accounting io. And the bioset is added to mddev instead of to raid0 and raid5 layer, because with this way, we can put common functions to md.h and reuse them in raid0 and raid5. Also struct md_io_acct is added accordingly which includes io start_time, the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct to get related io status. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-15md: revert io stats accountingGuoqing Jiang
The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause double fault problem per the report [1], and also it is not correct to change ->bi_end_io if md don't own it, so let's revert it. And io stats accounting will be replemented in later commits. [1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t Fixes: 41d2d848e5c0 ("md: improve io stats accounting") Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-01md: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig
Convert the md driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-23md-cluster: fix use-after-free issue when removing rdevHeming Zhao
md_kick_rdev_from_array will remove rdev, so we should use rdev_for_each_safe to search list. How to trigger: env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns). ``` node2=192.168.0.3 for i in {1..20}; do echo ==== $i `date` ====; mdadm -Ss && ssh ${node2} "mdadm -Ss" wipefs -a /dev/sda /dev/sdb mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \ /dev/sdb --assume-clean ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" mdadm --wait /dev/md0 ssh ${node2} "mdadm --wait /dev/md0" mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda sleep 1 done ``` Crash stack: ``` stack segment: 0000 [#1] SMP ... ... RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod] ... ... RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50 RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246 RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800 FS: 0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: raid1d+0x5c/0xd40 [raid1] ? finish_task_switch+0x75/0x2a0 ? lock_timer_base+0x67/0x80 ? try_to_del_timer_sync+0x4d/0x80 ? del_timer_sync+0x41/0x50 ? schedule_timeout+0x254/0x2d0 ? md_start_sync+0xe0/0xe0 [md_mod] ? md_thread+0x127/0x160 [md_mod] md_thread+0x127/0x160 [md_mod] ? wait_woken+0x80/0x80 kthread+0x10d/0x130 ? kthread_park+0xa0/0xa0 ret_from_fork+0x1f/0x40 ``` Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code") Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment") Cc: stable@vger.kernel.org Reviewed-by: Gang He <ghe@suse.com> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: do not return existing mddevs from mddev_find_or_allocChristoph Hellwig
Instead of returning an existing mddev, just for it to be discarded later directly return -EEXIST. Rename the function to mddev_alloc now that it doesn't find an existing mddev. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: refactor mddev_find_or_allocChristoph Hellwig
Allocate the new mddev first speculatively, which greatly simplifies the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: factor out a mddev_alloc_unit helper from mddev_findChristoph Hellwig
Split out a self contained helper to find a free minor for the md "unit" number. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-08md: split mddev_findChristoph Hellwig
Split mddev_find into a simple mddev_find that just finds an existing mddev by the unit number, and a more complicated mddev_find that deals with find or allocating a mddev. This turns out to fix this bug reported by Zhao Heming. ----------------------------- snip ------------------------------ commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env. *** env *** kvm-qemu VM 2C1G with 2 iscsi luns kernel should be non-preempt *** script *** about trigger 1 time with 10 tests `1 node1="15sp3-mdcluster1" 2 node2="15sp3-mdcluster2" 3 4 mdadm -Ss 5 ssh ${node2} "mdadm -Ss" 6 wipefs -a /dev/sda /dev/sdb 7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \ /dev/sdb --assume-clean 8 9 for i in {1..100}; do 10 echo ==== $i ====; 11 12 echo "test ...." 13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" 14 sleep 1 15 16 echo "clean ....." 17 ssh ${node2} "mdadm -Ss" 18 done ` I use mdcluster env to trigger soft lockup, but it isn't mdcluster speical bug. To stop md array in mdcluster env will do more jobs than non-cluster array, which will leave enough time/gap to allow kernel to run md_open. *** stack *** `ID: 2831 TASK: ffff8dd7223b5040 CPU: 0 COMMAND: "mdadm" #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3 #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9 #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964 #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4 #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c or: [ 884.226509] mddev_put+0x1c/0xe0 [md_mod] [ 884.226515] md_open+0x3c/0xe0 [md_mod] [ 884.226518] __blkdev_get+0x30d/0x710 [ 884.226520] ? bd_acquire+0xd0/0xd0 [ 884.226522] blkdev_get+0x14/0x30 [ 884.226524] do_dentry_open+0x204/0x3a0 [ 884.226531] path_openat+0x2fc/0x1520 [ 884.226534] ? seq_printf+0x4e/0x70 [ 884.226536] do_filp_open+0x9b/0x110 [ 884.226542] ? md_release+0x20/0x20 [md_mod] [ 884.226543] ? seq_read+0x1d8/0x3e0 [ 884.226545] ? kmem_cache_alloc+0x18a/0x270 [ 884.226547] ? do_sys_open+0x1bd/0x260 [ 884.226548] do_sys_open+0x1bd/0x260 [ 884.226551] do_syscall_64+0x5b/0x1e0 [ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ` *** rootcause *** "mdadm -A" (or other array assemble commands) will start a daemon "mdadm --monitor" by default. When "mdadm -Ss" is running, the stop action will wakeup "mdadm --monitor". The "--monitor" daemon will immediately get info from /proc/mdstat. This time mddev in kernel still exist, so /proc/mdstat still show md device, which makes "mdadm --monitor" to open /dev/md0. The previously "mdadm -Ss" is removing action, the "mdadm --monitor" open action will trigger md_open which is creating action. Racing is happening. `<thread 1>: "mdadm -Ss" md_release mddev_put deletes mddev from all_mddevs queue_work for mddev_delayed_delete at this time, "/dev/md0" is still available for opening <thread 2>: "mdadm --monitor ..." md_open + mddev_find can't find mddev of /dev/md0, and create a new mddev and | return. + trigger "if (mddev->gendisk != bdev->bd_disk)" and return -ERESTARTSYS. ` In non-preempt kernel, <thread 2> is occupying on current CPU. and mddev_delayed_delete which was created in <thread 1> also can't be schedule. In preempt kernel, it can also trigger above racing. But kernel doesn't allow one thread running on a CPU all the time. after <thread 2> running some time, the later "mdadm -A" (refer above script line 13) will call md_alloc to alloc a new gendisk for mddev. it will break md_open statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller, the soft lockup is broken. ------------------------------ snip ------------------------------ Cc: stable@vger.kernel.org Fixes: d3374825ce57 ("md: make devices disappear when they are no longer needed.") Reported-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-08md: factor out a mddev_find_locked helper from mddev_findChristoph Hellwig
Factor out a self-contained helper to just lookup a mddev by the dev_t "unit". Cc: stable@vger.kernel.org Reviewed-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-08md: md_open returns -EBUSY when entering racing areaZhao Heming
commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env. This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which will break the infinitely retry when md_open enter racing area. This patch is partly fix soft lockup issue, full fix needs mddev_find is split into two functions: mddev_find & mddev_find_or_alloc. And md_open should call new mddev_find (it only does searching job). For more detail, please refer with Christoph's "split mddev_find" patch in later commits. *** env *** kvm-qemu VM 2C1G with 2 iscsi luns kernel should be non-preempt *** script *** about trigger every time with below script ``` 1 node1="mdcluster1" 2 node2="mdcluster2" 3 4 mdadm -Ss 5 ssh ${node2} "mdadm -Ss" 6 wipefs -a /dev/sda /dev/sdb 7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \ /dev/sdb --assume-clean 8 9 for i in {1..10}; do 10 echo ==== $i ====; 11 12 echo "test ...." 13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" 14 sleep 1 15 16 echo "clean ....." 17 ssh ${node2} "mdadm -Ss" 18 done ``` I use mdcluster env to trigger soft lockup, but it isn't mdcluster speical bug. To stop md array in mdcluster env will do more jobs than non-cluster array, which will leave enough time/gap to allow kernel to run md_open. *** stack *** ``` [ 884.226509] mddev_put+0x1c/0xe0 [md_mod] [ 884.226515] md_open+0x3c/0xe0 [md_mod] [ 884.226518] __blkdev_get+0x30d/0x710 [ 884.226520] ? bd_acquire+0xd0/0xd0 [ 884.226522] blkdev_get+0x14/0x30 [ 884.226524] do_dentry_open+0x204/0x3a0 [ 884.226531] path_openat+0x2fc/0x1520 [ 884.226534] ? seq_printf+0x4e/0x70 [ 884.226536] do_filp_open+0x9b/0x110 [ 884.226542] ? md_release+0x20/0x20 [md_mod] [ 884.226543] ? seq_read+0x1d8/0x3e0 [ 884.226545] ? kmem_cache_alloc+0x18a/0x270 [ 884.226547] ? do_sys_open+0x1bd/0x260 [ 884.226548] do_sys_open+0x1bd/0x260 [ 884.226551] do_syscall_64+0x5b/0x1e0 [ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` *** rootcause *** "mdadm -A" (or other array assemble commands) will start a daemon "mdadm --monitor" by default. When "mdadm -Ss" is running, the stop action will wakeup "mdadm --monitor". The "--monitor" daemon will immediately get info from /proc/mdstat. This time mddev in kernel still exist, so /proc/mdstat still show md device, which makes "mdadm --monitor" to open /dev/md0. The previously "mdadm -Ss" is removing action, the "mdadm --monitor" open action will trigger md_open which is creating action. Racing is happening. ``` <thread 1>: "mdadm -Ss" md_release mddev_put deletes mddev from all_mddevs queue_work for mddev_delayed_delete at this time, "/dev/md0" is still available for opening <thread 2>: "mdadm --monitor ..." md_open + mddev_find can't find mddev of /dev/md0, and create a new mddev and | return. + trigger "if (mddev->gendisk != bdev->bd_disk)" and return -ERESTARTSYS. ``` In non-preempt kernel, <thread 2> is occupying on current CPU. and mddev_delayed_delete which was created in <thread 1> also can't be schedule. In preempt kernel, it can also trigger above racing. But kernel doesn't allow one thread running on a CPU all the time. after <thread 2> running some time, the later "mdadm -A" (refer above script line 13) will call md_alloc to alloc a new gendisk for mddev. it will break md_open statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller, the soft lockup is broken. Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-03-25md: Fix missing unused status line of /proc/mdstatJan Glauber
Reading /proc/mdstat with a read buffer size that would not fit the unused status line in the first read will skip this line from the output. So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something like: unused devices: <none> Don't return NULL immediately in start() for v=2 but call show() once to print the status line also for multiple reads. Cc: stable@vger.kernel.org Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: Jan Glauber <jglauber@digitalocean.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-03-25md: add md_submit_discard_bio() for submitting discard bioXiao Ni
Move these logic from raid0.c to md.c, so that we can also use it in raid10.c. Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-02-01md: use rdev_read_only in restart_arrayChristoph Hellwig
Make the read-only check in restart_array identical to the other two read-only checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01md: check for NULL ->meta_bdev before calling bdev_read_onlyChristoph Hellwig
->meta_bdev is optional and not set for most arrays. Add a rdev_read_only helper that calls bdev_read_only for both devices in a safe way. Fixes: 6f0d9689b670 ("block: remove the NULL bdev check in bdev_read_only") Reported-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27md: remove md_bio_alloc_syncChristoph Hellwig
md_bio_alloc_sync is never called with a NULL mddev, and ->sync_set is initialized in md_run, so it always must be initialized as well. Just open code the remaining call to bio_alloc_bioset. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27md: simplify sync_page_ioChristoph Hellwig
Use an on-stack bio and biovec for the single page synchronous I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27md: remove bio_alloc_mddevChristoph Hellwig
bio_alloc_mddev is never called with a NULL mddev, and ->bio_set is initialized in md_run, so it always must be initialized as well. Just open code the remaining call to bio_alloc_bioset. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: use ->bi_bdev for bio based I/O accountingChristoph Hellwig
Rework the I/O accounting for bio based drivers to use ->bi_bdev. This means all drivers can now simply use bio_start_io_acct to start accounting, and it will take partitions into account automatically. To end I/O account either bio_end_io_acct can be used if the driver never remaps I/O to a different device, or bio_end_io_acct_remapped if the driver did remap the I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: store a block_device pointer in struct bioChristoph Hellwig
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-20md: Set prev_flush_start and flush_bio in an atomic wayXiao Ni
One customer reports a crash problem which causes by flush request. It triggers a warning before crash. /* new request after previous flush is completed */ if (ktime_after(req_start, mddev->prev_flush_start)) { WARN_ON(mddev->flush_bio); mddev->flush_bio = bio; bio = NULL; } The WARN_ON is triggered. We use spin lock to protect prev_flush_start and flush_bio in md_flush_request. But there is no lock protection in md_submit_flush_data. It can set flush_bio to NULL first because of compiler reordering write instructions. For example, flush bio1 sets flush bio to NULL first in md_submit_flush_data. An interrupt or vmware causing an extended stall happen between updating flush_bio and prev_flush_start. Because flush_bio is NULL, flush bio2 can get the lock and submit to underlayer disks. Then flush bio1 updates prev_flush_start after the interrupt or extended stall. Then flush bio3 enters in md_flush_request. The start time req_start is behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK again. INIT_WORK() will re-initialize the list pointers in the work_struct, which then can result in a corrupted work list and the work_struct queued a second time. With the work list corrupted, it can lead in invalid work items being used and cause a crash in process_one_work. We need to make sure only one flush bio can be handled at one same time. So add spin lock in md_submit_flush_data to protect prev_flush_start and flush_bio in an atomic way. Reviewed-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-17Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "Nothing major in here: - NVMe pull request from Christoph: - nvmet passthrough improvements (Chaitanya Kulkarni) - fcloop error injection support (James Smart) - read-only support for zoned namespaces without Zone Append (Javier González) - improve some error message (Minwoo Im) - reject I/O to offline fabrics namespaces (Victor Gladkov) - PCI queue allocation cleanups (Niklas Schnelle) - remove an unused allocation in nvmet (Amit Engel) - a Kconfig spelling fix (Colin Ian King) - nvme_req_qid simplication (Baolin Wang) - MD pull request from Song: - Fix race condition in md_ioctl() (Dae R. Jeong) - Initialize read_slot properly for raid10 (Kevin Vigor) - Code cleanup (Pankaj Gupta) - md-cluster resync/reshape fix (Zhao Heming) - Move null_blk into its own directory (Damien Le Moal) - null_blk zone and discard improvements (Damien Le Moal) - bcache race fix (Dongsheng Yang) - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang, Lutz Pogrell, Md Haris Iqbal) - lightnvm NULL pointer deref fix (tangzhenhao) - sr in_interrupt() removal (Sebastian Andrzej Siewior) - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included as it made it easier for them to funnel the feature through the block driver tree. - Follow up fixes (Colin Ian King)" * tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits) block: drop dead assignments in loop_init() sr: Remove in_interrupt() usage in sr_init_command(). sr: Switch the sector size back to 2048 if sr_read_sector() changed it. cdrom: Reset sector_size back it is not 2048. drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c null_blk: Move driver into its own directory null_blk: Allow controlling max_hw_sectors limit null_blk: discard zones on reset null_blk: cleanup discard handling null_blk: Improve implicit zone close null_blk: improve zone locking block: Align max_hw_sectors to logical blocksize null_blk: Fail zone append to conventional zones null_blk: Fix zone size initialization bcache: fix race between setting bdev state to none and new write request direct to backing block/rnbd: fix a null pointer dereference on dev->blk_symlink_name block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name block/rnbd: call kobject_put in the failure path Documentation/ABI/rnbd-srv: add document for force_close block/rnbd-srv: close a mapped device from server side. ...
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-10Revert "md: add md_submit_discard_bio() for submitting discard bio"Song Liu
This reverts commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-04block: remove the request_queue argument to the block_bio_remap tracepointChristoph Hellwig
The request_queue can trivially be derived from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>