D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
--- /dev/null
+Overriding ACPI tables via initrd
+=================================
+
+1) Introduction (What is this about)
+2) What is this for
+3) How does it work
+4) References (Where to retrieve userspace tools)
+
+1) What is this about
+---------------------
+
+If the ACPI_INITRD_TABLE_OVERRIDE compile option is true, it is possible to
+override nearly any ACPI table provided by the BIOS with an instrumented,
+modified one.
+
+For a full list of ACPI tables that can be overridden, take a look at
+the char *table_sigs[MAX_ACPI_SIGNATURE]; definition in drivers/acpi/osl.c
+All ACPI tables iasl (Intel's ACPI compiler and disassembler) knows should
+be overridable, except:
+ - ACPI_SIG_RSDP (has a signature of 6 bytes)
+ - ACPI_SIG_FACS (does not have an ordinary ACPI table header)
+Both could get implemented as well.
+
+
+2) What is this for
+-------------------
+
+Please keep in mind that this is a debug option.
+ACPI tables should not get overridden for productive use.
+If BIOS ACPI tables are overridden the kernel will get tainted with the
+TAINT_OVERRIDDEN_ACPI_TABLE flag.
+Complain to your platform/BIOS vendor if you find a bug which is so sever
+that a workaround is not accepted in the Linux kernel.
+
+Still, it can and should be enabled in any kernel, because:
+ - There is no functional change with not instrumented initrds
+ - It provides a powerful feature to easily debug and test ACPI BIOS table
+ compatibility with the Linux kernel.
+
+
+3) How does it work
+-------------------
+
+# Extract the machine's ACPI tables:
+cd /tmp
+acpidump >acpidump
+acpixtract -a acpidump
+# Disassemble, modify and recompile them:
+iasl -d *.dat
+# For example add this statement into a _PRT (PCI Routing Table) function
+# of the DSDT:
+Store("HELLO WORLD", debug)
+iasl -sa dsdt.dsl
+# Add the raw ACPI tables to an uncompressed cpio archive.
+# They must be put into a /kernel/firmware/acpi directory inside the
+# cpio archive.
+# The uncompressed cpio archive must be the first.
+# Other, typically compressed cpio archives, must be
+# concatenated on top of the uncompressed one.
+mkdir -p kernel/firmware/acpi
+cp dsdt.aml kernel/firmware/acpi
+# A maximum of: #define ACPI_OVERRIDE_TABLES 10
+# tables are currently allowed (see osl.c):
+iasl -sa facp.dsl
+iasl -sa ssdt1.dsl
+cp facp.aml kernel/firmware/acpi
+cp ssdt1.aml kernel/firmware/acpi
+# Create the uncompressed cpio archive and concatenate the original initrd
+# on top:
+find kernel | cpio -H newc --create > /boot/instrumented_initrd
+cat /boot/initrd >>/boot/instrumented_initrd
+# reboot with increased acpi debug level, e.g. boot params:
+acpi.debug_level=0x2 acpi.debug_layer=0xFFFFFFFF
+# and check your syslog:
+[ 1.268089] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
+[ 1.272091] [ACPI Debug] String [0x0B] "HELLO WORLD"
+
+iasl is able to disassemble and recompile quite a lot different,
+also static ACPI tables.
+
+
+4) Where to retrieve userspace tools
+------------------------------------
+
+iasl and acpixtract are part of Intel's ACPICA project:
+http://acpica.org/
+and should be packaged by distributions (for example in the acpica package
+on SUSE).
+
+acpidump can be found in Len Browns pmtools:
+ftp://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools/acpidump
+This tool is also part of the acpica package on SUSE.
+Alternatively, used ACPI tables can be retrieved via sysfs in latest kernels:
+/sys/firmware/acpi/tables
--- /dev/null
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+ because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly
+ restrict p/s above to be the working-set. (It also makes explicit the
+ requirement for <C0,M0> to change about a change in the working set.)
+
+ Doing this does have the nice property that it lets you use your frequency
+ measurement as a weak-ordering for the benefit a task would receive when
+ we can't fit everything.
+
+ e.g. task1 has working set 10mb, f=90%
+ task2 has working set 90mb, f=10%
+
+ Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+ from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+ C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+ T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+ T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+ on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+ usual systems given factors like things like Haswell's enormous 35mb l3
+ cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+ L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+ L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+ L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+ | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+ Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+ the "worst" partition we should accept; but having it gives us a useful
+ bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
+ favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+ min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+ particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+ traffic, the more complicated solution could pick another Pareto point using
+ an aggregate objective function such that we balance the loss of work
+ efficiency against the gain of running, we'd want to more or less suggest
+ there to be a fixed bound on the error from the Pareto line for any
+ such solution.
+
+References:
+
+ http://en.wikipedia.org/wiki/Mathematical_optimization
+ http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+ min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+ T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+ problems and assumptions. It should work well for tasks without significant
+ shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's'
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+ the decaying avg includes the old accesses and therefore has a measure of repeat
+ accesses.
+
+ Rik also argued that the sample frequency is too low to get accurate access
+ frequency measurements, I'm not entirely convinced, event at low sample
+ frequencies the avg elapsed time 'e' over multiple samples should still
+ give us a fair approximation of the avg access frequency 'a'.
+
+ So doing both b&c has a fair chance of working and allowing us to distinguish
+ between important and less important memory accesses.
+
+ Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+ min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+ is very rare for there to be an 50/50 split in memory, lacking a perfect
+ split, the small will move towards the larger. In case of the perfect
+ split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.
W: http://www.native-instruments.com
F: sound/usb/caiaq/
+NATIVE LINUX KVM TOOL
+M: Pekka Enberg <penberg@kernel.org>
+M: Sasha Levin <levinsasha928@gmail.com>
+M: Asias He <asias.hejun@gmail.com>
+L: kvm@vger.kernel.org
+S: Maintained
+F: tools/kvm/
+
NCP FILESYSTEM
M: Petr Vandrovec <petr@vandrovec.name>
S: Odd Fixes
extern void paging_init(void);
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_PAGE_CHG_MASK)
+
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
*pmdp = entry;
}
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+ pgprot_t prot = PAGE_RW;
+
+ if (pmd_val(pmd) & _SEGMENT_ENTRY_RO) {
+ if (pmd_val(pmd) & _SEGMENT_ENTRY_INV)
+ prot = PAGE_NONE;
+ else
+ prot = PAGE_RO;
+ }
+ return prot;
+}
+
static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot)
{
unsigned long pgprot_pmd = 0;
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+ select EMBEDDED_NUMA
default n
help
Some SH systems have many various memories scattered around
If in doubt, say "Y".
+config KVMTOOL_TEST_ENABLE
+ bool "Enable options to create a bootable tools/kvm/ kernel"
+ select NET
+ select NETDEVICES
+ select PCI
+ select BLOCK
+ select BLK_DEV
+ select NETWORK_FILESYSTEMS
+ select INET
+ select EXPERIMENTAL
+ select SERIAL_8250
+ select SERIAL_8250_CONSOLE
+ select IP_PNP
+ select IP_PNP_DHCP
+ select BINFMT_ELF
+ select PCI_MSI
+ select HAVE_ARCH_KGDB
+ select DEBUG_KERNEL
+ select KGDB
+ select KGDB_SERIAL_CONSOLE
+ select VIRTUALIZATION
+ select VIRTIO
+ select VIRTIO_RING
+ select VIRTIO_PCI
+ select VIRTIO_BLK
+ select VIRTIO_CONSOLE
+ select VIRTIO_NET
+ select 9P_FS
+ select NET_9P
+ select NET_9P_VIRTIO
+
menuconfig PARAVIRT_GUEST
bool "Paravirtualized guest support"
---help---
}
#define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
#define canon_pgprot(p) __pgprot(massage_pgprot(p))
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}
+#define __HAVE_ARCH_PTE_ACCESSIBLE
+static inline int pte_accessible(pte_t a)
+{
+ return pte_flags(a) & _PAGE_PRESENT;
+}
+
static inline int pte_hidden(pte_t pte)
{
return pte_flags(pte) & _PAGE_HIDDEN;
} while (unlikely (val != old));
return old & 0x1;
}
+
+void __init arch_reserve_mem_area(acpi_physical_address addr, size_t size)
+{
+ e820_add_region(addr, size, E820_ACPI);
+ update_e820();
+}
reserve_initrd();
+#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
+ acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
+#endif
+
reserve_crashkernel();
vsmp_init();
pte_t entry, int dirty)
{
int changed = !pte_same(*ptep, entry);
+ /*
+ * If the page used to be inaccessible (_PAGE_PROTNONE), or
+ * this call upgrades the access permissions on the same page,
+ * it is safe to skip the remote TLB flush.
+ */
+ bool flush_remote = false;
+ if (!pte_accessible(*ptep))
+ flush_remote = false;
+ else if (pte_pfn(*ptep) != pte_pfn(entry) ||
+ (pte_write(*ptep) && !pte_write(entry)) ||
+ (pte_exec(*ptep) && !pte_exec(entry)))
+ flush_remote = true;
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (flush_remote)
+ flush_tlb_page(vma, address);
+ else
+ __flush_tlb_one(address);
}
return changed;
bool
default ACPI_CUSTOM_DSDT_FILE != ""
+config ACPI_INITRD_TABLE_OVERRIDE
+ bool "ACPI tables can be passed via uncompressed cpio in initrd"
+ default n
+ help
+ This option provides functionality to override arbitrary ACPI tables
+ via initrd. No functional change if no ACPI tables are passed via
+ initrd, therefore it's safe to say Y.
+ See Documentation/acpi/initrd_table_override.txt for details
+
config ACPI_BLACKLIST_YEAR
int "Disable ACPI for systems before Jan 1st this year" if X86_32
default 0
return AE_OK;
}
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+#include <linux/earlycpio.h>
+#include <linux/memblock.h>
+
+static u64 acpi_tables_addr;
+static int all_tables_size;
+
+/* Copied from acpica/tbutils.c:acpi_tb_checksum() */
+u8 __init acpi_table_checksum(u8 *buffer, u32 length)
+{
+ u8 sum = 0;
+ u8 *end = buffer + length;
+
+ while (buffer < end)
+ sum = (u8) (sum + *(buffer++));
+ return sum;
+}
+
+/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+static const char * const table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
+
+/* Non-fatal errors: Affected tables/files are ignored */
+#define INVALID_TABLE(x, path, name) \
+ { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+
+#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
+
+/* Must not increase 10 or needs code modification below */
+#define ACPI_OVERRIDE_TABLES 10
+
+void __init acpi_initrd_override(void *data, size_t size)
+{
+ int sig, no, table_nr = 0, total_offset = 0;
+ long offset = 0;
+ struct acpi_table_header *table;
+ char cpio_path[32] = "kernel/firmware/acpi/";
+ struct cpio_data file;
+ struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
+ char *p;
+
+ if (data == NULL || size == 0)
+ return;
+
+ for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+ file = find_cpio_data(cpio_path, data, size, &offset);
+ if (!file.data)
+ break;
+
+ data += offset;
+ size -= offset;
+
+ if (file.size < sizeof(struct acpi_table_header))
+ INVALID_TABLE("Table smaller than ACPI header",
+ cpio_path, file.name);
+
+ table = file.data;
+
+ for (sig = 0; table_sigs[sig]; sig++)
+ if (!memcmp(table->signature, table_sigs[sig], 4))
+ break;
+
+ if (!table_sigs[sig])
+ INVALID_TABLE("Unknown signature",
+ cpio_path, file.name);
+ if (file.size != table->length)
+ INVALID_TABLE("File length does not match table length",
+ cpio_path, file.name);
+ if (acpi_table_checksum(file.data, table->length))
+ INVALID_TABLE("Bad table checksum",
+ cpio_path, file.name);
+
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ table->signature, cpio_path, file.name, table->length);
+
+ all_tables_size += table->length;
+ early_initrd_files[table_nr].data = file.data;
+ early_initrd_files[table_nr].size = file.size;
+ table_nr++;
+ }
+ if (table_nr == 0)
+ return;
+
+ acpi_tables_addr =
+ memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
+ all_tables_size, PAGE_SIZE);
+ if (!acpi_tables_addr) {
+ WARN_ON(1);
+ return;
+ }
+ /*
+ * Only calling e820_add_reserve does not work and the
+ * tables are invalid (memory got used) later.
+ * memblock_reserve works as expected and the tables won't get modified.
+ * But it's not enough on X86 because ioremap will
+ * complain later (used by acpi_os_map_memory) that the pages
+ * that should get mapped are not marked "reserved".
+ * Both memblock_reserve and e820_add_region (via arch_reserve_mem_area)
+ * works fine.
+ */
+ memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
+ arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
+
+ p = early_ioremap(acpi_tables_addr, all_tables_size);
+
+ for (no = 0; no < table_nr; no++) {
+ memcpy(p + total_offset, early_initrd_files[no].data,
+ early_initrd_files[no].size);
+ total_offset += early_initrd_files[no].size;
+ }
+ early_iounmap(p, all_tables_size);
+}
+#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+
+static void acpi_table_taint(struct acpi_table_header *table)
+{
+ pr_warn(PREFIX
+ "Override [%4.4s-%8.8s], this is unsafe: tainting kernel\n",
+ table->signature, table->oem_table_id);
+ add_taint(TAINT_OVERRIDDEN_ACPI_TABLE);
+}
+
+
acpi_status
acpi_os_table_override(struct acpi_table_header * existing_table,
struct acpi_table_header ** new_table)
if (strncmp(existing_table->signature, "DSDT", 4) == 0)
*new_table = (struct acpi_table_header *)AmlCode;
#endif
- if (*new_table != NULL) {
- printk(KERN_WARNING PREFIX "Override [%4.4s-%8.8s], "
- "this is unsafe: tainting kernel\n",
- existing_table->signature,
- existing_table->oem_table_id);
- add_taint(TAINT_OVERRIDDEN_ACPI_TABLE);
- }
+ if (*new_table != NULL)
+ acpi_table_taint(existing_table);
return AE_OK;
}
acpi_status
acpi_os_physical_table_override(struct acpi_table_header *existing_table,
- acpi_physical_address * new_address,
- u32 *new_table_length)
+ acpi_physical_address *address,
+ u32 *table_length)
{
- return AE_SUPPORT;
-}
+#ifndef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+ *table_length = 0;
+ *address = 0;
+ return AE_OK;
+#else
+ int table_offset = 0;
+ struct acpi_table_header *table;
+
+ *table_length = 0;
+ *address = 0;
+
+ if (!acpi_tables_addr)
+ return AE_OK;
+
+ do {
+ if (table_offset + ACPI_HEADER_SIZE > all_tables_size) {
+ WARN_ON(1);
+ return AE_OK;
+ }
+ table = acpi_os_map_memory(acpi_tables_addr + table_offset,
+ ACPI_HEADER_SIZE);
+
+ if (table_offset + table->length > all_tables_size) {
+ acpi_os_unmap_memory(table, ACPI_HEADER_SIZE);
+ WARN_ON(1);
+ return AE_OK;
+ }
+
+ table_offset += table->length;
+
+ if (memcmp(existing_table->signature, table->signature, 4)) {
+ acpi_os_unmap_memory(table,
+ ACPI_HEADER_SIZE);
+ continue;
+ }
+
+ /* Only override tables with matching oem id */
+ if (memcmp(table->oem_table_id, existing_table->oem_table_id,
+ ACPI_OEM_TABLE_ID_SIZE)) {
+ acpi_os_unmap_memory(table,
+ ACPI_HEADER_SIZE);
+ continue;
+ }
+
+ table_offset -= table->length;
+ *table_length = table->length;
+ acpi_os_unmap_memory(table, ACPI_HEADER_SIZE);
+ *address = acpi_tables_addr + table_offset;
+ break;
+ } while (table_offset + ACPI_HEADER_SIZE < all_tables_size);
+
+ if (*address != 0)
+ acpi_table_taint(existing_table);
+ return AE_OK;
+#endif
+}
static irqreturn_t acpi_irq(int irq, void *dev_id)
{
node_page_state(dev->id, NUMA_HIT),
node_page_state(dev->id, NUMA_MISS),
node_page_state(dev->id, NUMA_FOREIGN),
- node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+ 0UL,
node_page_state(dev->id, NUMA_LOCAL),
node_page_state(dev->id, NUMA_OTHER));
}
static u64 get_idle_time(int cpu)
{
- u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
+ u64 idle, idle_time = -1ULL;
+
+ if (cpu_online(cpu))
+ idle_time = get_cpu_idle_time_us(cpu, NULL);
if (idle_time == -1ULL)
- /* !NO_HZ so we can rely on cpustat.idle */
+ /* !NO_HZ or cpu offline so we can rely on cpustat.idle */
idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
else
idle = usecs_to_cputime64(idle_time);
static u64 get_iowait_time(int cpu)
{
- u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+ u64 iowait, iowait_time = -1ULL;
+
+ if (cpu_online(cpu))
+ iowait_time = get_cpu_iowait_time_us(cpu, NULL);
if (iowait_time == -1ULL)
- /* !NO_HZ so we can rely on cpustat.iowait */
+ /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
else
iowait = usecs_to_cputime64(iowait_time);
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif
+#ifndef __HAVE_ARCH_PTE_ACCESSIBLE
+#define pte_accessible(pte) pte_present(pte)
+#endif
+
#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
typedef int (*acpi_table_entry_handler) (struct acpi_subtable_header *header, const unsigned long end);
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override(void *data, size_t size);
+#else
+static inline void acpi_initrd_override(void *data, size_t size)
+{
+}
+#endif
+
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
acpi_status acpi_os_prepare_sleep(u8 sleep_state,
u32 pm1a_control, u32 pm1b_control);
+#ifdef CONFIG_X86
+void arch_reserve_mem_area(acpi_physical_address addr, size_t size);
+#else
+static inline void arch_reserve_mem_area(acpi_physical_address addr,
+ size_t size)
+{
+}
+#endif /* CONFIG_X86 */
#else
#define acpi_os_set_prepare_sleep(func, pm1a_ctrl, pm1b_ctrl) do { } while (0)
#endif
--- /dev/null
+#ifndef _LINUX_EARLYCPIO_H
+#define _LINUX_EARLYCPIO_H
+
+#include <linux/types.h>
+
+#define MAX_CPIO_FILE_NAME 18
+
+struct cpio_data {
+ void *data;
+ size_t size;
+ char name[MAX_CPIO_FILE_NAME];
+};
+
+struct cpio_data find_cpio_data(const char *path, void *data, size_t len,
+ long *offset);
+
+#endif /* _LINUX_EARLYCPIO_H */
}
return page;
}
+
+extern bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd);
+
+extern void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
{
return 0;
}
+
+static inline bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+{
+ return false;
+}
+
+static inline void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd)
+{
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
#define INIT_TASK_COMM "swapper"
+#ifdef CONFIG_SCHED_NUMA
+# define INIT_TASK_NUMA(tsk) \
+ .node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_CPUSET_SEQ \
+ INIT_TASK_NUMA(tsk) \
}
#ifndef _LINUX_MEMPOLICY_H
#define _LINUX_MEMPOLICY_H 1
-
#include <linux/mmzone.h>
#include <linux/slab.h>
#include <linux/rbtree.h>
return 1;
}
-#else
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
+extern void lazy_migrate_process(struct mm_struct *mm);
+
+#else /* CONFIG_NUMA */
struct mempolicy {};
return 0;
}
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return -1; /* no node preference */
+}
+
#endif /* CONFIG_NUMA */
#endif
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
#else
static inline void putback_lru_pages(struct list_head *l) {}
#define migrate_page NULL
#define fail_migrate_page NULL
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* CONFIG_MIGRATION */
+
#endif /* _LINUX_MIGRATE_H */
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ * this path has an extra reference count
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_FAULT,
};
#endif /* MIGRATE_MODE_H_INCLUDED */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
* ->fault function. The vma's ->fault is responsible for returning a bitmask
* sets it, so none of the operations on it need to be atomic.
*/
-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out. The first is for the normal case, without
- * sparsemem. The second is for sparsemem when there is
- * plenty of space for node and section. The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH 0
-#endif
-
-#define ZONES_WIDTH ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH 0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there. This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_NID_PGSHIFT (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
}
#endif
+#ifdef CONFIG_SCHED_NUMA
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+ return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+ return page->_last_nid;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+ unsigned long old_flags, flags;
+ int last_nid;
+
+ do {
+ old_flags = flags = page->flags;
+ last_nid = (flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+
+ flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
+ flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_nid;
+}
+
+static inline int page_last_nid(struct page *page)
+{
+ return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+}
+#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_SCHED_NUMA */
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+ return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+ return page_to_nid(page);
+}
+#endif /* CONFIG_SCHED_NUMA */
+
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
+extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
}
#endif
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+ /*
+ * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+ */
+ vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+ return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
#include <asm/page.h>
#include <asm/mmu.h>
*/
void *shadow;
#endif
+
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+ int _last_nid;
+#endif
}
/*
* The struct page can be forced to be double word aligned so that atomic ops
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
+#endif
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long numa_next_scan;
+ int numa_scan_seq;
#endif
struct uprobes_state uprobes_state;
};
#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
#include <asm/page.h>
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
NUMA_FOREIGN, /* was intended here, hit elsewhere */
- NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
* match the requested limits. See gfp_zone() in include/linux/gfp.h
*/
-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
struct zone {
/* Fields commonly accessed by the page allocator */
* PA_SECTION_SHIFT physical address to/from section number
* PFN_SECTION_SHIFT pfn to/from section number
*/
-#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
--- /dev/null
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/*
+ * SECTION_SHIFT #bits space required to store a section #
+ */
+#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out. The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nid:| NODE | ZONE | LAST_NID | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nid:| SECTION | NODE | ZONE | LAST_NID | ... | FLAGS |
+ * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH 0
+#endif
+
+#define ZONES_WIDTH ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH 0
+#endif
+
+#ifdef CONFIG_SCHED_NUMA
+#define LAST_NID_SHIFT NODES_SHIFT
+#else
+#define LAST_NID_SHIFT 0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NID_WIDTH LAST_NID_SHIFT
+#else
+#define LAST_NID_WIDTH 0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there. This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_SCHED_NUMA) && LAST_NID_WIDTH == 0
+#define LAST_NID_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */
extern int __weak arch_sd_sibiling_asym_packing(void);
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_SCHED_NUMA
+ int node; /* task home node */
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_task_period;
+ u64 node_stamp; /* migration stamp */
+ unsigned long numa_contrib;
+ unsigned long *numa_faults;
+ struct callback_head numa_work;
+#endif /* CONFIG_SCHED_NUMA */
+
struct rcu_head rcu;
/*
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
+#ifdef CONFIG_SCHED_NUMA
+static inline int tsk_home_node(struct task_struct *p)
+{
+ return p->node;
+}
+
+extern void task_numa_fault(int node, int pages);
+#else
+static inline int tsk_home_node(struct task_struct *p)
+{
+ return -1;
+}
+
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+extern unsigned int sysctl_sched_numa_task_period_min;
+extern unsigned int sysctl_sched_numa_task_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
static inline unsigned int get_sysctl_timer_migration(void)
{
return sysctl_timer_migration;
}
-#else
+#else /* CONFIG_SCHED_DEBUG */
static inline unsigned int get_sysctl_timer_migration(void)
{
return 1;
}
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
const struct sched_param *);
+extern void sched_setnode(struct task_struct *p, int node);
extern struct task_struct *idle_task(int cpu);
/**
* is_idle_task - is the specified task an idle task?
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
+ MPOL_LOCAL,
+ MPOL_NOOP, /* retain existing policy for range */
MPOL_MAX, /* always last member of enum */
};
/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform
+ to policy */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */
+#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
+
+#define MPOL_MF_VALID (MPOL_MF_STRICT | \
+ MPOL_MF_MOVE | \
+ MPOL_MF_MOVE_ALL | \
+ MPOL_MF_LAZY)
/*
* Internal flags that share the struct mempolicy flags word with
#define MPOL_F_SHARED (1 << 0) /* identify shared policies */
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
+#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_HOME (1 << 4) /* this is the home-node policy */
#endif /* _UAPI_LINUX_MEMPOLICY_H */
config HAVE_UNSTABLE_SCHED_CLOCK
bool
+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config EMBEDDED_NUMA
+ bool
+
+config SCHED_NUMA
+ bool "Memory placement aware NUMA scheduler"
+ default n
+ depends on SMP && NUMA && MIGRATION && !EMBEDDED_NUMA
+ help
+ This option adds support for automatic NUMA aware memory/task placement.
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_SCHED_NUMA
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->node = -1;
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_faults = NULL;
+ p->numa_task_period = sysctl_sched_numa_task_period_min;
+ p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_SCHED_NUMA */
}
/*
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_id);
-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
{
struct sched_domain *sd;
int id = cpu;
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_id, cpu) = id;
+
+ for_each_domain(cpu, sd) {
+ if (cpumask_equal(sched_domain_span(sd),
+ cpumask_of_node(cpu_to_node(cpu))))
+ goto got_node;
+ }
+ sd = NULL;
+got_node:
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
}
/*
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);
- update_top_cache_domain(cpu);
+ update_domain_cache(cpu);
}
/* cpus with isolated domains */
static struct sched_domain_topology_level *sched_domain_topology = default_topology;
+#ifdef CONFIG_SCHED_NUMA
+
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Since home-node is pure preference there's no hard migrate to force
+ * us anywhere, this also allows us to call this from atomic context if
+ * required.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+ unsigned long flags;
+ int on_rq, running;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ on_rq = p->on_rq;
+ running = task_current(rq, p);
+
+ if (on_rq)
+ dequeue_task(rq, p, 0);
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
+
+ p->node = node;
+
+ if (running)
+ p->sched_class->set_curr_task(rq);
+ if (on_rq)
+ enqueue_task(rq, p, 0);
+ task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_SCHED_NUMA */
+
#ifdef CONFIG_NUMA
static int sched_domains_numa_levels;
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
rq->avg_idle = 2*sysctl_sched_migration_cost;
INIT_LIST_HEAD(&rq->cfs_tasks);
+#ifdef CONFIG_SCHED_NUMA
+ INIT_LIST_HEAD(&rq->offnode_tasks);
+ rq->onnode_running = 0;
+ rq->offnode_running = 0;
+ rq->offnode_weight = 0;
+#endif
rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ
SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
#endif
+#ifdef CONFIG_SCHED_NUMA
+ SEQ_printf(m, " %d/%d", p->node, cpu_to_node(task_cpu(p)));
+#endif
#ifdef CONFIG_CGROUP_SCHED
SEQ_printf(m, " %s", task_group_path(task_group(p)));
#endif
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/
#include <linux/latencytop.h>
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
#include <trace/events/sched.h>
se->exec_start = rq_of(cfs_rq)->clock_task;
}
+/**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality. We try and achieve this by making tasks stick to
+ * a particular node (their home node) but if fairness mandates they run
+ * elsewhere for long enough, we let the memory follow them.
+ *
+ * Tasks start out with their home-node unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ *
+ * We keep a home-node per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
+ */
+
+static unsigned long task_h_load(struct task_struct *p);
+
+#ifdef CONFIG_SCHED_NUMA
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ struct list_head *tasks = &rq->cfs_tasks;
+
+ if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+ p->numa_contrib = task_h_load(p);
+ rq->offnode_weight += p->numa_contrib;
+ rq->offnode_running++;
+ tasks = &rq->offnode_tasks;
+ } else
+ rq->onnode_running++;
+
+ return tasks;
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+ rq->offnode_weight -= p->numa_contrib;
+ rq->offnode_running--;
+ } else
+ rq->onnode_running--;
+}
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_task_period_min = 5000;
+unsigned int sysctl_sched_numa_task_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_placement(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ int node, max_node = -1;
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+ if (p->numa_scan_seq == seq)
+ return;
+
+ p->numa_scan_seq = seq;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ faults = p->numa_faults[node];
+
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_node = node;
+ }
+
+ p->numa_faults[node] /= 2;
+ }
+
+ if (max_node == -1)
+ return;
+
+ if (p->node != max_node) {
+ p->numa_task_period = sysctl_sched_numa_task_period_min;
+ if (sched_feat(NUMA_SETTLE) &&
+ (seq - p->numa_migrate_seq) <= (int)sysctl_sched_numa_settle_count)
+ return;
+ p->numa_migrate_seq = seq;
+ sched_setnode(p, max_node);
+ } else {
+ p->numa_task_period = min(sysctl_sched_numa_task_period_max,
+ p->numa_task_period * 2);
+ }
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+ struct task_struct *p = current;
+
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(unsigned long) * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
+
+ task_numa_placement(p);
+
+ p->numa_faults[node] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+ unsigned long migrate, next_scan, now = jiffies;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work; /* protect against double add */
+ /*
+ * Who cares about NUMA placement when they're dying.
+ *
+ * NOTE: make sure not to dereference p->mm before this check,
+ * exit_task_work() happens _after_ exit_mm() so we could be called
+ * without p->mm even though we still had it when we enqueued this
+ * work.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Enforce maximal scan/migration frequency..
+ */
+ migrate = mm->numa_next_scan;
+ if (time_before(now, migrate))
+ return;
+
+ next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_task_period_min);
+ if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+ return;
+
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ lazy_migrate_process(mm);
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->numa_work;
+ u64 period, now;
+
+ /*
+ * We don't care about NUMA placement if we don't have memory.
+ */
+ if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+ return;
+
+ /*
+ * Using runtime rather than walltime has the dual advantage that
+ * we (mostly) drive the selection from busy threads and that the
+ * task needs to have done some actual work before we bother with
+ * NUMA placement.
+ */
+ now = curr->se.sum_exec_runtime;
+ period = (u64)curr->numa_task_period * NSEC_PER_MSEC;
+
+ if (now - curr->node_stamp > period) {
+ curr->node_stamp = now;
+
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+ task_work_add(curr, work, true);
+ }
+ }
+}
+#else
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ return NULL;
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
/**************************************************
* Scheduling class queueing methods:
*/
if (!parent_entity(se))
update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
- if (entity_is_task(se))
- list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+ if (entity_is_task(se)) {
+ struct rq *rq = rq_of(cfs_rq);
+ struct task_struct *p = task_of(se);
+ struct list_head *tasks = &rq->cfs_tasks;
+
+ if (tsk_home_node(p) != -1)
+ tasks = account_numa_enqueue(rq, p);
+
+ list_add(&se->group_node, tasks);
+ }
+#endif /* CONFIG_SMP */
cfs_rq->nr_running++;
}
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
- if (entity_is_task(se))
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+
list_del_init(&se->group_node);
+
+ if (tsk_home_node(p) != -1)
+ account_numa_dequeue(rq_of(cfs_rq), p);
+ }
cfs_rq->nr_running--;
}
return target;
}
+#ifdef CONFIG_SCHED_NUMA
+static inline bool pick_numa_rand(int n)
+{
+ return !(get_random_int() % n);
+}
+
+/*
+ * Pick a random elegible CPU in the target node, hopefully faster
+ * than doing a least-loaded scan.
+ */
+static int numa_select_node_cpu(struct task_struct *p, int node)
+{
+ int weight = cpumask_weight(cpumask_of_node(node));
+ int i, cpu = -1;
+
+ for_each_cpu_and(i, cpumask_of_node(node), tsk_cpus_allowed(p)) {
+ if (cpu < 0 || pick_numa_rand(weight))
+ cpu = i;
+ }
+
+ return cpu;
+}
+#else
+static int numa_select_node_cpu(struct task_struct *p, int node)
+{
+ return -1;
+}
+#endif /* CONFIG_SCHED_NUMA */
+
/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
int new_cpu = cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
+ int node = tsk_home_node(p);
if (p->nr_cpus_allowed == 1)
return prev_cpu;
}
rcu_read_lock();
+ if (sched_feat_numa(NUMA_TTWU_BIAS) && node != -1) {
+ /*
+ * For fork,exec find the idlest cpu in the home-node.
+ */
+ if (sd_flag & (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
+ int node_cpu = numa_select_node_cpu(p, node);
+ if (node_cpu < 0)
+ goto find_sd;
+
+ new_cpu = cpu = node_cpu;
+ sd = per_cpu(sd_node, cpu);
+ goto pick_idlest;
+ }
+
+ /*
+ * For wake, pretend we were running in the home-node.
+ */
+ if (cpu_to_node(prev_cpu) != node) {
+ int node_cpu = numa_select_node_cpu(p, node);
+ if (node_cpu < 0)
+ goto find_sd;
+
+ if (sched_feat_numa(NUMA_TTWU_TO))
+ cpu = node_cpu;
+ else
+ prev_cpu = node_cpu;
+ }
+ }
+
+find_sd:
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
goto unlock;
}
+pick_idlest:
while (sd) {
int load_idx = sd->forkexec_idx;
struct sched_group *group;
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
- if (unlikely(p->policy != SCHED_NORMAL))
+ if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;
find_matching_se(&se, &pse);
unsigned int flags;
+ struct list_head *tasks;
+
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+
+ struct rq * (*find_busiest_queue)(struct lb_env *,
+ struct sched_group *);
};
/*
check_preempt_curr(env->dst_rq, p, 0);
}
+static int task_numa_hot(struct task_struct *p, struct lb_env *env)
+{
+ int from_dist, to_dist;
+ int node = tsk_home_node(p);
+
+ if (!sched_feat_numa(NUMA_HOT) || node == -1)
+ return 0; /* no node preference */
+
+ from_dist = node_distance(cpu_to_node(env->src_cpu), node);
+ to_dist = node_distance(cpu_to_node(env->dst_cpu), node);
+
+ if (to_dist < from_dist)
+ return 0; /* getting closer is ok */
+
+ return 1; /* stick to where we are */
+}
+
/*
* Is this task likely cache-hot:
*/
static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;
if (sysctl_sched_migration_cost == 0)
return 0;
- delta = now - p->se.exec_start;
+ delta = env->src_rq->clock_task - p->se.exec_start;
return delta < (s64)sysctl_sched_migration_cost;
}
* 2) too many balance attempts have failed.
*/
- tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+ tsk_cache_hot = task_hot(p, env);
+ if (env->idle == CPU_NOT_IDLE)
+ tsk_cache_hot |= task_numa_hot(p, env);
if (!tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
*
* Called with both runqueues locked.
*/
-static int move_one_task(struct lb_env *env)
+static int __move_one_task(struct lb_env *env)
{
struct task_struct *p, *n;
- list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+ list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;
return 0;
}
-static unsigned long task_h_load(struct task_struct *p);
+static int move_one_task(struct lb_env *env)
+{
+ if (sched_feat_numa(NUMA_PULL)) {
+ env->tasks = offnode_tasks(env->src_rq);
+ if (__move_one_task(env))
+ return 1;
+ }
+
+ env->tasks = &env->src_rq->cfs_tasks;
+ if (__move_one_task(env))
+ return 1;
+
+ return 0;
+}
static const unsigned int sched_nr_migrate_break = 32;
*/
static int move_tasks(struct lb_env *env)
{
- struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
unsigned long load;
int pulled = 0;
if (env->imbalance <= 0)
return 0;
- while (!list_empty(tasks)) {
- p = list_first_entry(tasks, struct task_struct, se.group_node);
+again:
+ while (!list_empty(env->tasks)) {
+ p = list_first_entry(env->tasks, struct task_struct, se.group_node);
env->loop++;
/* We've more or less seen every task there is, call it quits */
if (env->loop > env->loop_break) {
env->loop_break += sched_nr_migrate_break;
env->flags |= LBF_NEED_BREAK;
- break;
+ goto out;
}
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
* the critical section.
*/
if (env->idle == CPU_NEWLY_IDLE)
- break;
+ goto out;
#endif
/*
* weighted load.
*/
if (env->imbalance <= 0)
- break;
+ goto out;
continue;
next:
- list_move_tail(&p->se.group_node, tasks);
+ list_move_tail(&p->se.group_node, env->tasks);
}
+ if (env->tasks == offnode_tasks(env->src_rq)) {
+ env->tasks = &env->src_rq->cfs_tasks;
+ env->loop = 0;
+ goto again;
+ }
+
+out:
/*
* Right now, this is one of only two places move_task() is called,
* so we can safely collect move_task() stats here rather than
unsigned int busiest_group_weight;
int group_imb; /* Is there imbalance in this sd */
+#ifdef CONFIG_SCHED_NUMA
+ struct sched_group *numa_group; /* group which has offnode_tasks */
+ unsigned long numa_group_weight;
+ unsigned long numa_group_running;
+
+ unsigned long this_offnode_running;
+ unsigned long this_onnode_running;
+#endif
};
/*
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long numa_offnode_weight;
+ unsigned long numa_offnode_running;
+ unsigned long numa_onnode_running;
+#endif
};
/**
return load_idx;
}
+#ifdef CONFIG_SCHED_NUMA
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+ sgs->numa_offnode_weight += rq->offnode_weight;
+ sgs->numa_offnode_running += rq->offnode_running;
+ sgs->numa_onnode_running += rq->onnode_running;
+}
+
+/*
+ * Since the offnode lists are indiscriminate (they contain tasks for all other
+ * nodes) it is impossible to say if there's any task on there that wants to
+ * move towards the pulling cpu. Therefore select a random offnode list to pull
+ * from such that eventually we'll try them all.
+ *
+ * Select a random group that has offnode tasks as sds->numa_group
+ */
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+ struct sched_group *group, struct sd_lb_stats *sds,
+ int local_group, struct sg_lb_stats *sgs)
+{
+ if (!(sd->flags & SD_NUMA))
+ return;
+
+ if (local_group) {
+ sds->this_offnode_running = sgs->numa_offnode_running;
+ sds->this_onnode_running = sgs->numa_onnode_running;
+ return;
+ }
+
+ if (!sgs->numa_offnode_running)
+ return;
+
+ if (!sds->numa_group || pick_numa_rand(sd->span_weight / group->group_weight)) {
+ sds->numa_group = group;
+ sds->numa_group_weight = sgs->numa_offnode_weight;
+ sds->numa_group_running = sgs->numa_offnode_running;
+ }
+}
+
+/*
+ * Pick a random queue from the group that has offnode tasks.
+ */
+static struct rq *find_busiest_numa_queue(struct lb_env *env,
+ struct sched_group *group)
+{
+ struct rq *busiest = NULL, *rq;
+ int cpu;
+
+ for_each_cpu_and(cpu, sched_group_cpus(group), env->cpus) {
+ rq = cpu_rq(cpu);
+ if (!rq->offnode_running)
+ continue;
+ if (!busiest || pick_numa_rand(group->group_weight))
+ busiest = rq;
+ }
+
+ return busiest;
+}
+
+/*
+ * Called in case of no other imbalance, if there is a queue running offnode
+ * tasksk we'll say we're imbalanced anyway to nudge these tasks towards their
+ * proper node.
+ */
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ if (!sched_feat(NUMA_PULL_BIAS))
+ return 0;
+
+ if (!sds->numa_group)
+ return 0;
+
+ /*
+ * Only pull an offnode task home if we've got offnode or !numa tasks to trade for it.
+ */
+ if (!sds->this_offnode_running &&
+ !(sds->this_nr_running - sds->this_onnode_running - sds->this_offnode_running))
+ return 0;
+
+ env->imbalance = sds->numa_group_weight / sds->numa_group_running;
+ sds->busiest = sds->numa_group;
+ env->find_busiest_queue = find_busiest_numa_queue;
+ return 1;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+ return env->find_busiest_queue == find_busiest_numa_queue &&
+ env->src_rq->offnode_running == 1 &&
+ env->src_rq->nr_running == 1;
+}
+
+#else /* CONFIG_SCHED_NUMA */
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+ struct sched_group *group, struct sd_lb_stats *sds,
+ int local_group, struct sg_lb_stats *sgs)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ return 0;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+ return false;
+}
+#endif /* CONFIG_SCHED_NUMA */
+
unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
{
return SCHED_POWER_SCALE;
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
sgs->idle_cpus++;
+
+ update_sg_numa_stats(sgs, rq);
}
/*
sds->group_imb = sgs.group_imb;
}
+ update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
+
sg = sg->next;
} while (sg != env->sd->groups);
}
/* There is no busy sibling group to pull tasks from */
if (!sds.busiest || sds.busiest_nr_running == 0)
- goto out_balanced;
+ goto ret;
sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
* don't try and pull any tasks.
*/
if (sds.this_load >= sds.max_load)
- goto out_balanced;
+ goto ret;
/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
if (sds.this_load >= sds.avg_load)
- goto out_balanced;
+ goto ret;
if (env->idle == CPU_IDLE) {
/*
return sds.busiest;
out_balanced:
+ if (check_numa_busiest_group(env, &sds))
+ return sds.busiest;
+
ret:
env->imbalance = 0;
return NULL;
return 1;
}
+ if (need_active_numa_balance(env))
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
struct lb_env env = {
- .sd = sd,
- .dst_cpu = this_cpu,
- .dst_rq = this_rq,
- .dst_grpmask = sched_group_cpus(sd->groups),
- .idle = idle,
- .loop_break = sched_nr_migrate_break,
- .cpus = cpus,
+ .sd = sd,
+ .dst_cpu = this_cpu,
+ .dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
+ .idle = idle,
+ .loop_break = sched_nr_migrate_break,
+ .cpus = cpus,
+ .find_busiest_queue = find_busiest_queue,
};
cpumask_copy(cpus, cpu_active_mask);
goto out_balanced;
}
- busiest = find_busiest_queue(&env, group);
+ busiest = env.find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
+ env.src_rq = busiest;
+ env.src_cpu = busiest->cpu;
BUG_ON(busiest == env.dst_rq);
env.src_cpu = busiest->cpu;
env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
+ if (sched_feat_numa(NUMA_PULL))
+ env.tasks = offnode_tasks(busiest);
+ else
+ env.tasks = &busiest->cfs_tasks;
update_h_load(env.src_cpu);
more_balance:
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
+
+ if (sched_feat_numa(NUMA))
+ task_tick_numa(rq, curr);
}
/*
*/
SCHED_FEAT(CACHE_HOT_BUDDY, true)
+/*
+ * Allow wakeup-time preemption of the current task:
+ */
+SCHED_FEAT(WAKEUP_PREEMPTION, true)
+
/*
* Use arch dependent cpu power functions
*/
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_SCHED_NUMA
+SCHED_FEAT(NUMA, true)
+SCHED_FEAT(NUMA_HOT, true)
+SCHED_FEAT(NUMA_TTWU_BIAS, false)
+SCHED_FEAT(NUMA_TTWU_TO, false)
+SCHED_FEAT(NUMA_PULL, true)
+SCHED_FEAT(NUMA_PULL_BIAS, true)
+SCHED_FEAT(NUMA_SETTLE, true)
+#endif
+
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/slab.h>
#include "cpupri.h"
struct list_head cfs_tasks;
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long onnode_running;
+ unsigned long offnode_running;
+ unsigned long offnode_weight;
+ struct list_head offnode_tasks;
+#endif
+
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))
+#ifdef CONFIG_SCHED_NUMA
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+ return &rq->offnode_tasks;
+}
+
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_SCHED_NUMA */
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+ return NULL;
+}
+
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
#ifdef CONFIG_SMP
#define rcu_dereference_check_sched_domain(p) \
DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);
extern int group_balance_cpu(struct sched_group *sg);
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
+#ifdef CONFIG_SCHED_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SMP
{
.procname = "sched_tunable_scaling",
.data = &sysctl_sched_tunable_scaling,
.extra1 = &zero,
.extra2 = &one,
},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_SCHED_NUMA
+ {
+ .procname = "sched_numa_task_period_min_ms",
+ .data = &sysctl_sched_numa_task_period_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_task_period_max_ms",
+ .data = &sysctl_sched_numa_task_period_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_settle_count",
+ .data = &sysctl_sched_numa_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif /* CONFIG_SCHED_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
idr.o int_sqrt.o extable.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o earlycpio.o
lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
--- /dev/null
+/* ----------------------------------------------------------------------- *
+ *
+ * Copyright 2012 Intel Corporation; author H. Peter Anvin
+ *
+ * This file is part of the Linux kernel, and is made available
+ * under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * ----------------------------------------------------------------------- */
+
+/*
+ * earlycpio.c
+ *
+ * Find a specific cpio member; must precede any compressed content.
+ * This is used to locate data items in the initramfs used by the
+ * kernel itself during early boot (before the main initramfs is
+ * decompressed.) It is the responsibility of the initramfs creator
+ * to ensure that these items are uncompressed at the head of the
+ * blob. Depending on the boot loader or package tool that may be a
+ * separate file or part of the same file.
+ */
+
+#include <linux/earlycpio.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+enum cpio_fields {
+ C_MAGIC,
+ C_INO,
+ C_MODE,
+ C_UID,
+ C_GID,
+ C_NLINK,
+ C_MTIME,
+ C_FILESIZE,
+ C_MAJ,
+ C_MIN,
+ C_RMAJ,
+ C_RMIN,
+ C_NAMESIZE,
+ C_CHKSUM,
+ C_NFIELDS
+};
+
+/**
+ * cpio_data find_cpio_data - Search for files in an uncompressed cpio
+ * @path: The directory to search for, including a slash at the end
+ * @data: Pointer to the the cpio archive or a header inside
+ * @len: Remaining length of the cpio based on data pointer
+ * @offset: When a matching file is found, this is the offset to the
+ * beginning of the cpio. It can be used to iterate through
+ * the cpio to find all files inside of a directory path
+ *
+ * @return: struct cpio_data containing the address, length and
+ * filename (with the directory path cut off) of the found file.
+ * If you search for a filename and not for files in a directory,
+ * pass the absolute path of the filename in the cpio and make sure
+ * the match returned an empty filename string.
+ */
+
+struct cpio_data __cpuinit find_cpio_data(const char *path, void *data,
+ size_t len, long *offset)
+{
+ const size_t cpio_header_len = 8*C_NFIELDS - 2;
+ struct cpio_data cd = { NULL, 0, "" };
+ const char *p, *dptr, *nptr;
+ unsigned int ch[C_NFIELDS], *chp, v;
+ unsigned char c, x;
+ size_t mypathsize = strlen(path);
+ int i, j;
+
+ p = data;
+
+ while (len > cpio_header_len) {
+ if (!*p) {
+ /* All cpio headers need to be 4-byte aligned */
+ p += 4;
+ len -= 4;
+ continue;
+ }
+
+ j = 6; /* The magic field is only 6 characters */
+ chp = ch;
+ for (i = C_NFIELDS; i; i--) {
+ v = 0;
+ while (j--) {
+ v <<= 4;
+ c = *p++;
+
+ x = c - '0';
+ if (x < 10) {
+ v += x;
+ continue;
+ }
+
+ x = (c | 0x20) - 'a';
+ if (x < 6) {
+ v += x + 10;
+ continue;
+ }
+
+ goto quit; /* Invalid hexadecimal */
+ }
+ *chp++ = v;
+ j = 8; /* All other fields are 8 characters */
+ }
+
+ if ((ch[C_MAGIC] - 0x070701) > 1)
+ goto quit; /* Invalid magic */
+
+ len -= cpio_header_len;
+
+ dptr = PTR_ALIGN(p + ch[C_NAMESIZE], 4);
+ nptr = PTR_ALIGN(dptr + ch[C_FILESIZE], 4);
+
+ if (nptr > p + len || dptr < p || nptr < dptr)
+ goto quit; /* Buffer overrun */
+
+ if ((ch[C_MODE] & 0170000) == 0100000 &&
+ ch[C_NAMESIZE] >= mypathsize &&
+ !memcmp(p, path, mypathsize)) {
+ *offset = (long)nptr - (long)data;
+ if (ch[C_NAMESIZE] - mypathsize >= MAX_CPIO_FILE_NAME) {
+ pr_warn(
+ "File %s exceeding MAX_CPIO_FILE_NAME [%d]\n",
+ p, MAX_CPIO_FILE_NAME);
+ }
+ strlcpy(cd.name, p + mypathsize, MAX_CPIO_FILE_NAME);
+
+ cd.data = (void *)dptr;
+ cd.size = ch[C_FILESIZE];
+ return cd; /* Found it! */
+ }
+ len -= (nptr - p);
+ p = nptr;
+ }
+
+quit:
+ return cd;
+}
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/migrate.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+bool pmd_prot_none(struct vm_area_struct *vma, pmd_t pmd)
+{
+ /*
+ * See pte_prot_none().
+ */
+ if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+ return false;
+
+ return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+
+void do_huge_pmd_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t entry)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *new_page = NULL;
+ struct page *page = NULL;
+ int node, lru;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto unlock;
+
+ if (unlikely(pmd_trans_splitting(entry))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return;
+ }
+
+ page = pmd_page(entry);
+ if (page) {
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+
+ get_page(page);
+ node = mpol_misplaced(page, vma, haddr);
+ if (node != -1)
+ goto migrate;
+ }
+
+fixup:
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
+ spin_unlock(&mm->page_table_lock);
+ if (page) {
+ task_numa_fault(page_to_nid(page), HPAGE_PMD_NR);
+ put_page(page);
+ }
+ return;
+
+migrate:
+ WARN_ON(!(((unsigned long)page->mapping & PAGE_MAPPING_ANON)));
+ WARN_ON((((unsigned long)page->mapping & PAGE_MAPPING_KSM)));
+ BUG_ON(PageSwapCache(page));
+
+ spin_unlock(&mm->page_table_lock);
+
+ lock_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ put_page(page);
+ return;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ new_page = alloc_pages_node(node,
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
+ HPAGE_PMD_ORDER);
+
+ WARN_ON(PageLRU(new_page));
+
+ if (!new_page)
+ goto alloc_fail;
+
+ lru = PageLRU(page);
+
+ if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+ goto alloc_fail;
+
+ if (!trylock_page(new_page))
+ BUG();
+
+ /* anon mapping, we can simply copy page->mapping to the new page: */
+ new_page->mapping = page->mapping;
+ new_page->index = page->index;
+
+ migrate_page_copy(new_page, page);
+
+ WARN_ON(PageLRU(new_page));
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ if (lru)
+ putback_lru_page(page);
+
+ unlock_page(new_page);
+ ClearPageActive(new_page); /* Set by migrate_page_copy() */
+ new_page->mapping = NULL;
+ put_page(new_page); /* Free it */
+
+ unlock_page(page);
+ put_page(page); /* Drop the local reference */
+
+ return;
+ }
+
+ entry = mk_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = pmd_mkhuge(entry);
+
+ page_add_new_anon_rmap(new_page, vma, haddr);
+
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+ page_remove_rmap(page);
+ spin_unlock(&mm->page_table_lock);
+
+ put_page(page); /* Drop the rmap reference */
+
+ task_numa_fault(node, HPAGE_PMD_NR);
+
+ if (lru)
+ put_page(page); /* drop the LRU isolation reference */
+
+ unlock_page(new_page);
+ unlock_page(page);
+ put_page(page); /* Drop the local reference */
+
+ return;
+
+alloc_fail:
+ if (new_page)
+ put_page(new_page);
+
+ unlock_page(page);
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ put_page(page);
+ page = NULL;
+ goto unlock;
+ }
+ goto fixup;
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
+ page_xchg_last_nid(page, page_last_nid(page_tail));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;
+ pgprot_t prot;
spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
- if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm);
- pmd_populate(mm, &_pmd, pgtable);
-
- haddr = address;
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- BUG_ON(PageCompound(page+i));
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
- pte = pte_offset_map(&_pmd, haddr);
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
+ if (!pmd)
+ goto unlock;
- smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and
- * userland has the whole access to the hugepage
- * during the split (which happens in place). If we
- * overwrite the pmd with the not-huge version
- * pointing to the pte here (which of course we could
- * if all CPUs were bug free), userland could trigger
- * a small page size TLB miss on the small sized TLB
- * while the hugepage TLB entry is still established
- * in the huge TLB. Some CPU doesn't like that. See
- * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
- * Erratum 383 on page 93. Intel should be safe but is
- * also warns that it's only safe if the permission
- * and cache attributes of the two entries loaded in
- * the two TLB is identical (which should be the case
- * here). But it is generally safer to never allow
- * small and huge TLB entries for the same virtual
- * address to be loaded simultaneously. So instead of
- * doing "pmd_populate(); flush_tlb_range();" we first
- * mark the current pmd notpresent (atomically because
- * here the pmd_trans_huge and pmd_trans_splitting
- * must remain set at all times on the pmd until the
- * split is complete for this pmd), then we flush the
- * SMP TLB and finally we write the non-huge version
- * of the pmd entry with pmd_populate.
- */
- pmdp_invalidate(vma, address, pmd);
- pmd_populate(mm, pmd, pgtable);
- ret = 1;
+ prot = pmd_pgprot(*pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+
+ BUG_ON(PageCompound(page+i));
+ entry = mk_pte(page + i, prot);
+ entry = pte_mkdirty(entry);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
}
+
+ smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
+ /*
+ * Up to this point the pmd is present and huge.
+ *
+ * If we overwrite the pmd with the not-huge version, we could trigger
+ * a small page size TLB miss on the small sized TLB while the hugepage
+ * TLB entry is still established in the huge TLB.
+ *
+ * Some CPUs don't like that. See
+ * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
+ * on page 93.
+ *
+ * Thus it is generally safer to never allow small and huge TLB entries
+ * for overlapping virtual addresses to be loaded. So we first mark the
+ * current pmd not present, then we flush the TLB and finally we write
+ * the non-huge version of the pmd entry with pmd_populate.
+ *
+ * The above needs to be done under the ptl because pmd_trans_huge and
+ * pmd_trans_splitting must remain set on the pmd until the split is
+ * complete. The ptl also protects against concurrent faults due to
+ * making the pmd not-present.
+ */
+ set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+ flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmd_populate(mm, pmd, pgtable);
+ ret = 1;
+
+unlock:
spin_unlock(&mm->page_table_lock);
return ret;
{
struct page *hpage = NULL;
unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
bool wait = true;
-
- barrier(); /* write khugepaged_pages_to_scan to local stack */
+ unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
while (progress < pages) {
if (!khugepaged_prealloc_page(&hpage, &wait))
* (Gerhard.Wichert@pdb.siemens.de)
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/
#include <linux/kernel_stat.h>
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/migrate.h>
#include <asm/io.h>
#include <asm/pgalloc.h>
#include "internal.h"
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_nid.
+#endif
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
+static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
+{
+ /*
+ * If we have the normal vma->vm_page_prot protections we're not a
+ * 'special' PROT_NONE page.
+ *
+ * This means we cannot get 'special' PROT_NONE faults from genuine
+ * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+ * tracking.
+ *
+ * Neither case is really interesting for our current use though so we
+ * don't care.
+ */
+ if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+ return false;
+
+ return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
+static int do_prot_none(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, pmd_t *pmd,
+ unsigned int flags, pte_t entry)
+{
+ struct page *page = NULL;
+ int node, page_nid = -1;
+ spinlock_t *ptl;
+
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*ptep, entry)))
+ goto unlock;
+
+ page = vm_normal_page(vma, address, entry);
+ if (page) {
+ get_page(page);
+ page_nid = page_to_nid(page);
+ node = mpol_misplaced(page, vma, address);
+ if (node != -1)
+ goto migrate;
+ }
+
+fixup:
+ flush_cache_page(vma, address, pte_pfn(entry));
+
+ ptep_modify_prot_start(mm, address, ptep);
+ entry = pte_modify(entry, vma->vm_page_prot);
+ ptep_modify_prot_commit(mm, address, ptep, entry);
+
+ update_mmu_cache(vma, address, ptep);
+
+unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ if (page) {
+ task_numa_fault(page_nid, 1);
+ put_page(page);
+ }
+
+ return 0;
+
+migrate:
+ pte_unmap_unlock(ptep, ptl);
+
+ if (!migrate_misplaced_page(page, node)) {
+ page_nid = node;
+ goto out;
+ }
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*ptep, entry)) {
+ put_page(page);
+ page = NULL;
+ goto unlock;
+ }
+
+ goto fixup;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
pte, pmd, flags, entry);
}
+ if (pte_prot_none(vma, entry))
+ return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
+
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
- int ret;
+ int ret = 0;
barrier();
- if (pmd_trans_huge(orig_pmd)) {
- if (flags & FAULT_FLAG_WRITE &&
- !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_prot_none(vma, orig_pmd)) {
+ do_huge_pmd_prot_none(mm, vma, address, pmd,
+ flags, orig_pmd);
+ }
+
+ if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
*/
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
- return ret;
}
- return 0;
+
+ return ret;
}
}
+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
.flags = MPOL_F_LOCAL,
};
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+ struct mempolicy *pol = p->mempolicy;
+ int node;
+
+ if (!pol) {
+ node = tsk_home_node(p);
+ if (node != -1)
+ pol = &preferred_node_policy[node];
+ }
+
+ return pol;
+}
+
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
/*
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
- if (mode == MPOL_DEFAULT) {
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
- return NULL; /* simply delete any existing policy */
+ return NULL;
}
VM_BUG_ON(!nodes);
(flags & MPOL_F_RELATIVE_NODES)))
return ERR_PTR(-EINVAL);
}
+ } else if (mode == MPOL_LOCAL) {
+ if (!nodes_empty(*nodes))
+ return ERR_PTR(-EINVAL);
+ mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
return 0;
}
+static void
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
+
/*
* Check if all pages in a range are on a set of nodes.
* If pagelist != NULL then isolate pages from the LRU and
return ERR_PTR(-EFAULT);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+ unsigned long endvma = vma->vm_end;
+
+ if (endvma > end)
+ endvma = end;
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+
if (!(flags & MPOL_MF_DISCONTIG_OK)) {
if (!vma->vm_next && vma->vm_end < end)
return ERR_PTR(-EFAULT);
if (prev && prev->vm_end < vma->vm_start)
return ERR_PTR(-EFAULT);
}
- if (!is_vm_hugetlb_page(vma) &&
- ((flags & MPOL_MF_STRICT) ||
+
+ if (is_vm_hugetlb_page(vma))
+ goto next;
+
+ if (flags & MPOL_MF_LAZY) {
+ change_prot_none(vma, start, endvma);
+ goto next;
+ }
+
+ if ((flags & MPOL_MF_STRICT) ||
((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
- vma_migratable(vma)))) {
- unsigned long endvma = vma->vm_end;
+ vma_migratable(vma))) {
- if (endvma > end)
- endvma = end;
- if (vma->vm_start > start)
- start = vma->vm_start;
err = check_pgd_range(vma, start, endvma, nodes,
flags, private);
if (err) {
break;
}
}
+next:
prev = vma;
}
return first;
int err;
LIST_HEAD(pagelist);
- if (flags & ~(unsigned long)(MPOL_MF_STRICT |
- MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & ~(unsigned long)MPOL_MF_VALID)
return -EINVAL;
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
return -EPERM;
if (start & ~PAGE_MASK)
return -EINVAL;
- if (mode == MPOL_DEFAULT)
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
flags &= ~MPOL_MF_STRICT;
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
if (IS_ERR(new))
return PTR_ERR(new);
+ if (flags & MPOL_MF_LAZY)
+ new->flags |= MPOL_F_MOF;
+
/*
* If we are using the default policy then operation
* on discontinuous address spaces is okay after all
vma = check_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);
- err = PTR_ERR(vma);
- if (!IS_ERR(vma)) {
- int nr_failed = 0;
-
+ err = PTR_ERR(vma); /* maybe ... */
+ if (!IS_ERR(vma) && mode != MPOL_NOOP)
err = mbind_range(mm, start, end, new);
+ if (!err) {
+ int nr_failed = 0;
+
if (!list_empty(&pagelist)) {
+ WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(&pagelist, new_vma_page,
- (unsigned long)vma,
- false, MIGRATE_SYNC);
+ (unsigned long)vma,
+ false, MIGRATE_SYNC);
if (nr_failed)
putback_lru_pages(&pagelist);
}
- if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+ if (nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
} else
putback_lru_pages(&pagelist);
return err;
}
+static void lazy_migrate_vma(struct vm_area_struct *vma)
+{
+ if (!vma_migratable(vma))
+ return;
+
+ change_prot_none(vma, vma->vm_start, vma->vm_end);
+}
+
+void lazy_migrate_process(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next)
+ lazy_migrate_vma(vma);
+ up_read(&mm->mmap_sem);
+}
+
/*
* User space interface with variable sized bitmaps for nodelists.
*/
struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
- struct mempolicy *pol = task->mempolicy;
+ struct mempolicy *pol = get_task_policy(task);
if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
return NULL;
}
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy)
+{
+ unsigned nid, next;
+ struct task_struct *me = current;
+
+ nid = me->il_next;
+ next = next_node(nid, policy->v.nodes);
+ if (next >= MAX_NUMNODES)
+ next = first_node(policy->v.nodes);
+ if (next < MAX_NUMNODES)
+ me->il_next = next;
+ return nid;
+}
+
/* Return a zonelist indicated by gfp for node representing a mempolicy */
static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
int nd)
{
switch (policy->mode) {
+ case MPOL_INTERLEAVE:
+ nd = interleave_nodes(policy);
+ break;
case MPOL_PREFERRED:
if (!(policy->flags & MPOL_F_LOCAL))
nd = policy->v.preferred_node;
return node_zonelist(nd, gfp);
}
-/* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
-{
- unsigned nid, next;
- struct task_struct *me = current;
-
- nid = me->il_next;
- next = next_node(nid, policy->v.nodes);
- if (next >= MAX_NUMNODES)
- next = first_node(policy->v.nodes);
- if (next < MAX_NUMNODES)
- me->il_next = next;
- return nid;
-}
-
/*
* Depending on the memory policy provide a node from which to allocate the
* next slab entry.
return ret;
}
-/* Allocate a page in interleaved policy.
- Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
- unsigned nid)
-{
- struct zonelist *zl;
- struct page *page;
-
- zl = node_zonelist(nid, gfp);
- page = __alloc_pages(gfp, order, zl);
- if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
- inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
- return page;
-}
-
/**
* alloc_pages_vma - Allocate a page for a VMA.
*
pol = get_vma_policy(current, vma, addr);
cpuset_mems_cookie = get_mems_allowed();
- if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
- unsigned nid;
-
- nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
- mpol_cond_put(pol);
- page = alloc_page_interleave(gfp, order, nid);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
- goto retry_cpuset;
-
- return page;
- }
zl = policy_zonelist(gfp, pol, node);
if (unlikely(mpol_needs_cond_ref(pol))) {
/*
*/
struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
- struct mempolicy *pol = current->mempolicy;
+ struct mempolicy *pol = get_task_policy(current);
struct page *page;
unsigned int cpuset_mems_cookie;
* No reference counting needed for current->mempolicy
* nor system default_policy
*/
- if (pol->mode == MPOL_INTERLEAVE)
- page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
- else
- page = __alloc_pages_nodemask(gfp, order,
- policy_zonelist(gfp, pol, numa_node_id()),
- policy_nodemask(gfp, pol));
+ page = __alloc_pages_nodemask(gfp, order,
+ policy_zonelist(gfp, pol, numa_node_id()),
+ policy_nodemask(gfp, pol));
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
kmem_cache_free(sn_cache, n);
}
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page - page to be checked
+ * @vma - vm area where page mapped
+ * @addr - virtual address where page mapped
+ * @multi - use multi-stage node binding
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * -1 - not misplaced, page is in the right node
+ * node - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct mempolicy *pol;
+ struct zone *zone;
+ int curnid = page_to_nid(page);
+ unsigned long pgoff;
+ int polnid = -1;
+ int ret = -1;
+
+ BUG_ON(!vma);
+
+ pol = get_vma_policy(current, vma, addr);
+ if (!(pol->flags & MPOL_F_MOF))
+ goto out;
+
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+
+ pgoff = vma->vm_pgoff;
+ pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+ polnid = offset_il_node(pol, vma, pgoff);
+ break;
+
+ case MPOL_PREFERRED:
+ if (pol->flags & MPOL_F_LOCAL)
+ polnid = numa_node_id();
+ else
+ polnid = pol->v.preferred_node;
+ break;
+
+ case MPOL_BIND:
+ /*
+ * allows binding to multiple nodes.
+ * use current page if in policy nodemask,
+ * else select nearest allowed node, if any.
+ * If no allowed nodes, use current [!misplaced].
+ */
+ if (node_isset(curnid, pol->v.nodes))
+ goto out;
+ (void)first_zones_zonelist(
+ node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER),
+ &pol->v.nodes, &zone);
+ polnid = zone->node;
+ break;
+
+ default:
+ BUG();
+ }
+
+ /*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By
+ * using a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+ * equate a task's usage of a particular page (n_p) per total usage
+ * of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely task<->page relation.
+ */
+ if (pol->flags & MPOL_F_HOME) {
+ int last_nid;
+
+ /*
+ * Migrate towards the current node, depends on
+ * task_numa_placement() details.
+ */
+ polnid = numa_node_id();
+ last_nid = page_xchg_last_nid(page, polnid);
+ if (last_nid != polnid)
+ goto out;
+ }
+
+ if (curnid != polnid)
+ ret = polnid;
+out:
+ mpol_cond_put(pol);
+
+ return ret;
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);
+ for_each_node(nid) {
+ preferred_node_policy[nid] = (struct mempolicy) {
+ .refcnt = ATOMIC_INIT(1),
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_MOF | MPOL_F_HOME,
+ .v = { .preferred_node = nid, },
+ };
+ }
+
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
* "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag
* Used only for mpol_parse_str() and mpol_to_str()
*/
-#define MPOL_LOCAL MPOL_MAX
static const char * const policy_modes[] =
{
[MPOL_DEFAULT] = "default",
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
- [MPOL_LOCAL] = "local"
+ [MPOL_LOCAL] = "local",
+ [MPOL_NOOP] = "noop", /* should not actually be used */
};
if (flags)
*flags++ = '\0'; /* terminate mode string */
- for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+ for (mode = 0; mode < MPOL_MAX; mode++) {
if (!strcmp(str, policy_modes[mode])) {
break;
}
}
- if (mode > MPOL_LOCAL)
+ if (mode >= MPOL_MAX || mode == MPOL_NOOP)
goto out;
switch (mode) {
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC) {
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
do {
get_bh(bh);
lock_buffer(bh);
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;
+ if (mode == MIGRATE_FAULT) {
+ /*
+ * MIGRATE_FAULT has an extra reference on the page and
+ * otherwise acts like ASYNC, no point in delaying the
+ * fault, we'll try again next time.
+ */
+ expected_count++;
+ }
+
if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != 1)
+ expected_count += 1;
+ if (page_count(page) != expected_count)
return -EAGAIN;
return 0;
}
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
- expected_count = 2 + page_has_private(page);
+ expected_count += 2 + page_has_private(page);
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (mode == MIGRATE_ASYNC && head &&
+ if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
*/
void migrate_page_copy(struct page *newpage, struct page *page)
{
- if (PageHuge(page))
+ if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
else
copy_highpage(newpage, page);
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC)
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
struct anon_vma *anon_vma = NULL;
if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
+ if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
goto out;
/*
}
return err;
}
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ struct address_space *mapping = page_mapping(page);
+ int page_lru = page_is_file_cache(page);
+ struct page *newpage;
+ int ret = -EAGAIN;
+ gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+ /*
+ * Don't migrate pages that are mapped in multiple processes.
+ */
+ if (page_mapcount(page) != 1)
+ goto out;
+
+ /*
+ * Never wait for allocations just to migrate on fault, but don't dip
+ * into reserves. And, only accept pages from the specified node. No
+ * sense migrating to a different "misplaced" page!
+ */
+ if (mapping)
+ gfp = mapping_gfp_mask(mapping);
+ gfp &= ~__GFP_WAIT;
+ gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+ newpage = alloc_pages_node(node, gfp, 0);
+ if (!newpage) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (isolate_lru_page(page)) {
+ ret = -EBUSY;
+ goto put_new;
+ }
+
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+ /*
+ * A page that has been migrated has all references removed and will be
+ * freed. A page that has not been migrated will have kepts its
+ * references and be restored.
+ */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ putback_lru_page(page);
+put_new:
+ /*
+ * Move the new page to the LRU. If migration was not successful
+ * then this will free the page.
+ */
+ putback_lru_page(newpage);
+out:
+ return ret;
+}
+
+#endif /* CONFIG_NUMA */
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
- return newprot;
-}
-#endif
-
static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
} while (pud++, addr = next, addr != end);
}
-static void change_protection(struct vm_area_struct *vma,
+static void change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
flush_tlb_range(vma, start, end);
}
+void change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable)
+{
+ struct mm_struct *mm = vma->vm_mm;
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_change_protection(vma, start, end, newprot);
+ else
+ change_protection_range(vma, start, end, newprot, dirty_accountable);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
int
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
unsigned long start, unsigned long end, unsigned long newflags)
dirty_accountable = 1;
}
- mmu_notifier_invalidate_range_start(mm, start, end);
- if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
- else
- change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
perf_event_mmap(vma);
{
pte_t pte;
pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (pte_accessible(pte))
+ flush_tlb_page(vma, address);
return pte;
}
#endif
"numa_hit",
"numa_miss",
"numa_foreign",
- "numa_interleave",
"numa_local",
"numa_other",
#endif
# These targets are used from top-level makefile
PHONY += oldconfig xconfig gconfig menuconfig config silentoldconfig update-po-config \
- localmodconfig localyesconfig
+ localmodconfig localyesconfig kvmconfig
ifdef KBUILD_KCONFIG
Kconfig := $(KBUILD_KCONFIG)
$(Q)mkdir -p include/generated
$< --$@ $(Kconfig)
+kvmconfig:
+ $(Q)$(CONFIG_SHELL) $(srctree)/scripts/config -e KVMTOOL_TEST_ENABLE
+ $(Q)yes "" | make oldconfig > /dev/null
+ @echo 'Kernel configuration modified to run as KVM guest.'
+
localyesconfig localmodconfig: $(obj)/streamline_config.pl $(obj)/conf
$(Q)mkdir -p include/generated
$(Q)perl $< --$@ $(srctree) $(Kconfig) > .tmp.config
--- /dev/null
+/lkvm
+/vm
+*.o
+*.d
+.cscope
+tags
+include/common-cmds.h
+tests/boot/boot_test.iso
+tests/boot/rootfs/
+guest/init
+guest/init_stage2
+KVMTOOLS-VERSION-FILE
--- /dev/null
+Most of the infrastructure that 'perf' uses here has been reused
+from the Git project, as of version:
+
+ 66996ec: Sync with 1.6.2.4
+
+Here is an (incomplete!) list of main contributors to those files
+in util/* and elsewhere:
+
+ Alex Riesen
+ Christian Couder
+ Dmitry Potapov
+ Jeff King
+ Johannes Schindelin
+ Johannes Sixt
+ Junio C Hamano
+ Linus Torvalds
+ Matthias Kestenholz
+ Michal Ostrowski
+ Miklos Vajna
+ Petr Baudis
+ Pierre Habouzit
+ René Scharfe
+ Samuel Tardieu
+ Shawn O. Pearce
+ Steffen Prohaska
+ Steve Haslam
+
+Thanks guys!
+
+The full history of the files can be found in the upstream Git commits.
--- /dev/null
+This document explains how to debug a guests' kernel using KGDB.
+
+1. Run the guest:
+ 'lkvm run -k [vmlinuz] -p "kgdboc=ttyS1 kgdbwait" --tty 1'
+
+And see which PTY got assigned to ttyS1 (you'll see:
+' Info: Assigned terminal 1 to pty /dev/pts/X').
+
+2. Run GDB on the host:
+ 'gdb [vmlinuz]'
+
+3. Connect to the guest (from within GDB):
+ 'target remote /dev/pty/X'
+
+4. Start debugging! (enter 'continue' to continue boot).
--- /dev/null
+lkvm-balloon(1)
+================
+
+NAME
+----
+lkvm-balloon - Inflate or deflate the virtio balloon
+
+SYNOPSIS
+--------
+[verse]
+'lkvm balloon [command] [size] [instance]'
+
+DESCRIPTION
+-----------
+The command inflates or deflates the virtio balloon located in the
+specified instance.
+For a list of running instances see 'lkvm list'.
+
+Command can be either 'inflate' or 'deflate'. Inflate increases the
+size of the balloon, thus decreasing the amount of virtual RAM available
+for the guest. Deflation returns previously inflated memory back to the
+guest.
+
+size is specified in Mb.
--- /dev/null
+lkvm-debug(1)
+================
+
+NAME
+----
+lkvm-debug - Print debug information from a running instance
+
+SYNOPSIS
+--------
+[verse]
+'lkvm debug [instance]'
+
+DESCRIPTION
+-----------
+The command prints debug information from a running instance.
+For a list of running instances see 'lkvm list'.
--- /dev/null
+lkvm-list(1)
+================
+
+NAME
+----
+lkvm-list - Print a list of running instances on the host.
+
+SYNOPSIS
+--------
+[verse]
+'lkvm list'
+
+DESCRIPTION
+-----------
+This command prints a list of running instances on the host which
+belong to the user who currently ran 'lkvm list'.
--- /dev/null
+lkvm-pause(1)
+================
+
+NAME
+----
+lkvm-pause - Pause the virtual machine
+
+SYNOPSIS
+--------
+[verse]
+'lkvm pause [instance]'
+
+DESCRIPTION
+-----------
+The command pauses a virtual machine.
+For a list of running instances see 'lkvm list'.
--- /dev/null
+lkvm-resume(1)
+================
+
+NAME
+----
+lkvm-resume - Resume the virtual machine
+
+SYNOPSIS
+--------
+[verse]
+'lkvm resume [instance]'
+
+DESCRIPTION
+-----------
+The command resumes a virtual machine.
+For a list of running instances see 'lkvm list'.
--- /dev/null
+lkvm-run(1)
+================
+
+NAME
+----
+lkvm-run - Start the virtual machine
+
+SYNOPSIS
+--------
+[verse]
+'lkvm run' [-k <kernel image> | --kernel <kernel image>]
+
+DESCRIPTION
+-----------
+The command starts a virtual machine.
+
+OPTIONS
+-------
+-m::
+--mem=::
+ Virtual machine memory size in MiB.
+
+-p::
+--params::
+ Additional kernel command line arguments.
+
+-r::
+--initrd=::
+ Initial RAM disk image.
+
+-k::
+--kernel=::
+ The virtual machine kernel.
+
+--dev=::
+ KVM device file.
+
+-i::
+--image=::
+ A disk image file.
+
+-s::
+--single-step::
+ Enable single stepping.
+
+-g::
+--ioport-debug::
+ Enable ioport debugging.
+
+-c::
+--enable-virtio-console::
+ Enable the virtual IO console.
+
+--cpus::
+ The number of virtual CPUs to run.
+
+--debug::
+ Enable debug messages.
+
+SEE ALSO
+--------
+linkkvm:
--- /dev/null
+lkvm-sandbox(1)
+================
+
+NAME
+----
+lkvm-sandbox - Run a command in a sandboxed guest
+
+SYNOPSIS
+--------
+[verse]
+'lkvm sandbox ['lkvm run' arguments] -- [sandboxed command]'
+
+DESCRIPTION
+-----------
+The sandboxed command will run in a guest as part of it's init
+command.
--- /dev/null
+lkvm-setup(1)
+================
+
+NAME
+----
+lkvm-setup - Setup a new virtual machine
+
+SYNOPSIS
+--------
+[verse]
+'lkvm setup <name>'
+
+DESCRIPTION
+-----------
+The command setups a virtual machine.
--- /dev/null
+lkvm-stat(1)
+================
+
+NAME
+----
+lkvm-stat - Print statistics about a running instance
+
+SYNOPSIS
+--------
+[verse]
+'lkvm [command] [-n instance] [-p instance pid] [--all]'
+
+DESCRIPTION
+-----------
+The command prints statistics about a running instance.
+For a list of running instances see 'lkvm list'.
+
+Commands:
+ --memory, -m Display memory statistics
--- /dev/null
+lkvm-stop(1)
+================
+
+NAME
+----
+lkvm-stop - Stop a running instance
+
+SYNOPSIS
+--------
+[verse]
+'lkvm stop [instance]'
+
+DESCRIPTION
+-----------
+The command stops a running instance.
+For a list of running instances see 'lkvm list'.
--- /dev/null
+lkvm-version(1)
+================
+
+NAME
+----
+lkvm-version - Print the version of the kernel tree kvm tools
+was built on.
+
+SYNOPSIS
+--------
+[verse]
+'lkvm version'
+
+DESCRIPTION
+-----------
+The command prints the version of the kernel that was used to build
+kvm tools.
+
+Note that the version is not the version of the kernel which is currently
+running on the host, but is the version of the kernel tree from which kvm
+tools was built.
--- /dev/null
+General
+--------
+
+virtio-console as the name implies is a console over virtio transport. Here is
+a simple head to head comparison of the virtio-console vs regular 8250 console:
+
+8250 serial console:
+
+ - Requires CONFIG_SERIAL_8250=y and CONFIG_SERIAL_8250_CONSOLE=y kernel configs,
+which are enabled almost everywhere.
+ - Doesn't require guest-side changes.
+ - Compatible with older guests.
+
+virtio-console:
+
+ - Requires CONFIG_VIRTIO_CONSOLE=y (along with all other virtio dependencies),
+which got enabled only in recent kernels (but not all of them).
+ - Much faster.
+ - Consumes less processing resources.
+ - Requires guest-side changes.
+
+Enabling virtio-console
+------------------------
+
+First, make sure guest kernel is built with CONFIG_VIRTIO_CONSOLE=y. Once this
+is done, the following has to be done inside guest image:
+
+ - Add the following line to /etc/inittab:
+ 'hvc0:2345:respawn:/sbin/agetty -L 9600 hvc0'
+ - Add 'hvc0' to /etc/securetty (so you could actually log on)
+ - Start the guest with '--console virtio'
+
+Common errors
+--------------
+
+Q: I don't see anything on the screen!
+A: Make sure CONFIG_VIRTIO_CONSOLE=y is enabled in the *guest* kernel, also
+make sure you've updated /etc/inittab
+
+Q: It won't accept my username/password, but I enter them correctly!
+A: You didn't add 'hvc0' to /etc/securetty
--- /dev/null
+#
+# Define WERROR=0 to disable -Werror.
+#
+
+ifeq ($(strip $(V)),)
+ E = @echo
+ Q = @
+else
+ E = @\#
+ Q =
+endif
+ifneq ($(I), )
+ KINCL_PATH=$(I)
+else
+ KINCL_PATH=../..
+endif
+export E Q KINCL_PATH
+
+include config/utilities.mak
+include config/feature-tests.mak
+
+CC := $(CROSS_COMPILE)$(CC)
+LD := $(CROSS_COMPILE)$(LD)
+
+FIND := find
+CSCOPE := cscope
+TAGS := ctags
+INSTALL := install
+
+prefix = $(HOME)
+bindir_relative = bin
+bindir = $(prefix)/$(bindir_relative)
+
+DESTDIR_SQ = $(subst ','\'',$(DESTDIR))
+bindir_SQ = $(subst ','\'',$(bindir))
+
+PROGRAM := lkvm
+PROGRAM_ALIAS := vm
+
+GUEST_INIT := guest/init
+
+OBJS += builtin-balloon.o
+OBJS += builtin-debug.o
+OBJS += builtin-help.o
+OBJS += builtin-list.o
+OBJS += builtin-stat.o
+OBJS += builtin-pause.o
+OBJS += builtin-resume.o
+OBJS += builtin-run.o
+OBJS += builtin-setup.o
+OBJS += builtin-stop.o
+OBJS += builtin-version.o
+OBJS += disk/core.o
+OBJS += framebuffer.o
+OBJS += guest_compat.o
+OBJS += hw/rtc.o
+OBJS += hw/serial.o
+OBJS += ioport.o
+OBJS += kvm-cpu.o
+OBJS += kvm.o
+OBJS += main.o
+OBJS += mmio.o
+OBJS += pci.o
+OBJS += term.o
+OBJS += virtio/blk.o
+OBJS += virtio/scsi.o
+OBJS += virtio/console.o
+OBJS += virtio/core.o
+OBJS += virtio/net.o
+OBJS += virtio/rng.o
+OBJS += virtio/balloon.o
+OBJS += virtio/pci.o
+OBJS += disk/blk.o
+OBJS += disk/qcow.o
+OBJS += disk/raw.o
+OBJS += ioeventfd.o
+OBJS += net/uip/core.o
+OBJS += net/uip/arp.o
+OBJS += net/uip/icmp.o
+OBJS += net/uip/ipv4.o
+OBJS += net/uip/tcp.o
+OBJS += net/uip/udp.o
+OBJS += net/uip/buf.o
+OBJS += net/uip/csum.o
+OBJS += net/uip/dhcp.o
+OBJS += kvm-cmd.o
+OBJS += util/init.o
+OBJS += util/rbtree.o
+OBJS += util/threadpool.o
+OBJS += util/parse-options.o
+OBJS += util/rbtree-interval.o
+OBJS += util/strbuf.o
+OBJS += util/read-write.o
+OBJS += util/util.o
+OBJS += virtio/9p.o
+OBJS += virtio/9p-pdu.o
+OBJS += hw/vesa.o
+OBJS += hw/pci-shmem.o
+OBJS += kvm-ipc.o
+OBJS += builtin-sandbox.o
+OBJS += virtio/mmio.o
+
+# Translate uname -m into ARCH string
+ARCH ?= $(shell uname -m | sed -e s/i.86/i386/ -e s/ppc.*/powerpc/)
+
+ifeq ($(ARCH),i386)
+ ARCH := x86
+ DEFINES += -DCONFIG_X86_32
+endif
+ifeq ($(ARCH),x86_64)
+ ARCH := x86
+ DEFINES += -DCONFIG_X86_64
+endif
+
+LIBFDT_SRC = fdt.o fdt_ro.o fdt_wip.o fdt_sw.o fdt_rw.o fdt_strerror.o
+LIBFDT_OBJS = $(patsubst %,../../scripts/dtc/libfdt/%,$(LIBFDT_SRC))
+
+### Arch-specific stuff
+
+#x86
+ifeq ($(ARCH),x86)
+ DEFINES += -DCONFIG_X86
+ OBJS += x86/boot.o
+ OBJS += x86/cpuid.o
+ OBJS += x86/interrupt.o
+ OBJS += x86/ioport.o
+ OBJS += x86/irq.o
+ OBJS += x86/kvm.o
+ OBJS += x86/kvm-cpu.o
+ OBJS += x86/mptable.o
+ OBJS += hw/i8042.o
+# Exclude BIOS object files from header dependencies.
+ OTHEROBJS += x86/bios.o
+ OTHEROBJS += x86/bios/bios-rom.o
+ ARCH_INCLUDE := x86/include
+endif
+# POWER/ppc: Actually only support ppc64 currently.
+ifeq ($(ARCH), powerpc)
+ DEFINES += -DCONFIG_PPC
+ OBJS += powerpc/boot.o
+ OBJS += powerpc/ioport.o
+ OBJS += powerpc/irq.o
+ OBJS += powerpc/kvm.o
+ OBJS += powerpc/cpu_info.o
+ OBJS += powerpc/kvm-cpu.o
+ OBJS += powerpc/spapr_hcall.o
+ OBJS += powerpc/spapr_rtas.o
+ OBJS += powerpc/spapr_hvcons.o
+ OBJS += powerpc/spapr_pci.o
+ OBJS += powerpc/xics.o
+# We use libfdt, but it's sometimes not packaged 64bit. It's small too,
+# so just build it in:
+ CFLAGS += -I../../scripts/dtc/libfdt
+ OTHEROBJS += $(LIBFDT_OBJS)
+ ARCH_INCLUDE := powerpc/include
+ CFLAGS += -m64
+endif
+
+###
+
+ifeq (,$(ARCH_INCLUDE))
+ UNSUPP_ERR = @echo "This architecture is not supported in kvmtool." && exit 1
+else
+ UNSUPP_ERR =
+endif
+
+###
+
+# Detect optional features.
+# On a given system, some libs may link statically, some may not; so, check
+# both and only build those that link!
+
+FLAGS_BFD := $(CFLAGS) -lbfd
+ifeq ($(call try-cc,$(SOURCE_BFD),$(FLAGS_BFD)),y)
+ CFLAGS_DYNOPT += -DCONFIG_HAS_BFD
+ OBJS_DYNOPT += symbol.o
+ LIBS_DYNOPT += -lbfd
+endif
+ifeq ($(call try-cc,$(SOURCE_BFD),$(FLAGS_BFD) -static),y)
+ CFLAGS_STATOPT += -DCONFIG_HAS_BFD
+ OBJS_STATOPT += symbol.o
+ LIBS_STATOPT += -lbfd
+endif
+
+FLAGS_VNCSERVER := $(CFLAGS) -lvncserver
+ifeq ($(call try-cc,$(SOURCE_VNCSERVER),$(FLAGS_VNCSERVER)),y)
+ OBJS_DYNOPT += ui/vnc.o
+ CFLAGS_DYNOPT += -DCONFIG_HAS_VNCSERVER
+ LIBS_DYNOPT += -lvncserver
+endif
+ifeq ($(call try-cc,$(SOURCE_VNCSERVER),$(FLAGS_VNCSERVER) -static),y)
+ OBJS_STATOPT += ui/vnc.o
+ CFLAGS_STATOPT += -DCONFIG_HAS_VNCSERVER
+ LIBS_STATOPT += -lvncserver
+endif
+
+FLAGS_SDL := $(CFLAGS) -lSDL
+ifeq ($(call try-cc,$(SOURCE_SDL),$(FLAGS_SDL)),y)
+ OBJS_DYNOPT += ui/sdl.o
+ CFLAGS_DYNOPT += -DCONFIG_HAS_SDL
+ LIBS_DYNOPT += -lSDL
+endif
+ifeq ($(call try-cc,$(SOURCE_SDL),$(FLAGS_SDL) -static), y)
+ OBJS_STATOPT += ui/sdl.o
+ CFLAGS_STATOPT += -DCONFIG_HAS_SDL
+ LIBS_STATOPT += -lSDL
+endif
+
+FLAGS_ZLIB := $(CFLAGS) -lz
+ifeq ($(call try-cc,$(SOURCE_ZLIB),$(FLAGS_ZLIB)),y)
+ CFLAGS_DYNOPT += -DCONFIG_HAS_ZLIB
+ LIBS_DYNOPT += -lz
+endif
+ifeq ($(call try-cc,$(SOURCE_ZLIB),$(FLAGS_ZLIB) -static),y)
+ CFLAGS_STATOPT += -DCONFIG_HAS_ZLIB
+ LIBS_STATOPT += -lz
+endif
+
+FLAGS_AIO := $(CFLAGS) -laio
+ifeq ($(call try-cc,$(SOURCE_AIO),$(FLAGS_AIO)),y)
+ CFLAGS_DYNOPT += -DCONFIG_HAS_AIO
+ LIBS_DYNOPT += -laio
+endif
+ifeq ($(call try-cc,$(SOURCE_AIO),$(FLAGS_AIO) -static),y)
+ CFLAGS_STATOPT += -DCONFIG_HAS_AIO
+ LIBS_STATOPT += -laio
+endif
+
+ifneq ($(call try-build,$(SOURCE_STATIC),-static,),y)
+$(error No static libc found. Please install glibc-static package.)
+endif
+###
+
+LIBS += -lrt
+LIBS += -lpthread
+LIBS += -lutil
+
+
+DEPS := $(patsubst %.o,%.d,$(OBJS))
+
+DEFINES += -D_FILE_OFFSET_BITS=64
+DEFINES += -D_GNU_SOURCE
+DEFINES += -DKVMTOOLS_VERSION='"$(KVMTOOLS_VERSION)"'
+DEFINES += -DBUILD_ARCH='"$(ARCH)"'
+
+KVM_INCLUDE := include
+CFLAGS += $(CPPFLAGS) $(DEFINES) -I$(KVM_INCLUDE) -I$(ARCH_INCLUDE) -I$(KINCL_PATH)/include -I$(KINCL_PATH)/arch/$(ARCH)/include/ -O2 -fno-strict-aliasing -g -flto
+
+WARNINGS += -Wall
+WARNINGS += -Wcast-align
+WARNINGS += -Wformat=2
+WARNINGS += -Winit-self
+WARNINGS += -Wmissing-declarations
+WARNINGS += -Wmissing-prototypes
+WARNINGS += -Wnested-externs
+WARNINGS += -Wno-system-headers
+WARNINGS += -Wold-style-definition
+WARNINGS += -Wredundant-decls
+WARNINGS += -Wsign-compare
+WARNINGS += -Wstrict-prototypes
+WARNINGS += -Wundef
+WARNINGS += -Wvolatile-register-var
+WARNINGS += -Wwrite-strings
+
+CFLAGS += $(WARNINGS)
+
+# Some targets may use 'external' sources that don't build totally cleanly.
+CFLAGS_EASYGOING := $(CFLAGS)
+
+ifneq ($(WERROR),0)
+ CFLAGS += -Werror
+endif
+
+all: arch_support_check $(PROGRAM) $(PROGRAM_ALIAS) $(GUEST_INIT)
+
+arch_support_check:
+ $(UNSUPP_ERR)
+
+KVMTOOLS-VERSION-FILE:
+ @$(SHELL_PATH) util/KVMTOOLS-VERSION-GEN $(OUTPUT)
+-include $(OUTPUT)KVMTOOLS-VERSION-FILE
+
+# When building -static all objects are built with appropriate flags, which
+# may differ between static & dynamic .o. The objects are separated into
+# .o and .static.o. See the %.o: %.c rules below.
+#
+# $(OTHEROBJS) are things that do not get substituted like this.
+#
+STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
+GUEST_OBJS = guest/guest_init.o
+
+$(PROGRAM)-static: $(DEPS) $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
+ $(E) " LINK " $@
+ $(Q) $(CC) -static $(CFLAGS) $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_OBJS) $(LIBS) $(LIBS_STATOPT) -o $@
+
+$(PROGRAM): $(DEPS) $(OBJS) $(OBJS_DYNOPT) $(OTHEROBJS) $(GUEST_INIT)
+ $(E) " LINK " $@
+ $(Q) $(CC) $(CFLAGS) $(OBJS) $(OBJS_DYNOPT) $(OTHEROBJS) $(GUEST_OBJS) $(LIBS) $(LIBS_DYNOPT) -o $@
+
+$(PROGRAM_ALIAS): $(PROGRAM)
+ $(E) " LN " $@
+ $(Q) ln -f $(PROGRAM) $@
+
+$(GUEST_INIT): guest/init.c
+ $(E) " LINK " $@
+ $(Q) $(CC) -static guest/init.c -o $@
+ $(Q) $(LD) -r -b binary -o guest/guest_init.o $(GUEST_INIT)
+
+$(DEPS):
+
+util/rbtree.d: ../../lib/rbtree.c
+ $(Q) $(CC) -M -MT util/rbtree.o $(CFLAGS) $< -o $@
+
+%.d: %.c
+ $(Q) $(CC) -M -MT $(patsubst %.d,%.o,$@) $(CFLAGS) $< -o $@
+
+# The header file common-cmds.h is needed for compilation of builtin-help.c.
+builtin-help.d: $(KVM_INCLUDE)/common-cmds.h
+
+$(OBJS):
+
+# This rule relaxes the -Werror on libfdt, since for now it still has
+# a bunch of warnings. :(
+../../scripts/dtc/libfdt/%.o: ../../scripts/dtc/libfdt/%.c
+ifeq ($(C),1)
+ $(E) " CHECK " $@
+ $(Q) $(CHECK) -c $(CFLAGS_EASYGOING) $< -o $@
+endif
+ $(E) " CC " $@
+ $(Q) $(CC) -c $(CFLAGS_EASYGOING) $< -o $@
+
+util/rbtree.static.o util/rbtree.o: ../../lib/rbtree.c
+ifeq ($(C),1)
+ $(E) " CHECK " $@
+ $(Q) $(CHECK) -c $(CFLAGS) $< -o $@
+endif
+ $(E) " CC " $@
+ $(Q) $(CC) -c $(CFLAGS) $< -o $@
+
+%.static.o: %.c
+ifeq ($(C),1)
+ $(E) " CHECK " $@
+ $(Q) $(CHECK) -c $(CFLAGS) $(CFLAGS_STATOPT) $< -o $@
+endif
+ $(E) " CC " $@
+ $(Q) $(CC) -c $(CFLAGS) $(CFLAGS_STATOPT) $< -o $@
+
+%.o: %.c
+ifeq ($(C),1)
+ $(E) " CHECK " $@
+ $(Q) $(CHECK) -c $(CFLAGS) $(CFLAGS_DYNOPT) $< -o $@
+endif
+ $(E) " CC " $@
+ $(Q) $(CC) -c $(CFLAGS) $(CFLAGS_DYNOPT) $< -o $@
+
+
+$(KVM_INCLUDE)/common-cmds.h: util/generate-cmdlist.sh command-list.txt
+
+$(KVM_INCLUDE)/common-cmds.h: $(wildcard Documentation/kvm-*.txt)
+ $(E) " GEN " $@
+ $(Q) util/generate-cmdlist.sh > $@+ && mv $@+ $@
+
+#
+# BIOS assembly weirdness
+#
+BIOS_CFLAGS += -m32
+BIOS_CFLAGS += -march=i386
+BIOS_CFLAGS += -mregparm=3
+
+BIOS_CFLAGS += -fno-stack-protector
+BIOS_CFLAGS += -I../../arch/$(ARCH)
+
+x86/bios.o: x86/bios/bios.bin x86/bios/bios-rom.h
+
+x86/bios/bios.bin.elf: x86/bios/entry.S x86/bios/e820.c x86/bios/int10.c x86/bios/int15.c x86/bios/rom.ld.S
+ $(E) " CC x86/bios/memcpy.o"
+ $(Q) $(CC) -include code16gcc.h $(CFLAGS) $(BIOS_CFLAGS) -c -s x86/bios/memcpy.c -o x86/bios/memcpy.o
+ $(E) " CC x86/bios/e820.o"
+ $(Q) $(CC) -include code16gcc.h $(CFLAGS) $(BIOS_CFLAGS) -c -s x86/bios/e820.c -o x86/bios/e820.o
+ $(E) " CC x86/bios/int10.o"
+ $(Q) $(CC) -include code16gcc.h $(CFLAGS) $(BIOS_CFLAGS) -c -s x86/bios/int10.c -o x86/bios/int10.o
+ $(E) " CC x86/bios/int15.o"
+ $(Q) $(CC) -include code16gcc.h $(CFLAGS) $(BIOS_CFLAGS) -c -s x86/bios/int15.c -o x86/bios/int15.o
+ $(E) " CC x86/bios/entry.o"
+ $(Q) $(CC) $(CFLAGS) $(BIOS_CFLAGS) -c -s x86/bios/entry.S -o x86/bios/entry.o
+ $(E) " LD " $@
+ $(Q) $(LD) -T x86/bios/rom.ld.S -o x86/bios/bios.bin.elf x86/bios/memcpy.o x86/bios/entry.o x86/bios/e820.o x86/bios/int10.o x86/bios/int15.o
+
+x86/bios/bios.bin: x86/bios/bios.bin.elf
+ $(E) " OBJCOPY " $@
+ $(Q) objcopy -O binary -j .text x86/bios/bios.bin.elf x86/bios/bios.bin
+
+x86/bios/bios-rom.o: x86/bios/bios-rom.S x86/bios/bios.bin x86/bios/bios-rom.h
+ $(E) " CC " $@
+ $(Q) $(CC) -c $(CFLAGS) x86/bios/bios-rom.S -o x86/bios/bios-rom.o
+
+x86/bios/bios-rom.h: x86/bios/bios.bin.elf
+ $(E) " NM " $@
+ $(Q) cd x86/bios && sh gen-offsets.sh > bios-rom.h && cd ..
+
+check: all
+ $(MAKE) -C tests
+ ./$(PROGRAM) run tests/pit/tick.bin
+ ./$(PROGRAM) run -d tests/boot/boot_test.iso -p "init=init"
+.PHONY: check
+
+install: all
+ $(E) " INSTALL"
+ $(Q) $(INSTALL) -d -m 755 '$(DESTDIR_SQ)$(bindir_SQ)'
+ $(Q) $(INSTALL) $(PROGRAM) '$(DESTDIR_SQ)$(bindir_SQ)'
+.PHONY: install
+
+clean:
+ $(E) " CLEAN"
+ $(Q) rm -f x86/bios/*.bin
+ $(Q) rm -f x86/bios/*.elf
+ $(Q) rm -f x86/bios/*.o
+ $(Q) rm -f x86/bios/bios-rom.h
+ $(Q) rm -f tests/boot/boot_test.iso
+ $(Q) rm -rf tests/boot/rootfs/
+ $(Q) rm -f $(DEPS) $(OBJS) $(OTHEROBJS) $(OBJS_DYNOPT) $(STATIC_OBJS) $(PROGRAM) $(PROGRAM_ALIAS) $(PROGRAM)-static $(GUEST_INIT) $(GUEST_OBJS)
+ $(Q) rm -f cscope.*
+ $(Q) rm -f tags
+ $(Q) rm -f TAGS
+ $(Q) rm -f $(KVM_INCLUDE)/common-cmds.h
+ $(Q) rm -f KVMTOOLS-VERSION-FILE
+.PHONY: clean
+
+KVM_DEV ?= /dev/kvm
+
+$(KVM_DEV):
+ $(E) " MKNOD " $@
+ $(Q) mknod $@ char 10 232
+
+devices: $(KVM_DEV)
+.PHONY: devices
+
+TAGS:
+ $(E) " GEN" $@
+ $(Q) $(RM) -f TAGS
+ $(Q) $(FIND) . -name '*.[hcS]' -print | xargs etags -a
+.PHONY: TAGS
+
+tags:
+ $(E) " GEN" $@
+ $(Q) $(RM) -f tags
+ $(Q) $(FIND) . -name '*.[hcS]' -print | xargs ctags -a
+.PHONY: tags
+
+cscope:
+ $(E) " GEN" $@
+ $(Q) $(FIND) . -name '*.[hcS]' -print > cscope.files
+ $(Q) $(CSCOPE) -bkqu
+.PHONY: cscope
+
+# Deps
+-include $(DEPS)
--- /dev/null
+Native Linux KVM tool
+=====================
+The goal of this tool is to provide a clean, from-scratch, lightweight
+KVM host tool implementation that can boot Linux guest images (just a
+hobby, won't be big and professional like QEMU) with no BIOS
+dependencies and with only the minimal amount of legacy device
+emulation.
+
+It's great as a learning tool if you want to get your feet wet in
+virtualization land: it's only 5 KLOC of clean C code that can already
+boot a guest Linux image.
+
+Right now it can boot a Linux image and provide you output via a serial
+console, over the host terminal, i.e. you can use it to boot a guest
+Linux image in a terminal or over ssh and log into the guest without
+much guest or host side setup work needed.
+
+1. To try out the tool, clone the git repository:
+
+ git clone git://github.com/penberg/linux-kvm.git
+
+or alternatively, if you already have a kernel source tree:
+
+ git remote add kvm-tool git://github.com/penberg/linux-kvm.git
+ git remote update
+ git checkout -b kvm-tool/master kvm-tool
+
+2. Compile the tool:
+
+ cd tools/kvm && make
+
+3. Download a raw userspace image:
+
+ wget http://wiki.qemu.org/download/linux-0.2.img.bz2 && bunzip2
+linux-0.2.img.bz2
+
+4. The guest kernel has to be built with the following configuration:
+
+ - For the default console output:
+ CONFIG_SERIAL_8250=y
+ CONFIG_SERIAL_8250_CONSOLE=y
+
+ - For running 32bit images on 64bit hosts:
+ CONFIG_IA32_EMULATION=y
+
+ - Proper FS options according to image FS (e.g. CONFIG_EXT2_FS, CONFIG_EXT4_FS).
+
+ - For all virtio devices listed below:
+ CONFIG_VIRTIO=y
+ CONFIG_VIRTIO_RING=y
+ CONFIG_VIRTIO_PCI=y
+
+ - For virtio-blk devices (--disk, -d):
+ CONFIG_VIRTIO_BLK=y
+
+ - For virtio-net devices ([--network, -n] virtio):
+ CONFIG_VIRTIO_NET=y
+
+ - For virtio-9p devices (--virtio-9p):
+ CONFIG_NET_9P=y
+ CONFIG_NET_9P_VIRTIO=y
+ CONFIG_9P_FS=y
+
+ - For virtio-balloon device (--balloon):
+ CONFIG_VIRTIO_BALLOON=y
+
+ - For virtio-console device (--console virtio):
+ CONFIG_VIRTIO_CONSOLE=y
+
+ - For virtio-rng device (--rng):
+ CONFIG_HW_RANDOM_VIRTIO=y
+
+ - For vesa device (--sdl or --vnc):
+ CONFIG_FB_VESA=y
+
+
+5. And finally, launch the hypervisor:
+
+ ./lkvm run --disk linux-0.2.img \
+ --kernel ../../arch/x86/boot/bzImage \
+or
+
+ sudo ./lkvm run --disk linux-0.2.img \
+ --kernel ../../arch/x86/boot/bzImage \
+ --network virtio
+
+The tool has been written by Pekka Enberg, Cyrill Gorcunov, Asias He,
+Sasha Levin and Prasad Joshi. Special thanks to Avi Kivity for his help
+on KVM internals and Ingo Molnar for all-around support and encouragement!
+
+See the following thread for original discussion for motivation of this
+project:
+
+http://thread.gmane.org/gmane.linux.kernel/962051/focus=962620
+
+Build dependencies
+=====================
+For deb based systems:
+32-bit:
+sudo apt-get install build-essential
+64-bit:
+sudo apt-get install build-essential libc6-dev-i386
+
+For rpm based systems:
+32-bit:
+yum install glibc-devel
+64-bit:
+yum install glibc-devel glibc-static
+
+On 64-bit Arch Linux make sure the multilib repository is enabled in your
+/etc/pacman.conf and run
+pacman -Sy lib32-glibc
--- /dev/null
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-balloon.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm.h>
+#include <kvm/kvm-ipc.h>
+
+static const char *instance_name;
+static u64 inflate;
+static u64 deflate;
+
+static const char * const balloon_usage[] = {
+ "lkvm balloon [-n name] [-p pid] [-i amount] [-d amount]",
+ NULL
+};
+
+static const struct option balloon_options[] = {
+ OPT_GROUP("Instance options:"),
+ OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+ OPT_GROUP("Balloon options:"),
+ OPT_U64('i', "inflate", &inflate, "Amount to inflate"),
+ OPT_U64('d', "deflate", &deflate, "Amount to deflate"),
+ OPT_END(),
+};
+
+void kvm_balloon_help(void)
+{
+ usage_with_options(balloon_usage, balloon_options);
+}
+
+static void parse_balloon_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, balloon_options, balloon_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_balloon_help();
+ }
+}
+
+int kvm_cmd_balloon(int argc, const char **argv, const char *prefix)
+{
+ int instance;
+ int r;
+ int amount;
+
+ parse_balloon_options(argc, argv);
+
+ if (inflate == 0 && deflate == 0)
+ kvm_balloon_help();
+
+ if (instance_name == NULL)
+ kvm_balloon_help();
+
+ instance = kvm__get_sock_by_instance(instance_name);
+
+ if (instance <= 0)
+ die("Failed locating instance");
+
+ if (inflate)
+ amount = inflate;
+ else if (deflate)
+ amount = -deflate;
+ else
+ kvm_balloon_help();
+
+ r = kvm_ipc__send_msg(instance, KVM_IPC_BALLOON,
+ sizeof(amount), (u8 *)&amount);
+
+ close(instance);
+
+ if (r < 0)
+ return -1;
+
+ return 0;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-debug.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm-ipc.h>
+#include <kvm/read-write.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+#define BUFFER_SIZE 100
+
+static bool all;
+static int nmi = -1;
+static bool dump;
+static const char *instance_name;
+static const char *sysrq;
+
+static const char * const debug_usage[] = {
+ "lkvm debug [--all] [-n name] [-d] [-m vcpu]",
+ NULL
+};
+
+static const struct option debug_options[] = {
+ OPT_GROUP("General options:"),
+ OPT_BOOLEAN('d', "dump", &dump, "Generate a debug dump from guest"),
+ OPT_INTEGER('m', "nmi", &nmi, "Generate NMI on VCPU"),
+ OPT_STRING('s', "sysrq", &sysrq, "sysrq", "Inject a sysrq"),
+ OPT_GROUP("Instance options:"),
+ OPT_BOOLEAN('a', "all", &all, "Debug all instances"),
+ OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+ OPT_END()
+};
+
+static void parse_debug_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, debug_options, debug_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_debug_help();
+ }
+}
+
+void kvm_debug_help(void)
+{
+ usage_with_options(debug_usage, debug_options);
+}
+
+static int do_debug(const char *name, int sock)
+{
+ char buff[BUFFER_SIZE];
+ struct debug_cmd_params cmd = {.dbg_type = 0};
+ int r;
+
+ if (dump)
+ cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_DUMP;
+
+ if (nmi != -1) {
+ cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_NMI;
+ cmd.cpu = nmi;
+ }
+
+ if (sysrq) {
+ cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_SYSRQ;
+ cmd.sysrq = sysrq[0];
+ }
+
+ r = kvm_ipc__send_msg(sock, KVM_IPC_DEBUG, sizeof(cmd), (u8 *)&cmd);
+ if (r < 0)
+ return r;
+
+ if (!dump)
+ return 0;
+
+ do {
+ r = xread(sock, buff, BUFFER_SIZE);
+ if (r < 0)
+ return 0;
+ printf("%.*s", r, buff);
+ } while (r > 0);
+
+ return 0;
+}
+
+int kvm_cmd_debug(int argc, const char **argv, const char *prefix)
+{
+ parse_debug_options(argc, argv);
+ int instance;
+ int r;
+
+ if (all)
+ return kvm__enumerate_instances(do_debug);
+
+ if (instance_name == NULL)
+ kvm_debug_help();
+
+ instance = kvm__get_sock_by_instance(instance_name);
+
+ if (instance <= 0)
+ die("Failed locating instance");
+
+ r = do_debug(instance_name, instance);
+
+ close(instance);
+
+ return r;
+}
--- /dev/null
+#include <stdio.h>
+#include <string.h>
+
+/* user defined headers */
+#include <common-cmds.h>
+
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-help.h>
+#include <kvm/kvm.h>
+
+
+const char kvm_usage_string[] =
+ "lkvm COMMAND [ARGS]";
+
+const char kvm_more_info_string[] =
+ "See 'lkvm help COMMAND' for more information on a specific command.";
+
+
+static void list_common_cmds_help(void)
+{
+ unsigned int i, longest = 0;
+
+ for (i = 0; i < ARRAY_SIZE(common_cmds); i++) {
+ if (longest < strlen(common_cmds[i].name))
+ longest = strlen(common_cmds[i].name);
+ }
+
+ puts(" The most commonly used lkvm commands are:");
+ for (i = 0; i < ARRAY_SIZE(common_cmds); i++) {
+ printf(" %-*s ", longest, common_cmds[i].name);
+ puts(common_cmds[i].help);
+ }
+}
+
+static void kvm_help(void)
+{
+ printf("\n To start a simple non-privileged shell run '%s run'\n\n"
+ "usage: %s\n\n", KVM_BINARY_NAME, kvm_usage_string);
+ list_common_cmds_help();
+ printf("\n %s\n\n", kvm_more_info_string);
+}
+
+
+static void help_cmd(const char *cmd)
+{
+ struct cmd_struct *p;
+ p = kvm_get_command(kvm_commands, cmd);
+ if (!p)
+ kvm_help();
+ else if (p->help)
+ p->help();
+}
+
+int kvm_cmd_help(int argc, const char **argv, const char *prefix)
+{
+ if (!argv || !*argv) {
+ kvm_help();
+ return 0;
+ }
+ help_cmd(argv[0]);
+ return 0;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-list.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm-ipc.h>
+
+#include <dirent.h>
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+#include <fcntl.h>
+
+static bool run;
+static bool rootfs;
+
+static const char * const list_usage[] = {
+ "lkvm list",
+ NULL
+};
+
+static const struct option list_options[] = {
+ OPT_GROUP("General options:"),
+ OPT_BOOLEAN('i', "run", &run, "List running instances"),
+ OPT_BOOLEAN('r', "rootfs", &rootfs, "List rootfs instances"),
+ OPT_END()
+};
+
+#define KVM_INSTANCE_RUNNING "running"
+#define KVM_INSTANCE_PAUSED "paused"
+#define KVM_INSTANCE_SHUTOFF "shut off"
+
+void kvm_list_help(void)
+{
+ usage_with_options(list_usage, list_options);
+}
+
+static pid_t get_pid(int sock)
+{
+ pid_t pid;
+ int r;
+
+ r = kvm_ipc__send(sock, KVM_IPC_PID);
+ if (r < 0)
+ return r;
+
+ r = read(sock, &pid, sizeof(pid));
+ if (r < 0)
+ return r;
+
+ return pid;
+}
+
+int get_vmstate(int sock)
+{
+ int vmstate;
+ int r;
+
+ r = kvm_ipc__send(sock, KVM_IPC_VMSTATE);
+ if (r < 0)
+ return r;
+
+ r = read(sock, &vmstate, sizeof(vmstate));
+ if (r < 0)
+ return r;
+
+ return vmstate;
+
+}
+
+static int print_guest(const char *name, int sock)
+{
+ pid_t pid;
+ int vmstate;
+
+ pid = get_pid(sock);
+ vmstate = get_vmstate(sock);
+
+ if ((int)pid < 0 || vmstate < 0)
+ return -1;
+
+ if (vmstate == KVM_VMSTATE_PAUSED)
+ printf("%5d %-20s %s\n", pid, name, KVM_INSTANCE_PAUSED);
+ else
+ printf("%5d %-20s %s\n", pid, name, KVM_INSTANCE_RUNNING);
+
+ return 0;
+}
+
+static int kvm_list_running_instances(void)
+{
+ return kvm__enumerate_instances(print_guest);
+}
+
+static int kvm_list_rootfs(void)
+{
+ DIR *dir;
+ struct dirent *dirent;
+
+ dir = opendir(kvm__get_dir());
+ if (dir == NULL)
+ return -1;
+
+ while ((dirent = readdir(dir))) {
+ if (dirent->d_type == DT_DIR &&
+ strcmp(dirent->d_name, ".") &&
+ strcmp(dirent->d_name, ".."))
+ printf("%5s %-20s %s\n", "", dirent->d_name, KVM_INSTANCE_SHUTOFF);
+ }
+
+ return 0;
+}
+
+static void parse_setup_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, list_options, list_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_list_help();
+ }
+}
+
+int kvm_cmd_list(int argc, const char **argv, const char *prefix)
+{
+ int r;
+
+ parse_setup_options(argc, argv);
+
+ if (!run && !rootfs)
+ run = rootfs = true;
+
+ printf("%6s %-20s %s\n", "PID", "NAME", "STATE");
+ printf("------------------------------------\n");
+
+ if (run) {
+ r = kvm_list_running_instances();
+ if (r < 0)
+ perror("Error listing instances");
+ }
+
+ if (rootfs) {
+ r = kvm_list_rootfs();
+ if (r < 0)
+ perror("Error listing rootfs");
+ }
+
+ return 0;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-pause.h>
+#include <kvm/builtin-list.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm-ipc.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+static bool all;
+static const char *instance_name;
+
+static const char * const pause_usage[] = {
+ "lkvm pause [--all] [-n name]",
+ NULL
+};
+
+static const struct option pause_options[] = {
+ OPT_GROUP("General options:"),
+ OPT_BOOLEAN('a', "all", &all, "Pause all instances"),
+ OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+ OPT_END()
+};
+
+static void parse_pause_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, pause_options, pause_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_pause_help();
+ }
+}
+
+void kvm_pause_help(void)
+{
+ usage_with_options(pause_usage, pause_options);
+}
+
+static int do_pause(const char *name, int sock)
+{
+ int r;
+ int vmstate;
+
+ vmstate = get_vmstate(sock);
+ if (vmstate < 0)
+ return vmstate;
+ if (vmstate == KVM_VMSTATE_PAUSED) {
+ printf("Guest %s is already paused.\n", name);
+ return 0;
+ }
+
+ r = kvm_ipc__send(sock, KVM_IPC_PAUSE);
+ if (r)
+ return r;
+
+ printf("Guest %s paused\n", name);
+
+ return 0;
+}
+
+int kvm_cmd_pause(int argc, const char **argv, const char *prefix)
+{
+ int instance;
+ int r;
+
+ parse_pause_options(argc, argv);
+
+ if (all)
+ return kvm__enumerate_instances(do_pause);
+
+ if (instance_name == NULL)
+ kvm_pause_help();
+
+ instance = kvm__get_sock_by_instance(instance_name);
+
+ if (instance <= 0)
+ die("Failed locating instance");
+
+ r = do_pause(instance_name, instance);
+
+ close(instance);
+
+ return r;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-resume.h>
+#include <kvm/builtin-list.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm-ipc.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+static bool all;
+static const char *instance_name;
+
+static const char * const resume_usage[] = {
+ "lkvm resume [--all] [-n name]",
+ NULL
+};
+
+static const struct option resume_options[] = {
+ OPT_GROUP("General options:"),
+ OPT_BOOLEAN('a', "all", &all, "Resume all instances"),
+ OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+ OPT_END()
+};
+
+static void parse_resume_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, resume_options, resume_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_resume_help();
+ }
+}
+
+void kvm_resume_help(void)
+{
+ usage_with_options(resume_usage, resume_options);
+}
+
+static int do_resume(const char *name, int sock)
+{
+ int r;
+ int vmstate;
+
+ vmstate = get_vmstate(sock);
+ if (vmstate < 0)
+ return vmstate;
+ if (vmstate == KVM_VMSTATE_RUNNING) {
+ printf("Guest %s is still running.\n", name);
+ return 0;
+ }
+
+ r = kvm_ipc__send(sock, KVM_IPC_RESUME);
+ if (r)
+ return r;
+
+ printf("Guest %s resumed\n", name);
+
+ return 0;
+}
+
+int kvm_cmd_resume(int argc, const char **argv, const char *prefix)
+{
+ int instance;
+ int r;
+
+ parse_resume_options(argc, argv);
+
+ if (all)
+ return kvm__enumerate_instances(do_resume);
+
+ if (instance_name == NULL)
+ kvm_resume_help();
+
+ instance = kvm__get_sock_by_instance(instance_name);
+
+ if (instance <= 0)
+ die("Failed locating instance");
+
+ r = do_resume(instance_name, instance);
+
+ close(instance);
+
+ return r;
+}
--- /dev/null
+#include "kvm/builtin-run.h"
+
+#include "kvm/builtin-setup.h"
+#include "kvm/virtio-balloon.h"
+#include "kvm/virtio-console.h"
+#include "kvm/parse-options.h"
+#include "kvm/8250-serial.h"
+#include "kvm/framebuffer.h"
+#include "kvm/disk-image.h"
+#include "kvm/threadpool.h"
+#include "kvm/virtio-scsi.h"
+#include "kvm/virtio-blk.h"
+#include "kvm/virtio-net.h"
+#include "kvm/virtio-rng.h"
+#include "kvm/ioeventfd.h"
+#include "kvm/virtio-9p.h"
+#include "kvm/barrier.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/ioport.h"
+#include "kvm/symbol.h"
+#include "kvm/i8042.h"
+#include "kvm/mutex.h"
+#include "kvm/term.h"
+#include "kvm/util.h"
+#include "kvm/strbuf.h"
+#include "kvm/vesa.h"
+#include "kvm/irq.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+#include "kvm/rtc.h"
+#include "kvm/sdl.h"
+#include "kvm/vnc.h"
+#include "kvm/guest_compat.h"
+#include "kvm/pci-shmem.h"
+#include "kvm/kvm-ipc.h"
+#include "kvm/builtin-debug.h"
+
+#include <linux/types.h>
+#include <linux/err.h>
+
+#include <sys/utsname.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <termios.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <stdio.h>
+
+#define MB_SHIFT (20)
+#define KB_SHIFT (10)
+#define GB_SHIFT (30)
+
+__thread struct kvm_cpu *current_kvm_cpu;
+
+static int kvm_run_wrapper;
+
+bool do_debug_print = false;
+
+extern char _binary_guest_init_start;
+extern char _binary_guest_init_size;
+
+static const char * const run_usage[] = {
+ "lkvm run [<options>] [<kernel image>]",
+ NULL
+};
+
+enum {
+ KVM_RUN_DEFAULT,
+ KVM_RUN_SANDBOX,
+};
+
+static int img_name_parser(const struct option *opt, const char *arg, int unset)
+{
+ char path[PATH_MAX];
+ struct stat st;
+
+ snprintf(path, PATH_MAX, "%s%s", kvm__get_dir(), arg);
+
+ if ((stat(arg, &st) == 0 && S_ISDIR(st.st_mode)) ||
+ (stat(path, &st) == 0 && S_ISDIR(st.st_mode)))
+ return virtio_9p_img_name_parser(opt, arg, unset);
+ return disk_img_name_parser(opt, arg, unset);
+}
+
+void kvm_run_set_wrapper_sandbox(void)
+{
+ kvm_run_wrapper = KVM_RUN_SANDBOX;
+}
+
+#define BUILD_OPTIONS(name, cfg, kvm) \
+ struct option name[] = { \
+ OPT_GROUP("Basic options:"), \
+ OPT_STRING('\0', "name", &(cfg)->guest_name, "guest name", \
+ "A name for the guest"), \
+ OPT_INTEGER('c', "cpus", &(cfg)->nrcpus, "Number of CPUs"), \
+ OPT_U64('m', "mem", &(cfg)->ram_size, "Virtual machine memory" \
+ " size in MiB."), \
+ OPT_CALLBACK('\0', "shmem", NULL, \
+ "[pci:]<addr>:<size>[:handle=<handle>][:create]", \
+ "Share host shmem with guest via pci device", \
+ shmem_parser, NULL), \
+ OPT_CALLBACK('d', "disk", kvm, "image or rootfs_dir", "Disk " \
+ " image or rootfs directory", img_name_parser, \
+ kvm), \
+ OPT_BOOLEAN('\0', "balloon", &(cfg)->balloon, "Enable virtio" \
+ " balloon"), \
+ OPT_BOOLEAN('\0', "vnc", &(cfg)->vnc, "Enable VNC framebuffer"),\
+ OPT_BOOLEAN('\0', "sdl", &(cfg)->sdl, "Enable SDL framebuffer"),\
+ OPT_BOOLEAN('\0', "rng", &(cfg)->virtio_rng, "Enable virtio" \
+ " Random Number Generator"), \
+ OPT_CALLBACK('\0', "9p", NULL, "dir_to_share,tag_name", \
+ "Enable virtio 9p to share files between host and" \
+ " guest", virtio_9p_rootdir_parser, kvm), \
+ OPT_STRING('\0', "console", &(cfg)->console, "serial, virtio or"\
+ " hv", "Console to use"), \
+ OPT_STRING('\0', "dev", &(cfg)->dev, "device_file", \
+ "KVM device file"), \
+ OPT_CALLBACK('\0', "tty", NULL, "tty id", \
+ "Remap guest TTY into a pty on the host", \
+ tty_parser, NULL), \
+ OPT_STRING('\0', "sandbox", &(cfg)->sandbox, "script", \
+ "Run this script when booting into custom" \
+ " rootfs"), \
+ OPT_STRING('\0', "hugetlbfs", &(cfg)->hugetlbfs_path, "path", \
+ "Hugetlbfs path"), \
+ \
+ OPT_GROUP("Kernel options:"), \
+ OPT_STRING('k', "kernel", &(cfg)->kernel_filename, "kernel", \
+ "Kernel to boot in virtual machine"), \
+ OPT_STRING('i', "initrd", &(cfg)->initrd_filename, "initrd", \
+ "Initial RAM disk image"), \
+ OPT_STRING('p', "params", &(cfg)->kernel_cmdline, "params", \
+ "Kernel command line arguments"), \
+ OPT_STRING('f', "firmware", &(cfg)->firmware_filename, "firmware",\
+ "Firmware image to boot in virtual machine"), \
+ \
+ OPT_GROUP("Networking options:"), \
+ OPT_CALLBACK_DEFAULT('n', "network", NULL, "network params", \
+ "Create a new guest NIC", \
+ netdev_parser, NULL, kvm), \
+ OPT_BOOLEAN('\0', "no-dhcp", &(cfg)->no_dhcp, "Disable kernel" \
+ " DHCP in rootfs mode"), \
+ \
+ OPT_GROUP("BIOS options:"), \
+ OPT_INTEGER('\0', "vidmode", &(cfg)->vidmode, \
+ "Video mode"), \
+ \
+ OPT_GROUP("Debug options:"), \
+ OPT_BOOLEAN('\0', "debug", &do_debug_print, \
+ "Enable debug messages"), \
+ OPT_BOOLEAN('\0', "debug-single-step", &(cfg)->single_step, \
+ "Enable single stepping"), \
+ OPT_BOOLEAN('\0', "debug-ioport", &(cfg)->ioport_debug, \
+ "Enable ioport debugging"), \
+ OPT_BOOLEAN('\0', "debug-mmio", &(cfg)->mmio_debug, \
+ "Enable MMIO debugging"), \
+ OPT_INTEGER('\0', "debug-iodelay", &(cfg)->debug_iodelay, \
+ "Delay IO by millisecond"), \
+ OPT_END() \
+ };
+
+static void handle_sigalrm(int sig, siginfo_t *si, void *uc)
+{
+ struct kvm *kvm = si->si_value.sival_ptr;
+
+ kvm__arch_periodic_poll(kvm);
+}
+
+static void *kvm_cpu_thread(void *arg)
+{
+ current_kvm_cpu = arg;
+
+ if (kvm_cpu__start(current_kvm_cpu))
+ goto panic_kvm;
+
+ return (void *) (intptr_t) 0;
+
+panic_kvm:
+ fprintf(stderr, "KVM exit reason: %u (\"%s\")\n",
+ current_kvm_cpu->kvm_run->exit_reason,
+ kvm_exit_reasons[current_kvm_cpu->kvm_run->exit_reason]);
+ if (current_kvm_cpu->kvm_run->exit_reason == KVM_EXIT_UNKNOWN)
+ fprintf(stderr, "KVM exit code: 0x%Lu\n",
+ current_kvm_cpu->kvm_run->hw.hardware_exit_reason);
+
+ kvm_cpu__set_debug_fd(STDOUT_FILENO);
+ kvm_cpu__show_registers(current_kvm_cpu);
+ kvm_cpu__show_code(current_kvm_cpu);
+ kvm_cpu__show_page_tables(current_kvm_cpu);
+
+ return (void *) (intptr_t) 1;
+}
+
+static char kernel[PATH_MAX];
+
+static const char *host_kernels[] = {
+ "/boot/vmlinuz",
+ "/boot/bzImage",
+ NULL
+};
+
+static const char *default_kernels[] = {
+ "./bzImage",
+ "arch/" BUILD_ARCH "/boot/bzImage",
+ "../../arch/" BUILD_ARCH "/boot/bzImage",
+ NULL
+};
+
+static const char *default_vmlinux[] = {
+ "vmlinux",
+ "../../../vmlinux",
+ "../../vmlinux",
+ NULL
+};
+
+static void kernel_usage_with_options(void)
+{
+ const char **k;
+ struct utsname uts;
+
+ fprintf(stderr, "Fatal: could not find default kernel image in:\n");
+ k = &default_kernels[0];
+ while (*k) {
+ fprintf(stderr, "\t%s\n", *k);
+ k++;
+ }
+
+ if (uname(&uts) < 0)
+ return;
+
+ k = &host_kernels[0];
+ while (*k) {
+ if (snprintf(kernel, PATH_MAX, "%s-%s", *k, uts.release) < 0)
+ return;
+ fprintf(stderr, "\t%s\n", kernel);
+ k++;
+ }
+ fprintf(stderr, "\nPlease see '%s run --help' for more options.\n\n",
+ KVM_BINARY_NAME);
+}
+
+static u64 host_ram_size(void)
+{
+ long page_size;
+ long nr_pages;
+
+ nr_pages = sysconf(_SC_PHYS_PAGES);
+ if (nr_pages < 0) {
+ pr_warning("sysconf(_SC_PHYS_PAGES) failed");
+ return 0;
+ }
+
+ page_size = sysconf(_SC_PAGE_SIZE);
+ if (page_size < 0) {
+ pr_warning("sysconf(_SC_PAGE_SIZE) failed");
+ return 0;
+ }
+
+ return (nr_pages * page_size) >> MB_SHIFT;
+}
+
+/*
+ * If user didn't specify how much memory it wants to allocate for the guest,
+ * avoid filling the whole host RAM.
+ */
+#define RAM_SIZE_RATIO 0.8
+
+static u64 get_ram_size(int nr_cpus)
+{
+ u64 available;
+ u64 ram_size;
+
+ ram_size = 64 * (nr_cpus + 3);
+
+ available = host_ram_size() * RAM_SIZE_RATIO;
+ if (!available)
+ available = MIN_RAM_SIZE_MB;
+
+ if (ram_size > available)
+ ram_size = available;
+
+ return ram_size;
+}
+
+static const char *find_kernel(void)
+{
+ const char **k;
+ struct stat st;
+ struct utsname uts;
+
+ k = &default_kernels[0];
+ while (*k) {
+ if (stat(*k, &st) < 0 || !S_ISREG(st.st_mode)) {
+ k++;
+ continue;
+ }
+ strncpy(kernel, *k, PATH_MAX);
+ return kernel;
+ }
+
+ if (uname(&uts) < 0)
+ return NULL;
+
+ k = &host_kernels[0];
+ while (*k) {
+ if (snprintf(kernel, PATH_MAX, "%s-%s", *k, uts.release) < 0)
+ return NULL;
+
+ if (stat(kernel, &st) < 0 || !S_ISREG(st.st_mode)) {
+ k++;
+ continue;
+ }
+ return kernel;
+
+ }
+ return NULL;
+}
+
+static const char *find_vmlinux(void)
+{
+ const char **vmlinux;
+
+ vmlinux = &default_vmlinux[0];
+ while (*vmlinux) {
+ struct stat st;
+
+ if (stat(*vmlinux, &st) < 0 || !S_ISREG(st.st_mode)) {
+ vmlinux++;
+ continue;
+ }
+ return *vmlinux;
+ }
+ return NULL;
+}
+
+void kvm_run_help(void)
+{
+ struct kvm *kvm = NULL;
+
+ BUILD_OPTIONS(options, &kvm->cfg, kvm);
+ usage_with_options(run_usage, options);
+}
+
+static int kvm_setup_guest_init(struct kvm *kvm)
+{
+ const char *rootfs = kvm->cfg.custom_rootfs_name;
+ char tmp[PATH_MAX];
+ size_t size;
+ int fd, ret;
+ char *data;
+
+ /* Setup /virt/init */
+ size = (size_t)&_binary_guest_init_size;
+ data = (char *)&_binary_guest_init_start;
+ snprintf(tmp, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), rootfs);
+ remove(tmp);
+ fd = open(tmp, O_CREAT | O_WRONLY, 0755);
+ if (fd < 0)
+ die("Fail to setup %s", tmp);
+ ret = xwrite(fd, data, size);
+ if (ret < 0)
+ die("Fail to setup %s", tmp);
+ close(fd);
+
+ return 0;
+}
+
+static int kvm_run_set_sandbox(struct kvm *kvm)
+{
+ const char *guestfs_name = kvm->cfg.custom_rootfs_name;
+ char path[PATH_MAX], script[PATH_MAX], *tmp;
+
+ snprintf(path, PATH_MAX, "%s%s/virt/sandbox.sh", kvm__get_dir(), guestfs_name);
+
+ remove(path);
+
+ if (kvm->cfg.sandbox == NULL)
+ return 0;
+
+ tmp = realpath(kvm->cfg.sandbox, NULL);
+ if (tmp == NULL)
+ return -ENOMEM;
+
+ snprintf(script, PATH_MAX, "/host/%s", tmp);
+ free(tmp);
+
+ return symlink(script, path);
+}
+
+static void kvm_write_sandbox_cmd_exactly(int fd, const char *arg)
+{
+ const char *single_quote;
+
+ if (!*arg) { /* zero length string */
+ if (write(fd, "''", 2) <= 0)
+ die("Failed writing sandbox script");
+ return;
+ }
+
+ while (*arg) {
+ single_quote = strchrnul(arg, '\'');
+
+ /* write non-single-quote string as #('string') */
+ if (arg != single_quote) {
+ if (write(fd, "'", 1) <= 0 ||
+ write(fd, arg, single_quote - arg) <= 0 ||
+ write(fd, "'", 1) <= 0)
+ die("Failed writing sandbox script");
+ }
+
+ /* write single quote as #("'") */
+ if (*single_quote) {
+ if (write(fd, "\"'\"", 3) <= 0)
+ die("Failed writing sandbox script");
+ } else
+ break;
+
+ arg = single_quote + 1;
+ }
+}
+
+static void resolve_program(const char *src, char *dst, size_t len)
+{
+ struct stat st;
+ int err;
+
+ err = stat(src, &st);
+
+ if (!err && S_ISREG(st.st_mode)) {
+ char resolved_path[PATH_MAX];
+
+ if (!realpath(src, resolved_path))
+ die("Unable to resolve program %s: %s\n", src, strerror(errno));
+
+ snprintf(dst, len, "/host%s", resolved_path);
+ } else
+ strncpy(dst, src, len);
+}
+
+static void kvm_run_write_sandbox_cmd(struct kvm *kvm, const char **argv, int argc)
+{
+ const char script_hdr[] = "#! /bin/bash\n\n";
+ char program[PATH_MAX];
+ int fd;
+
+ remove(kvm->cfg.sandbox);
+
+ fd = open(kvm->cfg.sandbox, O_RDWR | O_CREAT, 0777);
+ if (fd < 0)
+ die("Failed creating sandbox script");
+
+ if (write(fd, script_hdr, sizeof(script_hdr) - 1) <= 0)
+ die("Failed writing sandbox script");
+
+ resolve_program(argv[0], program, PATH_MAX);
+ kvm_write_sandbox_cmd_exactly(fd, program);
+
+ argv++;
+ argc--;
+
+ while (argc) {
+ if (write(fd, " ", 1) <= 0)
+ die("Failed writing sandbox script");
+
+ kvm_write_sandbox_cmd_exactly(fd, argv[0]);
+ argv++;
+ argc--;
+ }
+ if (write(fd, "\n", 1) <= 0)
+ die("Failed writing sandbox script");
+
+ close(fd);
+}
+
+static struct kvm *kvm_cmd_run_init(int argc, const char **argv)
+{
+ static char real_cmdline[2048], default_name[20];
+ unsigned int nr_online_cpus;
+ struct sigaction sa;
+ struct kvm *kvm = kvm__new();
+
+ if (IS_ERR(kvm))
+ return kvm;
+
+ sa.sa_flags = SA_SIGINFO;
+ sa.sa_sigaction = handle_sigalrm;
+ sigemptyset(&sa.sa_mask);
+ sigaction(SIGALRM, &sa, NULL);
+
+ nr_online_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+ kvm->cfg.custom_rootfs_name = "default";
+
+ while (argc != 0) {
+ BUILD_OPTIONS(options, &kvm->cfg, kvm);
+ argc = parse_options(argc, argv, options, run_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION |
+ PARSE_OPT_KEEP_DASHDASH);
+ if (argc != 0) {
+ /* Cusrom options, should have been handled elsewhere */
+ if (strcmp(argv[0], "--") == 0) {
+ if (kvm_run_wrapper == KVM_RUN_SANDBOX) {
+ kvm->cfg.sandbox = DEFAULT_SANDBOX_FILENAME;
+ kvm_run_write_sandbox_cmd(kvm, argv+1, argc-1);
+ break;
+ }
+ }
+
+ if ((kvm_run_wrapper == KVM_RUN_DEFAULT && kvm->cfg.kernel_filename) ||
+ (kvm_run_wrapper == KVM_RUN_SANDBOX && kvm->cfg.sandbox)) {
+ fprintf(stderr, "Cannot handle parameter: "
+ "%s\n", argv[0]);
+ usage_with_options(run_usage, options);
+ free(kvm);
+ return ERR_PTR(-EINVAL);
+ }
+ if (kvm_run_wrapper == KVM_RUN_SANDBOX) {
+ /*
+ * first unhandled parameter is treated as
+ * sandbox command
+ */
+ kvm->cfg.sandbox = DEFAULT_SANDBOX_FILENAME;
+ kvm_run_write_sandbox_cmd(kvm, argv, argc);
+ } else {
+ /*
+ * first unhandled parameter is treated as a kernel
+ * image
+ */
+ kvm->cfg.kernel_filename = argv[0];
+ }
+ argv++;
+ argc--;
+ }
+
+ }
+
+ kvm->nr_disks = kvm->cfg.image_count;
+
+ if (!kvm->cfg.kernel_filename)
+ kvm->cfg.kernel_filename = find_kernel();
+
+ if (!kvm->cfg.kernel_filename) {
+ kernel_usage_with_options();
+ return ERR_PTR(-EINVAL);
+ }
+
+ kvm->cfg.vmlinux_filename = find_vmlinux();
+ kvm->vmlinux = kvm->cfg.vmlinux_filename;
+
+ if (kvm->cfg.nrcpus == 0)
+ kvm->cfg.nrcpus = nr_online_cpus;
+
+ if (!kvm->cfg.ram_size)
+ kvm->cfg.ram_size = get_ram_size(kvm->cfg.nrcpus);
+
+ if (kvm->cfg.ram_size < MIN_RAM_SIZE_MB)
+ die("Not enough memory specified: %lluMB (min %lluMB)", kvm->cfg.ram_size, MIN_RAM_SIZE_MB);
+
+ if (kvm->cfg.ram_size > host_ram_size())
+ pr_warning("Guest memory size %lluMB exceeds host physical RAM size %lluMB", kvm->cfg.ram_size, host_ram_size());
+
+ kvm->cfg.ram_size <<= MB_SHIFT;
+
+ if (!kvm->cfg.dev)
+ kvm->cfg.dev = DEFAULT_KVM_DEV;
+
+ if (!kvm->cfg.console)
+ kvm->cfg.console = DEFAULT_CONSOLE;
+
+ if (!strncmp(kvm->cfg.console, "virtio", 6))
+ kvm->cfg.active_console = CONSOLE_VIRTIO;
+ else if (!strncmp(kvm->cfg.console, "serial", 6))
+ kvm->cfg.active_console = CONSOLE_8250;
+ else if (!strncmp(kvm->cfg.console, "hv", 2))
+ kvm->cfg.active_console = CONSOLE_HV;
+ else
+ pr_warning("No console!");
+
+ if (!kvm->cfg.host_ip)
+ kvm->cfg.host_ip = DEFAULT_HOST_ADDR;
+
+ if (!kvm->cfg.guest_ip)
+ kvm->cfg.guest_ip = DEFAULT_GUEST_ADDR;
+
+ if (!kvm->cfg.guest_mac)
+ kvm->cfg.guest_mac = DEFAULT_GUEST_MAC;
+
+ if (!kvm->cfg.host_mac)
+ kvm->cfg.host_mac = DEFAULT_HOST_MAC;
+
+ if (!kvm->cfg.script)
+ kvm->cfg.script = DEFAULT_SCRIPT;
+
+ if (!kvm->cfg.vnc && !kvm->cfg.sdl)
+ kvm->cfg.vidmode = -1;
+
+ if (!kvm->cfg.network)
+ kvm->cfg.network = DEFAULT_NETWORK;
+
+ memset(real_cmdline, 0, sizeof(real_cmdline));
+ kvm__arch_set_cmdline(real_cmdline, kvm->cfg.vnc || kvm->cfg.sdl);
+
+ if (strlen(real_cmdline) > 0)
+ strcat(real_cmdline, " ");
+
+ if (kvm->cfg.kernel_cmdline)
+ strlcat(real_cmdline, kvm->cfg.kernel_cmdline, sizeof(real_cmdline));
+
+ if (!kvm->cfg.guest_name) {
+ if (kvm->cfg.custom_rootfs) {
+ kvm->cfg.guest_name = kvm->cfg.custom_rootfs_name;
+ } else {
+ sprintf(default_name, "guest-%u", getpid());
+ kvm->cfg.guest_name = default_name;
+ }
+ }
+
+ if (!kvm->cfg.using_rootfs && !kvm->cfg.disk_image[0].filename && !kvm->cfg.initrd_filename) {
+ char tmp[PATH_MAX];
+
+ kvm_setup_create_new(kvm->cfg.custom_rootfs_name);
+ kvm_setup_resolv(kvm->cfg.custom_rootfs_name);
+
+ snprintf(tmp, PATH_MAX, "%s%s", kvm__get_dir(), "default");
+ if (virtio_9p__register(kvm, tmp, "/dev/root") < 0)
+ die("Unable to initialize virtio 9p");
+ if (virtio_9p__register(kvm, "/", "hostfs") < 0)
+ die("Unable to initialize virtio 9p");
+ kvm->cfg.using_rootfs = kvm->cfg.custom_rootfs = 1;
+ }
+
+ if (kvm->cfg.using_rootfs) {
+ strcat(real_cmdline, " root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p");
+ if (kvm->cfg.custom_rootfs) {
+ kvm_run_set_sandbox(kvm);
+
+ strcat(real_cmdline, " init=/virt/init");
+
+ if (!kvm->cfg.no_dhcp)
+ strcat(real_cmdline, " ip=dhcp");
+ if (kvm_setup_guest_init(kvm))
+ die("Failed to setup init for guest.");
+ }
+ } else if (!strstr(real_cmdline, "root=")) {
+ strlcat(real_cmdline, " root=/dev/vda rw ", sizeof(real_cmdline));
+ }
+
+ kvm->cfg.real_cmdline = real_cmdline;
+
+ printf(" # %s run -k %s -m %Lu -c %d --name %s\n", KVM_BINARY_NAME,
+ kvm->cfg.kernel_filename, kvm->cfg.ram_size / 1024 / 1024, kvm->cfg.nrcpus, kvm->cfg.guest_name);
+
+ init_list__init(kvm);
+
+ return kvm;
+}
+
+static int kvm_cmd_run_work(struct kvm *kvm)
+{
+ int i;
+ void *ret = NULL;
+
+ for (i = 0; i < kvm->nrcpus; i++) {
+ if (pthread_create(&kvm->cpus[i]->thread, NULL, kvm_cpu_thread, kvm->cpus[i]) != 0)
+ die("unable to create KVM VCPU thread");
+ }
+
+ /* Only VCPU #0 is going to exit by itself when shutting down */
+ return pthread_join(kvm->cpus[0]->thread, &ret);
+}
+
+static void kvm_cmd_run_exit(struct kvm *kvm, int guest_ret)
+{
+ compat__print_all_messages();
+
+ init_list__exit(kvm);
+
+ if (guest_ret == 0)
+ printf("\n # KVM session ended normally.\n");
+}
+
+int kvm_cmd_run(int argc, const char **argv, const char *prefix)
+{
+ int ret = -EFAULT;
+ struct kvm *kvm;
+
+ kvm = kvm_cmd_run_init(argc, argv);
+ if (IS_ERR(kvm))
+ return PTR_ERR(kvm);
+
+ ret = kvm_cmd_run_work(kvm);
+ kvm_cmd_run_exit(kvm, ret);
+
+ return ret;
+}
--- /dev/null
+#include "kvm/builtin-sandbox.h"
+#include "kvm/builtin-run.h"
+
+int kvm_cmd_sandbox(int argc, const char **argv, const char *prefix)
+{
+ kvm_run_set_wrapper_sandbox();
+
+ return kvm_cmd_run(argc, argv, prefix);
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-setup.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/read-write.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <limits.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+
+extern char _binary_guest_init_start;
+extern char _binary_guest_init_size;
+
+static const char *instance_name;
+
+static const char * const setup_usage[] = {
+ "lkvm setup [name]",
+ NULL
+};
+
+static const struct option setup_options[] = {
+ OPT_END()
+};
+
+static void parse_setup_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, setup_options, setup_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0 && instance_name)
+ kvm_setup_help();
+ else
+ instance_name = argv[0];
+ argv++;
+ argc--;
+ }
+}
+
+void kvm_setup_help(void)
+{
+ printf("\n%s setup creates a new rootfs under %s.\n"
+ "This can be used later by the '-d' parameter of '%s run'.\n",
+ KVM_BINARY_NAME, kvm__get_dir(), KVM_BINARY_NAME);
+ usage_with_options(setup_usage, setup_options);
+}
+
+static int copy_file(const char *from, const char *to)
+{
+ int in_fd, out_fd;
+ void *src, *dst;
+ struct stat st;
+ int err = -1;
+
+ in_fd = open(from, O_RDONLY);
+ if (in_fd < 0)
+ return err;
+
+ if (fstat(in_fd, &st) < 0)
+ goto error_close_in;
+
+ out_fd = open(to, O_RDWR | O_CREAT | O_TRUNC, st.st_mode & (S_IRWXU|S_IRWXG|S_IRWXO));
+ if (out_fd < 0)
+ goto error_close_in;
+
+ src = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, in_fd, 0);
+ if (src == MAP_FAILED)
+ goto error_close_out;
+
+ if (ftruncate(out_fd, st.st_size) < 0)
+ goto error_munmap_src;
+
+ dst = mmap(NULL, st.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, out_fd, 0);
+ if (dst == MAP_FAILED)
+ goto error_munmap_src;
+
+ memcpy(dst, src, st.st_size);
+
+ if (fsync(out_fd) < 0)
+ goto error_munmap_dst;
+
+ err = 0;
+
+error_munmap_dst:
+ munmap(dst, st.st_size);
+error_munmap_src:
+ munmap(src, st.st_size);
+error_close_out:
+ close(out_fd);
+error_close_in:
+ close(in_fd);
+
+ return err;
+}
+
+static const char *guestfs_dirs[] = {
+ "/dev",
+ "/etc",
+ "/home",
+ "/host",
+ "/proc",
+ "/root",
+ "/sys",
+ "/tmp",
+ "/var",
+ "/var/lib",
+ "/virt",
+ "/virt/home",
+};
+
+static const char *guestfs_symlinks[] = {
+ "/bin",
+ "/lib",
+ "/lib64",
+ "/sbin",
+ "/usr",
+ "/etc/ld.so.conf",
+};
+
+static int copy_init(const char *guestfs_name)
+{
+ char path[PATH_MAX];
+ size_t size;
+ int fd, ret;
+ char *data;
+
+ size = (size_t)&_binary_guest_init_size;
+ data = (char *)&_binary_guest_init_start;
+ snprintf(path, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), guestfs_name);
+ remove(path);
+ fd = open(path, O_CREAT | O_WRONLY, 0755);
+ if (fd < 0)
+ die("Fail to setup %s", path);
+ ret = xwrite(fd, data, size);
+ if (ret < 0)
+ die("Fail to setup %s", path);
+ close(fd);
+
+ return 0;
+}
+
+static int copy_passwd(const char *guestfs_name)
+{
+ char path[PATH_MAX];
+ FILE *file;
+ int ret;
+
+ snprintf(path, PATH_MAX, "%s%s/etc/passwd", kvm__get_dir(), guestfs_name);
+
+ file = fopen(path, "w");
+ if (!file)
+ return -1;
+
+ ret = fprintf(file, "root:x:0:0:root:/root:/bin/sh\n");
+ if (ret < 0)
+ return ret;
+
+ fclose(file);
+
+ return 0;
+}
+
+static int make_guestfs_symlink(const char *guestfs_name, const char *path)
+{
+ char target[PATH_MAX];
+ char name[PATH_MAX];
+
+ snprintf(name, PATH_MAX, "%s%s%s", kvm__get_dir(), guestfs_name, path);
+
+ snprintf(target, PATH_MAX, "/host%s", path);
+
+ return symlink(target, name);
+}
+
+static int make_dir(const char *dir)
+{
+ char name[PATH_MAX];
+
+ snprintf(name, PATH_MAX, "%s%s", kvm__get_dir(), dir);
+
+ return mkdir(name, 0777);
+}
+
+static void make_guestfs_dir(const char *guestfs_name, const char *dir)
+{
+ char name[PATH_MAX];
+
+ snprintf(name, PATH_MAX, "%s%s", guestfs_name, dir);
+
+ make_dir(name);
+}
+
+void kvm_setup_resolv(const char *guestfs_name)
+{
+ char path[PATH_MAX];
+
+ snprintf(path, PATH_MAX, "%s%s/etc/resolv.conf", kvm__get_dir(), guestfs_name);
+
+ copy_file("/etc/resolv.conf", path);
+}
+
+static int do_setup(const char *guestfs_name)
+{
+ unsigned int i;
+ int ret;
+
+ ret = make_dir(guestfs_name);
+ if (ret < 0)
+ return ret;
+
+ for (i = 0; i < ARRAY_SIZE(guestfs_dirs); i++)
+ make_guestfs_dir(guestfs_name, guestfs_dirs[i]);
+
+ for (i = 0; i < ARRAY_SIZE(guestfs_symlinks); i++) {
+ make_guestfs_symlink(guestfs_name, guestfs_symlinks[i]);
+ }
+
+ ret = copy_init(guestfs_name);
+ if (ret < 0)
+ return ret;
+
+ return copy_passwd(guestfs_name);
+}
+
+int kvm_setup_create_new(const char *guestfs_name)
+{
+ return do_setup(guestfs_name);
+}
+
+int kvm_cmd_setup(int argc, const char **argv, const char *prefix)
+{
+ int r;
+
+ parse_setup_options(argc, argv);
+
+ if (instance_name == NULL)
+ kvm_setup_help();
+
+ r = do_setup(instance_name);
+ if (r == 0)
+ printf("A new rootfs '%s' has been created in '%s%s'.\n\n"
+ "You can now start it by running the following command:\n\n"
+ " %s run -d %s\n",
+ instance_name, kvm__get_dir(), instance_name,
+ KVM_BINARY_NAME,instance_name);
+ else
+ printf("Unable to create rootfs in %s%s: %s\n",
+ kvm__get_dir(), instance_name, strerror(errno));
+
+ return r;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-stat.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm-ipc.h>
+
+#include <sys/select.h>
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+#include <linux/virtio_balloon.h>
+
+static bool mem;
+static bool all;
+static const char *instance_name;
+
+static const char * const stat_usage[] = {
+ "lkvm stat [command] [--all] [-n name]",
+ NULL
+};
+
+static const struct option stat_options[] = {
+ OPT_GROUP("Commands options:"),
+ OPT_BOOLEAN('m', "memory", &mem, "Display memory statistics"),
+ OPT_GROUP("Instance options:"),
+ OPT_BOOLEAN('a', "all", &all, "All instances"),
+ OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+ OPT_END()
+};
+
+static void parse_stat_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, stat_options, stat_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_stat_help();
+ }
+}
+
+void kvm_stat_help(void)
+{
+ usage_with_options(stat_usage, stat_options);
+}
+
+static int do_memstat(const char *name, int sock)
+{
+ struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+ fd_set fdset;
+ struct timeval t = { .tv_sec = 1 };
+ int r;
+ u8 i;
+
+ FD_ZERO(&fdset);
+ FD_SET(sock, &fdset);
+ r = kvm_ipc__send(sock, KVM_IPC_STAT);
+ if (r < 0)
+ return r;
+
+ r = select(1, &fdset, NULL, NULL, &t);
+ if (r < 0) {
+ pr_err("Could not retrieve mem stats from %s", name);
+ return r;
+ }
+ r = read(sock, &stats, sizeof(stats));
+ if (r < 0)
+ return r;
+
+ printf("\n\n\t*** Guest memory statistics ***\n\n");
+ for (i = 0; i < VIRTIO_BALLOON_S_NR; i++) {
+ switch (stats[i].tag) {
+ case VIRTIO_BALLOON_S_SWAP_IN:
+ printf("The amount of memory that has been swapped in (in bytes):");
+ break;
+ case VIRTIO_BALLOON_S_SWAP_OUT:
+ printf("The amount of memory that has been swapped out to disk (in bytes):");
+ break;
+ case VIRTIO_BALLOON_S_MAJFLT:
+ printf("The number of major page faults that have occurred:");
+ break;
+ case VIRTIO_BALLOON_S_MINFLT:
+ printf("The number of minor page faults that have occurred:");
+ break;
+ case VIRTIO_BALLOON_S_MEMFREE:
+ printf("The amount of memory not being used for any purpose (in bytes):");
+ break;
+ case VIRTIO_BALLOON_S_MEMTOT:
+ printf("The total amount of memory available (in bytes):");
+ break;
+ }
+ printf("%llu\n", stats[i].val);
+ }
+ printf("\n");
+
+ return 0;
+}
+
+int kvm_cmd_stat(int argc, const char **argv, const char *prefix)
+{
+ int instance;
+ int r = 0;
+
+ parse_stat_options(argc, argv);
+
+ if (!mem)
+ usage_with_options(stat_usage, stat_options);
+
+ if (mem && all)
+ return kvm__enumerate_instances(do_memstat);
+
+ if (instance_name == NULL)
+ kvm_stat_help();
+
+ instance = kvm__get_sock_by_instance(instance_name);
+
+ if (instance <= 0)
+ die("Failed locating instance");
+
+ if (mem)
+ r = do_memstat(instance_name, instance);
+
+ close(instance);
+
+ return r;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-stop.h>
+#include <kvm/kvm.h>
+#include <kvm/parse-options.h>
+#include <kvm/kvm-ipc.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+static bool all;
+static const char *instance_name;
+
+static const char * const stop_usage[] = {
+ "lkvm stop [--all] [-n name]",
+ NULL
+};
+
+static const struct option stop_options[] = {
+ OPT_GROUP("General options:"),
+ OPT_BOOLEAN('a', "all", &all, "Stop all instances"),
+ OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+ OPT_END()
+};
+
+static void parse_stop_options(int argc, const char **argv)
+{
+ while (argc != 0) {
+ argc = parse_options(argc, argv, stop_options, stop_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (argc != 0)
+ kvm_stop_help();
+ }
+}
+
+void kvm_stop_help(void)
+{
+ usage_with_options(stop_usage, stop_options);
+}
+
+static int do_stop(const char *name, int sock)
+{
+ return kvm_ipc__send(sock, KVM_IPC_STOP);
+}
+
+int kvm_cmd_stop(int argc, const char **argv, const char *prefix)
+{
+ int instance;
+ int r;
+
+ parse_stop_options(argc, argv);
+
+ if (all)
+ return kvm__enumerate_instances(do_stop);
+
+ if (instance_name == NULL)
+ kvm_stop_help();
+
+ instance = kvm__get_sock_by_instance(instance_name);
+
+ if (instance <= 0)
+ die("Failed locating instance");
+
+ r = do_stop(instance_name, instance);
+
+ close(instance);
+
+ return r;
+}
--- /dev/null
+#include <kvm/util.h>
+#include <kvm/kvm-cmd.h>
+#include <kvm/builtin-version.h>
+#include <kvm/kvm.h>
+
+#include <stdio.h>
+#include <string.h>
+#include <signal.h>
+
+int kvm_cmd_version(int argc, const char **argv, const char *prefix)
+{
+ printf("kvm tool %s\n", KVMTOOLS_VERSION);
+
+ return 0;
+}
--- /dev/null
+/*
+ * code16gcc.h
+ *
+ * This file is -include'd when compiling 16-bit C code.
+ * Note: this asm() needs to be emitted before gcc emits any code.
+ * Depending on gcc version, this requires -fno-unit-at-a-time or
+ * -fno-toplevel-reorder.
+ *
+ * Hopefully gcc will eventually have a real -m16 option so we can
+ * drop this hack long term.
+ */
+
+#ifndef __ASSEMBLY__
+asm(".code16gcc");
+#endif
--- /dev/null
+#
+# List of known perf commands.
+# command name category [deprecated] [common]
+#
+lkvm-run mainporcelain common
+lkvm-setup mainporcelain common
+lkvm-pause common
+lkvm-resume common
+lkvm-version common
+lkvm-list common
+lkvm-debug common
+lkvm-balloon common
+lkvm-stop common
+lkvm-stat common
+lkvm-sandbox common
--- /dev/null
+define SOURCE_HELLO
+#include <stdio.h>
+int main(void)
+{
+ return puts(\"hi\");
+}
+endef
+
+ifndef NO_DWARF
+define SOURCE_DWARF
+#include <dwarf.h>
+#include <elfutils/libdw.h>
+#include <elfutils/version.h>
+#ifndef _ELFUTILS_PREREQ
+#error
+#endif
+
+int main(void)
+{
+ Dwarf *dbg = dwarf_begin(0, DWARF_C_READ);
+ return (long)dbg;
+}
+endef
+endif
+
+define SOURCE_LIBELF
+#include <libelf.h>
+
+int main(void)
+{
+ Elf *elf = elf_begin(0, ELF_C_READ, 0);
+ return (long)elf;
+}
+endef
+
+define SOURCE_GLIBC
+#include <gnu/libc-version.h>
+
+int main(void)
+{
+ const char *version = gnu_get_libc_version();
+ return (long)version;
+}
+endef
+
+define SOURCE_ELF_MMAP
+#include <libelf.h>
+int main(void)
+{
+ Elf *elf = elf_begin(0, ELF_C_READ_MMAP, 0);
+ return (long)elf;
+}
+endef
+
+ifndef NO_NEWT
+define SOURCE_NEWT
+#include <newt.h>
+
+int main(void)
+{
+ newtInit();
+ newtCls();
+ return newtFinished();
+}
+endef
+endif
+
+ifndef NO_LIBPERL
+define SOURCE_PERL_EMBED
+#include <EXTERN.h>
+#include <perl.h>
+
+int main(void)
+{
+perl_alloc();
+return 0;
+}
+endef
+endif
+
+ifndef NO_LIBPYTHON
+define SOURCE_PYTHON_VERSION
+#include <Python.h>
+#if PY_VERSION_HEX >= 0x03000000
+ #error
+#endif
+int main(void){}
+endef
+define SOURCE_PYTHON_EMBED
+#include <Python.h>
+int main(void)
+{
+ Py_Initialize();
+ return 0;
+}
+endef
+endif
+
+define SOURCE_BFD
+#include <bfd.h>
+
+int main(void)
+{
+ bfd_demangle(0, 0, 0);
+ return 0;
+}
+endef
+
+define SOURCE_CPLUS_DEMANGLE
+extern char *cplus_demangle(const char *, int);
+
+int main(void)
+{
+ cplus_demangle(0, 0);
+ return 0;
+}
+endef
+
+define SOURCE_STRLCPY
+#include <stdlib.h>
+extern size_t strlcpy(char *dest, const char *src, size_t size);
+
+int main(void)
+{
+ strlcpy(NULL, NULL, 0);
+ return 0;
+}
+endef
+
+define SOURCE_VNCSERVER
+#include <rfb/rfb.h>
+
+int main(void)
+{
+ rfbIsActive((void *)0);
+ return 0;
+}
+endef
+
+define SOURCE_SDL
+#include <SDL/SDL.h>
+
+int main(void)
+{
+ SDL_Init(SDL_INIT_VIDEO);
+ return 0;
+}
+endef
+
+define SOURCE_ZLIB
+#include <zlib.h>
+
+int main(void)
+{
+ inflateInit2(NULL, 0);
+ return 0;
+}
+endef
+
+define SOURCE_AIO
+#include <libaio.h>
+
+int main(void)
+{
+ io_setup(0, NULL);
+ return 0;
+}
+endef
+
+define SOURCE_STATIC
+#include <stdlib.h>
+
+int main(void)
+{
+ return 0;
+}
+endef
--- /dev/null
+# This allows us to work with the newline character:
+define newline
+
+
+endef
+newline := $(newline)
+
+# nl-escape
+#
+# Usage: escape = $(call nl-escape[,escape])
+#
+# This is used as the common way to specify
+# what should replace a newline when escaping
+# newlines; the default is a bizarre string.
+#
+nl-escape = $(or $(1),m822df3020w6a44id34bt574ctac44eb9f4n)
+
+# escape-nl
+#
+# Usage: escaped-text = $(call escape-nl,text[,escape])
+#
+# GNU make's $(shell ...) function converts to a
+# single space each newline character in the output
+# produced during the expansion; this may not be
+# desirable.
+#
+# The only solution is to change each newline into
+# something that won't be converted, so that the
+# information can be recovered later with
+# $(call unescape-nl...)
+#
+escape-nl = $(subst $(newline),$(call nl-escape,$(2)),$(1))
+
+# unescape-nl
+#
+# Usage: text = $(call unescape-nl,escaped-text[,escape])
+#
+# See escape-nl.
+#
+unescape-nl = $(subst $(call nl-escape,$(2)),$(newline),$(1))
+
+# shell-escape-nl
+#
+# Usage: $(shell some-command | $(call shell-escape-nl[,escape]))
+#
+# Use this to escape newlines from within a shell call;
+# the default escape is a bizarre string.
+#
+# NOTE: The escape is used directly as a string constant
+# in an `awk' program that is delimited by shell
+# single-quotes, so be wary of the characters
+# that are chosen.
+#
+define shell-escape-nl
+awk 'NR==1 {t=$$0} NR>1 {t=t "$(nl-escape)" $$0} END {printf t}'
+endef
+
+# shell-unescape-nl
+#
+# Usage: $(shell some-command | $(call shell-unescape-nl[,escape]))
+#
+# Use this to unescape newlines from within a shell call;
+# the default escape is a bizarre string.
+#
+# NOTE: The escape is used directly as an extended regular
+# expression constant in an `awk' program that is
+# delimited by shell single-quotes, so be wary
+# of the characters that are chosen.
+#
+# (The bash shell has a bug where `{gsub(...),...}' is
+# misinterpreted as a brace expansion; this can be
+# overcome by putting a space between `{' and `gsub').
+#
+define shell-unescape-nl
+awk 'NR==1 {t=$$0} NR>1 {t=t "\n" $$0} END { gsub(/$(nl-escape)/,"\n",t); printf t }'
+endef
+
+# escape-for-shell-sq
+#
+# Usage: embeddable-text = $(call escape-for-shell-sq,text)
+#
+# This function produces text that is suitable for
+# embedding in a shell string that is delimited by
+# single-quotes.
+#
+escape-for-shell-sq = $(subst ','\'',$(1))
+
+# shell-sq
+#
+# Usage: single-quoted-and-escaped-text = $(call shell-sq,text)
+#
+shell-sq = '$(escape-for-shell-sq)'
+
+# shell-wordify
+#
+# Usage: wordified-text = $(call shell-wordify,text)
+#
+# For instance:
+#
+# |define text
+# |hello
+# |world
+# |endef
+# |
+# |target:
+# | echo $(call shell-wordify,$(text))
+#
+# At least GNU make gets confused by expanding a newline
+# within the context of a command line of a makefile rule
+# (this is in constrast to a `$(shell ...)' function call,
+# which can handle it just fine).
+#
+# This function avoids the problem by producing a string
+# that works as a shell word, regardless of whether or
+# not it contains a newline.
+#
+# If the text to be wordified contains a newline, then
+# an intrictate shell command substitution is constructed
+# to render the text as a single line; when the shell
+# processes the resulting escaped text, it transforms
+# it into the original unescaped text.
+#
+# If the text does not contain a newline, then this function
+# produces the same results as the `$(shell-sq)' function.
+#
+shell-wordify = $(if $(findstring $(newline),$(1)),$(_sw-esc-nl),$(shell-sq))
+define _sw-esc-nl
+"$$(echo $(call escape-nl,$(shell-sq),$(2)) | $(call shell-unescape-nl,$(2)))"
+endef
+
+# is-absolute
+#
+# Usage: bool-value = $(call is-absolute,path)
+#
+is-absolute = $(shell echo $(shell-sq) | grep ^/ -q && echo y)
+
+# lookup
+#
+# Usage: absolute-executable-path-or-empty = $(call lookup,path)
+#
+# (It's necessary to use `sh -c' because GNU make messes up by
+# trying too hard and getting things wrong).
+#
+lookup = $(call unescape-nl,$(shell sh -c $(_l-sh)))
+_l-sh = $(call shell-sq,command -v $(shell-sq) | $(call shell-escape-nl,))
+
+# is-executable
+#
+# Usage: bool-value = $(call is-executable,path)
+#
+# (It's necessary to use `sh -c' because GNU make messes up by
+# trying too hard and getting things wrong).
+#
+is-executable = $(call _is-executable-helper,$(shell-sq))
+_is-executable-helper = $(shell sh -c $(_is-executable-sh))
+_is-executable-sh = $(call shell-sq,test -f $(1) -a -x $(1) && echo y)
+
+# get-executable
+#
+# Usage: absolute-executable-path-or-empty = $(call get-executable,path)
+#
+# The goal is to get an absolute path for an executable;
+# the `command -v' is defined by POSIX, but it's not
+# necessarily very portable, so it's only used if
+# relative path resolution is requested, as determined
+# by the presence of a leading `/'.
+#
+get-executable = $(if $(1),$(if $(is-absolute),$(_ge-abspath),$(lookup)))
+_ge-abspath = $(if $(is-executable),$(1))
+
+# get-supplied-or-default-executable
+#
+# Usage: absolute-executable-path-or-empty = $(call get-executable-or-default,variable,default)
+#
+define get-executable-or-default
+$(if $($(1)),$(call _ge_attempt,$($(1)),$(1)),$(call _ge_attempt,$(2)))
+endef
+_ge_attempt = $(or $(get-executable),$(_gea_warn),$(call _gea_err,$(2)))
+_gea_warn = $(warning The path '$(1)' is not executable.)
+_gea_err = $(if $(1),$(error Please set '$(1)' appropriately))
+
+# try-cc
+# Usage: option = $(call try-cc, source-to-build, cc-options)
+try-cc = $(shell sh -c \
+ 'TMP="$(OUTPUT)$(TMPOUT).$$$$"; \
+ echo "$(1)" | \
+ $(CC) -x c - $(2) -o "$$TMP" > /dev/null 2>&1 && echo y; \
+ rm -f "$$TMP"')
+
+# try-build
+# Usage: option = $(call try-build, source-to-build, cc-options, link-options)
+try-build = $(shell sh -c \
+ 'TMP="$(OUTPUT)$(TMPOUT).$$$$"; \
+ echo "$(1)" | \
+ $(CC) -x c - $(2) $(3) -o "$$TMP" > /dev/null 2>&1 && echo y; \
+ rm -f "$$TMP"')
--- /dev/null
+#include "kvm/disk-image.h"
+
+#include <linux/err.h>
+#include <mntent.h>
+
+/*
+ * raw image and blk dev are similar, so reuse raw image ops.
+ */
+static struct disk_image_operations blk_dev_ops = {
+ .read = raw_image__read,
+ .write = raw_image__write,
+};
+
+static bool is_mounted(struct stat *st)
+{
+ struct stat st_buf;
+ struct mntent *mnt;
+ FILE *f;
+
+ f = setmntent("/proc/mounts", "r");
+ if (!f)
+ return false;
+
+ while ((mnt = getmntent(f)) != NULL) {
+ if (stat(mnt->mnt_fsname, &st_buf) == 0 &&
+ S_ISBLK(st_buf.st_mode) && st->st_rdev == st_buf.st_rdev) {
+ fclose(f);
+ return true;
+ }
+ }
+
+ fclose(f);
+ return false;
+}
+
+struct disk_image *blkdev__probe(const char *filename, int flags, struct stat *st)
+{
+ struct disk_image *disk;
+ int fd, r;
+ u64 size;
+
+ if (!S_ISBLK(st->st_mode))
+ return ERR_PTR(-EINVAL);
+
+ if (is_mounted(st)) {
+ pr_err("Block device %s is already mounted! Unmount before use.",
+ filename);
+ return ERR_PTR(-EINVAL);
+ }
+
+ /*
+ * Be careful! We are opening host block device!
+ * Open it readonly since we do not want to break user's data on disk.
+ */
+ fd = open(filename, flags);
+ if (fd < 0)
+ return ERR_PTR(fd);
+
+ if (ioctl(fd, BLKGETSIZE64, &size) < 0) {
+ r = -errno;
+ close(fd);
+ return ERR_PTR(r);
+ }
+
+ /*
+ * FIXME: This will not work on 32-bit host because we can not
+ * mmap large disk. There is not enough virtual address space
+ * in 32-bit host. However, this works on 64-bit host.
+ */
+ disk = disk_image__new(fd, size, &blk_dev_ops, DISK_IMAGE_REGULAR);
+#ifdef CONFIG_HAS_AIO
+ if (!IS_ERR_OR_NULL(disk))
+ disk->async = 1;
+#endif
+ return disk;
+}
--- /dev/null
+#include "kvm/disk-image.h"
+#include "kvm/qcow.h"
+#include "kvm/virtio-blk.h"
+#include "kvm/kvm.h"
+
+#include <linux/err.h>
+#include <sys/eventfd.h>
+#include <sys/poll.h>
+
+#define AIO_MAX 256
+
+int debug_iodelay;
+
+static int disk_image__close(struct disk_image *disk);
+
+int disk_img_name_parser(const struct option *opt, const char *arg, int unset)
+{
+ const char *cur;
+ char *sep;
+ struct kvm *kvm = opt->ptr;
+
+ if (kvm->cfg.image_count >= MAX_DISK_IMAGES)
+ die("Currently only 4 images are supported");
+
+ kvm->cfg.disk_image[kvm->cfg.image_count].filename = arg;
+ cur = arg;
+
+ if (strncmp(arg, "scsi:", 5) == 0) {
+ sep = strstr(arg, ":");
+ if (sep)
+ kvm->cfg.disk_image[kvm->cfg.image_count].wwpn = sep + 1;
+ sep = strstr(sep + 1, ":");
+ if (sep) {
+ *sep = 0;
+ kvm->cfg.disk_image[kvm->cfg.image_count].tpgt = sep + 1;
+ }
+ cur = sep + 1;
+ }
+
+ do {
+ sep = strstr(cur, ",");
+ if (sep) {
+ if (strncmp(sep + 1, "ro", 2) == 0)
+ kvm->cfg.disk_image[kvm->cfg.image_count].readonly = true;
+ else if (strncmp(sep + 1, "direct", 6) == 0)
+ kvm->cfg.disk_image[kvm->cfg.image_count].direct = true;
+ *sep = 0;
+ cur = sep + 1;
+ }
+ } while (sep);
+
+ kvm->cfg.image_count++;
+
+ return 0;
+}
+
+#ifdef CONFIG_HAS_AIO
+static void *disk_image__thread(void *param)
+{
+ struct disk_image *disk = param;
+ struct io_event event[AIO_MAX];
+ struct timespec notime = {0};
+ int nr, i;
+ u64 dummy;
+
+ while (read(disk->evt, &dummy, sizeof(dummy)) > 0) {
+ nr = io_getevents(disk->ctx, 1, ARRAY_SIZE(event), event, ¬ime);
+ for (i = 0; i < nr; i++)
+ disk->disk_req_cb(event[i].data, event[i].res);
+ }
+
+ return NULL;
+}
+#endif
+
+struct disk_image *disk_image__new(int fd, u64 size,
+ struct disk_image_operations *ops,
+ int use_mmap)
+{
+ struct disk_image *disk;
+ int r;
+
+ disk = malloc(sizeof *disk);
+ if (!disk)
+ return ERR_PTR(-ENOMEM);
+
+ *disk = (struct disk_image) {
+ .fd = fd,
+ .size = size,
+ .ops = ops,
+ };
+
+ if (use_mmap == DISK_IMAGE_MMAP) {
+ /*
+ * The write to disk image will be discarded
+ */
+ disk->priv = mmap(NULL, size, PROT_RW, MAP_PRIVATE | MAP_NORESERVE, fd, 0);
+ if (disk->priv == MAP_FAILED) {
+ r = -errno;
+ free(disk);
+ return ERR_PTR(r);
+ }
+ }
+
+#ifdef CONFIG_HAS_AIO
+ if (disk) {
+ pthread_t thread;
+
+ disk->evt = eventfd(0, 0);
+ io_setup(AIO_MAX, &disk->ctx);
+ r = pthread_create(&thread, NULL, disk_image__thread, disk);
+ if (r) {
+ r = -errno;
+ free(disk);
+ return ERR_PTR(r);
+ }
+ }
+#endif
+ return disk;
+}
+
+static struct disk_image *disk_image__open(const char *filename, bool readonly, bool direct)
+{
+ struct disk_image *disk;
+ struct stat st;
+ int fd, flags;
+
+ if (readonly)
+ flags = O_RDONLY;
+ else
+ flags = O_RDWR;
+ if (direct)
+ flags |= O_DIRECT;
+
+ if (stat(filename, &st) < 0)
+ return ERR_PTR(-errno);
+
+ /* blk device ?*/
+ disk = blkdev__probe(filename, flags, &st);
+ if (!IS_ERR_OR_NULL(disk))
+ return disk;
+
+ fd = open(filename, flags);
+ if (fd < 0)
+ return ERR_PTR(fd);
+
+ /* qcow image ?*/
+ disk = qcow_probe(fd, true);
+ if (!IS_ERR_OR_NULL(disk)) {
+ pr_warning("Forcing read-only support for QCOW");
+ return disk;
+ }
+
+ /* raw image ?*/
+ disk = raw_image__probe(fd, &st, readonly);
+ if (!IS_ERR_OR_NULL(disk))
+ return disk;
+
+ if (close(fd) < 0)
+ pr_warning("close() failed");
+
+ return ERR_PTR(-ENOSYS);
+}
+
+static struct disk_image **disk_image__open_all(struct kvm *kvm)
+{
+ struct disk_image **disks;
+ const char *filename;
+ const char *wwpn;
+ const char *tpgt;
+ bool readonly;
+ bool direct;
+ void *err;
+ int i;
+ struct disk_image_params *params = (struct disk_image_params *)&kvm->cfg.disk_image;
+ int count = kvm->cfg.image_count;
+
+ if (!count)
+ return ERR_PTR(-EINVAL);
+ if (count > MAX_DISK_IMAGES)
+ return ERR_PTR(-ENOSPC);
+
+ disks = calloc(count, sizeof(*disks));
+ if (!disks)
+ return ERR_PTR(-ENOMEM);
+
+ for (i = 0; i < count; i++) {
+ filename = params[i].filename;
+ readonly = params[i].readonly;
+ direct = params[i].direct;
+ wwpn = params[i].wwpn;
+ tpgt = params[i].tpgt;
+
+ if (wwpn) {
+ disks[i] = malloc(sizeof(struct disk_image));
+ if (!disks[i])
+ return ERR_PTR(-ENOMEM);
+ disks[i]->wwpn = wwpn;
+ disks[i]->tpgt = tpgt;
+ continue;
+ }
+
+ if (!filename)
+ continue;
+
+ disks[i] = disk_image__open(filename, readonly, direct);
+ if (IS_ERR_OR_NULL(disks[i])) {
+ pr_err("Loading disk image '%s' failed", filename);
+ err = disks[i];
+ goto error;
+ }
+ disks[i]->debug_iodelay = kvm->cfg.debug_iodelay;
+ }
+
+ return disks;
+error:
+ for (i = 0; i < count; i++)
+ if (!IS_ERR_OR_NULL(disks[i]))
+ disk_image__close(disks[i]);
+
+ free(disks);
+ return err;
+}
+
+int disk_image__flush(struct disk_image *disk)
+{
+ if (disk->ops->flush)
+ return disk->ops->flush(disk);
+
+ return fsync(disk->fd);
+}
+
+static int disk_image__close(struct disk_image *disk)
+{
+ /* If there was no disk image then there's nothing to do: */
+ if (!disk)
+ return 0;
+
+ if (disk->ops->close)
+ return disk->ops->close(disk);
+
+ if (close(disk->fd) < 0)
+ pr_warning("close() failed");
+
+ free(disk);
+
+ return 0;
+}
+
+static int disk_image__close_all(struct disk_image **disks, int count)
+{
+ while (count)
+ disk_image__close(disks[--count]);
+
+ free(disks);
+
+ return 0;
+}
+
+/*
+ * Fill iov with disk data, starting from sector 'sector'.
+ * Return amount of bytes read.
+ */
+ssize_t disk_image__read(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param)
+{
+ ssize_t total = 0;
+
+ if (debug_iodelay)
+ msleep(debug_iodelay);
+
+ if (disk->ops->read) {
+ total = disk->ops->read(disk, sector, iov, iovcount, param);
+ if (total < 0) {
+ pr_info("disk_image__read error: total=%ld\n", (long)total);
+ return total;
+ }
+ }
+
+ if (!disk->async && disk->disk_req_cb)
+ disk->disk_req_cb(param, total);
+
+ return total;
+}
+
+/*
+ * Write iov to disk, starting from sector 'sector'.
+ * Return amount of bytes written.
+ */
+ssize_t disk_image__write(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param)
+{
+ ssize_t total = 0;
+
+ if (debug_iodelay)
+ msleep(debug_iodelay);
+
+ if (disk->ops->write) {
+ /*
+ * Try writev based operation first
+ */
+
+ total = disk->ops->write(disk, sector, iov, iovcount, param);
+ if (total < 0) {
+ pr_info("disk_image__write error: total=%ld\n", (long)total);
+ return total;
+ }
+ } else {
+ /* Do nothing */
+ }
+
+ if (!disk->async && disk->disk_req_cb)
+ disk->disk_req_cb(param, total);
+
+ return total;
+}
+
+ssize_t disk_image__get_serial(struct disk_image *disk, void *buffer, ssize_t *len)
+{
+ struct stat st;
+ int r;
+
+ r = fstat(disk->fd, &st);
+ if (r)
+ return r;
+
+ *len = snprintf(buffer, *len, "%llu%llu%llu",
+ (u64)st.st_dev, (u64)st.st_rdev, (u64)st.st_ino);
+ return *len;
+}
+
+void disk_image__set_callback(struct disk_image *disk,
+ void (*disk_req_cb)(void *param, long len))
+{
+ disk->disk_req_cb = disk_req_cb;
+}
+
+int disk_image__init(struct kvm *kvm)
+{
+ if (kvm->cfg.image_count) {
+ kvm->disks = disk_image__open_all(kvm);
+ if (IS_ERR(kvm->disks))
+ return PTR_ERR(kvm->disks);
+ }
+
+ return 0;
+}
+dev_base_init(disk_image__init);
+
+int disk_image__exit(struct kvm *kvm)
+{
+ return disk_image__close_all(kvm->disks, kvm->nr_disks);
+}
+dev_base_exit(disk_image__exit);
--- /dev/null
+#include "kvm/qcow.h"
+
+#include "kvm/disk-image.h"
+#include "kvm/read-write.h"
+#include "kvm/mutex.h"
+#include "kvm/util.h"
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+#ifdef CONFIG_HAS_ZLIB
+#include <zlib.h>
+#endif
+
+#include <linux/err.h>
+#include <linux/byteorder.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append);
+static int qcow_write_refcount_table(struct qcow *q);
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref);
+static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size);
+
+static inline int qcow_pwrite_sync(int fd,
+ void *buf, size_t count, off_t offset)
+{
+ if (pwrite_in_full(fd, buf, count, offset) < 0)
+ return -1;
+
+ return fdatasync(fd);
+}
+
+static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
+{
+ struct rb_node **link = &(root->rb_node), *parent = NULL;
+ u64 offset = new->offset;
+
+ /* search the tree */
+ while (*link) {
+ struct qcow_l2_table *t;
+
+ t = rb_entry(*link, struct qcow_l2_table, node);
+ if (!t)
+ goto error;
+
+ parent = *link;
+
+ if (t->offset > offset)
+ link = &(*link)->rb_left;
+ else if (t->offset < offset)
+ link = &(*link)->rb_right;
+ else
+ goto out;
+ }
+
+ /* add new node */
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, root);
+out:
+ return 0;
+error:
+ return -1;
+}
+
+static struct qcow_l2_table *l2_table_lookup(struct rb_root *root, u64 offset)
+{
+ struct rb_node *link = root->rb_node;
+
+ while (link) {
+ struct qcow_l2_table *t;
+
+ t = rb_entry(link, struct qcow_l2_table, node);
+ if (!t)
+ goto out;
+
+ if (t->offset > offset)
+ link = link->rb_left;
+ else if (t->offset < offset)
+ link = link->rb_right;
+ else
+ return t;
+ }
+out:
+ return NULL;
+}
+
+static void l1_table_free_cache(struct qcow_l1_table *l1t)
+{
+ struct rb_root *r = &l1t->root;
+ struct list_head *pos, *n;
+ struct qcow_l2_table *t;
+
+ list_for_each_safe(pos, n, &l1t->lru_list) {
+ /* Remove cache table from the list and RB tree */
+ list_del(pos);
+ t = list_entry(pos, struct qcow_l2_table, list);
+ rb_erase(&t->node, r);
+
+ /* Free the cached node */
+ free(t);
+ }
+}
+
+static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c)
+{
+ struct qcow_header *header = q->header;
+ u64 size;
+
+ if (!c->dirty)
+ return 0;
+
+ size = 1 << header->l2_bits;
+
+ if (qcow_pwrite_sync(q->fd, c->table,
+ size * sizeof(u64), c->offset) < 0)
+ return -1;
+
+ c->dirty = 0;
+
+ return 0;
+}
+
+static int cache_table(struct qcow *q, struct qcow_l2_table *c)
+{
+ struct qcow_l1_table *l1t = &q->table;
+ struct rb_root *r = &l1t->root;
+ struct qcow_l2_table *lru;
+
+ if (l1t->nr_cached == MAX_CACHE_NODES) {
+ /*
+ * The node at the head of the list is least recently used
+ * node. Remove it from the list and replaced with a new node.
+ */
+ lru = list_first_entry(&l1t->lru_list, struct qcow_l2_table, list);
+
+ /* Remove the node from the cache */
+ rb_erase(&lru->node, r);
+ list_del_init(&lru->list);
+ l1t->nr_cached--;
+
+ /* Free the LRUed node */
+ free(lru);
+ }
+
+ /* Add new node in RB Tree: Helps in searching faster */
+ if (l2_table_insert(r, c) < 0)
+ goto error;
+
+ /* Add in LRU replacement list */
+ list_add_tail(&c->list, &l1t->lru_list);
+ l1t->nr_cached++;
+
+ return 0;
+error:
+ return -1;
+}
+
+static struct qcow_l2_table *l2_table_search(struct qcow *q, u64 offset)
+{
+ struct qcow_l1_table *l1t = &q->table;
+ struct qcow_l2_table *l2t;
+
+ l2t = l2_table_lookup(&l1t->root, offset);
+ if (!l2t)
+ return NULL;
+
+ /* Update the LRU state, by moving the searched node to list tail */
+ list_move_tail(&l2t->list, &l1t->lru_list);
+
+ return l2t;
+}
+
+/* Allocates a new node for caching L2 table */
+static struct qcow_l2_table *new_cache_table(struct qcow *q, u64 offset)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_l2_table *c;
+ u64 l2t_sz;
+ u64 size;
+
+ l2t_sz = 1 << header->l2_bits;
+ size = sizeof(*c) + l2t_sz * sizeof(u64);
+ c = calloc(1, size);
+ if (!c)
+ goto out;
+
+ c->offset = offset;
+ RB_CLEAR_NODE(&c->node);
+ INIT_LIST_HEAD(&c->list);
+out:
+ return c;
+}
+
+static inline u64 get_l1_index(struct qcow *q, u64 offset)
+{
+ struct qcow_header *header = q->header;
+
+ return offset >> (header->l2_bits + header->cluster_bits);
+}
+
+static inline u64 get_l2_index(struct qcow *q, u64 offset)
+{
+ struct qcow_header *header = q->header;
+
+ return (offset >> (header->cluster_bits)) & ((1 << header->l2_bits)-1);
+}
+
+static inline u64 get_cluster_offset(struct qcow *q, u64 offset)
+{
+ struct qcow_header *header = q->header;
+
+ return offset & ((1 << header->cluster_bits)-1);
+}
+
+static struct qcow_l2_table *qcow_read_l2_table(struct qcow *q, u64 offset)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_l2_table *l2t;
+ u64 size;
+
+ size = 1 << header->l2_bits;
+
+ /* search an entry for offset in cache */
+ l2t = l2_table_search(q, offset);
+ if (l2t)
+ return l2t;
+
+ /* allocate new node for caching l2 table */
+ l2t = new_cache_table(q, offset);
+ if (!l2t)
+ goto error;
+
+ /* table not cached: read from the disk */
+ if (pread_in_full(q->fd, l2t->table, size * sizeof(u64), offset) < 0)
+ goto error;
+
+ /* cache the table */
+ if (cache_table(q, l2t) < 0)
+ goto error;
+
+ return l2t;
+error:
+ free(l2t);
+ return NULL;
+}
+
+static int qcow_decompress_buffer(u8 *out_buf, int out_buf_size,
+ const u8 *buf, int buf_size)
+{
+#ifdef CONFIG_HAS_ZLIB
+ z_stream strm1, *strm = &strm1;
+ int ret, out_len;
+
+ memset(strm, 0, sizeof(*strm));
+
+ strm->next_in = (u8 *)buf;
+ strm->avail_in = buf_size;
+ strm->next_out = out_buf;
+ strm->avail_out = out_buf_size;
+
+ ret = inflateInit2(strm, -12);
+ if (ret != Z_OK)
+ return -1;
+
+ ret = inflate(strm, Z_FINISH);
+ out_len = strm->next_out - out_buf;
+ if ((ret != Z_STREAM_END && ret != Z_BUF_ERROR) ||
+ out_len != out_buf_size) {
+ inflateEnd(strm);
+ return -1;
+ }
+
+ inflateEnd(strm);
+ return 0;
+#else
+ return -1;
+#endif
+}
+
+static ssize_t qcow1_read_cluster(struct qcow *q, u64 offset,
+ void *dst, u32 dst_len)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_l1_table *l1t = &q->table;
+ struct qcow_l2_table *l2t;
+ u64 clust_offset;
+ u64 clust_start;
+ u64 l2t_offset;
+ size_t length;
+ u64 l2t_size;
+ u64 l1_idx;
+ u64 l2_idx;
+ int coffset;
+ int csize;
+
+ l1_idx = get_l1_index(q, offset);
+ if (l1_idx >= l1t->table_size)
+ return -1;
+
+ clust_offset = get_cluster_offset(q, offset);
+ if (clust_offset >= q->cluster_size)
+ return -1;
+
+ length = q->cluster_size - clust_offset;
+ if (length > dst_len)
+ length = dst_len;
+
+ mutex_lock(&q->mutex);
+
+ l2t_offset = be64_to_cpu(l1t->l1_table[l1_idx]);
+ if (!l2t_offset)
+ goto zero_cluster;
+
+ l2t_size = 1 << header->l2_bits;
+
+ /* read and cache level 2 table */
+ l2t = qcow_read_l2_table(q, l2t_offset);
+ if (!l2t)
+ goto out_error;
+
+ l2_idx = get_l2_index(q, offset);
+ if (l2_idx >= l2t_size)
+ goto out_error;
+
+ clust_start = be64_to_cpu(l2t->table[l2_idx]);
+ if (clust_start & QCOW1_OFLAG_COMPRESSED) {
+ coffset = clust_start & q->cluster_offset_mask;
+ csize = clust_start >> (63 - q->header->cluster_bits);
+ csize &= (q->cluster_size - 1);
+
+ if (pread_in_full(q->fd, q->cluster_data, csize,
+ coffset) < 0)
+ goto out_error;
+
+ if (qcow_decompress_buffer(q->cluster_cache, q->cluster_size,
+ q->cluster_data, csize) < 0)
+ goto out_error;
+
+ memcpy(dst, q->cluster_cache + clust_offset, length);
+ mutex_unlock(&q->mutex);
+ } else {
+ if (!clust_start)
+ goto zero_cluster;
+
+ mutex_unlock(&q->mutex);
+
+ if (pread_in_full(q->fd, dst, length,
+ clust_start + clust_offset) < 0)
+ return -1;
+ }
+
+ return length;
+
+zero_cluster:
+ mutex_unlock(&q->mutex);
+ memset(dst, 0, length);
+ return length;
+
+out_error:
+ mutex_unlock(&q->mutex);
+ length = -1;
+ return -1;
+}
+
+static ssize_t qcow2_read_cluster(struct qcow *q, u64 offset,
+ void *dst, u32 dst_len)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_l1_table *l1t = &q->table;
+ struct qcow_l2_table *l2t;
+ u64 clust_offset;
+ u64 clust_start;
+ u64 l2t_offset;
+ size_t length;
+ u64 l2t_size;
+ u64 l1_idx;
+ u64 l2_idx;
+ int coffset;
+ int sector_offset;
+ int nb_csectors;
+ int csize;
+
+ l1_idx = get_l1_index(q, offset);
+ if (l1_idx >= l1t->table_size)
+ return -1;
+
+ clust_offset = get_cluster_offset(q, offset);
+ if (clust_offset >= q->cluster_size)
+ return -1;
+
+ length = q->cluster_size - clust_offset;
+ if (length > dst_len)
+ length = dst_len;
+
+ mutex_lock(&q->mutex);
+
+ l2t_offset = be64_to_cpu(l1t->l1_table[l1_idx]);
+
+ l2t_offset &= ~QCOW2_OFLAG_COPIED;
+ if (!l2t_offset)
+ goto zero_cluster;
+
+ l2t_size = 1 << header->l2_bits;
+
+ /* read and cache level 2 table */
+ l2t = qcow_read_l2_table(q, l2t_offset);
+ if (!l2t)
+ goto out_error;
+
+ l2_idx = get_l2_index(q, offset);
+ if (l2_idx >= l2t_size)
+ goto out_error;
+
+ clust_start = be64_to_cpu(l2t->table[l2_idx]);
+ if (clust_start & QCOW2_OFLAG_COMPRESSED) {
+ coffset = clust_start & q->cluster_offset_mask;
+ nb_csectors = ((clust_start >> q->csize_shift)
+ & q->csize_mask) + 1;
+ sector_offset = coffset & (SECTOR_SIZE - 1);
+ csize = nb_csectors * SECTOR_SIZE - sector_offset;
+
+ if (pread_in_full(q->fd, q->cluster_data,
+ nb_csectors * SECTOR_SIZE,
+ coffset & ~(SECTOR_SIZE - 1)) < 0) {
+ goto out_error;
+ }
+
+ if (qcow_decompress_buffer(q->cluster_cache, q->cluster_size,
+ q->cluster_data + sector_offset,
+ csize) < 0) {
+ goto out_error;
+ }
+
+ memcpy(dst, q->cluster_cache + clust_offset, length);
+ mutex_unlock(&q->mutex);
+ } else {
+ clust_start &= QCOW2_OFFSET_MASK;
+ if (!clust_start)
+ goto zero_cluster;
+
+ mutex_unlock(&q->mutex);
+
+ if (pread_in_full(q->fd, dst, length,
+ clust_start + clust_offset) < 0)
+ return -1;
+ }
+
+ return length;
+
+zero_cluster:
+ mutex_unlock(&q->mutex);
+ memset(dst, 0, length);
+ return length;
+
+out_error:
+ mutex_unlock(&q->mutex);
+ length = -1;
+ return -1;
+}
+
+static ssize_t qcow_read_sector_single(struct disk_image *disk, u64 sector,
+ void *dst, u32 dst_len)
+{
+ struct qcow *q = disk->priv;
+ struct qcow_header *header = q->header;
+ u32 nr_read;
+ u64 offset;
+ char *buf;
+ u32 nr;
+
+ buf = dst;
+ nr_read = 0;
+
+ while (nr_read < dst_len) {
+ offset = sector << SECTOR_SHIFT;
+ if (offset >= header->size)
+ return -1;
+
+ if (q->version == QCOW1_VERSION)
+ nr = qcow1_read_cluster(q, offset, buf,
+ dst_len - nr_read);
+ else
+ nr = qcow2_read_cluster(q, offset, buf,
+ dst_len - nr_read);
+
+ if (nr <= 0)
+ return -1;
+
+ nr_read += nr;
+ buf += nr;
+ sector += (nr >> SECTOR_SHIFT);
+ }
+
+ return dst_len;
+}
+
+static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param)
+{
+ ssize_t nr, total = 0;
+
+ while (iovcount--) {
+ nr = qcow_read_sector_single(disk, sector, iov->iov_base, iov->iov_len);
+ if (nr != (ssize_t)iov->iov_len) {
+ pr_info("qcow_read_sector error: nr=%ld iov_len=%ld\n", (long)nr, (long)iov->iov_len);
+ return -1;
+ }
+
+ sector += iov->iov_len >> SECTOR_SHIFT;
+ total += nr;
+ iov++;
+ }
+
+ return total;
+}
+
+static void refcount_table_free_cache(struct qcow_refcount_table *rft)
+{
+ struct rb_root *r = &rft->root;
+ struct list_head *pos, *n;
+ struct qcow_refcount_block *t;
+
+ list_for_each_safe(pos, n, &rft->lru_list) {
+ list_del(pos);
+ t = list_entry(pos, struct qcow_refcount_block, list);
+ rb_erase(&t->node, r);
+
+ free(t);
+ }
+}
+
+static int refcount_block_insert(struct rb_root *root, struct qcow_refcount_block *new)
+{
+ struct rb_node **link = &(root->rb_node), *parent = NULL;
+ u64 offset = new->offset;
+
+ /* search the tree */
+ while (*link) {
+ struct qcow_refcount_block *t;
+
+ t = rb_entry(*link, struct qcow_refcount_block, node);
+ if (!t)
+ goto error;
+
+ parent = *link;
+
+ if (t->offset > offset)
+ link = &(*link)->rb_left;
+ else if (t->offset < offset)
+ link = &(*link)->rb_right;
+ else
+ goto out;
+ }
+
+ /* add new node */
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, root);
+out:
+ return 0;
+error:
+ return -1;
+}
+
+static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb)
+{
+ if (!rfb->dirty)
+ return 0;
+
+ if (qcow_pwrite_sync(q->fd, rfb->entries,
+ rfb->size * sizeof(u16), rfb->offset) < 0)
+ return -1;
+
+ rfb->dirty = 0;
+
+ return 0;
+}
+
+static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c)
+{
+ struct qcow_refcount_table *rft = &q->refcount_table;
+ struct rb_root *r = &rft->root;
+ struct qcow_refcount_block *lru;
+
+ if (rft->nr_cached == MAX_CACHE_NODES) {
+ lru = list_first_entry(&rft->lru_list, struct qcow_refcount_block, list);
+
+ rb_erase(&lru->node, r);
+ list_del_init(&lru->list);
+ rft->nr_cached--;
+
+ free(lru);
+ }
+
+ if (refcount_block_insert(r, c) < 0)
+ goto error;
+
+ list_add_tail(&c->list, &rft->lru_list);
+ rft->nr_cached++;
+
+ return 0;
+error:
+ return -1;
+}
+
+static struct qcow_refcount_block *new_refcount_block(struct qcow *q, u64 rfb_offset)
+{
+ struct qcow_refcount_block *rfb;
+
+ rfb = malloc(sizeof *rfb + q->cluster_size);
+ if (!rfb)
+ return NULL;
+
+ rfb->offset = rfb_offset;
+ rfb->size = q->cluster_size / sizeof(u16);
+ RB_CLEAR_NODE(&rfb->node);
+ INIT_LIST_HEAD(&rfb->list);
+
+ return rfb;
+}
+
+static struct qcow_refcount_block *refcount_block_lookup(struct rb_root *root, u64 offset)
+{
+ struct rb_node *link = root->rb_node;
+
+ while (link) {
+ struct qcow_refcount_block *t;
+
+ t = rb_entry(link, struct qcow_refcount_block, node);
+ if (!t)
+ goto out;
+
+ if (t->offset > offset)
+ link = link->rb_left;
+ else if (t->offset < offset)
+ link = link->rb_right;
+ else
+ return t;
+ }
+out:
+ return NULL;
+}
+
+static struct qcow_refcount_block *refcount_block_search(struct qcow *q, u64 offset)
+{
+ struct qcow_refcount_table *rft = &q->refcount_table;
+ struct qcow_refcount_block *rfb;
+
+ rfb = refcount_block_lookup(&rft->root, offset);
+ if (!rfb)
+ return NULL;
+
+ /* Update the LRU state, by moving the searched node to list tail */
+ list_move_tail(&rfb->list, &rft->lru_list);
+
+ return rfb;
+}
+
+static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q,
+ u64 clust_idx)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_refcount_table *rft = &q->refcount_table;
+ struct qcow_refcount_block *rfb;
+ u64 new_block_offset;
+ u64 rft_idx;
+
+ rft_idx = clust_idx >> (header->cluster_bits -
+ QCOW_REFCOUNT_BLOCK_SHIFT);
+
+ if (rft_idx >= rft->rf_size) {
+ pr_warning("Don't support grow refcount block table");
+ return NULL;
+ }
+
+ new_block_offset = qcow_alloc_clusters(q, q->cluster_size, 0);
+ if (new_block_offset < 0)
+ return NULL;
+
+ rfb = new_refcount_block(q, new_block_offset);
+ if (!rfb)
+ return NULL;
+
+ memset(rfb->entries, 0x00, q->cluster_size);
+ rfb->dirty = 1;
+
+ /* write refcount block */
+ if (write_refcount_block(q, rfb) < 0)
+ goto free_rfb;
+
+ if (cache_refcount_block(q, rfb) < 0)
+ goto free_rfb;
+
+ rft->rf_table[rft_idx] = cpu_to_be64(new_block_offset);
+ if (update_cluster_refcount(q, new_block_offset >>
+ header->cluster_bits, 1) < 0)
+ goto recover_rft;
+
+ if (qcow_write_refcount_table(q) < 0)
+ goto recover_rft;
+
+ return rfb;
+
+recover_rft:
+ rft->rf_table[rft_idx] = 0;
+free_rfb:
+ free(rfb);
+ return NULL;
+}
+
+static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 clust_idx)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_refcount_table *rft = &q->refcount_table;
+ struct qcow_refcount_block *rfb;
+ u64 rfb_offset;
+ u64 rft_idx;
+
+ rft_idx = clust_idx >> (header->cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT);
+ if (rft_idx >= rft->rf_size)
+ return ERR_PTR(-ENOSPC);
+
+ rfb_offset = be64_to_cpu(rft->rf_table[rft_idx]);
+ if (!rfb_offset)
+ return ERR_PTR(-ENOSPC);
+
+ rfb = refcount_block_search(q, rfb_offset);
+ if (rfb)
+ return rfb;
+
+ rfb = new_refcount_block(q, rfb_offset);
+ if (!rfb)
+ return NULL;
+
+ if (pread_in_full(q->fd, rfb->entries, rfb->size * sizeof(u16), rfb_offset) < 0)
+ goto error_free_rfb;
+
+ if (cache_refcount_block(q, rfb) < 0)
+ goto error_free_rfb;
+
+ return rfb;
+
+error_free_rfb:
+ free(rfb);
+
+ return NULL;
+}
+
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+ struct qcow_refcount_block *rfb = NULL;
+ struct qcow_header *header = q->header;
+ u64 rfb_idx;
+
+ rfb = qcow_read_refcount_block(q, clust_idx);
+ if (PTR_ERR(rfb) == -ENOSPC)
+ return 0;
+ else if (IS_ERR_OR_NULL(rfb)) {
+ pr_warning("Error while reading refcount table");
+ return -1;
+ }
+
+ rfb_idx = clust_idx & (((1ULL <<
+ (header->cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+ if (rfb_idx >= rfb->size) {
+ pr_warning("L1: refcount block index out of bounds");
+ return -1;
+ }
+
+ return be16_to_cpu(rfb->entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+ struct qcow_refcount_block *rfb = NULL;
+ struct qcow_header *header = q->header;
+ u16 refcount;
+ u64 rfb_idx;
+
+ rfb = qcow_read_refcount_block(q, clust_idx);
+ if (PTR_ERR(rfb) == -ENOSPC) {
+ rfb = qcow_grow_refcount_block(q, clust_idx);
+ if (!rfb) {
+ pr_warning("error while growing refcount table");
+ return -1;
+ }
+ } else if (IS_ERR_OR_NULL(rfb)) {
+ pr_warning("error while reading refcount table");
+ return -1;
+ }
+
+ rfb_idx = clust_idx & (((1ULL <<
+ (header->cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+ if (rfb_idx >= rfb->size) {
+ pr_warning("refcount block index out of bounds");
+ return -1;
+ }
+
+ refcount = be16_to_cpu(rfb->entries[rfb_idx]) + append;
+ rfb->entries[rfb_idx] = cpu_to_be16(refcount);
+ rfb->dirty = 1;
+
+ /* write refcount block */
+ if (write_refcount_block(q, rfb) < 0) {
+ pr_warning("refcount block index out of bounds");
+ return -1;
+ }
+
+ /* update free_clust_idx since refcount becomes zero */
+ if (!refcount && clust_idx < q->free_clust_idx)
+ q->free_clust_idx = clust_idx;
+
+ return 0;
+}
+
+static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size)
+{
+ struct qcow_header *header = q->header;
+ u64 start, end, offset;
+
+ start = clust_start & ~(q->cluster_size - 1);
+ end = (clust_start + size - 1) & ~(q->cluster_size - 1);
+ for (offset = start; offset <= end; offset += q->cluster_size)
+ update_cluster_refcount(q, offset >> header->cluster_bits, -1);
+}
+
+/*
+ * Allocate clusters according to the size. Find a postion that
+ * can satisfy the size. free_clust_idx is initialized to zero and
+ * Record last position.
+ */
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref)
+{
+ struct qcow_header *header = q->header;
+ u16 clust_refcount;
+ u32 clust_idx = 0, i;
+ u64 clust_num;
+
+ clust_num = (size + (q->cluster_size - 1)) >> header->cluster_bits;
+
+again:
+ for (i = 0; i < clust_num; i++) {
+ clust_idx = q->free_clust_idx++;
+ clust_refcount = qcow_get_refcount(q, clust_idx);
+ if (clust_refcount < 0)
+ return -1;
+ else if (clust_refcount > 0)
+ goto again;
+ }
+
+ clust_idx++;
+
+ if (update_ref)
+ for (i = 0; i < clust_num; i++)
+ if (update_cluster_refcount(q,
+ clust_idx - clust_num + i, 1))
+ return -1;
+
+ return (clust_idx - clust_num) << header->cluster_bits;
+}
+
+static int qcow_write_l1_table(struct qcow *q)
+{
+ struct qcow_l1_table *l1t = &q->table;
+ struct qcow_header *header = q->header;
+
+ if (qcow_pwrite_sync(q->fd, l1t->l1_table,
+ l1t->table_size * sizeof(u64),
+ header->l1_table_offset) < 0)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Get l2 table. If the table has been copied, read table directly.
+ * If the table exists, allocate a new cluster and copy the table
+ * to the new cluster.
+ */
+static int get_cluster_table(struct qcow *q, u64 offset,
+ struct qcow_l2_table **result_l2t, u64 *result_l2_index)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_l1_table *l1t = &q->table;
+ struct qcow_l2_table *l2t;
+ u64 l1t_idx;
+ u64 l2t_offset;
+ u64 l2t_idx;
+ u64 l2t_size;
+ u64 l2t_new_offset;
+
+ l2t_size = 1 << header->l2_bits;
+
+ l1t_idx = get_l1_index(q, offset);
+ if (l1t_idx >= l1t->table_size)
+ return -1;
+
+ l2t_idx = get_l2_index(q, offset);
+ if (l2t_idx >= l2t_size)
+ return -1;
+
+ l2t_offset = be64_to_cpu(l1t->l1_table[l1t_idx]);
+ if (l2t_offset & QCOW2_OFLAG_COPIED) {
+ l2t_offset &= ~QCOW2_OFLAG_COPIED;
+ l2t = qcow_read_l2_table(q, l2t_offset);
+ if (!l2t)
+ goto error;
+ } else {
+ l2t_new_offset = qcow_alloc_clusters(q,
+ l2t_size*sizeof(u64), 1);
+
+ if (l2t_new_offset < 0)
+ goto error;
+
+ l2t = new_cache_table(q, l2t_new_offset);
+ if (!l2t)
+ goto free_cluster;
+
+ if (l2t_offset) {
+ l2t = qcow_read_l2_table(q, l2t_offset);
+ if (!l2t)
+ goto free_cache;
+ } else
+ memset(l2t->table, 0x00, l2t_size * sizeof(u64));
+
+ /* write l2 table */
+ l2t->dirty = 1;
+ if (qcow_l2_cache_write(q, l2t) < 0)
+ goto free_cache;
+
+ /* cache l2 table */
+ if (cache_table(q, l2t))
+ goto free_cache;
+
+ /* update the l1 talble */
+ l1t->l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset
+ | QCOW2_OFLAG_COPIED);
+ if (qcow_write_l1_table(q)) {
+ pr_warning("Update l1 table error");
+ goto free_cache;
+ }
+
+ /* free old cluster */
+ qcow_free_clusters(q, l2t_offset, q->cluster_size);
+ }
+
+ *result_l2t = l2t;
+ *result_l2_index = l2t_idx;
+
+ return 0;
+
+free_cache:
+ free(l2t);
+
+free_cluster:
+ qcow_free_clusters(q, l2t_new_offset, q->cluster_size);
+
+error:
+ return -1;
+}
+
+/*
+ * If the cluster has been copied, write data directly. If not,
+ * read the original data and write it to the new cluster with
+ * modification.
+ */
+static ssize_t qcow_write_cluster(struct qcow *q, u64 offset,
+ void *buf, u32 src_len)
+{
+ struct qcow_l2_table *l2t;
+ u64 clust_new_start;
+ u64 clust_start;
+ u64 clust_flags;
+ u64 clust_off;
+ u64 l2t_idx;
+ u64 len;
+
+ l2t = NULL;
+
+ clust_off = get_cluster_offset(q, offset);
+ if (clust_off >= q->cluster_size)
+ return -1;
+
+ len = q->cluster_size - clust_off;
+ if (len > src_len)
+ len = src_len;
+
+ mutex_lock(&q->mutex);
+
+ if (get_cluster_table(q, offset, &l2t, &l2t_idx)) {
+ pr_warning("Get l2 table error");
+ goto error;
+ }
+
+ clust_start = be64_to_cpu(l2t->table[l2t_idx]);
+ clust_flags = clust_start & QCOW2_OFLAGS_MASK;
+
+ clust_start &= QCOW2_OFFSET_MASK;
+ if (!(clust_flags & QCOW2_OFLAG_COPIED)) {
+ clust_new_start = qcow_alloc_clusters(q, q->cluster_size, 1);
+ if (clust_new_start < 0) {
+ pr_warning("Cluster alloc error");
+ goto error;
+ }
+
+ offset &= ~(q->cluster_size - 1);
+
+ /* if clust_start is not zero, read the original data*/
+ if (clust_start) {
+ mutex_unlock(&q->mutex);
+ if (qcow2_read_cluster(q, offset, q->copy_buff,
+ q->cluster_size) < 0) {
+ pr_warning("Read copy cluster error");
+ qcow_free_clusters(q, clust_new_start,
+ q->cluster_size);
+ return -1;
+ }
+ mutex_lock(&q->mutex);
+ } else
+ memset(q->copy_buff, 0x00, q->cluster_size);
+
+ memcpy(q->copy_buff + clust_off, buf, len);
+
+ /* Write actual data */
+ if (pwrite_in_full(q->fd, q->copy_buff, q->cluster_size,
+ clust_new_start) < 0)
+ goto free_cluster;
+
+ /* update l2 table*/
+ l2t->table[l2t_idx] = cpu_to_be64(clust_new_start
+ | QCOW2_OFLAG_COPIED);
+ l2t->dirty = 1;
+
+ if (qcow_l2_cache_write(q, l2t))
+ goto free_cluster;
+
+ /* free old cluster*/
+ if (clust_flags & QCOW2_OFLAG_COMPRESSED) {
+ int size;
+ size = ((clust_start >> q->csize_shift) &
+ q->csize_mask) + 1;
+ size *= 512;
+ clust_start &= q->cluster_offset_mask;
+ clust_start &= ~511;
+
+ qcow_free_clusters(q, clust_start, size);
+ } else if (clust_start)
+ qcow_free_clusters(q, clust_start, q->cluster_size);
+
+ } else {
+ /* Write actual data */
+ if (pwrite_in_full(q->fd, buf, len,
+ clust_start + clust_off) < 0)
+ goto error;
+ }
+ mutex_unlock(&q->mutex);
+ return len;
+
+free_cluster:
+ qcow_free_clusters(q, clust_new_start, q->cluster_size);
+
+error:
+ mutex_unlock(&q->mutex);
+ return -1;
+}
+
+static ssize_t qcow_write_sector_single(struct disk_image *disk, u64 sector, void *src, u32 src_len)
+{
+ struct qcow *q = disk->priv;
+ struct qcow_header *header = q->header;
+ u32 nr_written;
+ char *buf;
+ u64 offset;
+ ssize_t nr;
+
+ buf = src;
+ nr_written = 0;
+ offset = sector << SECTOR_SHIFT;
+
+ while (nr_written < src_len) {
+ if (offset >= header->size)
+ return -1;
+
+ nr = qcow_write_cluster(q, offset, buf, src_len - nr_written);
+ if (nr < 0)
+ return -1;
+
+ nr_written += nr;
+ buf += nr;
+ offset += nr;
+ }
+
+ return nr_written;
+}
+
+static ssize_t qcow_write_sector(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param)
+{
+ ssize_t nr, total = 0;
+
+ while (iovcount--) {
+ nr = qcow_write_sector_single(disk, sector, iov->iov_base, iov->iov_len);
+ if (nr != (ssize_t)iov->iov_len) {
+ pr_info("qcow_write_sector error: nr=%ld iov_len=%ld\n", (long)nr, (long)iov->iov_len);
+ return -1;
+ }
+
+ sector += iov->iov_len >> SECTOR_SHIFT;
+ iov++;
+ total += nr;
+ }
+
+ return total;
+}
+
+static int qcow_disk_flush(struct disk_image *disk)
+{
+ struct qcow *q = disk->priv;
+ struct qcow_refcount_table *rft;
+ struct list_head *pos, *n;
+ struct qcow_l1_table *l1t;
+
+ l1t = &q->table;
+ rft = &q->refcount_table;
+
+ mutex_lock(&q->mutex);
+
+ list_for_each_safe(pos, n, &rft->lru_list) {
+ struct qcow_refcount_block *c = list_entry(pos, struct qcow_refcount_block, list);
+
+ if (write_refcount_block(q, c) < 0)
+ goto error_unlock;
+ }
+
+ list_for_each_safe(pos, n, &l1t->lru_list) {
+ struct qcow_l2_table *c = list_entry(pos, struct qcow_l2_table, list);
+
+ if (qcow_l2_cache_write(q, c) < 0)
+ goto error_unlock;
+ }
+
+ if (qcow_write_l1_table < 0)
+ goto error_unlock;
+
+ mutex_unlock(&q->mutex);
+
+ return fsync(disk->fd);
+
+error_unlock:
+ mutex_unlock(&q->mutex);
+ return -1;
+}
+
+static int qcow_disk_close(struct disk_image *disk)
+{
+ struct qcow *q;
+
+ if (!disk)
+ return 0;
+
+ q = disk->priv;
+
+ refcount_table_free_cache(&q->refcount_table);
+ l1_table_free_cache(&q->table);
+ free(q->copy_buff);
+ free(q->cluster_data);
+ free(q->cluster_cache);
+ free(q->refcount_table.rf_table);
+ free(q->table.l1_table);
+ free(q->header);
+ free(q);
+
+ return 0;
+}
+
+static struct disk_image_operations qcow_disk_readonly_ops = {
+ .read = qcow_read_sector,
+ .close = qcow_disk_close,
+};
+
+static struct disk_image_operations qcow_disk_ops = {
+ .read = qcow_read_sector,
+ .write = qcow_write_sector,
+ .flush = qcow_disk_flush,
+ .close = qcow_disk_close,
+};
+
+static int qcow_read_refcount_table(struct qcow *q)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_refcount_table *rft = &q->refcount_table;
+
+ rft->rf_size = (header->refcount_table_size * q->cluster_size)
+ / sizeof(u64);
+
+ rft->rf_table = calloc(rft->rf_size, sizeof(u64));
+ if (!rft->rf_table)
+ return -1;
+
+ rft->root = RB_ROOT;
+ INIT_LIST_HEAD(&rft->lru_list);
+
+ return pread_in_full(q->fd, rft->rf_table, sizeof(u64) * rft->rf_size, header->refcount_table_offset);
+}
+
+static int qcow_write_refcount_table(struct qcow *q)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_refcount_table *rft = &q->refcount_table;
+
+ return qcow_pwrite_sync(q->fd, rft->rf_table,
+ rft->rf_size * sizeof(u64), header->refcount_table_offset);
+}
+
+static int qcow_read_l1_table(struct qcow *q)
+{
+ struct qcow_header *header = q->header;
+ struct qcow_l1_table *table = &q->table;
+
+ table->table_size = header->l1_size;
+
+ table->l1_table = calloc(table->table_size, sizeof(u64));
+ if (!table->l1_table)
+ return -1;
+
+ return pread_in_full(q->fd, table->l1_table, sizeof(u64) * table->table_size, header->l1_table_offset);
+}
+
+static void *qcow2_read_header(int fd)
+{
+ struct qcow2_header_disk f_header;
+ struct qcow_header *header;
+
+ header = malloc(sizeof(struct qcow_header));
+ if (!header)
+ return NULL;
+
+ if (pread_in_full(fd, &f_header, sizeof(struct qcow2_header_disk), 0) < 0) {
+ free(header);
+ return NULL;
+ }
+
+ be32_to_cpus(&f_header.magic);
+ be32_to_cpus(&f_header.version);
+ be64_to_cpus(&f_header.backing_file_offset);
+ be32_to_cpus(&f_header.backing_file_size);
+ be32_to_cpus(&f_header.cluster_bits);
+ be64_to_cpus(&f_header.size);
+ be32_to_cpus(&f_header.crypt_method);
+ be32_to_cpus(&f_header.l1_size);
+ be64_to_cpus(&f_header.l1_table_offset);
+ be64_to_cpus(&f_header.refcount_table_offset);
+ be32_to_cpus(&f_header.refcount_table_clusters);
+ be32_to_cpus(&f_header.nb_snapshots);
+ be64_to_cpus(&f_header.snapshots_offset);
+
+ *header = (struct qcow_header) {
+ .size = f_header.size,
+ .l1_table_offset = f_header.l1_table_offset,
+ .l1_size = f_header.l1_size,
+ .cluster_bits = f_header.cluster_bits,
+ .l2_bits = f_header.cluster_bits - 3,
+ .refcount_table_offset = f_header.refcount_table_offset,
+ .refcount_table_size = f_header.refcount_table_clusters,
+ };
+
+ return header;
+}
+
+static struct disk_image *qcow2_probe(int fd, bool readonly)
+{
+ struct disk_image *disk_image;
+ struct qcow_l1_table *l1t;
+ struct qcow_header *h;
+ struct qcow *q;
+
+ q = calloc(1, sizeof(struct qcow));
+ if (!q)
+ return NULL;
+
+ mutex_init(&q->mutex);
+ q->fd = fd;
+
+ l1t = &q->table;
+
+ l1t->root = RB_ROOT;
+ INIT_LIST_HEAD(&l1t->lru_list);
+
+ h = q->header = qcow2_read_header(fd);
+ if (!h)
+ goto free_qcow;
+
+ q->version = QCOW2_VERSION;
+ q->csize_shift = (62 - (q->header->cluster_bits - 8));
+ q->csize_mask = (1 << (q->header->cluster_bits - 8)) - 1;
+ q->cluster_offset_mask = (1LL << q->csize_shift) - 1;
+ q->cluster_size = 1 << q->header->cluster_bits;
+
+ q->copy_buff = malloc(q->cluster_size);
+ if (!q->copy_buff) {
+ pr_warning("copy buff malloc error");
+ goto free_header;
+ }
+
+ q->cluster_data = malloc(q->cluster_size);
+ if (!q->cluster_data) {
+ pr_warning("cluster data malloc error");
+ goto free_copy_buff;
+ }
+
+ q->cluster_cache = malloc(q->cluster_size);
+ if (!q->cluster_cache) {
+ pr_warning("cluster cache malloc error");
+ goto free_cluster_data;
+ }
+
+ if (qcow_read_l1_table(q) < 0)
+ goto free_cluster_cache;
+
+ if (qcow_read_refcount_table(q) < 0)
+ goto free_l1_table;
+
+ /*
+ * Do not use mmap use read/write instead
+ */
+ if (readonly)
+ disk_image = disk_image__new(fd, h->size, &qcow_disk_readonly_ops, DISK_IMAGE_REGULAR);
+ else
+ disk_image = disk_image__new(fd, h->size, &qcow_disk_ops, DISK_IMAGE_REGULAR);
+
+ if (IS_ERR_OR_NULL(disk_image))
+ goto free_refcount_table;
+
+ disk_image->async = 0;
+ disk_image->priv = q;
+
+ return disk_image;
+
+free_refcount_table:
+ if (q->refcount_table.rf_table)
+ free(q->refcount_table.rf_table);
+free_l1_table:
+ if (q->table.l1_table)
+ free(q->table.l1_table);
+free_cluster_cache:
+ if (q->cluster_cache)
+ free(q->cluster_cache);
+free_cluster_data:
+ if (q->cluster_data)
+ free(q->cluster_data);
+free_copy_buff:
+ if (q->copy_buff)
+ free(q->copy_buff);
+free_header:
+ if (q->header)
+ free(q->header);
+free_qcow:
+ if (q)
+ free(q);
+
+ return NULL;
+}
+
+static bool qcow2_check_image(int fd)
+{
+ struct qcow2_header_disk f_header;
+
+ if (pread_in_full(fd, &f_header, sizeof(struct qcow2_header_disk), 0) < 0)
+ return false;
+
+ be32_to_cpus(&f_header.magic);
+ be32_to_cpus(&f_header.version);
+
+ if (f_header.magic != QCOW_MAGIC)
+ return false;
+
+ if (f_header.version != QCOW2_VERSION)
+ return false;
+
+ return true;
+}
+
+static void *qcow1_read_header(int fd)
+{
+ struct qcow1_header_disk f_header;
+ struct qcow_header *header;
+
+ header = malloc(sizeof(struct qcow_header));
+ if (!header)
+ return NULL;
+
+ if (pread_in_full(fd, &f_header, sizeof(struct qcow1_header_disk), 0) < 0) {
+ free(header);
+ return NULL;
+ }
+
+ be32_to_cpus(&f_header.magic);
+ be32_to_cpus(&f_header.version);
+ be64_to_cpus(&f_header.backing_file_offset);
+ be32_to_cpus(&f_header.backing_file_size);
+ be32_to_cpus(&f_header.mtime);
+ be64_to_cpus(&f_header.size);
+ be32_to_cpus(&f_header.crypt_method);
+ be64_to_cpus(&f_header.l1_table_offset);
+
+ *header = (struct qcow_header) {
+ .size = f_header.size,
+ .l1_table_offset = f_header.l1_table_offset,
+ .l1_size = f_header.size / ((1 << f_header.l2_bits) * (1 << f_header.cluster_bits)),
+ .cluster_bits = f_header.cluster_bits,
+ .l2_bits = f_header.l2_bits,
+ };
+
+ return header;
+}
+
+static struct disk_image *qcow1_probe(int fd, bool readonly)
+{
+ struct disk_image *disk_image;
+ struct qcow_l1_table *l1t;
+ struct qcow_header *h;
+ struct qcow *q;
+
+ q = calloc(1, sizeof(struct qcow));
+ if (!q)
+ return NULL;
+
+ mutex_init(&q->mutex);
+ q->fd = fd;
+
+ l1t = &q->table;
+
+ l1t->root = RB_ROOT;
+ INIT_LIST_HEAD(&l1t->lru_list);
+
+ h = q->header = qcow1_read_header(fd);
+ if (!h)
+ goto free_qcow;
+
+ q->version = QCOW1_VERSION;
+ q->cluster_size = 1 << q->header->cluster_bits;
+ q->cluster_offset_mask = (1LL << (63 - q->header->cluster_bits)) - 1;
+ q->free_clust_idx = 0;
+
+ q->cluster_data = malloc(q->cluster_size);
+ if (!q->cluster_data) {
+ pr_warning("cluster data malloc error");
+ goto free_header;
+ }
+
+ q->cluster_cache = malloc(q->cluster_size);
+ if (!q->cluster_cache) {
+ pr_warning("cluster cache malloc error");
+ goto free_cluster_data;
+ }
+
+ if (qcow_read_l1_table(q) < 0)
+ goto free_cluster_cache;
+
+ /*
+ * Do not use mmap use read/write instead
+ */
+ if (readonly)
+ disk_image = disk_image__new(fd, h->size, &qcow_disk_readonly_ops, DISK_IMAGE_REGULAR);
+ else
+ disk_image = disk_image__new(fd, h->size, &qcow_disk_ops, DISK_IMAGE_REGULAR);
+
+ if (!disk_image)
+ goto free_l1_table;
+
+ disk_image->async = 1;
+ disk_image->priv = q;
+
+ return disk_image;
+
+free_l1_table:
+ if (q->table.l1_table)
+ free(q->table.l1_table);
+free_cluster_cache:
+ if (q->cluster_cache)
+ free(q->cluster_cache);
+free_cluster_data:
+ if (q->cluster_data)
+ free(q->cluster_data);
+free_header:
+ if (q->header)
+ free(q->header);
+free_qcow:
+ if (q)
+ free(q);
+
+ return NULL;
+}
+
+static bool qcow1_check_image(int fd)
+{
+ struct qcow1_header_disk f_header;
+
+ if (pread_in_full(fd, &f_header, sizeof(struct qcow1_header_disk), 0) < 0)
+ return false;
+
+ be32_to_cpus(&f_header.magic);
+ be32_to_cpus(&f_header.version);
+
+ if (f_header.magic != QCOW_MAGIC)
+ return false;
+
+ if (f_header.version != QCOW1_VERSION)
+ return false;
+
+ return true;
+}
+
+struct disk_image *qcow_probe(int fd, bool readonly)
+{
+ if (qcow1_check_image(fd))
+ return qcow1_probe(fd, readonly);
+
+ if (qcow2_check_image(fd))
+ return qcow2_probe(fd, readonly);
+
+ return NULL;
+}
--- /dev/null
+#include "kvm/disk-image.h"
+
+#include <linux/err.h>
+
+#ifdef CONFIG_HAS_AIO
+#include <libaio.h>
+#endif
+
+ssize_t raw_image__read(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param)
+{
+ u64 offset = sector << SECTOR_SHIFT;
+
+#ifdef CONFIG_HAS_AIO
+ struct iocb iocb;
+
+ return aio_preadv(disk->ctx, &iocb, disk->fd, iov, iovcount, offset,
+ disk->evt, param);
+#else
+ return preadv_in_full(disk->fd, iov, iovcount, offset);
+#endif
+}
+
+ssize_t raw_image__write(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param)
+{
+ u64 offset = sector << SECTOR_SHIFT;
+
+#ifdef CONFIG_HAS_AIO
+ struct iocb iocb;
+
+ return aio_pwritev(disk->ctx, &iocb, disk->fd, iov, iovcount, offset,
+ disk->evt, param);
+#else
+ return pwritev_in_full(disk->fd, iov, iovcount, offset);
+#endif
+}
+
+ssize_t raw_image__read_mmap(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param)
+{
+ u64 offset = sector << SECTOR_SHIFT;
+ ssize_t total = 0;
+
+ while (iovcount--) {
+ memcpy(iov->iov_base, disk->priv + offset, iov->iov_len);
+
+ sector += iov->iov_len >> SECTOR_SHIFT;
+ offset += iov->iov_len;
+ total += iov->iov_len;
+ iov++;
+ }
+
+ return total;
+}
+
+ssize_t raw_image__write_mmap(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param)
+{
+ u64 offset = sector << SECTOR_SHIFT;
+ ssize_t total = 0;
+
+ while (iovcount--) {
+ memcpy(disk->priv + offset, iov->iov_base, iov->iov_len);
+
+ sector += iov->iov_len >> SECTOR_SHIFT;
+ offset += iov->iov_len;
+ total += iov->iov_len;
+ iov++;
+ }
+
+ return total;
+}
+
+int raw_image__close(struct disk_image *disk)
+{
+ int ret = 0;
+
+ if (disk->priv != MAP_FAILED)
+ ret = munmap(disk->priv, disk->size);
+
+ close(disk->evt);
+
+#ifdef CONFIG_HAS_VIRTIO
+ io_destroy(disk->ctx);
+#endif
+
+ return ret;
+}
+
+/*
+ * multiple buffer based disk image operations
+ */
+static struct disk_image_operations raw_image_regular_ops = {
+ .read = raw_image__read,
+ .write = raw_image__write,
+};
+
+struct disk_image_operations ro_ops = {
+ .read = raw_image__read_mmap,
+ .write = raw_image__write_mmap,
+ .close = raw_image__close,
+};
+
+struct disk_image_operations ro_ops_nowrite = {
+ .read = raw_image__read,
+};
+
+struct disk_image *raw_image__probe(int fd, struct stat *st, bool readonly)
+{
+ struct disk_image *disk;
+
+ if (readonly) {
+ /*
+ * Use mmap's MAP_PRIVATE to implement non-persistent write
+ * FIXME: This does not work on 32-bit host.
+ */
+ struct disk_image *disk;
+
+ disk = disk_image__new(fd, st->st_size, &ro_ops, DISK_IMAGE_MMAP);
+ if (IS_ERR_OR_NULL(disk)) {
+ disk = disk_image__new(fd, st->st_size, &ro_ops_nowrite, DISK_IMAGE_REGULAR);
+#ifdef CONFIG_HAS_AIO
+ if (!IS_ERR_OR_NULL(disk))
+ disk->async = 1;
+#endif
+ }
+
+ return disk;
+ } else {
+ /*
+ * Use read/write instead of mmap
+ */
+ disk = disk_image__new(fd, st->st_size, &raw_image_regular_ops, DISK_IMAGE_REGULAR);
+#ifdef CONFIG_HAS_AIO
+ if (!IS_ERR_OR_NULL(disk))
+ disk->async = 1;
+#endif
+ return disk;
+ }
+}
--- /dev/null
+#include "kvm/framebuffer.h"
+#include "kvm/kvm.h"
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <errno.h>
+
+static LIST_HEAD(framebuffers);
+
+struct framebuffer *fb__register(struct framebuffer *fb)
+{
+ INIT_LIST_HEAD(&fb->node);
+ list_add(&fb->node, &framebuffers);
+
+ return fb;
+}
+
+int fb__attach(struct framebuffer *fb, struct fb_target_operations *ops)
+{
+ if (fb->nr_targets >= FB_MAX_TARGETS)
+ return -ENOSPC;
+
+ fb->targets[fb->nr_targets++] = ops;
+
+ return 0;
+}
+
+static int start_targets(struct framebuffer *fb)
+{
+ unsigned long i;
+
+ for (i = 0; i < fb->nr_targets; i++) {
+ struct fb_target_operations *ops = fb->targets[i];
+ int err = 0;
+
+ if (ops->start)
+ err = ops->start(fb);
+
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+int fb__init(struct kvm *kvm)
+{
+ struct framebuffer *fb;
+
+ list_for_each_entry(fb, &framebuffers, node) {
+ int err;
+
+ err = start_targets(fb);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+firmware_init(fb__init);
+
+int fb__exit(struct kvm *kvm)
+{
+ struct framebuffer *fb;
+
+ list_for_each_entry(fb, &framebuffers, node) {
+ u32 i;
+
+ for (i = 0; i < fb->nr_targets; i++)
+ if (fb->targets[i]->stop)
+ fb->targets[i]->stop(fb);
+
+ munmap(fb->mem, fb->mem_size);
+ }
+
+ return 0;
+}
+firmware_exit(fb__exit);
--- /dev/null
+/*
+ * This is a simple init for shared rootfs guests. This part should be limited
+ * to doing mounts and running stage 2 of the init process.
+ */
+#include <sys/mount.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <errno.h>
+#include <linux/reboot.h>
+
+static int run_process(char *filename)
+{
+ char *new_argv[] = { filename, NULL };
+ char *new_env[] = { "TERM=linux", "DISPLAY=192.168.33.1:0",
+ "HOME=/virt/home", NULL };
+
+ return execve(filename, new_argv, new_env);
+}
+
+static int run_process_sandbox(char *filename)
+{
+ char *new_argv[] = { filename, "/virt/sandbox.sh", NULL };
+ char *new_env[] = { "TERM=linux", "HOME=/virt/home", NULL };
+
+ return execve(filename, new_argv, new_env);
+}
+
+static void do_mounts(void)
+{
+ mount("hostfs", "/host", "9p", MS_RDONLY, "trans=virtio,version=9p2000.L");
+ mount("", "/sys", "sysfs", 0, NULL);
+ mount("proc", "/proc", "proc", 0, NULL);
+ mount("devtmpfs", "/dev", "devtmpfs", 0, NULL);
+ mkdir("/dev/pts", 0755);
+ mount("devpts", "/dev/pts", "devpts", 0, NULL);
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t child;
+ int status;
+
+ puts("Mounting...");
+
+ do_mounts();
+
+ /* get session leader */
+ setsid();
+
+ /* set controlling terminal */
+ ioctl(0, TIOCSCTTY, 1);
+
+ child = fork();
+ if (child < 0) {
+ printf("Fatal: fork() failed with %d\n", child);
+ return 0;
+ } else if (child == 0) {
+ if (access("/virt/sandbox.sh", R_OK) == 0)
+ run_process_sandbox("/bin/sh");
+ else
+ run_process("/bin/sh");
+ } else {
+ waitpid(child, &status, 0);
+ }
+
+ reboot(LINUX_REBOOT_CMD_RESTART);
+
+ printf("Init failed: %s\n", strerror(errno));
+
+ return 0;
+}
--- /dev/null
+#include "kvm/guest_compat.h"
+
+#include "kvm/mutex.h"
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+
+struct compat_message {
+ int id;
+ char *title;
+ char *desc;
+
+ struct list_head list;
+};
+
+static int id;
+static DEFINE_MUTEX(compat_mtx);
+static LIST_HEAD(messages);
+
+static void compat__free(struct compat_message *msg)
+{
+ free(msg->title);
+ free(msg->desc);
+ free(msg);
+}
+
+int compat__add_message(const char *title, const char *desc)
+{
+ struct compat_message *msg;
+ int msg_id;
+
+ msg = malloc(sizeof(*msg));
+ if (msg == NULL)
+ goto cleanup;
+
+ msg->title = strdup(title);
+ msg->desc = strdup(desc);
+
+ if (msg->title == NULL || msg->desc == NULL)
+ goto cleanup;
+
+ mutex_lock(&compat_mtx);
+
+ msg->id = msg_id = id++;
+ list_add_tail(&msg->list, &messages);
+
+ mutex_unlock(&compat_mtx);
+
+ return msg_id;
+
+cleanup:
+ if (msg)
+ compat__free(msg);
+
+ return -ENOMEM;
+}
+
+int compat__remove_message(int id)
+{
+ struct compat_message *pos, *n;
+
+ mutex_lock(&compat_mtx);
+
+ list_for_each_entry_safe(pos, n, &messages, list) {
+ if (pos->id == id) {
+ list_del(&pos->list);
+ compat__free(pos);
+
+ mutex_unlock(&compat_mtx);
+
+ return 0;
+ }
+ }
+
+ mutex_unlock(&compat_mtx);
+
+ return -ENOENT;
+}
+
+int compat__print_all_messages(void)
+{
+ mutex_lock(&compat_mtx);
+
+ while (!list_empty(&messages)) {
+ struct compat_message *msg;
+
+ msg = list_first_entry(&messages, struct compat_message, list);
+
+ printf("\n # KVM compatibility warning.\n\t%s\n\t%s\n",
+ msg->title, msg->desc);
+
+ list_del(&msg->list);
+ compat__free(msg);
+ }
+
+ mutex_unlock(&compat_mtx);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/read-write.h"
+#include "kvm/ioport.h"
+#include "kvm/mutex.h"
+#include "kvm/util.h"
+#include "kvm/term.h"
+#include "kvm/kvm.h"
+#include "kvm/i8042.h"
+#include "kvm/kvm-cpu.h"
+
+#include <stdint.h>
+
+/*
+ * IRQs
+ */
+#define KBD_IRQ 1
+#define AUX_IRQ 12
+
+/*
+ * Registers
+ */
+#define I8042_DATA_REG 0x60
+#define I8042_COMMAND_REG 0x64
+
+/*
+ * Commands
+ */
+#define I8042_CMD_CTL_RCTR 0x20
+#define I8042_CMD_CTL_WCTR 0x60
+#define I8042_CMD_AUX_LOOP 0xD3
+#define I8042_CMD_AUX_SEND 0xD4
+#define I8042_CMD_AUX_TEST 0xA9
+#define I8042_CMD_AUX_DISABLE 0xA7
+#define I8042_CMD_AUX_ENABLE 0xA8
+#define I8042_CMD_SYSTEM_RESET 0xFE
+
+#define RESPONSE_ACK 0xFA
+
+#define MODE_DISABLE_AUX 0x20
+
+#define AUX_ENABLE_REPORTING 0x20
+#define AUX_SCALING_FLAG 0x10
+#define AUX_DEFAULT_RESOLUTION 0x2
+#define AUX_DEFAULT_SAMPLE 100
+
+/*
+ * Status register bits
+ */
+#define I8042_STR_AUXDATA 0x20
+#define I8042_STR_KEYLOCK 0x10
+#define I8042_STR_CMDDAT 0x08
+#define I8042_STR_MUXERR 0x04
+#define I8042_STR_OBF 0x01
+
+#define KBD_MODE_KBD_INT 0x01
+#define KBD_MODE_SYS 0x02
+
+#define QUEUE_SIZE 128
+
+/*
+ * This represents the current state of the PS/2 keyboard system,
+ * including the AUX device (the mouse)
+ */
+struct kbd_state {
+ struct kvm *kvm;
+
+ char kq[QUEUE_SIZE]; /* Keyboard queue */
+ int kread, kwrite; /* Indexes into the queue */
+ int kcount; /* number of elements in queue */
+
+ char mq[QUEUE_SIZE];
+ int mread, mwrite;
+ int mcount;
+
+ u8 mstatus; /* Mouse status byte */
+ u8 mres; /* Current mouse resolution */
+ u8 msample; /* Current mouse samples/second */
+
+ u8 mode; /* i8042 mode register */
+ u8 status; /* i8042 status register */
+ /*
+ * Some commands (on port 0x64) have arguments;
+ * we store the command here while we wait for the argument
+ */
+ u32 write_cmd;
+};
+
+static struct kbd_state state;
+
+/*
+ * If there are packets to be read, set the appropriate IRQs high
+ */
+static void kbd_update_irq(void)
+{
+ u8 klevel = 0;
+ u8 mlevel = 0;
+
+ /* First, clear the kbd and aux output buffer full bits */
+ state.status &= ~(I8042_STR_OBF | I8042_STR_AUXDATA);
+
+ if (state.kcount > 0) {
+ state.status |= I8042_STR_OBF;
+ klevel = 1;
+ }
+
+ /* Keyboard has higher priority than mouse */
+ if (klevel == 0 && state.mcount != 0) {
+ state.status |= I8042_STR_OBF | I8042_STR_AUXDATA;
+ mlevel = 1;
+ }
+
+ kvm__irq_line(state.kvm, KBD_IRQ, klevel);
+ kvm__irq_line(state.kvm, AUX_IRQ, mlevel);
+}
+
+/*
+ * Add a byte to the mouse queue, then set IRQs
+ */
+void mouse_queue(u8 c)
+{
+ if (state.mcount >= QUEUE_SIZE)
+ return;
+
+ state.mq[state.mwrite++ % QUEUE_SIZE] = c;
+
+ state.mcount++;
+ kbd_update_irq();
+}
+
+/*
+ * Add a byte to the keyboard queue, then set IRQs
+ */
+void kbd_queue(u8 c)
+{
+ if (state.kcount >= QUEUE_SIZE)
+ return;
+
+ state.kq[state.kwrite++ % QUEUE_SIZE] = c;
+
+ state.kcount++;
+ kbd_update_irq();
+}
+
+static void kbd_write_command(struct kvm *kvm, u8 val)
+{
+ switch (val) {
+ case I8042_CMD_CTL_RCTR:
+ kbd_queue(state.mode);
+ break;
+ case I8042_CMD_CTL_WCTR:
+ case I8042_CMD_AUX_SEND:
+ case I8042_CMD_AUX_LOOP:
+ state.write_cmd = val;
+ break;
+ case I8042_CMD_AUX_TEST:
+ /* 0 means we're a normal PS/2 mouse */
+ mouse_queue(0);
+ break;
+ case I8042_CMD_AUX_DISABLE:
+ state.mode |= MODE_DISABLE_AUX;
+ break;
+ case I8042_CMD_AUX_ENABLE:
+ state.mode &= ~MODE_DISABLE_AUX;
+ break;
+ case I8042_CMD_SYSTEM_RESET:
+ kvm_cpu__reboot(kvm);
+ break;
+ default:
+ break;
+ }
+}
+
+/*
+ * Called when the OS reads from port 0x60 (PS/2 data)
+ */
+static u32 kbd_read_data(void)
+{
+ u32 ret;
+ int i;
+
+ if (state.kcount != 0) {
+ /* Keyboard data gets read first */
+ ret = state.kq[state.kread++ % QUEUE_SIZE];
+ state.kcount--;
+ kvm__irq_line(state.kvm, KBD_IRQ, 0);
+ kbd_update_irq();
+ } else if (state.mcount > 0) {
+ /* Followed by the mouse */
+ ret = state.mq[state.mread++ % QUEUE_SIZE];
+ state.mcount--;
+ kvm__irq_line(state.kvm, AUX_IRQ, 0);
+ kbd_update_irq();
+ } else if (state.kcount == 0) {
+ i = state.kread - 1;
+ if (i < 0)
+ i = QUEUE_SIZE;
+ ret = state.kq[i];
+ }
+ return ret;
+}
+
+/*
+ * Called when the OS read from port 0x64, the command port
+ */
+static u32 kbd_read_status(void)
+{
+ return (u32)state.status;
+}
+
+/*
+ * Called when the OS writes to port 0x60 (data port)
+ * Things written here are generally arguments to commands previously
+ * written to port 0x64 and stored in state.write_cmd
+ */
+static void kbd_write_data(u32 val)
+{
+ switch (state.write_cmd) {
+ case I8042_CMD_CTL_WCTR:
+ state.mode = val;
+ kbd_update_irq();
+ break;
+ case I8042_CMD_AUX_LOOP:
+ mouse_queue(val);
+ mouse_queue(RESPONSE_ACK);
+ break;
+ case I8042_CMD_AUX_SEND:
+ /* The OS wants to send a command to the mouse */
+ mouse_queue(RESPONSE_ACK);
+ switch (val) {
+ case 0xe6:
+ /* set scaling = 1:1 */
+ state.mstatus &= ~AUX_SCALING_FLAG;
+ break;
+ case 0xe8:
+ /* set resolution */
+ state.mres = val;
+ break;
+ case 0xe9:
+ /* Report mouse status/config */
+ mouse_queue(state.mstatus);
+ mouse_queue(state.mres);
+ mouse_queue(state.msample);
+ break;
+ case 0xf2:
+ /* send ID */
+ mouse_queue(0); /* normal mouse */
+ break;
+ case 0xf3:
+ /* set sample rate */
+ state.msample = val;
+ break;
+ case 0xf4:
+ /* enable reporting */
+ state.mstatus |= AUX_ENABLE_REPORTING;
+ break;
+ case 0xf5:
+ state.mstatus &= ~AUX_ENABLE_REPORTING;
+ break;
+ case 0xf6:
+ /* set defaults, just fall through to reset */
+ case 0xff:
+ /* reset */
+ state.mstatus = 0x0;
+ state.mres = AUX_DEFAULT_RESOLUTION;
+ state.msample = AUX_DEFAULT_SAMPLE;
+ break;
+ default:
+ break;
+ }
+ break;
+ case 0:
+ /* Just send the ID */
+ kbd_queue(RESPONSE_ACK);
+ kbd_queue(0xab);
+ kbd_queue(0x41);
+ kbd_update_irq();
+ break;
+ default:
+ /* Yeah whatever */
+ break;
+ }
+ state.write_cmd = 0;
+}
+
+static void kbd_reset(void)
+{
+ state = (struct kbd_state) {
+ .status = I8042_STR_MUXERR | I8042_STR_CMDDAT | I8042_STR_KEYLOCK, /* 0x1c */
+ .mode = KBD_MODE_KBD_INT | KBD_MODE_SYS, /* 0x3 */
+ .mres = AUX_DEFAULT_RESOLUTION,
+ .msample = AUX_DEFAULT_SAMPLE,
+ };
+}
+
+/*
+ * Called when the OS has written to one of the keyboard's ports (0x60 or 0x64)
+ */
+static bool kbd_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ switch (port) {
+ case I8042_COMMAND_REG: {
+ u8 value = kbd_read_status();
+ ioport__write8(data, value);
+ break;
+ }
+ case I8042_DATA_REG: {
+ u32 value = kbd_read_data();
+ ioport__write32(data, value);
+ break;
+ }
+ default:
+ return false;
+ }
+
+ return true;
+}
+
+static bool kbd_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ switch (port) {
+ case I8042_COMMAND_REG: {
+ u8 value = ioport__read8(data);
+ kbd_write_command(kvm, value);
+ break;
+ }
+ case I8042_DATA_REG: {
+ u32 value = ioport__read32(data);
+ kbd_write_data(value);
+ break;
+ }
+ default:
+ return false;
+ }
+
+ return true;
+}
+
+static struct ioport_operations kbd_ops = {
+ .io_in = kbd_in,
+ .io_out = kbd_out,
+};
+
+int kbd__init(struct kvm *kvm)
+{
+#ifndef CONFIG_X86
+ return 0;
+#endif
+
+ kbd_reset();
+ state.kvm = kvm;
+ ioport__register(kvm, I8042_DATA_REG, &kbd_ops, 2, NULL);
+ ioport__register(kvm, I8042_COMMAND_REG, &kbd_ops, 2, NULL);
+
+ return 0;
+}
+dev_init(kbd__init);
--- /dev/null
+#include "kvm/pci-shmem.h"
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/irq.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+#include "kvm/util.h"
+#include "kvm/ioport.h"
+#include "kvm/ioeventfd.h"
+
+#include <linux/kvm.h>
+#include <linux/byteorder.h>
+#include <sys/ioctl.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+
+#define MB_SHIFT (20)
+#define KB_SHIFT (10)
+#define GB_SHIFT (30)
+
+static struct pci_device_header pci_shmem_pci_device = {
+ .vendor_id = cpu_to_le16(PCI_VENDOR_ID_REDHAT_QUMRANET),
+ .device_id = cpu_to_le16(0x1110),
+ .header_type = PCI_HEADER_TYPE_NORMAL,
+ .class[2] = 0xFF, /* misc pci device */
+ .status = cpu_to_le16(PCI_STATUS_CAP_LIST),
+ .capabilities = (void *)&pci_shmem_pci_device.msix - (void *)&pci_shmem_pci_device,
+ .msix.cap = PCI_CAP_ID_MSIX,
+ .msix.ctrl = cpu_to_le16(1),
+ .msix.table_offset = cpu_to_le32(1), /* Use BAR 1 */
+ .msix.pba_offset = cpu_to_le32(0x1001), /* Use BAR 1 */
+};
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+ INTRMASK = 0,
+ INTRSTATUS = 4,
+ IVPOSITION = 8,
+ DOORBELL = 12,
+};
+
+static struct shmem_info *shmem_region;
+static u16 ivshmem_registers;
+static int local_fd;
+static u32 local_id;
+static u64 msix_block;
+static u64 msix_pba;
+static struct msix_table msix_table[2];
+
+int pci_shmem__register_mem(struct shmem_info *si)
+{
+ if (shmem_region == NULL) {
+ shmem_region = si;
+ } else {
+ pr_warning("only single shmem currently avail. ignoring.\n");
+ free(si);
+ }
+ return 0;
+}
+
+static bool shmem_pci__io_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ u16 offset = port - ivshmem_registers;
+
+ switch (offset) {
+ case INTRMASK:
+ break;
+ case INTRSTATUS:
+ break;
+ case IVPOSITION:
+ ioport__write32(data, local_id);
+ break;
+ case DOORBELL:
+ break;
+ };
+
+ return true;
+}
+
+static bool shmem_pci__io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ u16 offset = port - ivshmem_registers;
+
+ switch (offset) {
+ case INTRMASK:
+ break;
+ case INTRSTATUS:
+ break;
+ case IVPOSITION:
+ break;
+ case DOORBELL:
+ break;
+ };
+
+ return true;
+}
+
+static struct ioport_operations shmem_pci__io_ops = {
+ .io_in = shmem_pci__io_in,
+ .io_out = shmem_pci__io_out,
+};
+
+static void callback_mmio_msix(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr)
+{
+ void *mem;
+
+ if (addr - msix_block < 0x1000)
+ mem = &msix_table;
+ else
+ mem = &msix_pba;
+
+ if (is_write)
+ memcpy(mem + addr - msix_block, data, len);
+ else
+ memcpy(data, mem + addr - msix_block, len);
+}
+
+/*
+ * Return an irqfd which can be used by other guests to signal this guest
+ * whenever they need to poke it
+ */
+int pci_shmem__get_local_irqfd(struct kvm *kvm)
+{
+ int fd, gsi, r;
+ struct kvm_irqfd irqfd;
+
+ if (local_fd == 0) {
+ fd = eventfd(0, 0);
+ if (fd < 0)
+ return fd;
+
+ if (pci_shmem_pci_device.msix.ctrl & cpu_to_le16(PCI_MSIX_FLAGS_ENABLE)) {
+ gsi = irq__add_msix_route(kvm, &msix_table[0].msg);
+ } else {
+ gsi = pci_shmem_pci_device.irq_line;
+ }
+
+ irqfd = (struct kvm_irqfd) {
+ .fd = fd,
+ .gsi = gsi,
+ };
+
+ r = ioctl(kvm->vm_fd, KVM_IRQFD, &irqfd);
+ if (r < 0)
+ return r;
+
+ local_fd = fd;
+ }
+
+ return local_fd;
+}
+
+/*
+ * Connect a new client to ivshmem by adding the appropriate datamatch
+ * to the DOORBELL
+ */
+int pci_shmem__add_client(struct kvm *kvm, u32 id, int fd)
+{
+ struct kvm_ioeventfd ioevent;
+
+ ioevent = (struct kvm_ioeventfd) {
+ .addr = ivshmem_registers + DOORBELL,
+ .len = sizeof(u32),
+ .datamatch = id,
+ .fd = fd,
+ .flags = KVM_IOEVENTFD_FLAG_PIO | KVM_IOEVENTFD_FLAG_DATAMATCH,
+ };
+
+ return ioctl(kvm->vm_fd, KVM_IOEVENTFD, &ioevent);
+}
+
+/*
+ * Remove a client connected to ivshmem by removing the appropriate datamatch
+ * from the DOORBELL
+ */
+int pci_shmem__remove_client(struct kvm *kvm, u32 id)
+{
+ struct kvm_ioeventfd ioevent;
+
+ ioevent = (struct kvm_ioeventfd) {
+ .addr = ivshmem_registers + DOORBELL,
+ .len = sizeof(u32),
+ .datamatch = id,
+ .flags = KVM_IOEVENTFD_FLAG_PIO
+ | KVM_IOEVENTFD_FLAG_DATAMATCH
+ | KVM_IOEVENTFD_FLAG_DEASSIGN,
+ };
+
+ return ioctl(kvm->vm_fd, KVM_IOEVENTFD, &ioevent);
+}
+
+static void *setup_shmem(const char *key, size_t len, int creating)
+{
+ int fd;
+ int rtn;
+ void *mem;
+ int flag = O_RDWR;
+
+ if (creating)
+ flag |= O_CREAT;
+
+ fd = shm_open(key, flag, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ pr_warning("Failed to open shared memory file %s\n", key);
+ return NULL;
+ }
+
+ if (creating) {
+ rtn = ftruncate(fd, (off_t) len);
+ if (rtn < 0)
+ pr_warning("Can't ftruncate(fd,%zu)\n", len);
+ }
+ mem = mmap(NULL, len,
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
+ if (mem == MAP_FAILED) {
+ pr_warning("Failed to mmap shared memory file");
+ mem = NULL;
+ }
+ close(fd);
+
+ return mem;
+}
+
+int shmem_parser(const struct option *opt, const char *arg, int unset)
+{
+ const u64 default_size = SHMEM_DEFAULT_SIZE;
+ const u64 default_phys_addr = SHMEM_DEFAULT_ADDR;
+ const char *default_handle = SHMEM_DEFAULT_HANDLE;
+ struct shmem_info *si = malloc(sizeof(struct shmem_info));
+ u64 phys_addr;
+ u64 size;
+ char *handle = NULL;
+ int create = 0;
+ const char *p = arg;
+ char *next;
+ int base = 10;
+ int verbose = 0;
+
+ const int skip_pci = strlen("pci:");
+ if (verbose)
+ pr_info("shmem_parser(%p,%s,%d)", opt, arg, unset);
+ /* parse out optional addr family */
+ if (strcasestr(p, "pci:")) {
+ p += skip_pci;
+ } else if (strcasestr(p, "mem:")) {
+ die("I can't add to E820 map yet.\n");
+ }
+ /* parse out physical addr */
+ base = 10;
+ if (strcasestr(p, "0x"))
+ base = 16;
+ phys_addr = strtoll(p, &next, base);
+ if (next == p && phys_addr == 0) {
+ pr_info("shmem: no physical addr specified, using default.");
+ phys_addr = default_phys_addr;
+ }
+ if (*next != ':' && *next != '\0')
+ die("shmem: unexpected chars after phys addr.\n");
+ if (*next == '\0')
+ p = next;
+ else
+ p = next + 1;
+ /* parse out size */
+ base = 10;
+ if (strcasestr(p, "0x"))
+ base = 16;
+ size = strtoll(p, &next, base);
+ if (next == p && size == 0) {
+ pr_info("shmem: no size specified, using default.");
+ size = default_size;
+ }
+ /* look for [KMGkmg][Bb]* uses base 2. */
+ int skip_B = 0;
+ if (strspn(next, "KMGkmg")) { /* might have a prefix */
+ if (*(next + 1) == 'B' || *(next + 1) == 'b')
+ skip_B = 1;
+ switch (*next) {
+ case 'K':
+ case 'k':
+ size = size << KB_SHIFT;
+ break;
+ case 'M':
+ case 'm':
+ size = size << MB_SHIFT;
+ break;
+ case 'G':
+ case 'g':
+ size = size << GB_SHIFT;
+ break;
+ default:
+ die("shmem: bug in detecting size prefix.");
+ break;
+ }
+ next += 1 + skip_B;
+ }
+ if (*next != ':' && *next != '\0') {
+ die("shmem: unexpected chars after phys size. <%c><%c>\n",
+ *next, *p);
+ }
+ if (*next == '\0')
+ p = next;
+ else
+ p = next + 1;
+ /* parse out optional shmem handle */
+ const int skip_handle = strlen("handle=");
+ next = strcasestr(p, "handle=");
+ if (*p && next) {
+ if (p != next)
+ die("unexpected chars before handle\n");
+ p += skip_handle;
+ next = strchrnul(p, ':');
+ if (next - p) {
+ handle = malloc(next - p + 1);
+ strncpy(handle, p, next - p);
+ handle[next - p] = '\0'; /* just in case. */
+ }
+ if (*next == '\0')
+ p = next;
+ else
+ p = next + 1;
+ }
+ /* parse optional create flag to see if we should create shm seg. */
+ if (*p && strcasestr(p, "create")) {
+ create = 1;
+ p += strlen("create");
+ }
+ if (*p != '\0')
+ die("shmem: unexpected trailing chars\n");
+ if (handle == NULL) {
+ handle = malloc(strlen(default_handle) + 1);
+ strcpy(handle, default_handle);
+ }
+ if (verbose) {
+ pr_info("shmem: phys_addr = %llx", phys_addr);
+ pr_info("shmem: size = %llx", size);
+ pr_info("shmem: handle = %s", handle);
+ pr_info("shmem: create = %d", create);
+ }
+
+ si->phys_addr = phys_addr;
+ si->size = size;
+ si->handle = handle;
+ si->create = create;
+ pci_shmem__register_mem(si); /* ownership of si, etc. passed on. */
+ return 0;
+}
+
+int pci_shmem__init(struct kvm *kvm)
+{
+ u8 dev, line, pin;
+ char *mem;
+ int r;
+
+ if (shmem_region == 0)
+ return 0;
+
+ /* Register good old INTx */
+ r = irq__register_device(PCI_DEVICE_ID_PCI_SHMEM, &dev, &pin, &line);
+ if (r < 0)
+ return r;
+
+ pci_shmem_pci_device.irq_pin = pin;
+ pci_shmem_pci_device.irq_line = line;
+
+ /* Register MMIO space for MSI-X */
+ r = ioport__register(kvm, IOPORT_EMPTY, &shmem_pci__io_ops, IOPORT_SIZE, NULL);
+ if (r < 0)
+ return r;
+ ivshmem_registers = (u16)r;
+
+ msix_block = pci_get_io_space_block(0x1010);
+ kvm__register_mmio(kvm, msix_block, 0x1010, false, callback_mmio_msix, NULL);
+
+ /*
+ * This registers 3 BARs:
+ *
+ * 0 - ivshmem registers
+ * 1 - MSI-X MMIO space
+ * 2 - Shared memory block
+ */
+ pci_shmem_pci_device.bar[0] = cpu_to_le32(ivshmem_registers | PCI_BASE_ADDRESS_SPACE_IO);
+ pci_shmem_pci_device.bar_size[0] = shmem_region->size;
+ pci_shmem_pci_device.bar[1] = cpu_to_le32(msix_block | PCI_BASE_ADDRESS_SPACE_MEMORY);
+ pci_shmem_pci_device.bar_size[1] = 0x1010;
+ pci_shmem_pci_device.bar[2] = cpu_to_le32(shmem_region->phys_addr | PCI_BASE_ADDRESS_SPACE_MEMORY);
+ pci_shmem_pci_device.bar_size[2] = shmem_region->size;
+
+ pci__register(&pci_shmem_pci_device, dev);
+
+ /* Open shared memory and plug it into the guest */
+ mem = setup_shmem(shmem_region->handle, shmem_region->size,
+ shmem_region->create);
+ if (mem == NULL)
+ return -EINVAL;
+
+ kvm__register_mem(kvm, shmem_region->phys_addr, shmem_region->size,
+ mem);
+ return 0;
+}
+dev_init(pci_shmem__init);
+
+int pci_shmem__exit(struct kvm *kvm)
+{
+ return 0;
+}
+dev_exit(pci_shmem__exit);
--- /dev/null
+#include "kvm/rtc.h"
+
+#include "kvm/ioport.h"
+#include "kvm/kvm.h"
+
+#include <time.h>
+
+/*
+ * MC146818 RTC registers
+ */
+#define RTC_SECONDS 0x00
+#define RTC_SECONDS_ALARM 0x01
+#define RTC_MINUTES 0x02
+#define RTC_MINUTES_ALARM 0x03
+#define RTC_HOURS 0x04
+#define RTC_HOURS_ALARM 0x05
+#define RTC_DAY_OF_WEEK 0x06
+#define RTC_DAY_OF_MONTH 0x07
+#define RTC_MONTH 0x08
+#define RTC_YEAR 0x09
+
+#define RTC_REG_A 0x0A
+#define RTC_REG_B 0x0B
+#define RTC_REG_C 0x0C
+#define RTC_REG_D 0x0D
+
+struct rtc_device {
+ u8 cmos_idx;
+ u8 cmos_data[128];
+};
+
+static struct rtc_device rtc;
+
+static inline unsigned char bin2bcd(unsigned val)
+{
+ return ((val / 10) << 4) + val % 10;
+}
+
+static bool cmos_ram_data_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ struct tm *tm;
+ time_t ti;
+ int year;
+
+ time(&ti);
+
+ tm = gmtime(&ti);
+
+ switch (rtc.cmos_idx) {
+ case RTC_SECONDS:
+ ioport__write8(data, bin2bcd(tm->tm_sec));
+ break;
+ case RTC_MINUTES:
+ ioport__write8(data, bin2bcd(tm->tm_min));
+ break;
+ case RTC_HOURS:
+ ioport__write8(data, bin2bcd(tm->tm_hour));
+ break;
+ case RTC_DAY_OF_WEEK:
+ ioport__write8(data, bin2bcd(tm->tm_wday + 1));
+ break;
+ case RTC_DAY_OF_MONTH:
+ ioport__write8(data, bin2bcd(tm->tm_mday));
+ break;
+ case RTC_MONTH:
+ ioport__write8(data, bin2bcd(tm->tm_mon + 1));
+ break;
+ case RTC_YEAR:
+ /*
+ * The gmtime() function returns time since 1900. The CMOS
+ * standard is time since 2000. If the year is < 100, do
+ * nothing; if it is > 100, subtract 100; this is the best fit
+ * with the twisted CMOS logic.
+ */
+ year = tm->tm_year;
+ if (year > 99)
+ year -= 100;
+ ioport__write8(data, bin2bcd(year));
+ break;
+ default:
+ ioport__write8(data, rtc.cmos_data[rtc.cmos_idx]);
+ break;
+ }
+
+ return true;
+}
+
+static bool cmos_ram_data_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ switch (rtc.cmos_idx) {
+ case RTC_REG_C:
+ case RTC_REG_D:
+ /* Read-only */
+ break;
+ default:
+ rtc.cmos_data[rtc.cmos_idx] = ioport__read8(data);
+ break;
+ }
+
+ return true;
+}
+
+static struct ioport_operations cmos_ram_data_ioport_ops = {
+ .io_out = cmos_ram_data_out,
+ .io_in = cmos_ram_data_in,
+};
+
+static bool cmos_ram_index_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ u8 value = ioport__read8(data);
+
+ kvm->nmi_disabled = value & (1UL << 7);
+ rtc.cmos_idx = value & ~(1UL << 7);
+
+ return true;
+}
+
+static struct ioport_operations cmos_ram_index_ioport_ops = {
+ .io_out = cmos_ram_index_out,
+};
+
+int rtc__init(struct kvm *kvm)
+{
+ int r = 0;
+
+ /* PORT 0070-007F - CMOS RAM/RTC (REAL TIME CLOCK) */
+ r = ioport__register(kvm, 0x0070, &cmos_ram_index_ioport_ops, 1, NULL);
+ if (r < 0)
+ return r;
+
+ r = ioport__register(kvm, 0x0071, &cmos_ram_data_ioport_ops, 1, NULL);
+ if (r < 0) {
+ ioport__unregister(kvm, 0x0071);
+ return r;
+ }
+
+ return r;
+}
+dev_init(rtc__init);
+
+int rtc__exit(struct kvm *kvm)
+{
+ /* PORT 0070-007F - CMOS RAM/RTC (REAL TIME CLOCK) */
+ ioport__unregister(kvm, 0x0070);
+ ioport__unregister(kvm, 0x0071);
+
+ return 0;
+}
+dev_exit(rtc__exit);
--- /dev/null
+#include "kvm/8250-serial.h"
+
+#include "kvm/read-write.h"
+#include "kvm/ioport.h"
+#include "kvm/mutex.h"
+#include "kvm/util.h"
+#include "kvm/term.h"
+#include "kvm/kvm.h"
+
+#include <linux/types.h>
+#include <linux/serial_reg.h>
+
+#include <pthread.h>
+
+/*
+ * This fakes a U6_16550A. The fifo len needs to be 64 as the kernel
+ * expects that for autodetection.
+ */
+#define FIFO_LEN 64
+#define FIFO_MASK (FIFO_LEN - 1)
+
+#define UART_IIR_TYPE_BITS 0xc0
+
+struct serial8250_device {
+ pthread_mutex_t mutex;
+ u8 id;
+
+ u16 iobase;
+ u8 irq;
+ u8 irq_state;
+ int txcnt;
+ int rxcnt;
+ int rxdone;
+ char txbuf[FIFO_LEN];
+ char rxbuf[FIFO_LEN];
+
+ u8 dll;
+ u8 dlm;
+ u8 iir;
+ u8 ier;
+ u8 fcr;
+ u8 lcr;
+ u8 mcr;
+ u8 lsr;
+ u8 msr;
+ u8 scr;
+};
+
+#define SERIAL_REGS_SETTING \
+ .iir = UART_IIR_NO_INT, \
+ .lsr = UART_LSR_TEMT | UART_LSR_THRE, \
+ .msr = UART_MSR_DCD | UART_MSR_DSR | UART_MSR_CTS, \
+ .mcr = UART_MCR_OUT2,
+
+static struct serial8250_device devices[] = {
+ /* ttyS0 */
+ [0] = {
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+
+ .id = 0,
+ .iobase = 0x3f8,
+ .irq = 4,
+
+ SERIAL_REGS_SETTING
+ },
+ /* ttyS1 */
+ [1] = {
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+
+ .id = 1,
+ .iobase = 0x2f8,
+ .irq = 3,
+
+ SERIAL_REGS_SETTING
+ },
+ /* ttyS2 */
+ [2] = {
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+
+ .id = 2,
+ .iobase = 0x3e8,
+ .irq = 4,
+
+ SERIAL_REGS_SETTING
+ },
+ /* ttyS3 */
+ [3] = {
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+
+ .id = 3,
+ .iobase = 0x2e8,
+ .irq = 3,
+
+ SERIAL_REGS_SETTING
+ },
+};
+
+static void serial8250_flush_tx(struct kvm *kvm, struct serial8250_device *dev)
+{
+ dev->lsr |= UART_LSR_TEMT | UART_LSR_THRE;
+
+ if (dev->txcnt) {
+ if (kvm->cfg.active_console == CONSOLE_8250)
+ term_putc(dev->txbuf, dev->txcnt, dev->id);
+ dev->txcnt = 0;
+ }
+}
+
+static void serial8250_update_irq(struct kvm *kvm, struct serial8250_device *dev)
+{
+ u8 iir = 0;
+
+ /* Handle clear rx */
+ if (dev->lcr & UART_FCR_CLEAR_RCVR) {
+ dev->lcr &= ~UART_FCR_CLEAR_RCVR;
+ dev->rxcnt = dev->rxdone = 0;
+ dev->lsr &= ~UART_LSR_DR;
+ }
+
+ /* Handle clear tx */
+ if (dev->lcr & UART_FCR_CLEAR_XMIT) {
+ dev->lcr &= ~UART_FCR_CLEAR_XMIT;
+ dev->txcnt = 0;
+ dev->lsr |= UART_LSR_TEMT | UART_LSR_THRE;
+ }
+
+ /* Data ready and rcv interrupt enabled ? */
+ if ((dev->ier & UART_IER_RDI) && (dev->lsr & UART_LSR_DR))
+ iir |= UART_IIR_RDI;
+
+ /* Transmitter empty and interrupt enabled ? */
+ if ((dev->ier & UART_IER_THRI) && (dev->lsr & UART_LSR_TEMT))
+ iir |= UART_IIR_THRI;
+
+ /* Now update the irq line, if necessary */
+ if (!iir) {
+ dev->iir = UART_IIR_NO_INT;
+ if (dev->irq_state)
+ kvm__irq_line(kvm, dev->irq, 0);
+ } else {
+ dev->iir = iir;
+ if (!dev->irq_state)
+ kvm__irq_line(kvm, dev->irq, 1);
+ }
+ dev->irq_state = iir;
+
+ /*
+ * If the kernel disabled the tx interrupt, we know that there
+ * is nothing more to transmit, so we can reset our tx logic
+ * here.
+ */
+ if (!(dev->ier & UART_IER_THRI))
+ serial8250_flush_tx(kvm, dev);
+}
+
+#define SYSRQ_PENDING_NONE 0
+
+static int sysrq_pending;
+
+static void serial8250__sysrq(struct kvm *kvm, struct serial8250_device *dev)
+{
+ dev->lsr |= UART_LSR_DR | UART_LSR_BI;
+ dev->rxbuf[dev->rxcnt++] = sysrq_pending;
+ sysrq_pending = SYSRQ_PENDING_NONE;
+}
+
+static void serial8250__receive(struct kvm *kvm, struct serial8250_device *dev,
+ bool handle_sysrq)
+{
+ int c;
+
+ /*
+ * If the guest transmitted a full fifo, we clear the
+ * TEMT/THRE bits to let the kernel escape from the 8250
+ * interrupt handler. We come here only once a ms, so that
+ * should give the kernel the desired pause. That also flushes
+ * the tx fifo to the terminal.
+ */
+ serial8250_flush_tx(kvm, dev);
+
+ if (dev->mcr & UART_MCR_LOOP)
+ return;
+
+ if ((dev->lsr & UART_LSR_DR) || dev->rxcnt)
+ return;
+
+ if (handle_sysrq && sysrq_pending) {
+ serial8250__sysrq(kvm, dev);
+ return;
+ }
+
+ if (kvm->cfg.active_console != CONSOLE_8250)
+ return;
+
+ while (term_readable(dev->id) &&
+ dev->rxcnt < FIFO_LEN) {
+
+ c = term_getc(kvm, dev->id);
+
+ if (c < 0)
+ break;
+ dev->rxbuf[dev->rxcnt++] = c;
+ dev->lsr |= UART_LSR_DR;
+ }
+}
+
+void serial8250__update_consoles(struct kvm *kvm)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(devices); i++) {
+ struct serial8250_device *dev = &devices[i];
+
+ mutex_lock(&dev->mutex);
+
+ /* Restrict sysrq injection to the first port */
+ serial8250__receive(kvm, dev, i == 0);
+
+ serial8250_update_irq(kvm, dev);
+
+ mutex_unlock(&dev->mutex);
+ }
+}
+
+void serial8250__inject_sysrq(struct kvm *kvm, char sysrq)
+{
+ sysrq_pending = sysrq;
+}
+
+static struct serial8250_device *find_device(u16 port)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(devices); i++) {
+ struct serial8250_device *dev = &devices[i];
+
+ if (dev->iobase == (port & ~0x7))
+ return dev;
+ }
+ return NULL;
+}
+
+static bool serial8250_out(struct ioport *ioport, struct kvm *kvm, u16 port,
+ void *data, int size)
+{
+ struct serial8250_device *dev;
+ u16 offset;
+ bool ret = true;
+ char *addr = data;
+
+ dev = find_device(port);
+ if (!dev)
+ return false;
+
+ mutex_lock(&dev->mutex);
+
+ offset = port - dev->iobase;
+
+ switch (offset) {
+ case UART_TX:
+ if (dev->lcr & UART_LCR_DLAB) {
+ dev->dll = ioport__read8(data);
+ break;
+ }
+
+ /* Loopback mode */
+ if (dev->mcr & UART_MCR_LOOP) {
+ if (dev->rxcnt < FIFO_LEN) {
+ dev->rxbuf[dev->rxcnt++] = *addr;
+ dev->lsr |= UART_LSR_DR;
+ }
+ break;
+ }
+
+ if (dev->txcnt < FIFO_LEN) {
+ dev->txbuf[dev->txcnt++] = *addr;
+ dev->lsr &= ~UART_LSR_TEMT;
+ if (dev->txcnt == FIFO_LEN / 2)
+ dev->lsr &= ~UART_LSR_THRE;
+ } else {
+ /* Should never happpen */
+ dev->lsr &= ~(UART_LSR_TEMT | UART_LSR_THRE);
+ }
+ break;
+ case UART_IER:
+ if (!(dev->lcr & UART_LCR_DLAB))
+ dev->ier = ioport__read8(data) & 0x0f;
+ else
+ dev->dlm = ioport__read8(data);
+ break;
+ case UART_FCR:
+ dev->fcr = ioport__read8(data);
+ break;
+ case UART_LCR:
+ dev->lcr = ioport__read8(data);
+ break;
+ case UART_MCR:
+ dev->mcr = ioport__read8(data);
+ break;
+ case UART_LSR:
+ /* Factory test */
+ break;
+ case UART_MSR:
+ /* Not used */
+ break;
+ case UART_SCR:
+ dev->scr = ioport__read8(data);
+ break;
+ default:
+ ret = false;
+ break;
+ }
+
+ serial8250_update_irq(kvm, dev);
+
+ mutex_unlock(&dev->mutex);
+
+ return ret;
+}
+
+static void serial8250_rx(struct serial8250_device *dev, void *data)
+{
+ if (dev->rxdone == dev->rxcnt)
+ return;
+
+ /* Break issued ? */
+ if (dev->lsr & UART_LSR_BI) {
+ dev->lsr &= ~UART_LSR_BI;
+ ioport__write8(data, 0);
+ return;
+ }
+
+ ioport__write8(data, dev->rxbuf[dev->rxdone++]);
+ if (dev->rxcnt == dev->rxdone) {
+ dev->lsr &= ~UART_LSR_DR;
+ dev->rxcnt = dev->rxdone = 0;
+ }
+}
+
+static bool serial8250_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ struct serial8250_device *dev;
+ u16 offset;
+ bool ret = true;
+
+ dev = find_device(port);
+ if (!dev)
+ return false;
+
+ mutex_lock(&dev->mutex);
+
+ offset = port - dev->iobase;
+
+ switch (offset) {
+ case UART_RX:
+ if (dev->lcr & UART_LCR_DLAB)
+ ioport__write8(data, dev->dll);
+ else
+ serial8250_rx(dev, data);
+ break;
+ case UART_IER:
+ if (dev->lcr & UART_LCR_DLAB)
+ ioport__write8(data, dev->dlm);
+ else
+ ioport__write8(data, dev->ier);
+ break;
+ case UART_IIR:
+ ioport__write8(data, dev->iir | UART_IIR_TYPE_BITS);
+ break;
+ case UART_LCR:
+ ioport__write8(data, dev->lcr);
+ break;
+ case UART_MCR:
+ ioport__write8(data, dev->mcr);
+ break;
+ case UART_LSR:
+ ioport__write8(data, dev->lsr);
+ break;
+ case UART_MSR:
+ ioport__write8(data, dev->msr);
+ break;
+ case UART_SCR:
+ ioport__write8(data, dev->scr);
+ break;
+ default:
+ ret = false;
+ break;
+ }
+
+ serial8250_update_irq(kvm, dev);
+
+ mutex_unlock(&dev->mutex);
+
+ return ret;
+}
+
+static struct ioport_operations serial8250_ops = {
+ .io_in = serial8250_in,
+ .io_out = serial8250_out,
+};
+
+static int serial8250__device_init(struct kvm *kvm, struct serial8250_device *dev)
+{
+ int r;
+
+ r = ioport__register(kvm, dev->iobase, &serial8250_ops, 8, NULL);
+ kvm__irq_line(kvm, dev->irq, 0);
+
+ return r;
+}
+
+int serial8250__init(struct kvm *kvm)
+{
+ unsigned int i, j;
+ int r = 0;
+
+ for (i = 0; i < ARRAY_SIZE(devices); i++) {
+ struct serial8250_device *dev = &devices[i];
+
+ r = serial8250__device_init(kvm, dev);
+ if (r < 0)
+ goto cleanup;
+ }
+
+ return r;
+cleanup:
+ for (j = 0; j <= i; j++) {
+ struct serial8250_device *dev = &devices[j];
+
+ ioport__unregister(kvm, dev->iobase);
+ }
+
+ return r;
+}
+dev_init(serial8250__init);
+
+int serial8250__exit(struct kvm *kvm)
+{
+ unsigned int i;
+ int r;
+
+ for (i = 0; i < ARRAY_SIZE(devices); i++) {
+ struct serial8250_device *dev = &devices[i];
+
+ r = ioport__unregister(kvm, dev->iobase);
+ if (r < 0)
+ return r;
+ }
+
+ return 0;
+}
+dev_exit(serial8250__exit);
--- /dev/null
+#include "kvm/vesa.h"
+
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/framebuffer.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/ioport.h"
+#include "kvm/util.h"
+#include "kvm/irq.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+
+#include <linux/byteorder.h>
+#include <sys/mman.h>
+#include <linux/err.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <inttypes.h>
+#include <unistd.h>
+
+static bool vesa_pci_io_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ return true;
+}
+
+static bool vesa_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ return true;
+}
+
+static struct ioport_operations vesa_io_ops = {
+ .io_in = vesa_pci_io_in,
+ .io_out = vesa_pci_io_out,
+};
+
+static struct pci_device_header vesa_pci_device = {
+ .vendor_id = cpu_to_le16(PCI_VENDOR_ID_REDHAT_QUMRANET),
+ .device_id = cpu_to_le16(PCI_DEVICE_ID_VESA),
+ .header_type = PCI_HEADER_TYPE_NORMAL,
+ .revision_id = 0,
+ .class[2] = 0x03,
+ .subsys_vendor_id = cpu_to_le16(PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET),
+ .subsys_id = cpu_to_le16(PCI_SUBSYSTEM_ID_VESA),
+ .bar[1] = cpu_to_le32(VESA_MEM_ADDR | PCI_BASE_ADDRESS_SPACE_MEMORY),
+ .bar_size[1] = VESA_MEM_SIZE,
+};
+
+static struct framebuffer vesafb;
+
+struct framebuffer *vesa__init(struct kvm *kvm)
+{
+ u16 vesa_base_addr;
+ u8 dev, line, pin;
+ char *mem;
+ int r;
+
+ if (!kvm->cfg.vnc && !kvm->cfg.sdl)
+ return NULL;
+
+ r = irq__register_device(PCI_DEVICE_ID_VESA, &dev, &pin, &line);
+ if (r < 0)
+ return ERR_PTR(r);
+
+ r = ioport__register(kvm, IOPORT_EMPTY, &vesa_io_ops, IOPORT_SIZE, NULL);
+ if (r < 0)
+ return ERR_PTR(r);
+
+ vesa_pci_device.irq_pin = pin;
+ vesa_pci_device.irq_line = line;
+ vesa_base_addr = (u16)r;
+ vesa_pci_device.bar[0] = cpu_to_le32(vesa_base_addr | PCI_BASE_ADDRESS_SPACE_IO);
+ pci__register(&vesa_pci_device, dev);
+
+ mem = mmap(NULL, VESA_MEM_SIZE, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
+ if (mem == MAP_FAILED)
+ ERR_PTR(-errno);
+
+ kvm__register_mem(kvm, VESA_MEM_ADDR, VESA_MEM_SIZE, mem);
+
+ vesafb = (struct framebuffer) {
+ .width = VESA_WIDTH,
+ .height = VESA_HEIGHT,
+ .depth = VESA_BPP,
+ .mem = mem,
+ .mem_addr = VESA_MEM_ADDR,
+ .mem_size = VESA_MEM_SIZE,
+ .kvm = kvm,
+ };
+ return fb__register(&vesafb);
+}
--- /dev/null
+#ifndef _KVM_ASM_HWEIGHT_H_
+#define _KVM_ASM_HWEIGHT_H_
+
+#include <linux/types.h>
+unsigned int hweight32(unsigned int w);
+unsigned long hweight64(__u64 w);
+
+#endif /* _KVM_ASM_HWEIGHT_H_ */
--- /dev/null
+#ifndef KVM_BIOS_MEMCPY_H
+#define KVM_BIOS_MEMCPY_H
+
+#include <linux/types.h>
+#include <stddef.h>
+
+void memcpy16(u16 dst_seg, void *dst, u16 src_seg, const void *src, size_t len);
+
+#endif /* KVM_BIOS_MEMCPY_H */
--- /dev/null
+#ifndef KVM__8250_SERIAL_H
+#define KVM__8250_SERIAL_H
+
+struct kvm;
+
+int serial8250__init(struct kvm *kvm);
+int serial8250__exit(struct kvm *kvm);
+void serial8250__update_consoles(struct kvm *kvm);
+void serial8250__inject_sysrq(struct kvm *kvm, char sysrq);
+
+#endif /* KVM__8250_SERIAL_H */
--- /dev/null
+#ifndef KVM_APIC_H_
+#define KVM_APIC_H_
+
+#include <asm/apicdef.h>
+
+/*
+ * APIC, IOAPIC stuff
+ */
+#define APIC_BASE_ADDR_STEP 0x00400000
+#define IOAPIC_BASE_ADDR_STEP 0x00100000
+
+#define APIC_ADDR(apic) (APIC_DEFAULT_PHYS_BASE + apic * APIC_BASE_ADDR_STEP)
+#define IOAPIC_ADDR(ioapic) (IO_APIC_DEFAULT_PHYS_BASE + ioapic * IOAPIC_BASE_ADDR_STEP)
+
+#define KVM_APIC_VERSION 0x14 /* xAPIC */
+
+#endif /* KVM_APIC_H_ */
--- /dev/null
+#ifndef KVM__BRLOCK_H
+#define KVM__BRLOCK_H
+
+#include "kvm/kvm.h"
+#include "kvm/barrier.h"
+
+/*
+ * brlock is a lock which is very cheap for reads, but very expensive
+ * for writes.
+ * This lock will be used when updates are very rare and reads are common.
+ * This lock is currently implemented by stopping the guest while
+ * performing the updates. We assume that the only threads whichread from
+ * the locked data are VCPU threads, and the only writer isn't a VCPU thread.
+ */
+
+#ifndef barrier
+#define barrier() __asm__ __volatile__("": : :"memory")
+#endif
+
+#ifdef KVM_BRLOCK_DEBUG
+
+#include "kvm/rwsem.h"
+
+DECLARE_RWSEM(brlock_sem);
+
+#define br_read_lock(kvm) down_read(&brlock_sem);
+#define br_read_unlock(kvm) up_read(&brlock_sem);
+
+#define br_write_lock(kvm) down_write(&brlock_sem);
+#define br_write_unlock(kvm) up_write(&brlock_sem);
+
+#else
+
+#define br_read_lock(kvm) barrier()
+#define br_read_unlock(kvm) barrier()
+
+#define br_write_lock(kvm) kvm__pause(kvm)
+#define br_write_unlock(kvm) kvm__continue(kvm)
+#endif
+
+#endif
--- /dev/null
+#ifndef KVM__BALLOON_H
+#define KVM__BALLOON_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_balloon(int argc, const char **argv, const char *prefix);
+void kvm_balloon_help(void) NORETURN;
+
+#endif
--- /dev/null
+#ifndef KVM__DEBUG_H
+#define KVM__DEBUG_H
+
+#include <kvm/util.h>
+#include <linux/types.h>
+
+#define KVM_DEBUG_CMD_TYPE_DUMP (1 << 0)
+#define KVM_DEBUG_CMD_TYPE_NMI (1 << 1)
+#define KVM_DEBUG_CMD_TYPE_SYSRQ (1 << 2)
+
+struct debug_cmd_params {
+ u32 dbg_type;
+ u32 cpu;
+ char sysrq;
+};
+
+int kvm_cmd_debug(int argc, const char **argv, const char *prefix);
+void kvm_debug_help(void) NORETURN;
+
+#endif
--- /dev/null
+#ifndef __KVM_HELP_H__
+#define __KVM_HELP_H__
+
+int kvm_cmd_help(int argc, const char **argv, const char *prefix);
+
+#endif
--- /dev/null
+#ifndef KVM__LIST_H
+#define KVM__LIST_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_list(int argc, const char **argv, const char *prefix);
+void kvm_list_help(void) NORETURN;
+int get_vmstate(int sock);
+
+#endif
--- /dev/null
+#ifndef KVM__PAUSE_H
+#define KVM__PAUSE_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_pause(int argc, const char **argv, const char *prefix);
+void kvm_pause_help(void) NORETURN;
+
+#endif
--- /dev/null
+#ifndef KVM__RESUME_H
+#define KVM__RESUME_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_resume(int argc, const char **argv, const char *prefix);
+void kvm_resume_help(void) NORETURN;
+
+#endif
--- /dev/null
+#ifndef __KVM_RUN_H__
+#define __KVM_RUN_H__
+
+#include <kvm/util.h>
+
+int kvm_cmd_run(int argc, const char **argv, const char *prefix);
+void kvm_run_help(void) NORETURN;
+
+void kvm_run_set_wrapper_sandbox(void);
+
+#endif
--- /dev/null
+#ifndef KVM__SANDBOX_H
+#define KVM__SANDBOX_H
+
+int kvm_cmd_sandbox(int argc, const char **argv, const char *prefix);
+
+#endif
--- /dev/null
+#ifndef KVM__SETUP_H
+#define KVM__SETUP_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_setup(int argc, const char **argv, const char *prefix);
+void kvm_setup_help(void) NORETURN;
+int kvm_setup_create_new(const char *guestfs_name);
+void kvm_setup_resolv(const char *guestfs_name);
+
+#endif
--- /dev/null
+#ifndef KVM__STAT_H
+#define KVM__STAT_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_stat(int argc, const char **argv, const char *prefix);
+void kvm_stat_help(void) NORETURN;
+
+#endif
--- /dev/null
+#ifndef KVM__STOP_H
+#define KVM__STOP_H
+
+#include <kvm/util.h>
+
+int kvm_cmd_stop(int argc, const char **argv, const char *prefix);
+void kvm_stop_help(void) NORETURN;
+
+#endif
--- /dev/null
+#ifndef KVM__VERSION_H
+#define KVM__VERSION_H
+
+int kvm_cmd_version(int argc, const char **argv, const char *prefix);
+
+#endif
--- /dev/null
+#ifndef KVM_COMPILER_H_
+#define KVM_COMPILER_H_
+
+#ifndef __compiletime_error
+# define __compiletime_error(message)
+#endif
+
+#define notrace __attribute__((no_instrument_function))
+
+#endif /* KVM_COMPILER_H_ */
--- /dev/null
+#ifndef KVM__DISK_IMAGE_H
+#define KVM__DISK_IMAGE_H
+
+#include "kvm/read-write.h"
+#include "kvm/util.h"
+#include "kvm/parse-options.h"
+
+#include <linux/types.h>
+#include <linux/fs.h> /* for BLKGETSIZE64 */
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <stdbool.h>
+#include <sys/uio.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+
+#define SECTOR_SHIFT 9
+#define SECTOR_SIZE (1UL << SECTOR_SHIFT)
+
+enum {
+ DISK_IMAGE_REGULAR,
+ DISK_IMAGE_MMAP,
+};
+
+#define MAX_DISK_IMAGES 4
+
+struct disk_image;
+
+struct disk_image_operations {
+ ssize_t (*read)(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param);
+ ssize_t (*write)(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param);
+ int (*flush)(struct disk_image *disk);
+ int (*close)(struct disk_image *disk);
+};
+
+struct disk_image_params {
+ const char *filename;
+ /*
+ * wwpn == World Wide Port Number
+ * tpgt == Target Portal Group Tag
+ */
+ const char *wwpn;
+ const char *tpgt;
+ bool readonly;
+ bool direct;
+};
+
+struct disk_image {
+ int fd;
+ u64 size;
+ struct disk_image_operations *ops;
+ void *priv;
+ void *disk_req_cb_param;
+ void (*disk_req_cb)(void *param, long len);
+ bool async;
+ int evt;
+#ifdef CONFIG_HAS_AIO
+ io_context_t ctx;
+#endif
+ const char *wwpn;
+ const char *tpgt;
+ int debug_iodelay;
+};
+
+int disk_img_name_parser(const struct option *opt, const char *arg, int unset);
+int disk_image__init(struct kvm *kvm);
+int disk_image__exit(struct kvm *kvm);
+struct disk_image *disk_image__new(int fd, u64 size, struct disk_image_operations *ops, int mmap);
+int disk_image__flush(struct disk_image *disk);
+ssize_t disk_image__read(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param);
+ssize_t disk_image__write(struct disk_image *disk, u64 sector, const struct iovec *iov,
+ int iovcount, void *param);
+ssize_t disk_image__get_serial(struct disk_image *disk, void *buffer, ssize_t *len);
+
+struct disk_image *raw_image__probe(int fd, struct stat *st, bool readonly);
+struct disk_image *blkdev__probe(const char *filename, int flags, struct stat *st);
+
+ssize_t raw_image__read(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param);
+ssize_t raw_image__write(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param);
+ssize_t raw_image__read_mmap(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param);
+ssize_t raw_image__write_mmap(struct disk_image *disk, u64 sector,
+ const struct iovec *iov, int iovcount, void *param);
+int raw_image__close(struct disk_image *disk);
+void disk_image__set_callback(struct disk_image *disk, void (*disk_req_cb)(void *param, long len));
+#endif /* KVM__DISK_IMAGE_H */
--- /dev/null
+#ifndef KVM_E820_H
+#define KVM_E820_H
+
+#include <linux/types.h>
+#include <kvm/bios.h>
+
+#define SMAP 0x534d4150 /* ASCII "SMAP" */
+
+struct biosregs;
+
+extern bioscall void e820_query_map(struct biosregs *regs);
+
+#endif /* KVM_E820_H */
--- /dev/null
+#ifndef KVM__FRAMEBUFFER_H
+#define KVM__FRAMEBUFFER_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+
+struct framebuffer;
+
+struct fb_target_operations {
+ int (*start)(struct framebuffer *fb);
+ int (*stop)(struct framebuffer *fb);
+};
+
+#define FB_MAX_TARGETS 2
+
+struct framebuffer {
+ struct list_head node;
+
+ u32 width;
+ u32 height;
+ u8 depth;
+ char *mem;
+ u64 mem_addr;
+ u64 mem_size;
+ struct kvm *kvm;
+
+ unsigned long nr_targets;
+ struct fb_target_operations *targets[FB_MAX_TARGETS];
+};
+
+struct framebuffer *fb__register(struct framebuffer *fb);
+int fb__attach(struct framebuffer *fb, struct fb_target_operations *ops);
+int fb__init(struct kvm *kvm);
+int fb__exit(struct kvm *kvm);
+
+#endif /* KVM__FRAMEBUFFER_H */
--- /dev/null
+#ifndef KVM__GUEST_COMPAT_H
+#define KVM__GUEST_COMPAT_H
+
+int compat__print_all_messages(void);
+int compat__remove_message(int id);
+int compat__add_message(const char *title, const char *description);
+
+
+#endif
\ No newline at end of file
--- /dev/null
+#ifndef KVM__PCKBD_H
+#define KVM__PCKBD_H
+
+#include <linux/types.h>
+
+struct kvm;
+
+void mouse_queue(u8 c);
+void kbd_queue(u8 c);
+int kbd__init(struct kvm *kvm);
+
+#endif
--- /dev/null
+#ifndef KVM__IOEVENTFD_H
+#define KVM__IOEVENTFD_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <sys/eventfd.h>
+#include "kvm/util.h"
+
+struct kvm;
+
+struct ioevent {
+ u64 io_addr;
+ u8 io_len;
+ void (*fn)(struct kvm *kvm, void *ptr);
+ struct kvm *fn_kvm;
+ void *fn_ptr;
+ int fd;
+ u64 datamatch;
+
+ struct list_head list;
+};
+
+int ioeventfd__init(struct kvm *kvm);
+int ioeventfd__exit(struct kvm *kvm);
+int ioeventfd__add_event(struct ioevent *ioevent, bool is_pio, bool poll_in_userspace);
+int ioeventfd__del_event(u64 addr, u64 datamatch);
+
+#endif
--- /dev/null
+#ifndef KVM__IOPORT_H
+#define KVM__IOPORT_H
+
+#include "kvm/rbtree-interval.h"
+
+#include <stdbool.h>
+#include <limits.h>
+#include <asm/types.h>
+#include <linux/types.h>
+#include <linux/byteorder.h>
+
+/* some ports we reserve for own use */
+#define IOPORT_DBG 0xe0
+#define IOPORT_START 0x6200
+#define IOPORT_SIZE 0x400
+
+#define IOPORT_EMPTY USHRT_MAX
+
+struct kvm;
+
+struct ioport {
+ struct rb_int_node node;
+ struct ioport_operations *ops;
+ void *priv;
+};
+
+struct ioport_operations {
+ bool (*io_in)(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size);
+ bool (*io_out)(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size);
+};
+
+void ioport__setup_arch(struct kvm *kvm);
+
+int ioport__register(struct kvm *kvm, u16 port, struct ioport_operations *ops,
+ int count, void *param);
+int ioport__unregister(struct kvm *kvm, u16 port);
+int ioport__init(struct kvm *kvm);
+int ioport__exit(struct kvm *kvm);
+
+static inline u8 ioport__read8(u8 *data)
+{
+ return *data;
+}
+/* On BE platforms, PCI I/O is byteswapped, i.e. LE, so swap back. */
+static inline u16 ioport__read16(u16 *data)
+{
+ return le16_to_cpu(*data);
+}
+
+static inline u32 ioport__read32(u32 *data)
+{
+ return le32_to_cpu(*data);
+}
+
+static inline void ioport__write8(u8 *data, u8 value)
+{
+ *data = value;
+}
+
+static inline void ioport__write16(u16 *data, u16 value)
+{
+ *data = cpu_to_le16(value);
+}
+
+static inline void ioport__write32(u32 *data, u32 value)
+{
+ *data = cpu_to_le32(value);
+}
+
+#endif /* KVM__IOPORT_H */
--- /dev/null
+#ifndef KVM__IRQ_H
+#define KVM__IRQ_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/list.h>
+#include <linux/kvm.h>
+
+#include "kvm/msi.h"
+
+struct kvm;
+
+struct irq_line {
+ u8 line;
+ struct list_head node;
+};
+
+struct pci_dev {
+ struct rb_node node;
+ u32 id;
+ u8 pin;
+ struct list_head lines;
+};
+
+int irq__register_device(u32 dev, u8 *num, u8 *pin, u8 *line);
+
+struct rb_node *irq__get_pci_tree(void);
+
+int irq__init(struct kvm *kvm);
+int irq__exit(struct kvm *kvm);
+int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg);
+
+#endif
--- /dev/null
+#ifndef __KVM_CMD_H__
+#define __KVM_CMD_H__
+
+struct cmd_struct {
+ const char *cmd;
+ int (*fn)(int, const char **, const char *);
+ void (*help)(void);
+ int option;
+};
+
+extern struct cmd_struct kvm_commands[];
+struct cmd_struct *kvm_get_command(struct cmd_struct *command,
+ const char *cmd);
+
+int handle_command(struct cmd_struct *command, int argc, const char **argv);
+
+#endif
--- /dev/null
+#ifndef KVM_CONFIG_H_
+#define KVM_CONFIG_H_
+
+#include "kvm/disk-image.h"
+
+#define DEFAULT_KVM_DEV "/dev/kvm"
+#define DEFAULT_CONSOLE "serial"
+#define DEFAULT_NETWORK "user"
+#define DEFAULT_HOST_ADDR "192.168.33.1"
+#define DEFAULT_GUEST_ADDR "192.168.33.15"
+#define DEFAULT_GUEST_MAC "02:15:15:15:15:15"
+#define DEFAULT_HOST_MAC "02:01:01:01:01:01"
+#define DEFAULT_SCRIPT "none"
+#define DEFAULT_SANDBOX_FILENAME "guest/sandbox.sh"
+
+#define MIN_RAM_SIZE_MB (64ULL)
+#define MIN_RAM_SIZE_BYTE (MIN_RAM_SIZE_MB << MB_SHIFT)
+
+struct kvm_config {
+ struct disk_image_params disk_image[MAX_DISK_IMAGES];
+ u64 ram_size;
+ u8 image_count;
+ u8 num_net_devices;
+ bool virtio_rng;
+ int active_console;
+ int debug_iodelay;
+ int nrcpus;
+ int vidmode;
+ const char *kernel_cmdline;
+ const char *kernel_filename;
+ const char *vmlinux_filename;
+ const char *initrd_filename;
+ const char *firmware_filename;
+ const char *console;
+ const char *dev;
+ const char *network;
+ const char *host_ip;
+ const char *guest_ip;
+ const char *guest_mac;
+ const char *host_mac;
+ const char *script;
+ const char *guest_name;
+ const char *sandbox;
+ const char *hugetlbfs_path;
+ const char *custom_rootfs_name;
+ const char *real_cmdline;
+ struct virtio_net_params *net_params;
+ bool single_step;
+ bool vnc;
+ bool sdl;
+ bool balloon;
+ bool using_rootfs;
+ bool custom_rootfs;
+ bool no_net;
+ bool no_dhcp;
+ bool ioport_debug;
+ bool mmio_debug;
+};
+
+#endif
--- /dev/null
+#ifndef KVM__KVM_CPU_H
+#define KVM__KVM_CPU_H
+
+#include "kvm/kvm-cpu-arch.h"
+#include <stdbool.h>
+
+int kvm_cpu__init(struct kvm *kvm);
+int kvm_cpu__exit(struct kvm *kvm);
+struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned long cpu_id);
+void kvm_cpu__delete(struct kvm_cpu *vcpu);
+void kvm_cpu__reset_vcpu(struct kvm_cpu *vcpu);
+void kvm_cpu__setup_cpuid(struct kvm_cpu *vcpu);
+void kvm_cpu__enable_singlestep(struct kvm_cpu *vcpu);
+void kvm_cpu__run(struct kvm_cpu *vcpu);
+void kvm_cpu__reboot(struct kvm *kvm);
+int kvm_cpu__start(struct kvm_cpu *cpu);
+bool kvm_cpu__handle_exit(struct kvm_cpu *vcpu);
+
+int kvm_cpu__get_debug_fd(void);
+void kvm_cpu__set_debug_fd(int fd);
+void kvm_cpu__show_code(struct kvm_cpu *vcpu);
+void kvm_cpu__show_registers(struct kvm_cpu *vcpu);
+void kvm_cpu__show_page_tables(struct kvm_cpu *vcpu);
+void kvm_cpu__arch_nmi(struct kvm_cpu *cpu);
+
+#endif /* KVM__KVM_CPU_H */
--- /dev/null
+#ifndef KVM__IPC_H_
+#define KVM__IPC_H_
+
+#include <linux/types.h>
+#include "kvm/kvm.h"
+
+enum {
+ KVM_IPC_BALLOON = 1,
+ KVM_IPC_DEBUG = 2,
+ KVM_IPC_STAT = 3,
+ KVM_IPC_PAUSE = 4,
+ KVM_IPC_RESUME = 5,
+ KVM_IPC_STOP = 6,
+ KVM_IPC_PID = 7,
+ KVM_IPC_VMSTATE = 8,
+};
+
+int kvm_ipc__register_handler(u32 type, void (*cb)(struct kvm *kvm,
+ int fd, u32 type, u32 len, u8 *msg));
+int kvm_ipc__init(struct kvm *kvm);
+int kvm_ipc__exit(struct kvm *kvm);
+
+int kvm_ipc__send(int fd, u32 type);
+int kvm_ipc__send_msg(int fd, u32 type, u32 len, u8 *msg);
+
+#endif
--- /dev/null
+#ifndef KVM__KVM_H
+#define KVM__KVM_H
+
+#include "kvm/kvm-arch.h"
+#include "kvm/kvm-config.h"
+#include "kvm/util-init.h"
+#include "kvm/kvm.h"
+
+#include <stdbool.h>
+#include <linux/types.h>
+#include <time.h>
+#include <signal.h>
+
+#define SIGKVMEXIT (SIGRTMIN + 0)
+#define SIGKVMPAUSE (SIGRTMIN + 1)
+
+#define KVM_PID_FILE_PATH "/.lkvm/"
+#define HOME_DIR getenv("HOME")
+#define KVM_BINARY_NAME "lkvm"
+
+#define PAGE_SIZE (sysconf(_SC_PAGE_SIZE))
+
+#define DEFINE_KVM_EXT(ext) \
+ .name = #ext, \
+ .code = ext
+
+enum {
+ KVM_VMSTATE_RUNNING,
+ KVM_VMSTATE_PAUSED,
+};
+
+struct kvm_ext {
+ const char *name;
+ int code;
+};
+
+struct kvm {
+ struct kvm_arch arch;
+ struct kvm_config cfg;
+ int sys_fd; /* For system ioctls(), i.e. /dev/kvm */
+ int vm_fd; /* For VM ioctls() */
+ timer_t timerid; /* Posix timer for interrupts */
+
+ int nrcpus; /* Number of cpus to run */
+ struct kvm_cpu **cpus;
+
+ u32 mem_slots; /* for KVM_SET_USER_MEMORY_REGION */
+ u64 ram_size;
+ void *ram_start;
+ u64 ram_pagesize;
+
+ bool nmi_disabled;
+
+ const char *vmlinux;
+ struct disk_image **disks;
+ int nr_disks;
+
+ int vm_state;
+};
+
+void kvm__set_dir(const char *fmt, ...);
+const char *kvm__get_dir(void);
+
+int kvm__init(struct kvm *kvm);
+struct kvm *kvm__new(void);
+int kvm__recommended_cpus(struct kvm *kvm);
+int kvm__max_cpus(struct kvm *kvm);
+void kvm__init_ram(struct kvm *kvm);
+int kvm__exit(struct kvm *kvm);
+bool kvm__load_firmware(struct kvm *kvm, const char *firmware_filename);
+bool kvm__load_kernel(struct kvm *kvm, const char *kernel_filename,
+ const char *initrd_filename, const char *kernel_cmdline, u16 vidmode);
+int kvm_timer__init(struct kvm *kvm);
+int kvm_timer__exit(struct kvm *kvm);
+void kvm__irq_line(struct kvm *kvm, int irq, int level);
+void kvm__irq_trigger(struct kvm *kvm, int irq);
+bool kvm__emulate_io(struct kvm *kvm, u16 port, void *data, int direction, int size, u32 count);
+bool kvm__emulate_mmio(struct kvm *kvm, u64 phys_addr, u8 *data, u32 len, u8 is_write);
+int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size, void *userspace_addr);
+int kvm__register_mmio(struct kvm *kvm, u64 phys_addr, u64 phys_addr_len, bool coalesce,
+ void (*mmio_fn)(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr),
+ void *ptr);
+bool kvm__deregister_mmio(struct kvm *kvm, u64 phys_addr);
+void kvm__pause(struct kvm *kvm);
+void kvm__continue(struct kvm *kvm);
+void kvm__notify_paused(void);
+int kvm__get_sock_by_instance(const char *name);
+int kvm__enumerate_instances(int (*callback)(const char *name, int pid));
+void kvm__remove_socket(const char *name);
+
+void kvm__arch_set_cmdline(char *cmdline, bool video);
+void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size);
+void kvm__arch_delete_ram(struct kvm *kvm);
+int kvm__arch_setup_firmware(struct kvm *kvm);
+int kvm__arch_free_firmware(struct kvm *kvm);
+bool kvm__arch_cpu_supports_vm(void);
+void kvm__arch_periodic_poll(struct kvm *kvm);
+
+int load_flat_binary(struct kvm *kvm, int fd_kernel, int fd_initrd, const char *kernel_cmdline);
+bool load_bzimage(struct kvm *kvm, int fd_kernel, int fd_initrd, const char *kernel_cmdline, u16 vidmode);
+
+/*
+ * Debugging
+ */
+void kvm__dump_mem(struct kvm *kvm, unsigned long addr, unsigned long size);
+
+extern const char *kvm_exit_reasons[];
+
+static inline bool host_ptr_in_ram(struct kvm *kvm, void *p)
+{
+ return kvm->ram_start <= p && p < (kvm->ram_start + kvm->ram_size);
+}
+
+static inline void *guest_flat_to_host(struct kvm *kvm, unsigned long offset)
+{
+ return kvm->ram_start + offset;
+}
+
+bool kvm__supports_extension(struct kvm *kvm, unsigned int extension);
+
+#endif /* KVM__KVM_H */
--- /dev/null
+#ifndef LKVM_MSI_H
+#define LKVM_MSI_H
+
+struct msi_msg {
+ u32 address_lo; /* low 32 bits of msi message address */
+ u32 address_hi; /* high 32 bits of msi message address */
+ u32 data; /* 16 bits of msi message data */
+};
+
+#endif /* LKVM_MSI_H */
--- /dev/null
+#ifndef KVM__MUTEX_H
+#define KVM__MUTEX_H
+
+#include <pthread.h>
+
+#include "kvm/util.h"
+
+/*
+ * Kernel-alike mutex API - to make it easier for kernel developers
+ * to write user-space code! :-)
+ */
+
+#define DEFINE_MUTEX(mutex) pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER
+
+static inline void mutex_init(pthread_mutex_t *mutex)
+{
+ if (pthread_mutex_init(mutex, NULL) != 0)
+ die("unexpected pthread_mutex_init() failure!");
+}
+
+static inline void mutex_lock(pthread_mutex_t *mutex)
+{
+ if (pthread_mutex_lock(mutex) != 0)
+ die("unexpected pthread_mutex_lock() failure!");
+}
+
+static inline void mutex_unlock(pthread_mutex_t *mutex)
+{
+ if (pthread_mutex_unlock(mutex) != 0)
+ die("unexpected pthread_mutex_unlock() failure!");
+}
+
+#endif /* KVM__MUTEX_H */
--- /dev/null
+#ifndef __PARSE_OPTIONS_H__
+#define __PARSE_OPTIONS_H__
+
+#include <inttypes.h>
+#include <kvm/util.h>
+
+enum parse_opt_type {
+ /* special types */
+ OPTION_END,
+ OPTION_ARGUMENT,
+ OPTION_GROUP,
+ /* options with no arguments */
+ OPTION_BIT,
+ OPTION_BOOLEAN,
+ OPTION_INCR,
+ OPTION_SET_UINT,
+ OPTION_SET_PTR,
+ /* options with arguments (usually) */
+ OPTION_STRING,
+ OPTION_INTEGER,
+ OPTION_LONG,
+ OPTION_CALLBACK,
+ OPTION_U64,
+ OPTION_UINTEGER,
+};
+
+enum parse_opt_flags {
+ PARSE_OPT_KEEP_DASHDASH = 1,
+ PARSE_OPT_STOP_AT_NON_OPTION = 2,
+ PARSE_OPT_KEEP_ARGV0 = 4,
+ PARSE_OPT_KEEP_UNKNOWN = 8,
+ PARSE_OPT_NO_INTERNAL_HELP = 16,
+};
+
+enum parse_opt_option_flags {
+ PARSE_OPT_OPTARG = 1,
+ PARSE_OPT_NOARG = 2,
+ PARSE_OPT_NONEG = 4,
+ PARSE_OPT_HIDDEN = 8,
+ PARSE_OPT_LASTARG_DEFAULT = 16,
+};
+
+struct option;
+typedef int parse_opt_cb(const struct option *, const char *arg, int unset);
+/*
+ * `type`::
+ * holds the type of the option, you must have an OPTION_END last in your
+ * array.
+ *
+ * `short_name`::
+ * the character to use as a short option name, '\0' if none.
+ *
+ * `long_name`::
+ * the long option name, without the leading dashes, NULL if none.
+ *
+ * `value`::
+ * stores pointers to the values to be filled.
+ *
+ * `argh`::
+ * token to explain the kind of argument this option wants. Keep it
+ * homogenous across the repository.
+ *
+ * `help`::
+ * the short help associated to what the option does.
+ * Must never be NULL (except for OPTION_END).
+ * OPTION_GROUP uses this pointer to store the group header.
+ *
+ * `flags`::
+ * mask of parse_opt_option_flags.
+ * PARSE_OPT_OPTARG: says that the argument is optionnal (not for BOOLEANs)
+ * PARSE_OPT_NOARG: says that this option takes no argument, for CALLBACKs
+ * PARSE_OPT_NONEG: says that this option cannot be negated
+ * PARSE_OPT_HIDDEN this option is skipped in the default usage, showed in
+ * the long one.
+ *
+ * `callback`::
+ * pointer to the callback to use for OPTION_CALLBACK.
+ *
+ * `defval`::
+ * default value to fill (*->value) with for PARSE_OPT_OPTARG.
+ * OPTION_{BIT,SET_UINT,SET_PTR} store the {mask,integer,pointer} to put in
+ * the value when met.
+ * CALLBACKS can use it like they want.
+ */
+struct option {
+enum parse_opt_type type;
+int short_name;
+const char *long_name;
+void *value;
+const char *argh;
+const char *help;
+void *ptr;
+
+int flags;
+parse_opt_cb *callback;
+intptr_t defval;
+};
+
+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))
+#define check_vtype(v, type) \
+ (BUILD_BUG_ON_ZERO(!__builtin_types_compatible_p(typeof(v), type)) + v)
+
+#define OPT_INTEGER(s, l, v, h) \
+{ \
+ .type = OPTION_INTEGER, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = check_vtype(v, int *), \
+ .help = (h) \
+}
+
+#define OPT_U64(s, l, v, h) \
+{ \
+ .type = OPTION_U64, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = check_vtype(v, u64 *), \
+ .help = (h) \
+}
+
+#define OPT_STRING(s, l, v, a, h) \
+{ \
+ .type = OPTION_STRING, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = check_vtype(v, const char **), (a), \
+ .help = (h) \
+}
+
+#define OPT_BOOLEAN(s, l, v, h) \
+{ \
+ .type = OPTION_BOOLEAN, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = check_vtype(v, bool *), \
+ .help = (h) \
+}
+
+#define OPT_INCR(s, l, v, h) \
+{ \
+ .type = OPTION_INCR, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = check_vtype(v, int *), \
+ .help = (h) \
+}
+
+#define OPT_GROUP(h) \
+{ \
+ .type = OPTION_GROUP, \
+ .help = (h) \
+}
+
+#define OPT_CALLBACK(s, l, v, a, h, f, p) \
+{ \
+ .type = OPTION_CALLBACK, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = (v), \
+ (a), \
+ .help = (h), \
+ .callback = (f), \
+ .ptr = (p), \
+}
+
+#define OPT_CALLBACK_NOOPT(s, l, v, a, h, f, p) \
+{ \
+ .type = OPTION_CALLBACK, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = (v), \
+ (a), \
+ .help = (h), \
+ .callback = (f), \
+ .flags = PARSE_OPT_NOARG, \
+ .ptr = (p), \
+}
+
+#define OPT_CALLBACK_DEFAULT(s, l, v, a, h, f, d, p) \
+{ \
+ .type = OPTION_CALLBACK, \
+ .short_name = (s), \
+ .long_name = (l), \
+ .value = (v), (a), \
+ .help = (h), \
+ .callback = (f), \
+ .defval = (intptr_t)d, \
+ .flags = PARSE_OPT_LASTARG_DEFAULT, \
+ .ptr = (p) \
+}
+
+#define OPT_END() { .type = OPTION_END }
+
+enum {
+ PARSE_OPT_HELP = -1,
+ PARSE_OPT_DONE,
+ PARSE_OPT_UNKNOWN,
+};
+
+/*
+ * It's okay for the caller to consume argv/argc in the usual way.
+ * Other fields of that structure are private to parse-options and should not
+ * be modified in any way.
+ **/
+struct parse_opt_ctx_t {
+ const char **argv;
+ const char **out;
+ int argc, cpidx;
+ const char *opt;
+ int flags;
+};
+
+/* global functions */
+void usage_with_options(const char * const *usagestr,
+ const struct option *opts) NORETURN;
+int parse_options(int argc, const char **argv, const struct option *options,
+ const char * const usagestr[], int flags);
+#endif
--- /dev/null
+#ifndef KVM__PCI_SHMEM_H
+#define KVM__PCI_SHMEM_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+
+#include "kvm/parse-options.h"
+
+#define SHMEM_DEFAULT_SIZE (16 << MB_SHIFT)
+#define SHMEM_DEFAULT_ADDR (0xc8000000)
+#define SHMEM_DEFAULT_HANDLE "/kvm_shmem"
+
+struct kvm;
+struct shmem_info;
+
+struct shmem_info {
+ u64 phys_addr;
+ u64 size;
+ char *handle;
+ int create;
+};
+
+int pci_shmem__init(struct kvm *kvm);
+int pci_shmem__exit(struct kvm *kvm);
+int pci_shmem__register_mem(struct shmem_info *si);
+int shmem_parser(const struct option *opt, const char *arg, int unset);
+
+int pci_shmem__get_local_irqfd(struct kvm *kvm);
+int pci_shmem__add_client(struct kvm *kvm, u32 id, int fd);
+int pci_shmem__remove_client(struct kvm *kvm, u32 id);
+
+#endif
--- /dev/null
+#ifndef KVM__PCI_H
+#define KVM__PCI_H
+
+#include <linux/types.h>
+#include <linux/kvm.h>
+#include <linux/pci_regs.h>
+#include <endian.h>
+
+#include "kvm/kvm.h"
+#include "kvm/msi.h"
+
+#define PCI_MAX_DEVICES 256
+/*
+ * PCI Configuration Mechanism #1 I/O ports. See Section 3.7.4.1.
+ * ("Configuration Mechanism #1") of the PCI Local Bus Specification 2.1 for
+ * details.
+ */
+#define PCI_CONFIG_ADDRESS 0xcf8
+#define PCI_CONFIG_DATA 0xcfc
+#define PCI_CONFIG_BUS_FORWARD 0xcfa
+#define PCI_IO_SIZE 0x100
+
+union pci_config_address {
+ struct {
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+ unsigned reg_offset : 2; /* 1 .. 0 */
+ unsigned register_number : 6; /* 7 .. 2 */
+ unsigned function_number : 3; /* 10 .. 8 */
+ unsigned device_number : 5; /* 15 .. 11 */
+ unsigned bus_number : 8; /* 23 .. 16 */
+ unsigned reserved : 7; /* 30 .. 24 */
+ unsigned enable_bit : 1; /* 31 */
+#else
+ unsigned enable_bit : 1; /* 31 */
+ unsigned reserved : 7; /* 30 .. 24 */
+ unsigned bus_number : 8; /* 23 .. 16 */
+ unsigned device_number : 5; /* 15 .. 11 */
+ unsigned function_number : 3; /* 10 .. 8 */
+ unsigned register_number : 6; /* 7 .. 2 */
+ unsigned reg_offset : 2; /* 1 .. 0 */
+#endif
+ };
+ u32 w;
+};
+
+struct msix_table {
+ struct msi_msg msg;
+ u32 ctrl;
+};
+
+struct msix_cap {
+ u8 cap;
+ u8 next;
+ u16 ctrl;
+ u32 table_offset;
+ u32 pba_offset;
+};
+
+struct pci_device_header {
+ u16 vendor_id;
+ u16 device_id;
+ u16 command;
+ u16 status;
+ u8 revision_id;
+ u8 class[3];
+ u8 cacheline_size;
+ u8 latency_timer;
+ u8 header_type;
+ u8 bist;
+ u32 bar[6];
+ u32 card_bus;
+ u16 subsys_vendor_id;
+ u16 subsys_id;
+ u32 exp_rom_bar;
+ u8 capabilities;
+ u8 reserved1[3];
+ u32 reserved2;
+ u8 irq_line;
+ u8 irq_pin;
+ u8 min_gnt;
+ u8 max_lat;
+ struct msix_cap msix;
+ u8 empty[136]; /* Rest of PCI config space */
+ u32 bar_size[6];
+} __attribute__((packed));
+
+int pci__init(struct kvm *kvm);
+int pci__exit(struct kvm *kvm);
+int pci__register(struct pci_device_header *dev, u8 dev_num);
+struct pci_device_header *pci__find_dev(u8 dev_num);
+u32 pci_get_io_space_block(u32 size);
+void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size);
+void pci__config_rd(struct kvm *kvm, union pci_config_address addr, void *data, int size);
+
+#endif /* KVM__PCI_H */
--- /dev/null
+#ifndef KVM__QCOW_H
+#define KVM__QCOW_H
+
+#include "kvm/mutex.h"
+
+#include <linux/types.h>
+#include <stdbool.h>
+#include <linux/rbtree.h>
+#include <linux/list.h>
+
+#define QCOW_MAGIC (('Q' << 24) | ('F' << 16) | ('I' << 8) | 0xfb)
+
+#define QCOW1_VERSION 1
+#define QCOW2_VERSION 2
+
+#define QCOW1_OFLAG_COMPRESSED (1ULL << 63)
+
+#define QCOW2_OFLAG_COPIED (1ULL << 63)
+#define QCOW2_OFLAG_COMPRESSED (1ULL << 62)
+
+#define QCOW2_OFLAGS_MASK (QCOW2_OFLAG_COPIED|QCOW2_OFLAG_COMPRESSED)
+
+#define QCOW2_OFFSET_MASK (~QCOW2_OFLAGS_MASK)
+
+#define MAX_CACHE_NODES 32
+
+struct qcow_l2_table {
+ u64 offset;
+ struct rb_node node;
+ struct list_head list;
+ u8 dirty;
+ u64 table[];
+};
+
+struct qcow_l1_table {
+ u32 table_size;
+ u64 *l1_table;
+
+ /* Level2 caching data structures */
+ struct rb_root root;
+ struct list_head lru_list;
+ int nr_cached;
+};
+
+#define QCOW_REFCOUNT_BLOCK_SHIFT 1
+
+struct qcow_refcount_block {
+ u64 offset;
+ struct rb_node node;
+ struct list_head list;
+ u64 size;
+ u8 dirty;
+ u16 entries[];
+};
+
+struct qcow_refcount_table {
+ u32 rf_size;
+ u64 *rf_table;
+
+ /* Refcount block caching data structures */
+ struct rb_root root;
+ struct list_head lru_list;
+ int nr_cached;
+};
+
+struct qcow_header {
+ u64 size; /* in bytes */
+ u64 l1_table_offset;
+ u32 l1_size;
+ u8 cluster_bits;
+ u8 l2_bits;
+ u64 refcount_table_offset;
+ u32 refcount_table_size;
+};
+
+struct qcow {
+ pthread_mutex_t mutex;
+ struct qcow_header *header;
+ struct qcow_l1_table table;
+ struct qcow_refcount_table refcount_table;
+ int fd;
+ int csize_shift;
+ int csize_mask;
+ u32 version;
+ u64 cluster_size;
+ u64 cluster_offset_mask;
+ u64 free_clust_idx;
+ void *cluster_cache;
+ void *cluster_data;
+ void *copy_buff;
+};
+
+struct qcow1_header_disk {
+ u32 magic;
+ u32 version;
+
+ u64 backing_file_offset;
+ u32 backing_file_size;
+ u32 mtime;
+
+ u64 size; /* in bytes */
+
+ u8 cluster_bits;
+ u8 l2_bits;
+ u32 crypt_method;
+
+ u64 l1_table_offset;
+};
+
+struct qcow2_header_disk {
+ u32 magic;
+ u32 version;
+
+ u64 backing_file_offset;
+ u32 backing_file_size;
+
+ u32 cluster_bits;
+ u64 size; /* in bytes */
+ u32 crypt_method;
+
+ u32 l1_size;
+ u64 l1_table_offset;
+
+ u64 refcount_table_offset;
+ u32 refcount_table_clusters;
+
+ u32 nb_snapshots;
+ u64 snapshots_offset;
+};
+
+struct disk_image *qcow_probe(int fd, bool readonly);
+
+#endif /* KVM__QCOW_H */
--- /dev/null
+#ifndef KVM__INTERVAL_RBTREE_H
+#define KVM__INTERVAL_RBTREE_H
+
+#include <linux/rbtree.h>
+#include <linux/types.h>
+
+#define RB_INT_INIT(l, h) (struct rb_int_node){.low = l, .high = h}
+#define rb_int(n) rb_entry(n, struct rb_int_node, node)
+
+struct rb_int_node {
+ struct rb_node node;
+ u64 low;
+ u64 high;
+
+ /* max_high will store the highest high of it's 2 children. */
+ u64 max_high;
+};
+
+/* Return the rb_int_node interval in which 'point' is located. */
+struct rb_int_node *rb_int_search_single(struct rb_root *root, u64 point);
+
+/* Return the rb_int_node in which start:len is located. */
+struct rb_int_node *rb_int_search_range(struct rb_root *root, u64 low, u64 high);
+
+int rb_int_insert(struct rb_root *root, struct rb_int_node *data);
+void rb_int_erase(struct rb_root *root, struct rb_int_node *node);
+
+#endif
--- /dev/null
+#ifndef KVM_READ_WRITE_H
+#define KVM_READ_WRITE_H
+
+#include <sys/types.h>
+#include <sys/uio.h>
+#include <unistd.h>
+
+#ifdef CONFIG_HAS_AIO
+#include <libaio.h>
+#endif
+
+ssize_t xread(int fd, void *buf, size_t count);
+ssize_t xwrite(int fd, const void *buf, size_t count);
+
+ssize_t read_in_full(int fd, void *buf, size_t count);
+ssize_t write_in_full(int fd, const void *buf, size_t count);
+
+ssize_t xpread(int fd, void *buf, size_t count, off_t offset);
+ssize_t xpwrite(int fd, const void *buf, size_t count, off_t offset);
+
+ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
+ssize_t pwrite_in_full(int fd, const void *buf, size_t count, off_t offset);
+
+ssize_t xreadv(int fd, const struct iovec *iov, int iovcnt);
+ssize_t xwritev(int fd, const struct iovec *iov, int iovcnt);
+
+ssize_t readv_in_full(int fd, const struct iovec *iov, int iovcnt);
+ssize_t writev_in_full(int fd, const struct iovec *iov, int iovcnt);
+
+ssize_t xpreadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+ssize_t xpwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+
+ssize_t preadv_in_full(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+ssize_t pwritev_in_full(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+
+#ifdef CONFIG_HAS_AIO
+int aio_preadv(io_context_t ctx, struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt,
+ off_t offset, int ev, void *param);
+int aio_pwritev(io_context_t ctx, struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt,
+ off_t offset, int ev, void *param);
+#endif
+
+#endif /* KVM_READ_WRITE_H */
--- /dev/null
+#ifndef KVM__RTC_H
+#define KVM__RTC_H
+
+struct kvm;
+
+int rtc__init(struct kvm *kvm);
+int rtc__exit(struct kvm *kvm);
+
+#endif /* KVM__RTC_H */
--- /dev/null
+#ifndef KVM__RWSEM_H
+#define KVM__RWSEM_H
+
+#include <pthread.h>
+
+#include "kvm/util.h"
+
+/*
+ * Kernel-alike rwsem API - to make it easier for kernel developers
+ * to write user-space code! :-)
+ */
+
+#define DECLARE_RWSEM(sem) pthread_rwlock_t sem = PTHREAD_RWLOCK_INITIALIZER
+
+static inline void down_read(pthread_rwlock_t *rwsem)
+{
+ if (pthread_rwlock_rdlock(rwsem) != 0)
+ die("unexpected pthread_rwlock_rdlock() failure!");
+}
+
+static inline void down_write(pthread_rwlock_t *rwsem)
+{
+ if (pthread_rwlock_wrlock(rwsem) != 0)
+ die("unexpected pthread_rwlock_wrlock() failure!");
+}
+
+static inline void up_read(pthread_rwlock_t *rwsem)
+{
+ if (pthread_rwlock_unlock(rwsem) != 0)
+ die("unexpected pthread_rwlock_unlock() failure!");
+}
+
+static inline void up_write(pthread_rwlock_t *rwsem)
+{
+ if (pthread_rwlock_unlock(rwsem) != 0)
+ die("unexpected pthread_rwlock_unlock() failure!");
+}
+
+#endif /* KVM__RWSEM_H */
--- /dev/null
+#ifndef KVM__SDL_H
+#define KVM__SDL_H
+
+#include "kvm/util.h"
+
+struct framebuffer;
+
+#ifdef CONFIG_HAS_SDL
+int sdl__init(struct kvm *kvm);
+int sdl__exit(struct kvm *kvm);
+#else
+static inline int sdl__init(struct kvm *kvm)
+{
+ if (kvm->cfg.sdl)
+ die("SDL support not compiled in. (install the SDL-dev[el] package)");
+
+ return 0;
+}
+static inline int sdl__exit(struct kvm *kvm)
+{
+ if (kvm->cfg.sdl)
+ die("SDL support not compiled in. (install the SDL-dev[el] package)");
+
+ return 0;
+}
+#endif
+
+#endif /* KVM__SDL_H */
--- /dev/null
+#ifndef KVM_SEGMENT_H
+#define KVM_SEGMENT_H
+
+#include <linux/types.h>
+
+static inline u32 segment_to_flat(u16 selector, u16 offset)
+{
+ return ((u32)selector << 4) + (u32) offset;
+}
+
+static inline u16 flat_to_seg16(u32 address)
+{
+ return address >> 4;
+}
+
+static inline u16 flat_to_off16(u32 address, u32 segment)
+{
+ return address - (segment << 4);
+}
+
+#endif /* KVM_SEGMENT_H */
--- /dev/null
+#ifndef __STRBUF_H__
+#define __STRBUF_H__
+
+#include <sys/types.h>
+#include <string.h>
+
+int prefixcmp(const char *str, const char *prefix);
+
+extern size_t strlcat(char *dest, const char *src, size_t count);
+extern size_t strlcpy(char *dest, const char *src, size_t size);
+
+/* some inline functions */
+
+static inline const char *skip_prefix(const char *str, const char *prefix)
+{
+ size_t len = strlen(prefix);
+ return strncmp(str, prefix, len) ? NULL : str + len;
+}
+
+#endif
--- /dev/null
+#ifndef KVM__SYMBOL_H
+#define KVM__SYMBOL_H
+
+#include <stddef.h>
+#include <string.h>
+
+struct kvm;
+
+#define SYMBOL_DEFAULT_UNKNOWN "<unknown>"
+
+#ifdef CONFIG_HAS_BFD
+
+int symbol_init(struct kvm *kvm);
+int symbol_exit(struct kvm *kvm);
+char *symbol_lookup(struct kvm *kvm, unsigned long addr, char *sym, size_t size);
+
+#else
+
+static inline int symbol_init(struct kvm *kvm) { return 0; }
+static inline char *symbol_lookup(struct kvm *kvm, unsigned long addr, char *sym, size_t size)
+{
+ char *s = strncpy(sym, SYMBOL_DEFAULT_UNKNOWN, size);
+ sym[size - 1] = '\0';
+ return s;
+}
+static inline int symbol_exit(struct kvm *kvm) { return 0; }
+
+#endif
+
+#endif /* KVM__SYMBOL_H */
--- /dev/null
+#ifndef KVM__TERM_H
+#define KVM__TERM_H
+
+#include "kvm/kvm.h"
+
+#include <sys/uio.h>
+#include <stdbool.h>
+
+#define CONSOLE_8250 1
+#define CONSOLE_VIRTIO 2
+#define CONSOLE_HV 3
+
+int term_putc_iov(struct iovec *iov, int iovcnt, int term);
+int term_getc_iov(struct kvm *kvm, struct iovec *iov, int iovcnt, int term);
+int term_putc(char *addr, int cnt, int term);
+int term_getc(struct kvm *kvm, int term);
+
+bool term_readable(int term);
+void term_set_tty(int term);
+int term_init(struct kvm *kvm);
+int term_exit(struct kvm *kvm);
+int tty_parser(const struct option *opt, const char *arg, int unset);
+
+#endif /* KVM__TERM_H */
--- /dev/null
+#ifndef KVM__THREADPOOL_H
+#define KVM__THREADPOOL_H
+
+#include "kvm/mutex.h"
+
+#include <linux/list.h>
+
+struct kvm;
+
+typedef void (*kvm_thread_callback_fn_t)(struct kvm *kvm, void *data);
+
+struct thread_pool__job {
+ kvm_thread_callback_fn_t callback;
+ struct kvm *kvm;
+ void *data;
+
+ int signalcount;
+ pthread_mutex_t mutex;
+
+ struct list_head queue;
+};
+
+static inline void thread_pool__init_job(struct thread_pool__job *job, struct kvm *kvm, kvm_thread_callback_fn_t callback, void *data)
+{
+ *job = (struct thread_pool__job) {
+ .kvm = kvm,
+ .callback = callback,
+ .data = data,
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+ };
+}
+
+int thread_pool__init(struct kvm *kvm);
+int thread_pool__exit(struct kvm *kvm);
+
+void thread_pool__do_job(struct thread_pool__job *job);
+
+#endif
--- /dev/null
+#ifndef KVM_TYPES_H
+#define KVM_TYPES_H
+
+/* FIXME: include/linux/if_tun.h and include/linux/if_ether.h complains */
+#define __be16 u16
+
+#endif /* KVM_TYPES_H */
--- /dev/null
+#ifndef KVM__UIP_H
+#define KVM__UIP_H
+
+#include "linux/types.h"
+#include "kvm/mutex.h"
+
+#include <netinet/in.h>
+#include <sys/uio.h>
+
+#define UIP_BUF_STATUS_FREE 0
+#define UIP_BUF_STATUS_INUSE 1
+#define UIP_BUF_STATUS_USED 2
+
+#define UIP_ETH_P_IP 0X0800
+#define UIP_ETH_P_ARP 0X0806
+
+#define UIP_IP_VER_4 0X40
+#define UIP_IP_HDR_LEN 0X05
+#define UIP_IP_TTL 0X40
+#define UIP_IP_P_UDP 0X11
+#define UIP_IP_P_TCP 0X06
+#define UIP_IP_P_ICMP 0X01
+
+#define UIP_TCP_HDR_LEN 0x50
+#define UIP_TCP_WIN_SIZE 14600
+#define UIP_TCP_FLAG_FIN 1
+#define UIP_TCP_FLAG_SYN 2
+#define UIP_TCP_FLAG_RST 4
+#define UIP_TCP_FLAG_PSH 8
+#define UIP_TCP_FLAG_ACK 16
+#define UIP_TCP_FLAG_URG 32
+
+#define UIP_BOOTP_VENDOR_SPECIFIC_LEN 64
+#define UIP_BOOTP_MAX_PAYLOAD_LEN 300
+#define UIP_DHCP_VENDOR_SPECIFIC_LEN 312
+#define UIP_DHCP_PORT_SERVER 67
+#define UIP_DHCP_PORT_CLIENT 68
+#define UIP_DHCP_MACPAD_LEN 10
+#define UIP_DHCP_HOSTNAME_LEN 64
+#define UIP_DHCP_FILENAME_LEN 128
+#define UIP_DHCP_MAGIC_COOKIE 0x63825363
+#define UIP_DHCP_MAGIC_COOKIE_LEN 4
+#define UIP_DHCP_LEASE_TIME 0x00003840
+#define UIP_DHCP_MAX_PAYLOAD_LEN (UIP_BOOTP_MAX_PAYLOAD_LEN - UIP_BOOTP_VENDOR_SPECIFIC_LEN + UIP_DHCP_VENDOR_SPECIFIC_LEN)
+#define UIP_DHCP_OPTION_LEN (UIP_DHCP_VENDOR_SPECIFIC_LEN - UIP_DHCP_MAGIC_COOKIE_LEN)
+#define UIP_DHCP_DISCOVER 1
+#define UIP_DHCP_OFFER 2
+#define UIP_DHCP_REQUEST 3
+#define UIP_DHCP_ACK 5
+#define UIP_DHCP_MAX_DNS_SERVER_NR 3
+#define UIP_DHCP_MAX_DOMAIN_NAME_LEN 256
+#define UIP_DHCP_TAG_MSG_TYPE 53
+#define UIP_DHCP_TAG_MSG_TYPE_LEN 1
+#define UIP_DHCP_TAG_SERVER_ID 54
+#define UIP_DHCP_TAG_SERVER_ID_LEN 4
+#define UIP_DHCP_TAG_LEASE_TIME 51
+#define UIP_DHCP_TAG_LEASE_TIME_LEN 4
+#define UIP_DHCP_TAG_SUBMASK 1
+#define UIP_DHCP_TAG_SUBMASK_LEN 4
+#define UIP_DHCP_TAG_ROUTER 3
+#define UIP_DHCP_TAG_ROUTER_LEN 4
+#define UIP_DHCP_TAG_ROOT 17
+#define UIP_DHCP_TAG_ROOT_LEN 4
+#define UIP_DHCP_TAG_DNS_SERVER 6
+#define UIP_DHCP_TAG_DNS_SERVER_LEN 4
+#define UIP_DHCP_TAG_DOMAIN_NAME 15
+#define UIP_DHCP_TAG_END 255
+
+/*
+ * IP package maxium len == 64 KBytes
+ * IP header == 20 Bytes
+ * TCP header == 20 Bytes
+ * UDP header == 8 Bytes
+ */
+#define UIP_MAX_TCP_PAYLOAD (64*1024 - 20 - 20 - 1)
+#define UIP_MAX_UDP_PAYLOAD (64*1024 - 20 - 8 - 1)
+
+struct uip_eth_addr {
+ u8 addr[6];
+};
+
+struct uip_eth {
+ struct uip_eth_addr dst;
+ struct uip_eth_addr src;
+ u16 type;
+} __attribute__((packed));
+
+struct uip_arp {
+ struct uip_eth eth;
+ u16 hwtype;
+ u16 proto;
+ u8 hwlen;
+ u8 protolen;
+ u16 op;
+ struct uip_eth_addr smac;
+ u32 sip;
+ struct uip_eth_addr dmac;
+ u32 dip;
+} __attribute__((packed));
+
+struct uip_ip {
+ struct uip_eth eth;
+ u8 vhl;
+ u8 tos;
+ /*
+ * len = IP hdr + IP payload
+ */
+ u16 len;
+ u16 id;
+ u16 flgfrag;
+ u8 ttl;
+ u8 proto;
+ u16 csum;
+ u32 sip;
+ u32 dip;
+} __attribute__((packed));
+
+struct uip_icmp {
+ struct uip_ip ip;
+ u8 type;
+ u8 code;
+ u16 csum;
+ u16 id;
+ u16 seq;
+} __attribute__((packed));
+
+struct uip_udp {
+ /*
+ * FIXME: IP Options (IP hdr len > 20 bytes) are not supported
+ */
+ struct uip_ip ip;
+ u16 sport;
+ u16 dport;
+ /*
+ * len = UDP hdr + UDP payload
+ */
+ u16 len;
+ u16 csum;
+ u8 payload[0];
+} __attribute__((packed));
+
+struct uip_tcp {
+ /*
+ * FIXME: IP Options (IP hdr len > 20 bytes) are not supported
+ */
+ struct uip_ip ip;
+ u16 sport;
+ u16 dport;
+ u32 seq;
+ u32 ack;
+ u8 off;
+ u8 flg;
+ u16 win;
+ u16 csum;
+ u16 urgent;
+} __attribute__((packed));
+
+struct uip_pseudo_hdr {
+ u32 sip;
+ u32 dip;
+ u8 zero;
+ u8 proto;
+ u16 len;
+} __attribute__((packed));
+
+struct uip_dhcp {
+ struct uip_udp udp;
+ u8 msg_type;
+ u8 hardware_type;
+ u8 hardware_len;
+ u8 hops;
+ u32 id;
+ u16 time;
+ u16 flg;
+ u32 client_ip;
+ u32 your_ip;
+ u32 server_ip;
+ u32 agent_ip;
+ struct uip_eth_addr client_mac;
+ u8 pad[UIP_DHCP_MACPAD_LEN];
+ u8 server_hostname[UIP_DHCP_HOSTNAME_LEN];
+ u8 boot_filename[UIP_DHCP_FILENAME_LEN];
+ u32 magic_cookie;
+ u8 option[UIP_DHCP_OPTION_LEN];
+} __attribute__((packed));
+
+struct uip_info {
+ struct list_head udp_socket_head;
+ struct list_head tcp_socket_head;
+ pthread_mutex_t udp_socket_lock;
+ pthread_mutex_t tcp_socket_lock;
+ struct uip_eth_addr guest_mac;
+ struct uip_eth_addr host_mac;
+ pthread_cond_t buf_free_cond;
+ pthread_cond_t buf_used_cond;
+ struct list_head buf_head;
+ pthread_mutex_t buf_lock;
+ pthread_t udp_thread;
+ int udp_epollfd;
+ int buf_free_nr;
+ int buf_used_nr;
+ u32 guest_ip;
+ u32 guest_netmask;
+ u32 host_ip;
+ u32 dns_ip[UIP_DHCP_MAX_DNS_SERVER_NR];
+ char *domain_name;
+ u32 buf_nr;
+};
+
+struct uip_buf {
+ struct list_head list;
+ struct uip_info *info;
+ int vnet_len;
+ int eth_len;
+ int status;
+ char *vnet;
+ char *eth;
+ int id;
+};
+
+struct uip_udp_socket {
+ struct sockaddr_in addr;
+ struct list_head list;
+ pthread_mutex_t *lock;
+ u32 dport, sport;
+ u32 dip, sip;
+ int fd;
+};
+
+struct uip_tcp_socket {
+ struct sockaddr_in addr;
+ struct list_head list;
+ struct uip_info *info;
+ pthread_cond_t cond;
+ pthread_mutex_t *lock;
+ pthread_t thread;
+ u32 dport, sport;
+ u32 guest_acked;
+ u16 window_size;
+ /*
+ * Initial Sequence Number
+ */
+ u32 isn_server;
+ u32 isn_guest;
+ u32 ack_server;
+ u32 seq_server;
+ int write_done;
+ int read_done;
+ u32 dip, sip;
+ u8 *payload;
+ int fd;
+};
+
+struct uip_tx_arg {
+ struct virtio_net_hdr *vnet;
+ struct uip_info *info;
+ struct uip_eth *eth;
+ int vnet_len;
+ int eth_len;
+};
+
+static inline u16 uip_ip_hdrlen(struct uip_ip *ip)
+{
+ return (ip->vhl & 0x0f) * 4;
+}
+
+static inline u16 uip_ip_len(struct uip_ip *ip)
+{
+ return htons(ip->len);
+}
+
+static inline u16 uip_udp_hdrlen(struct uip_udp *udp)
+{
+ return 8;
+}
+
+static inline u16 uip_udp_len(struct uip_udp *udp)
+{
+ return ntohs(udp->len);
+}
+
+static inline u16 uip_tcp_hdrlen(struct uip_tcp *tcp)
+{
+ return (tcp->off >> 4) * 4;
+}
+
+static inline u16 uip_tcp_len(struct uip_tcp *tcp)
+{
+ struct uip_ip *ip;
+
+ ip = &tcp->ip;
+
+ return uip_ip_len(ip) - uip_ip_hdrlen(ip);
+}
+
+static inline u16 uip_tcp_payloadlen(struct uip_tcp *tcp)
+{
+ return uip_tcp_len(tcp) - uip_tcp_hdrlen(tcp);
+}
+
+static inline u8 *uip_tcp_payload(struct uip_tcp *tcp)
+{
+ return (u8 *)&tcp->sport + uip_tcp_hdrlen(tcp);
+}
+
+static inline bool uip_tcp_is_syn(struct uip_tcp *tcp)
+{
+ return (tcp->flg & UIP_TCP_FLAG_SYN) != 0;
+}
+
+static inline bool uip_tcp_is_fin(struct uip_tcp *tcp)
+{
+ return (tcp->flg & UIP_TCP_FLAG_FIN) != 0;
+}
+
+static inline u32 uip_tcp_isn(struct uip_tcp *tcp)
+{
+ return ntohl(tcp->seq);
+}
+
+static inline u32 uip_tcp_isn_alloc(void)
+{
+ /*
+ * FIXME: should increase every 4ms
+ */
+ return 10000000;
+}
+
+static inline u16 uip_eth_hdrlen(struct uip_eth *eth)
+{
+ return sizeof(*eth);
+}
+
+int uip_tx(struct iovec *iov, u16 out, struct uip_info *info);
+int uip_rx(struct iovec *iov, u16 in, struct uip_info *info);
+int uip_init(struct uip_info *info);
+
+int uip_tx_do_ipv4_udp_dhcp(struct uip_tx_arg *arg);
+int uip_tx_do_ipv4_icmp(struct uip_tx_arg *arg);
+int uip_tx_do_ipv4_tcp(struct uip_tx_arg *arg);
+int uip_tx_do_ipv4_udp(struct uip_tx_arg *arg);
+int uip_tx_do_ipv4(struct uip_tx_arg *arg);
+int uip_tx_do_arp(struct uip_tx_arg *arg);
+
+u16 uip_csum_icmp(struct uip_icmp *icmp);
+u16 uip_csum_udp(struct uip_udp *udp);
+u16 uip_csum_tcp(struct uip_tcp *tcp);
+u16 uip_csum_ip(struct uip_ip *ip);
+
+struct uip_buf *uip_buf_set_used(struct uip_info *info, struct uip_buf *buf);
+struct uip_buf *uip_buf_set_free(struct uip_info *info, struct uip_buf *buf);
+struct uip_buf *uip_buf_get_used(struct uip_info *info);
+struct uip_buf *uip_buf_get_free(struct uip_info *info);
+struct uip_buf *uip_buf_clone(struct uip_tx_arg *arg);
+
+int uip_udp_make_pkg(struct uip_info *info, struct uip_udp_socket *sk, struct uip_buf *buf, u8 *payload, int payload_len);
+bool uip_udp_is_dhcp(struct uip_udp *udp);
+
+int uip_dhcp_get_dns(struct uip_info *info);
+#endif /* KVM__UIP_H */
--- /dev/null
+#ifndef KVM__UTIL_INIT_H
+#define KVM__UTIL_INIT_H
+
+struct kvm;
+
+struct init_item {
+ struct hlist_node n;
+ const char *fn_name;
+ int (*init)(struct kvm *);
+};
+
+int init_list__init(struct kvm *kvm);
+int init_list__exit(struct kvm *kvm);
+
+int init_list_add(struct init_item *t, int (*init)(struct kvm *),
+ int priority, const char *name);
+int exit_list_add(struct init_item *t, int (*init)(struct kvm *),
+ int priority, const char *name);
+
+#define __init_list_add(cb, l) \
+static void __attribute__ ((constructor)) __init__##cb(void) \
+{ \
+ static char name[] = #cb; \
+ static struct init_item t; \
+ init_list_add(&t, cb, l, name); \
+}
+
+#define __exit_list_add(cb, l) \
+static void __attribute__ ((constructor)) __init__##cb(void) \
+{ \
+ static char name[] = #cb; \
+ static struct init_item t; \
+ exit_list_add(&t, cb, l, name); \
+}
+
+#define core_init(cb) __init_list_add(cb, 0)
+#define base_init(cb) __init_list_add(cb, 2)
+#define dev_base_init(cb) __init_list_add(cb, 4)
+#define dev_init(cb) __init_list_add(cb, 5)
+#define virtio_dev_init(cb) __init_list_add(cb, 6)
+#define firmware_init(cb) __init_list_add(cb, 7)
+#define late_init(cb) __init_list_add(cb, 9)
+
+#define core_exit(cb) __exit_list_add(cb, 0)
+#define base_exit(cb) __exit_list_add(cb, 2)
+#define dev_base_exit(cb) __exit_list_add(cb, 4)
+#define dev_exit(cb) __exit_list_add(cb, 5)
+#define virtio_dev_exit(cb) __exit_list_add(cb, 6)
+#define firmware_exit(cb) __exit_list_add(cb, 7)
+#define late_exit(cb) __exit_list_add(cb, 9)
+#endif
--- /dev/null
+#include <linux/stringify.h>
+
+#ifndef KVM__UTIL_H
+#define KVM__UTIL_H
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+
+/*
+ * Some bits are stolen from perf tool :)
+ */
+
+#include <unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdarg.h>
+#include <string.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <errno.h>
+#include <limits.h>
+#include <sys/param.h>
+#include <sys/types.h>
+#include <linux/types.h>
+
+#ifdef __GNUC__
+#define NORETURN __attribute__((__noreturn__))
+#else
+#define NORETURN
+#ifndef __attribute__
+#define __attribute__(x)
+#endif
+#endif
+
+extern bool do_debug_print;
+
+#define PROT_RW (PROT_READ|PROT_WRITE)
+#define MAP_ANON_NORESERVE (MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE)
+
+extern void die(const char *err, ...) NORETURN __attribute__((format (printf, 1, 2)));
+extern void die_perror(const char *s) NORETURN;
+extern int pr_err(const char *err, ...) __attribute__((format (printf, 1, 2)));
+extern void pr_warning(const char *err, ...) __attribute__((format (printf, 1, 2)));
+extern void pr_info(const char *err, ...) __attribute__((format (printf, 1, 2)));
+extern void set_die_routine(void (*routine)(const char *err, va_list params) NORETURN);
+
+#define pr_debug(fmt, ...) \
+ do { \
+ if (do_debug_print) \
+ pr_info("(%s) %s:%d: " fmt, __FILE__, \
+ __func__, __LINE__, ##__VA_ARGS__); \
+ } while (0)
+
+
+#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
+
+#ifndef BUG_ON_HANDLER
+# define BUG_ON_HANDLER(condition) \
+ do { \
+ if ((condition)) { \
+ pr_err("BUG at %s:%d", __FILE__, __LINE__); \
+ raise(SIGABRT); \
+ } \
+ } while (0)
+#endif
+
+#define BUG_ON(condition) BUG_ON_HANDLER((condition))
+
+#define DIE_IF(cnd) \
+do { \
+ if (cnd) \
+ die(" at (" __FILE__ ":" __stringify(__LINE__) "): " \
+ __stringify(cnd) "\n"); \
+} while (0)
+
+#define WARN_ON(condition) ({ \
+ int __ret_warn_on = !!(condition); \
+ if (__ret_warn_on) \
+ pr_warning("(%s) %s:%d: failed condition: %s", \
+ __FILE__, __func__, __LINE__, \
+ __stringify(condition)); \
+ __ret_warn_on; \
+})
+
+#define MSECS_TO_USECS(s) ((s) * 1000)
+
+/* Millisecond sleep */
+static inline void msleep(unsigned int msecs)
+{
+ usleep(MSECS_TO_USECS(msecs));
+}
+
+struct kvm;
+void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
+void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *hugetlbfs_path, u64 size);
+
+#endif /* KVM__UTIL_H */
--- /dev/null
+#ifndef KVM__VESA_H
+#define KVM__VESA_H
+
+#include <linux/types.h>
+
+#define VESA_WIDTH 640
+#define VESA_HEIGHT 480
+
+#define VESA_MEM_ADDR 0xd0000000
+#define VESA_MEM_SIZE (4*VESA_WIDTH*VESA_HEIGHT)
+#define VESA_BPP 32
+
+struct kvm;
+struct biosregs;
+
+struct framebuffer *vesa__init(struct kvm *self);
+
+#endif
--- /dev/null
+#ifndef KVM__VIRTIO_9P_H
+#define KVM__VIRTIO_9P_H
+#include "kvm/virtio.h"
+#include "kvm/pci.h"
+#include "kvm/threadpool.h"
+#include "kvm/parse-options.h"
+
+#include <sys/types.h>
+#include <dirent.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+
+#define NUM_VIRT_QUEUES 1
+#define VIRTQUEUE_NUM 128
+#define VIRTIO_9P_DEFAULT_TAG "kvm_9p"
+#define VIRTIO_9P_HDR_LEN (sizeof(u32)+sizeof(u8)+sizeof(u16))
+#define VIRTIO_9P_VERSION_DOTL "9P2000.L"
+#define MAX_TAG_LEN 32
+
+struct p9_msg {
+ u32 size;
+ u8 cmd;
+ u16 tag;
+ u8 msg[0];
+} __attribute__((packed));
+
+struct p9_fid {
+ u32 fid;
+ u32 uid;
+ char abs_path[PATH_MAX];
+ char *path;
+ DIR *dir;
+ int fd;
+ struct rb_node node;
+};
+
+struct p9_dev_job {
+ struct virt_queue *vq;
+ struct p9_dev *p9dev;
+ struct thread_pool__job job_id;
+};
+
+struct p9_dev {
+ struct list_head list;
+ struct virtio_device vdev;
+ struct rb_root fids;
+
+ struct virtio_9p_config *config;
+ u32 features;
+
+ /* virtio queue */
+ struct virt_queue vqs[NUM_VIRT_QUEUES];
+ struct p9_dev_job jobs[NUM_VIRT_QUEUES];
+ char root_dir[PATH_MAX];
+};
+
+struct p9_pdu {
+ u32 queue_head;
+ size_t read_offset;
+ size_t write_offset;
+ u16 out_iov_cnt;
+ u16 in_iov_cnt;
+ struct iovec in_iov[VIRTQUEUE_NUM];
+ struct iovec out_iov[VIRTQUEUE_NUM];
+};
+
+struct kvm;
+
+int virtio_9p_rootdir_parser(const struct option *opt, const char *arg, int unset);
+int virtio_9p_img_name_parser(const struct option *opt, const char *arg, int unset);
+int virtio_9p__register(struct kvm *kvm, const char *root, const char *tag_name);
+int virtio_9p__init(struct kvm *kvm);
+int virtio_p9_pdu_readf(struct p9_pdu *pdu, const char *fmt, ...);
+int virtio_p9_pdu_writef(struct p9_pdu *pdu, const char *fmt, ...);
+
+#endif
--- /dev/null
+#ifndef KVM__BLN_VIRTIO_H
+#define KVM__BLN_VIRTIO_H
+
+struct kvm;
+
+int virtio_bln__init(struct kvm *kvm);
+int virtio_bln__exit(struct kvm *kvm);
+
+#endif /* KVM__BLN_VIRTIO_H */
--- /dev/null
+#ifndef KVM__BLK_VIRTIO_H
+#define KVM__BLK_VIRTIO_H
+
+#include "kvm/disk-image.h"
+
+struct kvm;
+
+int virtio_blk__init(struct kvm *kvm);
+int virtio_blk__exit(struct kvm *kvm);
+void virtio_blk_complete(void *param, long len);
+
+#endif /* KVM__BLK_VIRTIO_H */
--- /dev/null
+#ifndef KVM__CONSOLE_VIRTIO_H
+#define KVM__CONSOLE_VIRTIO_H
+
+struct kvm;
+
+int virtio_console__init(struct kvm *kvm);
+void virtio_console__inject_interrupt(struct kvm *kvm);
+int virtio_console__exit(struct kvm *kvm);
+
+#endif /* KVM__CONSOLE_VIRTIO_H */
--- /dev/null
+#ifndef KVM__VIRTIO_MMIO_H
+#define KVM__VIRTIO_MMIO_H
+
+#include <linux/types.h>
+#include <linux/virtio_mmio.h>
+
+#define VIRTIO_MMIO_MAX_VQ 3
+#define VIRTIO_MMIO_MAX_CONFIG 1
+#define VIRTIO_MMIO_IO_SIZE 0x200
+
+struct kvm;
+
+struct virtio_mmio_ioevent_param {
+ struct virtio_device *vdev;
+ u32 vq;
+};
+
+struct virtio_mmio_hdr {
+ char magic[4];
+ u32 version;
+ u32 device_id;
+ u32 vendor_id;
+ u32 host_features;
+ u32 host_features_sel;
+ u32 reserved_1[2];
+ u32 guest_features;
+ u32 guest_features_sel;
+ u32 guest_page_size;
+ u32 reserved_2;
+ u32 queue_sel;
+ u32 queue_num_max;
+ u32 queue_num;
+ u32 queue_align;
+ u32 queue_pfn;
+ u32 reserved_3[3];
+ u32 queue_notify;
+ u32 reserved_4[3];
+ u32 interrupt_state;
+ u32 interrupt_ack;
+ u32 reserved_5[2];
+ u32 status;
+} __attribute__((packed));
+
+struct virtio_mmio {
+ u32 addr;
+ void *dev;
+ struct kvm *kvm;
+ u8 irq;
+ struct virtio_mmio_hdr hdr;
+ struct virtio_mmio_ioevent_param ioeventfds[VIRTIO_MMIO_MAX_VQ];
+};
+
+int virtio_mmio_signal_vq(struct kvm *kvm, struct virtio_device *vdev, u32 vq);
+int virtio_mmio_signal_config(struct kvm *kvm, struct virtio_device *vdev);
+int virtio_mmio_exit(struct kvm *kvm, struct virtio_device *vdev);
+int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ int device_id, int subsys_id, int class);
+#endif
--- /dev/null
+#ifndef KVM__VIRTIO_NET_H
+#define KVM__VIRTIO_NET_H
+
+#include "kvm/parse-options.h"
+
+struct kvm;
+
+struct virtio_net_params {
+ const char *guest_ip;
+ const char *host_ip;
+ const char *script;
+ const char *trans;
+ char guest_mac[6];
+ char host_mac[6];
+ struct kvm *kvm;
+ int mode;
+ int vhost;
+ int fd;
+};
+
+int virtio_net__init(struct kvm *kvm);
+int virtio_net__exit(struct kvm *kvm);
+int netdev_parser(const struct option *opt, const char *arg, int unset);
+
+enum {
+ NET_MODE_USER,
+ NET_MODE_TAP
+};
+
+#endif /* KVM__VIRTIO_NET_H */
--- /dev/null
+#ifndef VIRTIO_PCI_DEV_H_
+#define VIRTIO_PCI_DEV_H_
+
+#include <linux/virtio_ids.h>
+
+/*
+ * Virtio PCI device constants and resources
+ * they do use (such as irqs and pins).
+ */
+
+#define PCI_DEVICE_ID_VIRTIO_NET 0x1000
+#define PCI_DEVICE_ID_VIRTIO_BLK 0x1001
+#define PCI_DEVICE_ID_VIRTIO_CONSOLE 0x1003
+#define PCI_DEVICE_ID_VIRTIO_RNG 0x1004
+#define PCI_DEVICE_ID_VIRTIO_BLN 0x1005
+#define PCI_DEVICE_ID_VIRTIO_SCSI 0x1008
+#define PCI_DEVICE_ID_VIRTIO_9P 0x1009
+#define PCI_DEVICE_ID_VESA 0x2000
+#define PCI_DEVICE_ID_PCI_SHMEM 0x0001
+
+#define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4
+#define PCI_VENDOR_ID_PCI_SHMEM 0x0001
+#define PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET 0x1af4
+
+#define PCI_SUBSYSTEM_ID_VESA 0x0004
+#define PCI_SUBSYSTEM_ID_PCI_SHMEM 0x0001
+
+#define PCI_CLASS_BLK 0x018000
+#define PCI_CLASS_NET 0x020000
+#define PCI_CLASS_CONSOLE 0x078000
+/*
+ * 0xFF Device does not fit in any defined classes
+ */
+#define PCI_CLASS_RNG 0xff0000
+#define PCI_CLASS_BLN 0xff0000
+#define PCI_CLASS_9P 0xff0000
+
+#endif /* VIRTIO_PCI_DEV_H_ */
--- /dev/null
+#ifndef KVM__VIRTIO_PCI_H
+#define KVM__VIRTIO_PCI_H
+
+#include "kvm/pci.h"
+
+#include <linux/types.h>
+
+#define VIRTIO_PCI_MAX_VQ 3
+#define VIRTIO_PCI_MAX_CONFIG 1
+
+struct kvm;
+
+struct virtio_pci_ioevent_param {
+ struct virtio_device *vdev;
+ u32 vq;
+};
+
+#define VIRTIO_PCI_F_SIGNAL_MSI (1 << 0)
+
+struct virtio_pci {
+ struct pci_device_header pci_hdr;
+ void *dev;
+
+ u16 base_addr;
+ u8 status;
+ u8 isr;
+ u32 features;
+
+ /* MSI-X */
+ u16 config_vector;
+ u32 config_gsi;
+ u32 vq_vector[VIRTIO_PCI_MAX_VQ];
+ u32 gsis[VIRTIO_PCI_MAX_VQ];
+ u32 msix_io_block;
+ u64 msix_pba;
+ struct msix_table msix_table[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
+
+ /* virtio queue */
+ u16 queue_selector;
+ struct virtio_pci_ioevent_param ioeventfds[VIRTIO_PCI_MAX_VQ];
+};
+
+int virtio_pci__signal_vq(struct kvm *kvm, struct virtio_device *vdev, u32 vq);
+int virtio_pci__signal_config(struct kvm *kvm, struct virtio_device *vdev);
+int virtio_pci__exit(struct kvm *kvm, struct virtio_device *vdev);
+int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ int device_id, int subsys_id, int class);
+
+#endif
--- /dev/null
+#ifndef KVM__RNG_VIRTIO_H
+#define KVM__RNG_VIRTIO_H
+
+struct kvm;
+
+int virtio_rng__init(struct kvm *kvm);
+int virtio_rng__exit(struct kvm *kvm);
+
+#endif /* KVM__RNG_VIRTIO_H */
--- /dev/null
+#ifndef KVM__SCSI_VIRTIO_H
+#define KVM__SCSI_VIRTIO_H
+
+#include "kvm/disk-image.h"
+
+struct kvm;
+
+int virtio_scsi_init(struct kvm *kvm);
+int virtio_scsi_exit(struct kvm *kvm);
+
+/*----------------------------------------------------*/
+/* TODO: Remove this when tcm_vhost goes upstream */
+#define TRANSPORT_IQN_LEN 224
+#define VHOST_SCSI_ABI_VERSION 0
+struct vhost_scsi_target {
+ int abi_version;
+ unsigned char vhost_wwpn[TRANSPORT_IQN_LEN];
+ unsigned short vhost_tpgt;
+};
+/* VHOST_SCSI specific defines */
+#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_scsi_target)
+#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_scsi_target)
+#define VHOST_SCSI_GET_ABI_VERSION _IOW(VHOST_VIRTIO, 0x42, struct vhost_scsi_target)
+/*----------------------------------------------------*/
+
+#endif /* KVM__SCSI_VIRTIO_H */
--- /dev/null
+#ifndef KVM__VIRTIO_H
+#define KVM__VIRTIO_H
+
+#include <linux/virtio_ring.h>
+#include <linux/virtio_pci.h>
+
+#include <linux/types.h>
+#include <sys/uio.h>
+
+#include "kvm/kvm.h"
+
+#define VIRTIO_IRQ_LOW 0
+#define VIRTIO_IRQ_HIGH 1
+
+#define VIRTIO_PCI_O_CONFIG 0
+#define VIRTIO_PCI_O_MSIX 1
+
+struct virt_queue {
+ struct vring vring;
+ u32 pfn;
+ /* The last_avail_idx field is an index to ->ring of struct vring_avail.
+ It's where we assume the next request index is at. */
+ u16 last_avail_idx;
+ u16 last_used_signalled;
+};
+
+static inline u16 virt_queue__pop(struct virt_queue *queue)
+{
+ return queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+}
+
+static inline struct vring_desc *virt_queue__get_desc(struct virt_queue *queue, u16 desc_ndx)
+{
+ return &queue->vring.desc[desc_ndx];
+}
+
+static inline bool virt_queue__available(struct virt_queue *vq)
+{
+ if (!vq->vring.avail)
+ return 0;
+
+ vring_avail_event(&vq->vring) = vq->last_avail_idx;
+ return vq->vring.avail->idx != vq->last_avail_idx;
+}
+
+/*
+ * Warning: on 32-bit hosts, shifting pfn left may cause a truncation of pfn values
+ * higher than 4GB - thus, pointing to the wrong area in guest virtual memory space
+ * and breaking the virt queue which owns this pfn.
+ */
+static inline void *guest_pfn_to_host(struct kvm *kvm, u32 pfn)
+{
+ return guest_flat_to_host(kvm, (unsigned long)pfn << VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+}
+
+
+struct vring_used_elem *virt_queue__set_used_elem(struct virt_queue *queue, u32 head, u32 len);
+
+bool virtio_queue__should_signal(struct virt_queue *vq);
+u16 virt_queue__get_iov(struct virt_queue *vq, struct iovec iov[],
+ u16 *out, u16 *in, struct kvm *kvm);
+u16 virt_queue__get_head_iov(struct virt_queue *vq, struct iovec iov[],
+ u16 *out, u16 *in, u16 head, struct kvm *kvm);
+u16 virt_queue__get_inout_iov(struct kvm *kvm, struct virt_queue *queue,
+ struct iovec in_iov[], struct iovec out_iov[],
+ u16 *in, u16 *out);
+int virtio__get_dev_specific_field(int offset, bool msix, u32 *config_off);
+
+enum virtio_trans {
+ VIRTIO_PCI,
+ VIRTIO_MMIO,
+};
+
+struct virtio_device {
+ bool use_vhost;
+ void *virtio;
+ struct virtio_ops *ops;
+};
+
+struct virtio_ops {
+ u8 *(*get_config)(struct kvm *kvm, void *dev);
+ u32 (*get_host_features)(struct kvm *kvm, void *dev);
+ void (*set_guest_features)(struct kvm *kvm, void *dev, u32 features);
+ int (*init_vq)(struct kvm *kvm, void *dev, u32 vq, u32 pfn);
+ int (*notify_vq)(struct kvm *kvm, void *dev, u32 vq);
+ int (*get_pfn_vq)(struct kvm *kvm, void *dev, u32 vq);
+ int (*get_size_vq)(struct kvm *kvm, void *dev, u32 vq);
+ int (*set_size_vq)(struct kvm *kvm, void *dev, u32 vq, int size);
+ void (*notify_vq_gsi)(struct kvm *kvm, void *dev, u32 vq, u32 gsi);
+ void (*notify_vq_eventfd)(struct kvm *kvm, void *dev, u32 vq, u32 efd);
+ int (*signal_vq)(struct kvm *kvm, struct virtio_device *vdev, u32 queueid);
+ int (*signal_config)(struct kvm *kvm, struct virtio_device *vdev);
+ int (*init)(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ int device_id, int subsys_id, int class);
+ int (*exit)(struct kvm *kvm, struct virtio_device *vdev);
+};
+
+int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ struct virtio_ops *ops, enum virtio_trans trans,
+ int device_id, int subsys_id, int class);
+int virtio_compat_add_message(const char *device, const char *config);
+#endif /* KVM__VIRTIO_H */
--- /dev/null
+#ifndef KVM__VNC_H
+#define KVM__VNC_H
+
+#include "kvm/kvm.h"
+
+struct framebuffer;
+
+#ifdef CONFIG_HAS_VNCSERVER
+int vnc__init(struct kvm *kvm);
+int vnc__exit(struct kvm *kvm);
+#else
+static inline int vnc__init(struct kvm *kvm)
+{
+ return 0;
+}
+static inline int vnc__exit(struct kvm *kvm)
+{
+ return 0;
+}
+#endif
+
+#endif /* KVM__VNC_H */
--- /dev/null
+#ifndef _KVM_LINUX_BITOPS_H_
+#define _KVM_LINUX_BITOPS_H_
+
+#include <linux/kernel.h>
+#include <linux/compiler.h>
+#include <asm/hweight.h>
+
+#define BITS_PER_LONG __WORDSIZE
+#define BITS_PER_BYTE 8
+#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
+
+static inline void set_bit(int nr, unsigned long *addr)
+{
+ addr[nr / BITS_PER_LONG] |= 1UL << (nr % BITS_PER_LONG);
+}
+
+static inline void clear_bit(int nr, unsigned long *addr)
+{
+ addr[nr / BITS_PER_LONG] &= ~(1UL << (nr % BITS_PER_LONG));
+}
+
+static __always_inline int test_bit(unsigned int nr, const unsigned long *addr)
+{
+ return ((1UL << (nr % BITS_PER_LONG)) &
+ (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
+}
+
+static inline unsigned long hweight_long(unsigned long w)
+{
+ return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
+}
+
+#endif
--- /dev/null
+#ifndef __BYTE_ORDER_H__
+#define __BYTE_ORDER_H__
+
+#include <asm/byteorder.h>
+#include <linux/byteorder/generic.h>
+
+#endif
--- /dev/null
+#ifndef _PERF_LINUX_COMPILER_H_
+#define _PERF_LINUX_COMPILER_H_
+
+#ifndef __always_inline
+#define __always_inline inline
+#endif
+#define __user
+
+#ifndef __attribute_const__
+#define __attribute_const__
+#endif
+
+#define __used __attribute__((__unused__))
+#define __packed __attribute__((packed))
+#define __iomem
+#define __force
+#define __must_check
+#define unlikely
+
+#endif
--- /dev/null
+
+#ifndef KVM__LINUX_KERNEL_H_
+#define KVM__LINUX_KERNEL_H_
+
+#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
+
+#define ALIGN(x,a) __ALIGN_MASK(x,(typeof(x))(a)-1)
+#define __ALIGN_MASK(x,mask) (((x)+(mask))&~(mask))
+
+#ifndef offsetof
+#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
+#endif
+
+#ifndef container_of
+/**
+ * container_of - cast a member of a structure out to the containing structure
+ * @ptr: the pointer to the member.
+ * @type: the type of the container struct this is embedded in.
+ * @member: the name of the member within the struct.
+ *
+ */
+#define container_of(ptr, type, member) ({ \
+ const typeof(((type *)0)->member) * __mptr = (ptr); \
+ (type *)((char *)__mptr - offsetof(type, member)); })
+#endif
+
+#define min(x, y) ({ \
+ typeof(x) _min1 = (x); \
+ typeof(y) _min2 = (y); \
+ (void) (&_min1 == &_min2); \
+ _min1 < _min2 ? _min1 : _min2; })
+
+#define max(x, y) ({ \
+ typeof(x) _max1 = (x); \
+ typeof(y) _max2 = (y); \
+ (void) (&_max1 == &_max2); \
+ _max1 > _max2 ? _max1 : _max2; })
+
+#endif
--- /dev/null
+#ifndef KVM__LINUX_MODULE_H
+#define KVM__LINUX_MODULE_H
+
+#define EXPORT_SYMBOL(name)
+
+#endif
--- /dev/null
+#ifndef KVM__LINUX_PREFETCH_H
+#define KVM__LINUX_PREFETCH_H
+
+static inline void prefetch(void *a __attribute__((unused))) { }
+
+#endif
--- /dev/null
+#ifndef _LINUX_STDDEF_H
+#define _LINUX_STDDEF_H
+
+#include <linux/compiler.h>
+
+#undef NULL
+#define NULL ((void *)0)
+
+#undef offsetof
+#ifdef __compiler_offsetof
+#define offsetof(TYPE,MEMBER) __compiler_offsetof(TYPE,MEMBER)
+#else
+#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
+#endif
+
+#endif
--- /dev/null
+#ifndef LINUX_TYPES_H
+#define LINUX_TYPES_H
+
+#include <kvm/compiler.h>
+#define __SANE_USERSPACE_TYPES__ /* For PPC64, to get LL64 types */
+#include <asm/types.h>
+
+typedef __u64 u64;
+typedef __s64 s64;
+
+typedef __u32 u32;
+typedef __s32 s32;
+
+typedef __u16 u16;
+typedef __s16 s16;
+
+typedef __u8 u8;
+typedef __s8 s8;
+
+#ifdef __CHECKER__
+#define __bitwise__ __attribute__((bitwise))
+#else
+#define __bitwise__
+#endif
+#ifdef __CHECK_ENDIAN__
+#define __bitwise __bitwise__
+#else
+#define __bitwise
+#endif
+
+
+typedef __u16 __bitwise __le16;
+typedef __u16 __bitwise __be16;
+typedef __u32 __bitwise __le32;
+typedef __u32 __bitwise __be32;
+typedef __u64 __bitwise __le64;
+typedef __u64 __bitwise __be64;
+
+struct list_head {
+ struct list_head *next, *prev;
+};
+
+struct hlist_head {
+ struct hlist_node *first;
+};
+
+struct hlist_node {
+ struct hlist_node *next, **pprev;
+};
+
+#endif /* LINUX_TYPES_H */
--- /dev/null
+#include <sys/epoll.h>
+#include <sys/ioctl.h>
+#include <pthread.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <signal.h>
+
+#include <linux/kernel.h>
+#include <linux/kvm.h>
+#include <linux/types.h>
+
+#include "kvm/ioeventfd.h"
+#include "kvm/kvm.h"
+#include "kvm/util.h"
+
+#define IOEVENTFD_MAX_EVENTS 20
+
+static struct epoll_event events[IOEVENTFD_MAX_EVENTS];
+static int epoll_fd, epoll_stop_fd;
+static LIST_HEAD(used_ioevents);
+static bool ioeventfd_avail;
+
+static void *ioeventfd__thread(void *param)
+{
+ u64 tmp = 1;
+
+ for (;;) {
+ int nfds, i;
+
+ nfds = epoll_wait(epoll_fd, events, IOEVENTFD_MAX_EVENTS, -1);
+ for (i = 0; i < nfds; i++) {
+ struct ioevent *ioevent;
+
+ if (events[i].data.fd == epoll_stop_fd)
+ goto done;
+
+ ioevent = events[i].data.ptr;
+
+ if (read(ioevent->fd, &tmp, sizeof(tmp)) < 0)
+ die("Failed reading event");
+
+ ioevent->fn(ioevent->fn_kvm, ioevent->fn_ptr);
+ }
+ }
+
+done:
+ tmp = write(epoll_stop_fd, &tmp, sizeof(tmp));
+
+ return NULL;
+}
+
+static int ioeventfd__start(void)
+{
+ pthread_t thread;
+
+ if (!ioeventfd_avail)
+ return -ENOSYS;
+
+ return pthread_create(&thread, NULL, ioeventfd__thread, NULL);
+}
+
+int ioeventfd__init(struct kvm *kvm)
+{
+ struct epoll_event epoll_event = {.events = EPOLLIN};
+ int r;
+
+ ioeventfd_avail = kvm__supports_extension(kvm, KVM_CAP_IOEVENTFD);
+ if (!ioeventfd_avail)
+ return 1; /* Not fatal, but let caller determine no-go. */
+
+ epoll_fd = epoll_create(IOEVENTFD_MAX_EVENTS);
+ if (epoll_fd < 0)
+ return -errno;
+
+ epoll_stop_fd = eventfd(0, 0);
+ epoll_event.data.fd = epoll_stop_fd;
+
+ r = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, epoll_stop_fd, &epoll_event);
+ if (r < 0)
+ goto cleanup;
+
+ r = ioeventfd__start();
+ if (r < 0)
+ goto cleanup;
+
+ r = 0;
+
+ return r;
+
+cleanup:
+ close(epoll_stop_fd);
+ close(epoll_fd);
+
+ return r;
+}
+base_init(ioeventfd__init);
+
+int ioeventfd__exit(struct kvm *kvm)
+{
+ u64 tmp = 1;
+ int r;
+
+ if (!ioeventfd_avail)
+ return 0;
+
+ r = write(epoll_stop_fd, &tmp, sizeof(tmp));
+ if (r < 0)
+ return r;
+
+ r = read(epoll_stop_fd, &tmp, sizeof(tmp));
+ if (r < 0)
+ return r;
+
+ close(epoll_fd);
+ close(epoll_stop_fd);
+
+ return 0;
+}
+base_exit(ioeventfd__exit);
+
+int ioeventfd__add_event(struct ioevent *ioevent, bool is_pio, bool poll_in_userspace)
+{
+ struct kvm_ioeventfd kvm_ioevent;
+ struct epoll_event epoll_event;
+ struct ioevent *new_ioevent;
+ int event, r;
+
+ if (!ioeventfd_avail)
+ return -ENOSYS;
+
+ new_ioevent = malloc(sizeof(*new_ioevent));
+ if (new_ioevent == NULL)
+ return -ENOMEM;
+
+ *new_ioevent = *ioevent;
+ event = new_ioevent->fd;
+
+ kvm_ioevent = (struct kvm_ioeventfd) {
+ .addr = ioevent->io_addr,
+ .len = ioevent->io_len,
+ .datamatch = ioevent->datamatch,
+ .fd = event,
+ .flags = KVM_IOEVENTFD_FLAG_DATAMATCH,
+ };
+
+ if (is_pio)
+ kvm_ioevent.flags |= KVM_IOEVENTFD_FLAG_PIO;
+
+ r = ioctl(ioevent->fn_kvm->vm_fd, KVM_IOEVENTFD, &kvm_ioevent);
+ if (r) {
+ r = -errno;
+ goto cleanup;
+ }
+
+ if (!poll_in_userspace)
+ return 0;
+
+ epoll_event = (struct epoll_event) {
+ .events = EPOLLIN,
+ .data.ptr = new_ioevent,
+ };
+
+ r = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event, &epoll_event);
+ if (r) {
+ r = -errno;
+ goto cleanup;
+ }
+
+ list_add_tail(&new_ioevent->list, &used_ioevents);
+
+ return 0;
+
+cleanup:
+ free(new_ioevent);
+ return r;
+}
+
+int ioeventfd__del_event(u64 addr, u64 datamatch)
+{
+ struct kvm_ioeventfd kvm_ioevent;
+ struct ioevent *ioevent;
+ u8 found = 0;
+
+ if (!ioeventfd_avail)
+ return -ENOSYS;
+
+ list_for_each_entry(ioevent, &used_ioevents, list) {
+ if (ioevent->io_addr == addr) {
+ found = 1;
+ break;
+ }
+ }
+
+ if (found == 0 || ioevent == NULL)
+ return -ENOENT;
+
+ kvm_ioevent = (struct kvm_ioeventfd) {
+ .addr = ioevent->io_addr,
+ .len = ioevent->io_len,
+ .datamatch = ioevent->datamatch,
+ .flags = KVM_IOEVENTFD_FLAG_PIO
+ | KVM_IOEVENTFD_FLAG_DEASSIGN
+ | KVM_IOEVENTFD_FLAG_DATAMATCH,
+ };
+
+ ioctl(ioevent->fn_kvm->vm_fd, KVM_IOEVENTFD, &kvm_ioevent);
+
+ epoll_ctl(epoll_fd, EPOLL_CTL_DEL, ioevent->fd, NULL);
+
+ list_del(&ioevent->list);
+
+ close(ioevent->fd);
+ free(ioevent);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/ioport.h"
+
+#include "kvm/kvm.h"
+#include "kvm/util.h"
+#include "kvm/brlock.h"
+#include "kvm/rbtree-interval.h"
+#include "kvm/mutex.h"
+
+#include <linux/kvm.h> /* for KVM_EXIT_* */
+#include <linux/types.h>
+
+#include <stdbool.h>
+#include <limits.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#define ioport_node(n) rb_entry(n, struct ioport, node)
+
+DEFINE_MUTEX(ioport_mutex);
+
+static u16 free_io_port_idx; /* protected by ioport_mutex */
+
+static struct rb_root ioport_tree = RB_ROOT;
+
+static u16 ioport__find_free_port(void)
+{
+ u16 free_port;
+
+ mutex_lock(&ioport_mutex);
+ free_port = IOPORT_START + free_io_port_idx * IOPORT_SIZE;
+ free_io_port_idx++;
+ mutex_unlock(&ioport_mutex);
+
+ return free_port;
+}
+
+static struct ioport *ioport_search(struct rb_root *root, u64 addr)
+{
+ struct rb_int_node *node;
+
+ node = rb_int_search_single(root, addr);
+ if (node == NULL)
+ return NULL;
+
+ return ioport_node(node);
+}
+
+static int ioport_insert(struct rb_root *root, struct ioport *data)
+{
+ return rb_int_insert(root, &data->node);
+}
+
+static void ioport_remove(struct rb_root *root, struct ioport *data)
+{
+ rb_int_erase(root, &data->node);
+}
+
+int ioport__register(struct kvm *kvm, u16 port, struct ioport_operations *ops, int count, void *param)
+{
+ struct ioport *entry;
+ int r;
+
+ br_write_lock(kvm);
+ if (port == IOPORT_EMPTY)
+ port = ioport__find_free_port();
+
+ entry = ioport_search(&ioport_tree, port);
+ if (entry) {
+ pr_warning("ioport re-registered: %x", port);
+ rb_int_erase(&ioport_tree, &entry->node);
+ }
+
+ entry = malloc(sizeof(*entry));
+ if (entry == NULL)
+ return -ENOMEM;
+
+ *entry = (struct ioport) {
+ .node = RB_INT_INIT(port, port + count),
+ .ops = ops,
+ .priv = param,
+ };
+
+ r = ioport_insert(&ioport_tree, entry);
+ if (r < 0) {
+ free(entry);
+ br_write_unlock(kvm);
+ return r;
+ }
+ br_write_unlock(kvm);
+
+ return port;
+}
+
+int ioport__unregister(struct kvm *kvm, u16 port)
+{
+ struct ioport *entry;
+ int r;
+
+ br_write_lock(kvm);
+
+ r = -ENOENT;
+ entry = ioport_search(&ioport_tree, port);
+ if (!entry)
+ goto done;
+
+ ioport_remove(&ioport_tree, entry);
+
+ free(entry);
+
+ r = 0;
+
+done:
+ br_write_unlock(kvm);
+
+ return r;
+}
+
+static void ioport__unregister_all(void)
+{
+ struct ioport *entry;
+ struct rb_node *rb;
+ struct rb_int_node *rb_node;
+
+ rb = rb_first(&ioport_tree);
+ while (rb) {
+ rb_node = rb_int(rb);
+ entry = ioport_node(rb_node);
+ ioport_remove(&ioport_tree, entry);
+ free(entry);
+ rb = rb_first(&ioport_tree);
+ }
+}
+
+static const char *to_direction(int direction)
+{
+ if (direction == KVM_EXIT_IO_IN)
+ return "IN";
+ else
+ return "OUT";
+}
+
+static void ioport_error(u16 port, void *data, int direction, int size, u32 count)
+{
+ fprintf(stderr, "IO error: %s port=%x, size=%d, count=%u\n", to_direction(direction), port, size, count);
+}
+
+bool kvm__emulate_io(struct kvm *kvm, u16 port, void *data, int direction, int size, u32 count)
+{
+ struct ioport_operations *ops;
+ bool ret = false;
+ struct ioport *entry;
+ void *ptr = data;
+
+ br_read_lock();
+ entry = ioport_search(&ioport_tree, port);
+ if (!entry)
+ goto error;
+
+ ops = entry->ops;
+
+ while (count--) {
+ if (direction == KVM_EXIT_IO_IN && ops->io_in)
+ ret = ops->io_in(entry, kvm, port, ptr, size);
+ else if (ops->io_out)
+ ret = ops->io_out(entry, kvm, port, ptr, size);
+
+ ptr += size;
+ }
+
+ br_read_unlock();
+
+ if (!ret)
+ goto error;
+
+ return true;
+error:
+ br_read_unlock();
+
+ if (kvm->cfg.ioport_debug)
+ ioport_error(port, data, direction, size, count);
+
+ return !kvm->cfg.ioport_debug;
+}
+
+int ioport__init(struct kvm *kvm)
+{
+ ioport__setup_arch(kvm);
+
+ return 0;
+}
+dev_base_init(ioport__init);
+
+int ioport__exit(struct kvm *kvm)
+{
+ ioport__unregister_all();
+ return 0;
+}
+dev_base_exit(ioport__exit);
--- /dev/null
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+
+/* user defined header files */
+#include "kvm/builtin-debug.h"
+#include "kvm/builtin-pause.h"
+#include "kvm/builtin-resume.h"
+#include "kvm/builtin-balloon.h"
+#include "kvm/builtin-list.h"
+#include "kvm/builtin-version.h"
+#include "kvm/builtin-setup.h"
+#include "kvm/builtin-stop.h"
+#include "kvm/builtin-stat.h"
+#include "kvm/builtin-help.h"
+#include "kvm/builtin-sandbox.h"
+#include "kvm/kvm-cmd.h"
+#include "kvm/builtin-run.h"
+#include "kvm/util.h"
+
+struct cmd_struct kvm_commands[] = {
+ { "pause", kvm_cmd_pause, kvm_pause_help, 0 },
+ { "resume", kvm_cmd_resume, kvm_resume_help, 0 },
+ { "debug", kvm_cmd_debug, kvm_debug_help, 0 },
+ { "balloon", kvm_cmd_balloon, kvm_balloon_help, 0 },
+ { "list", kvm_cmd_list, kvm_list_help, 0 },
+ { "version", kvm_cmd_version, NULL, 0 },
+ { "--version", kvm_cmd_version, NULL, 0 },
+ { "stop", kvm_cmd_stop, kvm_stop_help, 0 },
+ { "stat", kvm_cmd_stat, kvm_stat_help, 0 },
+ { "help", kvm_cmd_help, NULL, 0 },
+ { "setup", kvm_cmd_setup, kvm_setup_help, 0 },
+ { "run", kvm_cmd_run, kvm_run_help, 0 },
+ { "sandbox", kvm_cmd_sandbox, kvm_run_help, 0 },
+ { NULL, NULL, NULL, 0 },
+};
+
+/*
+ * kvm_get_command: Searches the command in an array of the commands and
+ * returns a pointer to cmd_struct if a match is found.
+ *
+ * Input parameters:
+ * command: Array of possible commands. The last entry in the array must be
+ * NULL.
+ * cmd: A string command to search in the array
+ *
+ * Return Value:
+ * NULL: If the cmd is not matched with any of the command in the command array
+ * p: Pointer to cmd_struct of the matching command
+ */
+struct cmd_struct *kvm_get_command(struct cmd_struct *command,
+ const char *cmd)
+{
+ struct cmd_struct *p = command;
+
+ while (p->cmd) {
+ if (!strcmp(p->cmd, cmd))
+ return p;
+ p++;
+ }
+ return NULL;
+}
+
+int handle_command(struct cmd_struct *command, int argc, const char **argv)
+{
+ struct cmd_struct *p;
+ const char *prefix = NULL;
+ int ret = 0;
+
+ if (!argv || !*argv) {
+ p = kvm_get_command(command, "help");
+ BUG_ON(!p);
+ return p->fn(argc, argv, prefix);
+ }
+
+ p = kvm_get_command(command, argv[0]);
+ if (!p) {
+ p = kvm_get_command(command, "help");
+ BUG_ON(!p);
+ p->fn(0, NULL, prefix);
+ return EINVAL;
+ }
+
+ ret = p->fn(argc - 1, &argv[1], prefix);
+ if (ret < 0) {
+ if (errno == EPERM)
+ die("Permission error - are you root?");
+ }
+
+ return ret;
+}
--- /dev/null
+#include "kvm/kvm-cpu.h"
+
+#include "kvm/symbol.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <stdio.h>
+
+extern __thread struct kvm_cpu *current_kvm_cpu;
+
+void kvm_cpu__enable_singlestep(struct kvm_cpu *vcpu)
+{
+ struct kvm_guest_debug debug = {
+ .control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP,
+ };
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_GUEST_DEBUG, &debug) < 0)
+ pr_warning("KVM_SET_GUEST_DEBUG failed");
+}
+
+void kvm_cpu__run(struct kvm_cpu *vcpu)
+{
+ int err;
+
+ if (!vcpu->is_running)
+ return;
+
+ err = ioctl(vcpu->vcpu_fd, KVM_RUN, 0);
+ if (err < 0 && (errno != EINTR && errno != EAGAIN))
+ die_perror("KVM_RUN failed");
+}
+
+static void kvm_cpu_signal_handler(int signum)
+{
+ if (signum == SIGKVMEXIT) {
+ if (current_kvm_cpu && current_kvm_cpu->is_running) {
+ current_kvm_cpu->is_running = false;
+ kvm__continue(current_kvm_cpu->kvm);
+ }
+ } else if (signum == SIGKVMPAUSE) {
+ current_kvm_cpu->paused = 1;
+ }
+}
+
+static void kvm_cpu__handle_coalesced_mmio(struct kvm_cpu *cpu)
+{
+ if (cpu->ring) {
+ while (cpu->ring->first != cpu->ring->last) {
+ struct kvm_coalesced_mmio *m;
+ m = &cpu->ring->coalesced_mmio[cpu->ring->first];
+ kvm_cpu__emulate_mmio(cpu->kvm,
+ m->phys_addr,
+ m->data,
+ m->len,
+ 1);
+ cpu->ring->first = (cpu->ring->first + 1) % KVM_COALESCED_MMIO_MAX;
+ }
+ }
+}
+
+void kvm_cpu__reboot(struct kvm *kvm)
+{
+ int i;
+
+ /* The kvm->cpus array contains a null pointer in the last location */
+ for (i = 0; ; i++) {
+ if (kvm->cpus[i])
+ pthread_kill(kvm->cpus[i]->thread, SIGKVMEXIT);
+ else
+ break;
+ }
+}
+
+int kvm_cpu__start(struct kvm_cpu *cpu)
+{
+ sigset_t sigset;
+
+ sigemptyset(&sigset);
+ sigaddset(&sigset, SIGALRM);
+
+ pthread_sigmask(SIG_BLOCK, &sigset, NULL);
+
+ signal(SIGKVMEXIT, kvm_cpu_signal_handler);
+ signal(SIGKVMPAUSE, kvm_cpu_signal_handler);
+
+ kvm_cpu__reset_vcpu(cpu);
+
+ if (cpu->kvm->cfg.single_step)
+ kvm_cpu__enable_singlestep(cpu);
+
+ while (cpu->is_running) {
+ if (cpu->paused) {
+ kvm__notify_paused();
+ cpu->paused = 0;
+ }
+
+ if (cpu->needs_nmi) {
+ kvm_cpu__arch_nmi(cpu);
+ cpu->needs_nmi = 0;
+ }
+
+ kvm_cpu__run(cpu);
+
+ switch (cpu->kvm_run->exit_reason) {
+ case KVM_EXIT_UNKNOWN:
+ break;
+ case KVM_EXIT_DEBUG:
+ kvm_cpu__show_registers(cpu);
+ kvm_cpu__show_code(cpu);
+ break;
+ case KVM_EXIT_IO: {
+ bool ret;
+
+ ret = kvm_cpu__emulate_io(cpu->kvm,
+ cpu->kvm_run->io.port,
+ (u8 *)cpu->kvm_run +
+ cpu->kvm_run->io.data_offset,
+ cpu->kvm_run->io.direction,
+ cpu->kvm_run->io.size,
+ cpu->kvm_run->io.count);
+
+ if (!ret)
+ goto panic_kvm;
+ break;
+ }
+ case KVM_EXIT_MMIO: {
+ bool ret;
+
+ /*
+ * If we had MMIO exit, coalesced ring should be processed
+ * *before* processing the exit itself
+ */
+ kvm_cpu__handle_coalesced_mmio(cpu);
+
+ ret = kvm_cpu__emulate_mmio(cpu->kvm,
+ cpu->kvm_run->mmio.phys_addr,
+ cpu->kvm_run->mmio.data,
+ cpu->kvm_run->mmio.len,
+ cpu->kvm_run->mmio.is_write);
+
+ if (!ret)
+ goto panic_kvm;
+ break;
+ }
+ case KVM_EXIT_INTR:
+ if (cpu->is_running)
+ break;
+ goto exit_kvm;
+ case KVM_EXIT_SHUTDOWN:
+ goto exit_kvm;
+ default: {
+ bool ret;
+
+ ret = kvm_cpu__handle_exit(cpu);
+ if (!ret)
+ goto panic_kvm;
+ break;
+ }
+ }
+ kvm_cpu__handle_coalesced_mmio(cpu);
+ }
+
+exit_kvm:
+ return 0;
+
+panic_kvm:
+ return 1;
+}
+
+int kvm_cpu__init(struct kvm *kvm)
+{
+ int max_cpus, recommended_cpus, i;
+
+ max_cpus = kvm__max_cpus(kvm);
+ recommended_cpus = kvm__recommended_cpus(kvm);
+
+ if (kvm->cfg.nrcpus > max_cpus) {
+ printf(" # Limit the number of CPUs to %d\n", max_cpus);
+ kvm->cfg.nrcpus = max_cpus;
+ } else if (kvm->cfg.nrcpus > recommended_cpus) {
+ printf(" # Warning: The maximum recommended amount of VCPUs"
+ " is %d\n", recommended_cpus);
+ }
+
+ kvm->nrcpus = kvm->cfg.nrcpus;
+
+ /* Alloc one pointer too many, so array ends up 0-terminated */
+ kvm->cpus = calloc(kvm->nrcpus + 1, sizeof(void *));
+ if (!kvm->cpus) {
+ pr_warning("Couldn't allocate array for %d CPUs", kvm->nrcpus);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < kvm->nrcpus; i++) {
+ kvm->cpus[i] = kvm_cpu__arch_init(kvm, i);
+ if (!kvm->cpus[i]) {
+ pr_warning("unable to initialize KVM VCPU");
+ goto fail_alloc;
+ }
+ }
+
+ return 0;
+
+fail_alloc:
+ for (i = 0; i < kvm->nrcpus; i++)
+ free(kvm->cpus[i]);
+ return -ENOMEM;
+}
+base_init(kvm_cpu__init);
+
+int kvm_cpu__exit(struct kvm *kvm)
+{
+ int i, r;
+ void *ret = NULL;
+
+ kvm_cpu__delete(kvm->cpus[0]);
+ kvm->cpus[0] = NULL;
+
+ for (i = 1; i < kvm->nrcpus; i++) {
+ if (kvm->cpus[i]->is_running) {
+ pthread_kill(kvm->cpus[i]->thread, SIGKVMEXIT);
+ if (pthread_join(kvm->cpus[i]->thread, &ret) != 0)
+ die("pthread_join");
+ kvm_cpu__delete(kvm->cpus[i]);
+ }
+ if (ret == NULL)
+ r = 0;
+ }
+
+ free(kvm->cpus);
+
+ kvm->nrcpus = 0;
+
+ return r;
+}
+late_exit(kvm_cpu__exit);
--- /dev/null
+#include <sys/epoll.h>
+#include <sys/un.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/eventfd.h>
+#include <dirent.h>
+
+#include "kvm/kvm-ipc.h"
+#include "kvm/rwsem.h"
+#include "kvm/read-write.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/builtin-debug.h"
+#include "kvm/strbuf.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/8250-serial.h"
+
+struct kvm_ipc_head {
+ u32 type;
+ u32 len;
+};
+
+#define KVM_IPC_MAX_MSGS 16
+
+#define KVM_SOCK_SUFFIX ".sock"
+#define KVM_SOCK_SUFFIX_LEN ((ssize_t)sizeof(KVM_SOCK_SUFFIX) - 1)
+
+extern __thread struct kvm_cpu *current_kvm_cpu;
+static void (*msgs[KVM_IPC_MAX_MSGS])(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg);
+static DECLARE_RWSEM(msgs_rwlock);
+static int epoll_fd, server_fd, stop_fd;
+static pthread_t thread;
+
+static int kvm__create_socket(struct kvm *kvm)
+{
+ char full_name[PATH_MAX];
+ unsigned int s;
+ struct sockaddr_un local;
+ int len, r;
+
+ /* This usually 108 bytes long */
+ BUILD_BUG_ON(sizeof(local.sun_path) < 32);
+
+ snprintf(full_name, sizeof(full_name), "%s/%s%s",
+ kvm__get_dir(), kvm->cfg.guest_name, KVM_SOCK_SUFFIX);
+ if (access(full_name, F_OK) == 0) {
+ pr_err("Socket file %s already exist", full_name);
+ return -EEXIST;
+ }
+
+ s = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (s < 0)
+ return s;
+ local.sun_family = AF_UNIX;
+ strlcpy(local.sun_path, full_name, sizeof(local.sun_path));
+ len = strlen(local.sun_path) + sizeof(local.sun_family);
+ r = bind(s, (struct sockaddr *)&local, len);
+ if (r < 0)
+ goto fail;
+
+ r = listen(s, 5);
+ if (r < 0)
+ goto fail;
+
+ return s;
+
+fail:
+ close(s);
+ return r;
+}
+
+void kvm__remove_socket(const char *name)
+{
+ char full_name[PATH_MAX];
+
+ snprintf(full_name, sizeof(full_name), "%s/%s%s",
+ kvm__get_dir(), name, KVM_SOCK_SUFFIX);
+ unlink(full_name);
+}
+
+int kvm__get_sock_by_instance(const char *name)
+{
+ int s, len, r;
+ char sock_file[PATH_MAX];
+ struct sockaddr_un local;
+
+ snprintf(sock_file, sizeof(sock_file), "%s/%s%s",
+ kvm__get_dir(), name, KVM_SOCK_SUFFIX);
+ s = socket(AF_UNIX, SOCK_STREAM, 0);
+
+ local.sun_family = AF_UNIX;
+ strlcpy(local.sun_path, sock_file, sizeof(local.sun_path));
+ len = strlen(local.sun_path) + sizeof(local.sun_family);
+
+ r = connect(s, &local, len);
+ if (r < 0 && errno == ECONNREFUSED) {
+ /* Tell the user clean ghost socket file */
+ pr_err("\"%s\" could be a ghost socket file, please remove it",
+ sock_file);
+ return r;
+ } else if (r < 0) {
+ return r;
+ }
+
+ return s;
+}
+
+int kvm__enumerate_instances(int (*callback)(const char *name, int fd))
+{
+ int sock;
+ DIR *dir;
+ struct dirent entry, *result;
+ int ret = 0;
+
+ dir = opendir(kvm__get_dir());
+ if (!dir)
+ return -errno;
+
+ for (;;) {
+ readdir_r(dir, &entry, &result);
+ if (result == NULL)
+ break;
+ if (entry.d_type == DT_SOCK) {
+ ssize_t name_len = strlen(entry.d_name);
+ char *p;
+
+ if (name_len <= KVM_SOCK_SUFFIX_LEN)
+ continue;
+
+ p = &entry.d_name[name_len - KVM_SOCK_SUFFIX_LEN];
+ if (memcmp(KVM_SOCK_SUFFIX, p, KVM_SOCK_SUFFIX_LEN))
+ continue;
+
+ *p = 0;
+ sock = kvm__get_sock_by_instance(entry.d_name);
+ if (sock < 0)
+ continue;
+ ret = callback(entry.d_name, sock);
+ close(sock);
+ if (ret < 0)
+ break;
+ }
+ }
+
+ closedir(dir);
+
+ return ret;
+}
+
+int kvm_ipc__register_handler(u32 type, void (*cb)(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg))
+{
+ if (type >= KVM_IPC_MAX_MSGS)
+ return -ENOSPC;
+
+ down_write(&msgs_rwlock);
+ msgs[type] = cb;
+ up_write(&msgs_rwlock);
+
+ return 0;
+}
+
+int kvm_ipc__send(int fd, u32 type)
+{
+ struct kvm_ipc_head head = {.type = type, .len = 0,};
+
+ if (write_in_full(fd, &head, sizeof(head)) < 0)
+ return -1;
+
+ return 0;
+}
+
+int kvm_ipc__send_msg(int fd, u32 type, u32 len, u8 *msg)
+{
+ struct kvm_ipc_head head = {.type = type, .len = len,};
+
+ if (write_in_full(fd, &head, sizeof(head)) < 0)
+ return -1;
+
+ if (write_in_full(fd, msg, len) < 0)
+ return -1;
+
+ return 0;
+}
+
+static int kvm_ipc__handle(struct kvm *kvm, int fd, u32 type, u32 len, u8 *data)
+{
+ void (*cb)(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg);
+
+ if (type >= KVM_IPC_MAX_MSGS)
+ return -ENOSPC;
+
+ down_read(&msgs_rwlock);
+ cb = msgs[type];
+ up_read(&msgs_rwlock);
+
+ if (cb == NULL) {
+ pr_warning("No device handles type %u\n", type);
+ return -ENODEV;
+ }
+
+ cb(kvm, fd, type, len, data);
+
+ return 0;
+}
+
+static int kvm_ipc__new_conn(int fd)
+{
+ int client;
+ struct epoll_event ev;
+
+ client = accept(fd, NULL, NULL);
+ if (client < 0)
+ return -1;
+
+ ev.events = EPOLLIN | EPOLLRDHUP;
+ ev.data.fd = client;
+ if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client, &ev) < 0) {
+ close(client);
+ return -1;
+ }
+
+ return client;
+}
+
+static void kvm_ipc__close_conn(int fd)
+{
+ epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, NULL);
+ close(fd);
+}
+
+static int kvm_ipc__receive(struct kvm *kvm, int fd)
+{
+ struct kvm_ipc_head head;
+ u8 *msg = NULL;
+ u32 n;
+
+ n = read(fd, &head, sizeof(head));
+ if (n != sizeof(head))
+ goto done;
+
+ msg = malloc(head.len);
+ if (msg == NULL)
+ goto done;
+
+ n = read_in_full(fd, msg, head.len);
+ if (n != head.len)
+ goto done;
+
+ kvm_ipc__handle(kvm, fd, head.type, head.len, msg);
+
+ return 0;
+
+done:
+ free(msg);
+ return -1;
+}
+
+static void *kvm_ipc__thread(void *param)
+{
+ struct epoll_event event;
+ struct kvm *kvm = param;
+
+ for (;;) {
+ int nfds;
+
+ nfds = epoll_wait(epoll_fd, &event, 1, -1);
+ if (nfds > 0) {
+ int fd = event.data.fd;
+
+ if (fd == stop_fd && event.events & EPOLLIN) {
+ break;
+ } else if (fd == server_fd) {
+ int client, r;
+
+ client = kvm_ipc__new_conn(fd);
+ /*
+ * Handle multiple IPC cmd at a time
+ */
+ do {
+ r = kvm_ipc__receive(kvm, client);
+ } while (r == 0);
+
+ } else if (event.events & (EPOLLERR | EPOLLRDHUP | EPOLLHUP)) {
+ kvm_ipc__close_conn(fd);
+ } else {
+ kvm_ipc__receive(kvm, fd);
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static void kvm__pid(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ pid_t pid = getpid();
+ int r = 0;
+
+ if (type == KVM_IPC_PID)
+ r = write(fd, &pid, sizeof(pid));
+
+ if (r < 0)
+ pr_warning("Failed sending PID");
+}
+
+static void handle_stop(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ if (WARN_ON(type != KVM_IPC_STOP || len))
+ return;
+
+ kvm_cpu__reboot(kvm);
+}
+
+/* Pause/resume the guest using SIGUSR2 */
+static int is_paused;
+
+static void handle_pause(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ if (WARN_ON(len))
+ return;
+
+ if (type == KVM_IPC_RESUME && is_paused) {
+ kvm->vm_state = KVM_VMSTATE_RUNNING;
+ kvm__continue(kvm);
+ } else if (type == KVM_IPC_PAUSE && !is_paused) {
+ kvm->vm_state = KVM_VMSTATE_PAUSED;
+ ioctl(kvm->vm_fd, KVM_KVMCLOCK_CTRL);
+ kvm__pause(kvm);
+ } else {
+ return;
+ }
+
+ is_paused = !is_paused;
+}
+
+static void handle_vmstate(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ int r = 0;
+
+ if (type == KVM_IPC_VMSTATE)
+ r = write(fd, &kvm->vm_state, sizeof(kvm->vm_state));
+
+ if (r < 0)
+ pr_warning("Failed sending VMSTATE");
+}
+
+/*
+ * Serialize debug printout so that the output of multiple vcpus does not
+ * get mixed up:
+ */
+static int printout_done;
+
+static void handle_sigusr1(int sig)
+{
+ struct kvm_cpu *cpu = current_kvm_cpu;
+ int fd = kvm_cpu__get_debug_fd();
+
+ if (!cpu || cpu->needs_nmi)
+ return;
+
+ dprintf(fd, "\n #\n # vCPU #%ld's dump:\n #\n", cpu->cpu_id);
+ kvm_cpu__show_registers(cpu);
+ kvm_cpu__show_code(cpu);
+ kvm_cpu__show_page_tables(cpu);
+ fflush(stdout);
+ printout_done = 1;
+}
+
+static void handle_debug(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ int i;
+ struct debug_cmd_params *params;
+ u32 dbg_type;
+ u32 vcpu;
+
+ if (WARN_ON(type != KVM_IPC_DEBUG || len != sizeof(*params)))
+ return;
+
+ params = (void *)msg;
+ dbg_type = params->dbg_type;
+ vcpu = params->cpu;
+
+ if (dbg_type & KVM_DEBUG_CMD_TYPE_SYSRQ)
+ serial8250__inject_sysrq(kvm, params->sysrq);
+
+ if (dbg_type & KVM_DEBUG_CMD_TYPE_NMI) {
+ if ((int)vcpu >= kvm->nrcpus)
+ return;
+
+ kvm->cpus[vcpu]->needs_nmi = 1;
+ pthread_kill(kvm->cpus[vcpu]->thread, SIGUSR1);
+ }
+
+ if (!(dbg_type & KVM_DEBUG_CMD_TYPE_DUMP))
+ return;
+
+ for (i = 0; i < kvm->nrcpus; i++) {
+ struct kvm_cpu *cpu = kvm->cpus[i];
+
+ if (!cpu)
+ continue;
+
+ printout_done = 0;
+
+ kvm_cpu__set_debug_fd(fd);
+ pthread_kill(cpu->thread, SIGUSR1);
+ /*
+ * Wait for the vCPU to dump state before signalling
+ * the next thread. Since this is debug code it does
+ * not matter that we are burning CPU time a bit:
+ */
+ while (!printout_done)
+ sleep(0);
+ }
+
+ close(fd);
+
+ serial8250__inject_sysrq(kvm, 'p');
+}
+
+int kvm_ipc__init(struct kvm *kvm)
+{
+ int ret;
+ int sock = kvm__create_socket(kvm);
+ struct epoll_event ev = {0};
+
+ server_fd = sock;
+
+ epoll_fd = epoll_create(KVM_IPC_MAX_MSGS);
+ if (epoll_fd < 0) {
+ ret = epoll_fd;
+ goto err;
+ }
+
+ ev.events = EPOLLIN | EPOLLET;
+ ev.data.fd = sock;
+ if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sock, &ev) < 0) {
+ pr_err("Failed starting IPC thread");
+ ret = -EFAULT;
+ goto err_epoll;
+ }
+
+ stop_fd = eventfd(0, 0);
+ if (stop_fd < 0) {
+ ret = stop_fd;
+ goto err_epoll;
+ }
+
+ ev.events = EPOLLIN | EPOLLET;
+ ev.data.fd = stop_fd;
+ if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, stop_fd, &ev) < 0) {
+ pr_err("Failed adding stop event to epoll");
+ ret = -EFAULT;
+ goto err_stop;
+ }
+
+ if (pthread_create(&thread, NULL, kvm_ipc__thread, kvm) != 0) {
+ pr_err("Failed starting IPC thread");
+ ret = -EFAULT;
+ goto err_stop;
+ }
+
+ kvm_ipc__register_handler(KVM_IPC_PID, kvm__pid);
+ kvm_ipc__register_handler(KVM_IPC_DEBUG, handle_debug);
+ kvm_ipc__register_handler(KVM_IPC_PAUSE, handle_pause);
+ kvm_ipc__register_handler(KVM_IPC_RESUME, handle_pause);
+ kvm_ipc__register_handler(KVM_IPC_STOP, handle_stop);
+ kvm_ipc__register_handler(KVM_IPC_VMSTATE, handle_vmstate);
+ signal(SIGUSR1, handle_sigusr1);
+
+ return 0;
+
+err_stop:
+ close(stop_fd);
+err_epoll:
+ close(epoll_fd);
+err:
+ return ret;
+}
+base_init(kvm_ipc__init);
+
+int kvm_ipc__exit(struct kvm *kvm)
+{
+ u64 val = 1;
+ int ret;
+
+ ret = write(stop_fd, &val, sizeof(val));
+ if (ret < 0)
+ return ret;
+
+ close(server_fd);
+ close(epoll_fd);
+
+ kvm__remove_socket(kvm->cfg.guest_name);
+
+ return ret;
+}
+base_exit(kvm_ipc__exit);
--- /dev/null
+#include "kvm/kvm.h"
+#include "kvm/read-write.h"
+#include "kvm/util.h"
+#include "kvm/strbuf.h"
+#include "kvm/mutex.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/kvm-ipc.h"
+
+#include <linux/kvm.h>
+#include <linux/err.h>
+
+#include <sys/un.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <stdbool.h>
+#include <limits.h>
+#include <signal.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <time.h>
+#include <sys/eventfd.h>
+#include <asm/unistd.h>
+#include <dirent.h>
+
+#define DEFINE_KVM_EXIT_REASON(reason) [reason] = #reason
+
+const char *kvm_exit_reasons[] = {
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_UNKNOWN),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_EXCEPTION),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_IO),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_HYPERCALL),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_DEBUG),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_HLT),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_MMIO),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_IRQ_WINDOW_OPEN),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_SHUTDOWN),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_FAIL_ENTRY),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_INTR),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_SET_TPR),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_TPR_ACCESS),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_S390_SIEIC),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_S390_RESET),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_DCR),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_NMI),
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_INTERNAL_ERROR),
+#ifdef CONFIG_PPC64
+ DEFINE_KVM_EXIT_REASON(KVM_EXIT_PAPR_HCALL),
+#endif
+};
+
+static int pause_event;
+static DEFINE_MUTEX(pause_lock);
+extern struct kvm_ext kvm_req_ext[];
+
+static char kvm_dir[PATH_MAX];
+
+static int set_dir(const char *fmt, va_list args)
+{
+ char tmp[PATH_MAX];
+
+ vsnprintf(tmp, sizeof(tmp), fmt, args);
+
+ mkdir(tmp, 0777);
+
+ if (!realpath(tmp, kvm_dir))
+ return -errno;
+
+ strcat(kvm_dir, "/");
+
+ return 0;
+}
+
+void kvm__set_dir(const char *fmt, ...)
+{
+ va_list args;
+
+ va_start(args, fmt);
+ set_dir(fmt, args);
+ va_end(args);
+}
+
+const char *kvm__get_dir(void)
+{
+ return kvm_dir;
+}
+
+bool kvm__supports_extension(struct kvm *kvm, unsigned int extension)
+{
+ int ret;
+
+ ret = ioctl(kvm->sys_fd, KVM_CHECK_EXTENSION, extension);
+ if (ret < 0)
+ return false;
+
+ return ret;
+}
+
+static int kvm__check_extensions(struct kvm *kvm)
+{
+ int i;
+
+ for (i = 0; ; i++) {
+ if (!kvm_req_ext[i].name)
+ break;
+ if (!kvm__supports_extension(kvm, kvm_req_ext[i].code)) {
+ pr_err("Unsuppored KVM extension detected: %s",
+ kvm_req_ext[i].name);
+ return -i;
+ }
+ }
+
+ return 0;
+}
+
+struct kvm *kvm__new(void)
+{
+ struct kvm *kvm = calloc(1, sizeof(*kvm));
+ if (!kvm)
+ return ERR_PTR(-ENOMEM);
+
+ kvm->sys_fd = -1;
+ kvm->vm_fd = -1;
+
+ return kvm;
+}
+
+int kvm__exit(struct kvm *kvm)
+{
+ kvm__arch_delete_ram(kvm);
+ free(kvm);
+
+ return 0;
+}
+core_exit(kvm__exit);
+
+/*
+ * Note: KVM_SET_USER_MEMORY_REGION assumes that we don't pass overlapping
+ * memory regions to it. Therefore, be careful if you use this function for
+ * registering memory regions for emulating hardware.
+ */
+int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size, void *userspace_addr)
+{
+ struct kvm_userspace_memory_region mem;
+ int ret;
+
+ mem = (struct kvm_userspace_memory_region) {
+ .slot = kvm->mem_slots++,
+ .guest_phys_addr = guest_phys,
+ .memory_size = size,
+ .userspace_addr = (unsigned long)userspace_addr,
+ };
+
+ ret = ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &mem);
+ if (ret < 0)
+ return -errno;
+
+ return 0;
+}
+
+int kvm__recommended_cpus(struct kvm *kvm)
+{
+ int ret;
+
+ ret = ioctl(kvm->sys_fd, KVM_CHECK_EXTENSION, KVM_CAP_NR_VCPUS);
+ if (ret <= 0)
+ /*
+ * api.txt states that if KVM_CAP_NR_VCPUS does not exist,
+ * assume 4.
+ */
+ return 4;
+
+ return ret;
+}
+
+/*
+ * The following hack should be removed once 'x86: Raise the hard
+ * VCPU count limit' makes it's way into the mainline.
+ */
+#ifndef KVM_CAP_MAX_VCPUS
+#define KVM_CAP_MAX_VCPUS 66
+#endif
+
+int kvm__max_cpus(struct kvm *kvm)
+{
+ int ret;
+
+ ret = ioctl(kvm->sys_fd, KVM_CHECK_EXTENSION, KVM_CAP_MAX_VCPUS);
+ if (ret <= 0)
+ ret = kvm__recommended_cpus(kvm);
+
+ return ret;
+}
+
+int kvm__init(struct kvm *kvm)
+{
+ int ret;
+
+ if (!kvm__arch_cpu_supports_vm()) {
+ pr_err("Your CPU does not support hardware virtualization");
+ ret = -ENOSYS;
+ goto err;
+ }
+
+ kvm->sys_fd = open(kvm->cfg.dev, O_RDWR);
+ if (kvm->sys_fd < 0) {
+ if (errno == ENOENT)
+ pr_err("'%s' not found. Please make sure your kernel has CONFIG_KVM "
+ "enabled and that the KVM modules are loaded.", kvm->cfg.dev);
+ else if (errno == ENODEV)
+ pr_err("'%s' KVM driver not available.\n # (If the KVM "
+ "module is loaded then 'dmesg' may offer further clues "
+ "about the failure.)", kvm->cfg.dev);
+ else
+ pr_err("Could not open %s: ", kvm->cfg.dev);
+
+ ret = -errno;
+ goto err_free;
+ }
+
+ ret = ioctl(kvm->sys_fd, KVM_GET_API_VERSION, 0);
+ if (ret != KVM_API_VERSION) {
+ pr_err("KVM_API_VERSION ioctl");
+ ret = -errno;
+ goto err_sys_fd;
+ }
+
+ kvm->vm_fd = ioctl(kvm->sys_fd, KVM_CREATE_VM, 0);
+ if (kvm->vm_fd < 0) {
+ ret = kvm->vm_fd;
+ goto err_sys_fd;
+ }
+
+ if (kvm__check_extensions(kvm)) {
+ pr_err("A required KVM extention is not supported by OS");
+ ret = -ENOSYS;
+ goto err_vm_fd;
+ }
+
+ kvm__arch_init(kvm, kvm->cfg.hugetlbfs_path, kvm->cfg.ram_size);
+
+ kvm__init_ram(kvm);
+
+ if (!kvm->cfg.firmware_filename) {
+ if (!kvm__load_kernel(kvm, kvm->cfg.kernel_filename,
+ kvm->cfg.initrd_filename, kvm->cfg.real_cmdline, kvm->cfg.vidmode))
+ die("unable to load kernel %s", kvm->cfg.kernel_filename);
+ }
+
+ if (kvm->cfg.firmware_filename) {
+ if (!kvm__load_firmware(kvm, kvm->cfg.firmware_filename))
+ die("unable to load firmware image %s: %s", kvm->cfg.firmware_filename, strerror(errno));
+ } else {
+ ret = kvm__arch_setup_firmware(kvm);
+ if (ret < 0)
+ die("kvm__arch_setup_firmware() failed with error %d\n", ret);
+ }
+
+ return 0;
+
+err_vm_fd:
+ close(kvm->vm_fd);
+err_sys_fd:
+ close(kvm->sys_fd);
+err_free:
+ free(kvm);
+err:
+ return ret;
+}
+core_init(kvm__init);
+
+/* RFC 1952 */
+#define GZIP_ID1 0x1f
+#define GZIP_ID2 0x8b
+#define CPIO_MAGIC "0707"
+/* initrd may be gzipped, or a plain cpio */
+static bool initrd_check(int fd)
+{
+ unsigned char id[4];
+
+ if (read_in_full(fd, id, ARRAY_SIZE(id)) < 0)
+ return false;
+
+ if (lseek(fd, 0, SEEK_SET) < 0)
+ die_perror("lseek");
+
+ return (id[0] == GZIP_ID1 && id[1] == GZIP_ID2) ||
+ !memcmp(id, CPIO_MAGIC, 4);
+}
+
+bool kvm__load_kernel(struct kvm *kvm, const char *kernel_filename,
+ const char *initrd_filename, const char *kernel_cmdline, u16 vidmode)
+{
+ bool ret;
+ int fd_kernel = -1, fd_initrd = -1;
+
+ fd_kernel = open(kernel_filename, O_RDONLY);
+ if (fd_kernel < 0)
+ die("Unable to open kernel %s", kernel_filename);
+
+ if (initrd_filename) {
+ fd_initrd = open(initrd_filename, O_RDONLY);
+ if (fd_initrd < 0)
+ die("Unable to open initrd %s", initrd_filename);
+
+ if (!initrd_check(fd_initrd))
+ die("%s is not an initrd", initrd_filename);
+ }
+
+ ret = load_bzimage(kvm, fd_kernel, fd_initrd, kernel_cmdline, vidmode);
+
+ if (ret)
+ goto found_kernel;
+
+ pr_warning("%s is not a bzImage. Trying to load it as a flat binary...", kernel_filename);
+
+ ret = load_flat_binary(kvm, fd_kernel, fd_initrd, kernel_cmdline);
+
+ if (ret)
+ goto found_kernel;
+
+ if (initrd_filename)
+ close(fd_initrd);
+ close(fd_kernel);
+
+ die("%s is not a valid bzImage or flat binary", kernel_filename);
+
+found_kernel:
+ if (initrd_filename)
+ close(fd_initrd);
+ close(fd_kernel);
+
+ return ret;
+}
+
+#define TIMER_INTERVAL_NS 1000000 /* 1 msec */
+
+/*
+ * This function sets up a timer that's used to inject interrupts from the
+ * userspace hypervisor into the guest at periodical intervals. Please note
+ * that clock interrupt, for example, is not handled here.
+ */
+int kvm_timer__init(struct kvm *kvm)
+{
+ struct itimerspec its;
+ struct sigevent sev;
+ int r;
+
+ memset(&sev, 0, sizeof(struct sigevent));
+ sev.sigev_value.sival_int = 0;
+ sev.sigev_notify = SIGEV_THREAD_ID;
+ sev.sigev_signo = SIGALRM;
+ sev.sigev_value.sival_ptr = kvm;
+ sev._sigev_un._tid = syscall(__NR_gettid);
+
+ r = timer_create(CLOCK_REALTIME, &sev, &kvm->timerid);
+ if (r < 0)
+ return r;
+
+ its.it_value.tv_sec = TIMER_INTERVAL_NS / 1000000000;
+ its.it_value.tv_nsec = TIMER_INTERVAL_NS % 1000000000;
+ its.it_interval.tv_sec = its.it_value.tv_sec;
+ its.it_interval.tv_nsec = its.it_value.tv_nsec;
+
+ r = timer_settime(kvm->timerid, 0, &its, NULL);
+ if (r < 0) {
+ timer_delete(kvm->timerid);
+ return r;
+ }
+
+ return 0;
+}
+firmware_init(kvm_timer__init);
+
+int kvm_timer__exit(struct kvm *kvm)
+{
+ if (kvm->timerid)
+ if (timer_delete(kvm->timerid) < 0)
+ die("timer_delete()");
+
+ kvm->timerid = 0;
+
+ return 0;
+}
+firmware_exit(kvm_timer__exit);
+
+void kvm__dump_mem(struct kvm *kvm, unsigned long addr, unsigned long size)
+{
+ unsigned char *p;
+ unsigned long n;
+
+ size &= ~7; /* mod 8 */
+ if (!size)
+ return;
+
+ p = guest_flat_to_host(kvm, addr);
+
+ for (n = 0; n < size; n += 8) {
+ if (!host_ptr_in_ram(kvm, p + n))
+ break;
+
+ printf(" 0x%08lx: %02x %02x %02x %02x %02x %02x %02x %02x\n",
+ addr + n, p[n + 0], p[n + 1], p[n + 2], p[n + 3],
+ p[n + 4], p[n + 5], p[n + 6], p[n + 7]);
+ }
+}
+
+void kvm__pause(struct kvm *kvm)
+{
+ int i, paused_vcpus = 0;
+
+ /* Check if the guest is running */
+ if (!kvm->cpus[0] || kvm->cpus[0]->thread == 0)
+ return;
+
+ mutex_lock(&pause_lock);
+
+ pause_event = eventfd(0, 0);
+ if (pause_event < 0)
+ die("Failed creating pause notification event");
+ for (i = 0; i < kvm->nrcpus; i++)
+ pthread_kill(kvm->cpus[i]->thread, SIGKVMPAUSE);
+
+ while (paused_vcpus < kvm->nrcpus) {
+ u64 cur_read;
+
+ if (read(pause_event, &cur_read, sizeof(cur_read)) < 0)
+ die("Failed reading pause event");
+ paused_vcpus += cur_read;
+ }
+ close(pause_event);
+}
+
+void kvm__continue(struct kvm *kvm)
+{
+ /* Check if the guest is running */
+ if (!kvm->cpus[0] || kvm->cpus[0]->thread == 0)
+ return;
+
+ mutex_unlock(&pause_lock);
+}
+
+void kvm__notify_paused(void)
+{
+ u64 p = 1;
+
+ if (write(pause_event, &p, sizeof(p)) < 0)
+ die("Failed notifying of paused VCPU.");
+
+ mutex_lock(&pause_lock);
+ mutex_unlock(&pause_lock);
+}
--- /dev/null
+#include "kvm/kvm.h"
+
+#include <stdlib.h>
+#include <stdio.h>
+
+/* user defined header files */
+#include <kvm/kvm-cmd.h>
+
+static int handle_kvm_command(int argc, char **argv)
+{
+ return handle_command(kvm_commands, argc, (const char **) &argv[0]);
+}
+
+int main(int argc, char *argv[])
+{
+ kvm__set_dir("%s/%s", HOME_DIR, KVM_PID_FILE_PATH);
+
+ return handle_kvm_command(argc - 1, &argv[1]);
+}
--- /dev/null
+#include "kvm/kvm.h"
+#include "kvm/rbtree-interval.h"
+#include "kvm/brlock.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/ioctl.h>
+#include <linux/kvm.h>
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/err.h>
+#include <errno.h>
+
+#define mmio_node(n) rb_entry(n, struct mmio_mapping, node)
+
+struct mmio_mapping {
+ struct rb_int_node node;
+ void (*mmio_fn)(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr);
+ void *ptr;
+};
+
+static struct rb_root mmio_tree = RB_ROOT;
+
+static struct mmio_mapping *mmio_search(struct rb_root *root, u64 addr, u64 len)
+{
+ struct rb_int_node *node;
+
+ node = rb_int_search_range(root, addr, addr + len);
+ if (node == NULL)
+ return NULL;
+
+ return mmio_node(node);
+}
+
+/* Find lowest match, Check for overlap */
+static struct mmio_mapping *mmio_search_single(struct rb_root *root, u64 addr)
+{
+ struct rb_int_node *node;
+
+ node = rb_int_search_single(root, addr);
+ if (node == NULL)
+ return NULL;
+
+ return mmio_node(node);
+}
+
+static int mmio_insert(struct rb_root *root, struct mmio_mapping *data)
+{
+ return rb_int_insert(root, &data->node);
+}
+
+static const char *to_direction(u8 is_write)
+{
+ if (is_write)
+ return "write";
+
+ return "read";
+}
+
+int kvm__register_mmio(struct kvm *kvm, u64 phys_addr, u64 phys_addr_len, bool coalesce,
+ void (*mmio_fn)(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr),
+ void *ptr)
+{
+ struct mmio_mapping *mmio;
+ struct kvm_coalesced_mmio_zone zone;
+ int ret;
+
+ mmio = malloc(sizeof(*mmio));
+ if (mmio == NULL)
+ return -ENOMEM;
+
+ *mmio = (struct mmio_mapping) {
+ .node = RB_INT_INIT(phys_addr, phys_addr + phys_addr_len),
+ .mmio_fn = mmio_fn,
+ .ptr = ptr,
+ };
+
+ if (coalesce) {
+ zone = (struct kvm_coalesced_mmio_zone) {
+ .addr = phys_addr,
+ .size = phys_addr_len,
+ };
+ ret = ioctl(kvm->vm_fd, KVM_REGISTER_COALESCED_MMIO, &zone);
+ if (ret < 0) {
+ free(mmio);
+ return -errno;
+ }
+ }
+ br_write_lock(kvm);
+ ret = mmio_insert(&mmio_tree, mmio);
+ br_write_unlock(kvm);
+
+ return ret;
+}
+
+bool kvm__deregister_mmio(struct kvm *kvm, u64 phys_addr)
+{
+ struct mmio_mapping *mmio;
+ struct kvm_coalesced_mmio_zone zone;
+
+ br_write_lock(kvm);
+ mmio = mmio_search_single(&mmio_tree, phys_addr);
+ if (mmio == NULL) {
+ br_write_unlock(kvm);
+ return false;
+ }
+
+ zone = (struct kvm_coalesced_mmio_zone) {
+ .addr = phys_addr,
+ .size = 1,
+ };
+ ioctl(kvm->vm_fd, KVM_UNREGISTER_COALESCED_MMIO, &zone);
+
+ rb_int_erase(&mmio_tree, &mmio->node);
+ br_write_unlock(kvm);
+
+ free(mmio);
+ return true;
+}
+
+bool kvm__emulate_mmio(struct kvm *kvm, u64 phys_addr, u8 *data, u32 len, u8 is_write)
+{
+ struct mmio_mapping *mmio;
+
+ br_read_lock();
+ mmio = mmio_search(&mmio_tree, phys_addr, len);
+
+ if (mmio)
+ mmio->mmio_fn(phys_addr, data, len, is_write, mmio->ptr);
+ else {
+ if (kvm->cfg.mmio_debug)
+ fprintf(stderr, "Warning: Ignoring MMIO %s at %016llx (length %u)\n",
+ to_direction(is_write), phys_addr, len);
+ }
+ br_read_unlock();
+
+ return true;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+int uip_tx_do_arp(struct uip_tx_arg *arg)
+{
+ struct uip_arp *arp, *arp2;
+ struct uip_info *info;
+ struct uip_buf *buf;
+
+ info = arg->info;
+ buf = uip_buf_clone(arg);
+
+ arp = (struct uip_arp *)(arg->eth);
+ arp2 = (struct uip_arp *)(buf->eth);
+
+ /*
+ * ARP replay code: 2
+ */
+ arp2->op = htons(0x2);
+ arp2->dmac = arp->smac;
+ arp2->dip = arp->sip;
+
+ if (arp->dip == htonl(info->host_ip)) {
+ arp2->smac = info->host_mac;
+ arp2->sip = htonl(info->host_ip);
+
+ uip_buf_set_used(info, buf);
+ }
+
+ return 0;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+
+struct uip_buf *uip_buf_get_used(struct uip_info *info)
+{
+ struct uip_buf *buf;
+ bool found = false;
+
+ mutex_lock(&info->buf_lock);
+
+ while (!(info->buf_used_nr > 0))
+ pthread_cond_wait(&info->buf_used_cond, &info->buf_lock);
+
+ list_for_each_entry(buf, &info->buf_head, list) {
+ if (buf->status == UIP_BUF_STATUS_USED) {
+ /*
+ * Set status to INUSE immediately to prevent
+ * someone from using this buf until we free it
+ */
+ buf->status = UIP_BUF_STATUS_INUSE;
+ info->buf_used_nr--;
+ found = true;
+ break;
+ }
+ }
+
+ mutex_unlock(&info->buf_lock);
+
+ return found ? buf : NULL;
+}
+
+struct uip_buf *uip_buf_get_free(struct uip_info *info)
+{
+ struct uip_buf *buf;
+ bool found = false;
+
+ mutex_lock(&info->buf_lock);
+
+ while (!(info->buf_free_nr > 0))
+ pthread_cond_wait(&info->buf_free_cond, &info->buf_lock);
+
+ list_for_each_entry(buf, &info->buf_head, list) {
+ if (buf->status == UIP_BUF_STATUS_FREE) {
+ /*
+ * Set status to INUSE immediately to prevent
+ * someone from using this buf until we free it
+ */
+ buf->status = UIP_BUF_STATUS_INUSE;
+ info->buf_free_nr--;
+ found = true;
+ break;
+ }
+ }
+
+ mutex_unlock(&info->buf_lock);
+
+ return found ? buf : NULL;
+}
+
+struct uip_buf *uip_buf_set_used(struct uip_info *info, struct uip_buf *buf)
+{
+ mutex_lock(&info->buf_lock);
+
+ buf->status = UIP_BUF_STATUS_USED;
+ info->buf_used_nr++;
+ pthread_cond_signal(&info->buf_used_cond);
+
+ mutex_unlock(&info->buf_lock);
+
+ return buf;
+}
+
+struct uip_buf *uip_buf_set_free(struct uip_info *info, struct uip_buf *buf)
+{
+ mutex_lock(&info->buf_lock);
+
+ buf->status = UIP_BUF_STATUS_FREE;
+ info->buf_free_nr++;
+ pthread_cond_signal(&info->buf_free_cond);
+
+ mutex_unlock(&info->buf_lock);
+
+ return buf;
+}
+
+struct uip_buf *uip_buf_clone(struct uip_tx_arg *arg)
+{
+ struct uip_buf *buf;
+ struct uip_eth *eth2;
+ struct uip_info *info;
+
+ info = arg->info;
+
+ /*
+ * Get buffer from device to guest
+ */
+ buf = uip_buf_get_free(info);
+
+ /*
+ * Clone buffer
+ */
+ memcpy(buf->vnet, arg->vnet, arg->vnet_len);
+ memcpy(buf->eth, arg->eth, arg->eth_len);
+ buf->vnet_len = arg->vnet_len;
+ buf->eth_len = arg->eth_len;
+
+ eth2 = (struct uip_eth *)buf->eth;
+ eth2->src = info->host_mac;
+ eth2->dst = arg->eth->src;
+
+ return buf;
+}
--- /dev/null
+#include "kvm/mutex.h"
+#include "kvm/uip.h"
+
+#include <linux/virtio_net.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+
+int uip_tx(struct iovec *iov, u16 out, struct uip_info *info)
+{
+ struct virtio_net_hdr *vnet;
+ struct uip_tx_arg arg;
+ int eth_len, vnet_len;
+ struct uip_eth *eth;
+ u8 *buf = NULL;
+ u16 proto;
+ int i;
+
+ /*
+ * Buffer from guest to device
+ */
+ vnet_len = iov[0].iov_len;
+ vnet = iov[0].iov_base;
+
+ eth_len = iov[1].iov_len;
+ eth = iov[1].iov_base;
+
+ /*
+ * In case, ethernet frame is in more than one iov entry.
+ * Copy iov buffer into one linear buffer.
+ */
+ if (out > 2) {
+ eth_len = 0;
+ for (i = 1; i < out; i++)
+ eth_len += iov[i].iov_len;
+
+ buf = malloc(eth_len);
+ if (!buf)
+ return -1;
+
+ eth = (struct uip_eth *)buf;
+ for (i = 1; i < out; i++) {
+ memcpy(buf, iov[i].iov_base, iov[i].iov_len);
+ buf += iov[i].iov_len;
+ }
+ }
+
+ memset(&arg, 0, sizeof(arg));
+
+ arg.vnet_len = vnet_len;
+ arg.eth_len = eth_len;
+ arg.info = info;
+ arg.vnet = vnet;
+ arg.eth = eth;
+
+ /*
+ * Check package type
+ */
+ proto = ntohs(eth->type);
+
+ switch (proto) {
+ case UIP_ETH_P_ARP:
+ uip_tx_do_arp(&arg);
+ break;
+ case UIP_ETH_P_IP:
+ uip_tx_do_ipv4(&arg);
+ break;
+ default:
+ break;
+ }
+
+ if (out > 2 && buf)
+ free(eth);
+
+ return vnet_len + eth_len;
+}
+
+int uip_rx(struct iovec *iov, u16 in, struct uip_info *info)
+{
+ struct virtio_net_hdr *vnet;
+ struct uip_eth *eth;
+ struct uip_buf *buf;
+ int vnet_len;
+ int eth_len;
+ char *p;
+ int len;
+ int cnt;
+ int i;
+
+ /*
+ * Sleep until there is a buffer for guest
+ */
+ buf = uip_buf_get_used(info);
+
+ /*
+ * Fill device to guest buffer, vnet hdr fisrt
+ */
+ vnet_len = iov[0].iov_len;
+ vnet = iov[0].iov_base;
+ if (buf->vnet_len > vnet_len) {
+ len = -1;
+ goto out;
+ }
+ memcpy(vnet, buf->vnet, buf->vnet_len);
+
+ /*
+ * Then, the real eth data
+ * Note: Be sure buf->eth_len is not bigger than the buffer len that guest provides
+ */
+ cnt = buf->eth_len;
+ p = buf->eth;
+ for (i = 1; i < in; i++) {
+ eth_len = iov[i].iov_len;
+ eth = iov[i].iov_base;
+ if (cnt > eth_len) {
+ memcpy(eth, p, eth_len);
+ cnt -= eth_len;
+ p += eth_len;
+ } else {
+ memcpy(eth, p, cnt);
+ cnt -= cnt;
+ break;
+ }
+ }
+
+ if (cnt) {
+ pr_warning("uip_rx error");
+ len = -1;
+ goto out;
+ }
+
+ len = buf->vnet_len + buf->eth_len;
+
+out:
+ uip_buf_set_free(info, buf);
+ return len;
+}
+
+int uip_init(struct uip_info *info)
+{
+ struct list_head *udp_socket_head;
+ struct list_head *tcp_socket_head;
+ struct list_head *buf_head;
+ struct uip_buf *buf;
+ int buf_nr;
+ int i;
+
+ udp_socket_head = &info->udp_socket_head;
+ tcp_socket_head = &info->tcp_socket_head;
+ buf_head = &info->buf_head;
+ buf_nr = info->buf_nr;
+
+ INIT_LIST_HEAD(udp_socket_head);
+ INIT_LIST_HEAD(tcp_socket_head);
+ INIT_LIST_HEAD(buf_head);
+
+ pthread_mutex_init(&info->udp_socket_lock, NULL);
+ pthread_mutex_init(&info->tcp_socket_lock, NULL);
+ pthread_mutex_init(&info->buf_lock, NULL);
+
+ pthread_cond_init(&info->buf_used_cond, NULL);
+ pthread_cond_init(&info->buf_free_cond, NULL);
+
+
+ for (i = 0; i < buf_nr; i++) {
+ buf = malloc(sizeof(*buf));
+ memset(buf, 0, sizeof(*buf));
+
+ buf->status = UIP_BUF_STATUS_FREE;
+ buf->info = info;
+ buf->id = i;
+ list_add_tail(&buf->list, buf_head);
+ }
+
+ list_for_each_entry(buf, buf_head, list) {
+ buf->vnet = malloc(sizeof(struct virtio_net_hdr));
+ buf->vnet_len = sizeof(struct virtio_net_hdr);
+ buf->eth = malloc(1024*64 + sizeof(struct uip_pseudo_hdr));
+ buf->eth_len = 1024*64 + sizeof(struct uip_pseudo_hdr);
+
+ memset(buf->vnet, 0, buf->vnet_len);
+ memset(buf->eth, 0, buf->eth_len);
+ }
+
+ info->buf_free_nr = buf_nr;
+ info->buf_used_nr = 0;
+
+ uip_dhcp_get_dns(info);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+static u16 uip_csum(u16 csum, u8 *addr, u16 count)
+{
+ long sum = csum;
+
+ while (count > 1) {
+ sum += *(u16 *)addr;
+ addr += 2;
+ count -= 2;
+ }
+
+ if (count > 0)
+ sum += *(unsigned char *)addr;
+
+ while (sum>>16)
+ sum = (sum & 0xffff) + (sum >> 16);
+
+ return ~sum;
+}
+
+u16 uip_csum_ip(struct uip_ip *ip)
+{
+ return uip_csum(0, &ip->vhl, uip_ip_hdrlen(ip));
+}
+
+u16 uip_csum_icmp(struct uip_icmp *icmp)
+{
+ struct uip_ip *ip;
+
+ ip = &icmp->ip;
+ return icmp->csum = uip_csum(0, &icmp->type, htons(ip->len) - uip_ip_hdrlen(ip) - 8); /* icmp header len = 8 */
+}
+
+u16 uip_csum_udp(struct uip_udp *udp)
+{
+ struct uip_pseudo_hdr hdr;
+ struct uip_ip *ip;
+ int udp_len;
+ u8 *pad;
+
+ ip = &udp->ip;
+
+ hdr.sip = ip->sip;
+ hdr.dip = ip->dip;
+ hdr.zero = 0;
+ hdr.proto = ip->proto;
+ hdr.len = udp->len;
+
+ udp_len = uip_udp_len(udp);
+
+ if (udp_len % 2) {
+ pad = (u8 *)&udp->sport + udp_len;
+ *pad = 0;
+ memcpy((u8 *)&udp->sport + udp_len + 1, &hdr, sizeof(hdr));
+ return uip_csum(0, (u8 *)&udp->sport, udp_len + 1 + sizeof(hdr));
+ } else {
+ memcpy((u8 *)&udp->sport + udp_len, &hdr, sizeof(hdr));
+ return uip_csum(0, (u8 *)&udp->sport, udp_len + sizeof(hdr));
+ }
+
+}
+
+u16 uip_csum_tcp(struct uip_tcp *tcp)
+{
+ struct uip_pseudo_hdr hdr;
+ struct uip_ip *ip;
+ u16 tcp_len;
+ u8 *pad;
+
+ ip = &tcp->ip;
+ tcp_len = ntohs(ip->len) - uip_ip_hdrlen(ip);
+
+ hdr.sip = ip->sip;
+ hdr.dip = ip->dip;
+ hdr.zero = 0;
+ hdr.proto = ip->proto;
+ hdr.len = htons(tcp_len);
+
+ if (tcp_len > UIP_MAX_TCP_PAYLOAD + 20)
+ pr_warning("tcp_len(%d) is too large", tcp_len);
+
+ if (tcp_len % 2) {
+ pad = (u8 *)&tcp->sport + tcp_len;
+ *pad = 0;
+ memcpy((u8 *)&tcp->sport + tcp_len + 1, &hdr, sizeof(hdr));
+ return uip_csum(0, (u8 *)&tcp->sport, tcp_len + 1 + sizeof(hdr));
+ } else {
+ memcpy((u8 *)&tcp->sport + tcp_len, &hdr, sizeof(hdr));
+ return uip_csum(0, (u8 *)&tcp->sport, tcp_len + sizeof(hdr));
+ }
+}
--- /dev/null
+#include "kvm/uip.h"
+
+#include <arpa/inet.h>
+
+#define EMPTY_ADDR "0.0.0.0"
+
+static inline bool uip_dhcp_is_discovery(struct uip_dhcp *dhcp)
+{
+ return (dhcp->option[2] == UIP_DHCP_DISCOVER &&
+ dhcp->option[1] == UIP_DHCP_TAG_MSG_TYPE_LEN &&
+ dhcp->option[0] == UIP_DHCP_TAG_MSG_TYPE);
+}
+
+static inline bool uip_dhcp_is_request(struct uip_dhcp *dhcp)
+{
+ return (dhcp->option[2] == UIP_DHCP_REQUEST &&
+ dhcp->option[1] == UIP_DHCP_TAG_MSG_TYPE_LEN &&
+ dhcp->option[0] == UIP_DHCP_TAG_MSG_TYPE);
+}
+
+bool uip_udp_is_dhcp(struct uip_udp *udp)
+{
+ struct uip_dhcp *dhcp;
+
+ if (ntohs(udp->sport) != UIP_DHCP_PORT_CLIENT ||
+ ntohs(udp->dport) != UIP_DHCP_PORT_SERVER)
+ return false;
+
+ dhcp = (struct uip_dhcp *)udp;
+
+ if (ntohl(dhcp->magic_cookie) != UIP_DHCP_MAGIC_COOKIE)
+ return false;
+
+ return true;
+}
+
+int uip_dhcp_get_dns(struct uip_info *info)
+{
+ char key[256], val[256];
+ struct in_addr addr;
+ int ret = -1;
+ int n = 0;
+ FILE *fp;
+ u32 ip;
+
+ fp = fopen("/etc/resolv.conf", "r");
+ if (!fp)
+ return ret;
+
+ while (!feof(fp)) {
+ if (fscanf(fp, "%s %s\n", key, val) != 2)
+ continue;
+ if (strncmp("domain", key, 6) == 0)
+ info->domain_name = strndup(val, UIP_DHCP_MAX_DOMAIN_NAME_LEN);
+ else if (strncmp("nameserver", key, 10) == 0) {
+ if (!inet_aton(val, &addr))
+ continue;
+ ip = ntohl(addr.s_addr);
+ if (n < UIP_DHCP_MAX_DNS_SERVER_NR)
+ info->dns_ip[n++] = ip;
+ ret = 0;
+ }
+ }
+
+ fclose(fp);
+ return ret;
+}
+
+static int uip_dhcp_fill_option_name_and_server(struct uip_info *info, u8 *opt, int i)
+{
+ u8 domain_name_len;
+ u32 *addr;
+ int n;
+
+ if (info->domain_name) {
+ domain_name_len = strlen(info->domain_name);
+ opt[i++] = UIP_DHCP_TAG_DOMAIN_NAME;
+ opt[i++] = domain_name_len;
+ memcpy(&opt[i], info->domain_name, domain_name_len);
+ i += domain_name_len;
+ }
+
+ for (n = 0; n < UIP_DHCP_MAX_DNS_SERVER_NR; n++) {
+ if (info->dns_ip[n] == 0)
+ continue;
+ opt[i++] = UIP_DHCP_TAG_DNS_SERVER;
+ opt[i++] = UIP_DHCP_TAG_DNS_SERVER_LEN;
+ addr = (u32 *)&opt[i];
+ *addr = htonl(info->dns_ip[n]);
+ i += UIP_DHCP_TAG_DNS_SERVER_LEN;
+ }
+
+ return i;
+}
+static int uip_dhcp_fill_option(struct uip_info *info, struct uip_dhcp *dhcp, int reply_msg_type)
+{
+ int i = 0;
+ u32 *addr;
+ u8 *opt;
+
+ opt = dhcp->option;
+
+ opt[i++] = UIP_DHCP_TAG_MSG_TYPE;
+ opt[i++] = UIP_DHCP_TAG_MSG_TYPE_LEN;
+ opt[i++] = reply_msg_type;
+
+ opt[i++] = UIP_DHCP_TAG_SERVER_ID;
+ opt[i++] = UIP_DHCP_TAG_SERVER_ID_LEN;
+ addr = (u32 *)&opt[i];
+ *addr = htonl(info->host_ip);
+ i += UIP_DHCP_TAG_SERVER_ID_LEN;
+
+ opt[i++] = UIP_DHCP_TAG_LEASE_TIME;
+ opt[i++] = UIP_DHCP_TAG_LEASE_TIME_LEN;
+ addr = (u32 *)&opt[i];
+ *addr = htonl(UIP_DHCP_LEASE_TIME);
+ i += UIP_DHCP_TAG_LEASE_TIME_LEN;
+
+ opt[i++] = UIP_DHCP_TAG_SUBMASK;
+ opt[i++] = UIP_DHCP_TAG_SUBMASK_LEN;
+ addr = (u32 *)&opt[i];
+ *addr = htonl(info->guest_netmask);
+ i += UIP_DHCP_TAG_SUBMASK_LEN;
+
+ opt[i++] = UIP_DHCP_TAG_ROUTER;
+ opt[i++] = UIP_DHCP_TAG_ROUTER_LEN;
+ addr = (u32 *)&opt[i];
+ *addr = htonl(info->host_ip);
+ i += UIP_DHCP_TAG_ROUTER_LEN;
+
+ opt[i++] = UIP_DHCP_TAG_ROOT;
+ opt[i++] = strlen(EMPTY_ADDR);
+ addr = (u32 *)&opt[i];
+ strncpy((void *) addr, EMPTY_ADDR, strlen(EMPTY_ADDR));
+ i += strlen(EMPTY_ADDR);
+
+ i = uip_dhcp_fill_option_name_and_server(info, opt, i);
+
+ opt[i++] = UIP_DHCP_TAG_END;
+
+ return 0;
+}
+
+static int uip_dhcp_make_pkg(struct uip_info *info, struct uip_udp_socket *sk, struct uip_buf *buf, u8 reply_msg_type)
+{
+ struct uip_dhcp *dhcp;
+
+ dhcp = (struct uip_dhcp *)buf->eth;
+
+ dhcp->msg_type = 2;
+ dhcp->client_ip = 0;
+ dhcp->your_ip = htonl(info->guest_ip);
+ dhcp->server_ip = htonl(info->host_ip);
+ dhcp->agent_ip = 0;
+
+ uip_dhcp_fill_option(info, dhcp, reply_msg_type);
+
+ sk->sip = htonl(info->guest_ip);
+ sk->dip = htonl(info->host_ip);
+ sk->sport = htons(UIP_DHCP_PORT_CLIENT);
+ sk->dport = htons(UIP_DHCP_PORT_SERVER);
+
+ return 0;
+}
+
+int uip_tx_do_ipv4_udp_dhcp(struct uip_tx_arg *arg)
+{
+ struct uip_udp_socket sk;
+ struct uip_dhcp *dhcp;
+ struct uip_info *info;
+ struct uip_buf *buf;
+ u8 reply_msg_type;
+
+ dhcp = (struct uip_dhcp *)arg->eth;
+
+ if (uip_dhcp_is_discovery(dhcp))
+ reply_msg_type = UIP_DHCP_OFFER;
+ else if (uip_dhcp_is_request(dhcp))
+ reply_msg_type = UIP_DHCP_ACK;
+ else
+ return -1;
+
+ buf = uip_buf_clone(arg);
+ info = arg->info;
+
+ /*
+ * Cook DHCP pkg
+ */
+ uip_dhcp_make_pkg(info, &sk, buf, reply_msg_type);
+
+ /*
+ * Cook UDP pkg
+ */
+ uip_udp_make_pkg(info, &sk, buf, NULL, UIP_DHCP_MAX_PAYLOAD_LEN);
+
+ /*
+ * Send data received from socket to guest
+ */
+ uip_buf_set_used(info, buf);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+int uip_tx_do_ipv4_icmp(struct uip_tx_arg *arg)
+{
+ struct uip_ip *ip, *ip2;
+ struct uip_icmp *icmp2;
+ struct uip_buf *buf;
+
+ buf = uip_buf_clone(arg);
+
+ icmp2 = (struct uip_icmp *)(buf->eth);
+ ip2 = (struct uip_ip *)(buf->eth);
+ ip = (struct uip_ip *)(arg->eth);
+
+ ip2->sip = ip->dip;
+ ip2->dip = ip->sip;
+ ip2->csum = 0;
+ /*
+ * ICMP reply: 0
+ */
+ icmp2->type = 0;
+ icmp2->csum = 0;
+ ip2->csum = uip_csum_ip(ip2);
+ icmp2->csum = uip_csum_icmp(icmp2);
+
+ uip_buf_set_used(arg->info, buf);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+int uip_tx_do_ipv4(struct uip_tx_arg *arg)
+{
+ struct uip_ip *ip;
+
+ ip = (struct uip_ip *)(arg->eth);
+
+ if (uip_ip_hdrlen(ip) != 20) {
+ pr_warning("IP header length is not 20 bytes");
+ return -1;
+ }
+
+ switch (ip->proto) {
+ case UIP_IP_P_ICMP:
+ uip_tx_do_ipv4_icmp(arg);
+ break;
+ case UIP_IP_P_TCP:
+ uip_tx_do_ipv4_tcp(arg);
+ break;
+ case UIP_IP_P_UDP:
+ uip_tx_do_ipv4_udp(arg);
+ break;
+ default:
+ break;
+ }
+
+ return 0;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+#include <linux/virtio_net.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <arpa/inet.h>
+
+static int uip_tcp_socket_close(struct uip_tcp_socket *sk, int how)
+{
+ shutdown(sk->fd, how);
+
+ if (sk->write_done && sk->read_done) {
+ shutdown(sk->fd, SHUT_RDWR);
+ close(sk->fd);
+
+ mutex_lock(sk->lock);
+ list_del(&sk->list);
+ mutex_unlock(sk->lock);
+
+ free(sk);
+ }
+
+ return 0;
+}
+
+static struct uip_tcp_socket *uip_tcp_socket_find(struct uip_tx_arg *arg, u32 sip, u32 dip, u16 sport, u16 dport)
+{
+ struct list_head *sk_head;
+ pthread_mutex_t *sk_lock;
+ struct uip_tcp_socket *sk;
+
+ sk_head = &arg->info->tcp_socket_head;
+ sk_lock = &arg->info->tcp_socket_lock;
+
+ mutex_lock(sk_lock);
+ list_for_each_entry(sk, sk_head, list) {
+ if (sk->sip == sip && sk->dip == dip && sk->sport == sport && sk->dport == dport) {
+ mutex_unlock(sk_lock);
+ return sk;
+ }
+ }
+ mutex_unlock(sk_lock);
+
+ return NULL;
+}
+
+static struct uip_tcp_socket *uip_tcp_socket_alloc(struct uip_tx_arg *arg, u32 sip, u32 dip, u16 sport, u16 dport)
+{
+ struct list_head *sk_head;
+ struct uip_tcp_socket *sk;
+ pthread_mutex_t *sk_lock;
+ struct uip_tcp *tcp;
+ struct uip_ip *ip;
+ int ret;
+
+ tcp = (struct uip_tcp *)arg->eth;
+ ip = (struct uip_ip *)arg->eth;
+
+ sk_head = &arg->info->tcp_socket_head;
+ sk_lock = &arg->info->tcp_socket_lock;
+
+ sk = malloc(sizeof(*sk));
+ memset(sk, 0, sizeof(*sk));
+
+ sk->lock = sk_lock;
+ sk->info = arg->info;
+
+ sk->fd = socket(AF_INET, SOCK_STREAM, 0);
+ sk->addr.sin_family = AF_INET;
+ sk->addr.sin_port = dport;
+ sk->addr.sin_addr.s_addr = dip;
+
+ pthread_cond_init(&sk->cond, NULL);
+
+ if (ntohl(dip) == arg->info->host_ip)
+ sk->addr.sin_addr.s_addr = inet_addr("127.0.0.1");
+
+ ret = connect(sk->fd, (struct sockaddr *)&sk->addr, sizeof(sk->addr));
+ if (ret) {
+ free(sk);
+ return NULL;
+ }
+
+ sk->sip = ip->sip;
+ sk->dip = ip->dip;
+ sk->sport = tcp->sport;
+ sk->dport = tcp->dport;
+
+ mutex_lock(sk_lock);
+ list_add_tail(&sk->list, sk_head);
+ mutex_unlock(sk_lock);
+
+ return sk;
+}
+
+static int uip_tcp_payload_send(struct uip_tcp_socket *sk, u8 flag, u16 payload_len)
+{
+ struct uip_info *info;
+ struct uip_eth *eth2;
+ struct uip_tcp *tcp2;
+ struct uip_buf *buf;
+ struct uip_ip *ip2;
+
+ info = sk->info;
+
+ /*
+ * Get free buffer to send data to guest
+ */
+ buf = uip_buf_get_free(info);
+
+ /*
+ * Cook a ethernet frame
+ */
+ tcp2 = (struct uip_tcp *)buf->eth;
+ eth2 = (struct uip_eth *)buf->eth;
+ ip2 = (struct uip_ip *)buf->eth;
+
+ eth2->src = info->host_mac;
+ eth2->dst = info->guest_mac;
+ eth2->type = htons(UIP_ETH_P_IP);
+
+ ip2->vhl = UIP_IP_VER_4 | UIP_IP_HDR_LEN;
+ ip2->tos = 0;
+ ip2->id = 0;
+ ip2->flgfrag = 0;
+ ip2->ttl = UIP_IP_TTL;
+ ip2->proto = UIP_IP_P_TCP;
+ ip2->csum = 0;
+ ip2->sip = sk->dip;
+ ip2->dip = sk->sip;
+
+ tcp2->sport = sk->dport;
+ tcp2->dport = sk->sport;
+ tcp2->seq = htonl(sk->seq_server);
+ tcp2->ack = htonl(sk->ack_server);
+ /*
+ * Diable TCP options, tcp hdr len equals 20 bytes
+ */
+ tcp2->off = UIP_TCP_HDR_LEN;
+ tcp2->flg = flag;
+ tcp2->win = htons(UIP_TCP_WIN_SIZE);
+ tcp2->csum = 0;
+ tcp2->urgent = 0;
+
+ if (payload_len > 0)
+ memcpy(uip_tcp_payload(tcp2), sk->payload, payload_len);
+
+ ip2->len = htons(uip_tcp_hdrlen(tcp2) + payload_len + uip_ip_hdrlen(ip2));
+ ip2->csum = uip_csum_ip(ip2);
+ tcp2->csum = uip_csum_tcp(tcp2);
+
+ /*
+ * virtio_net_hdr
+ */
+ buf->vnet_len = sizeof(struct virtio_net_hdr);
+ memset(buf->vnet, 0, buf->vnet_len);
+
+ buf->eth_len = ntohs(ip2->len) + uip_eth_hdrlen(&ip2->eth);
+
+ /*
+ * Increase server seq
+ */
+ sk->seq_server += payload_len;
+
+ /*
+ * Send data received from socket to guest
+ */
+ uip_buf_set_used(info, buf);
+
+ return 0;
+}
+
+static void *uip_tcp_socket_thread(void *p)
+{
+ struct uip_tcp_socket *sk;
+ int len, left, ret;
+ u8 *payload, *pos;
+
+ sk = p;
+
+ payload = malloc(UIP_MAX_TCP_PAYLOAD);
+ if (!payload)
+ goto out;
+
+ while (1) {
+ pos = payload;
+
+ ret = read(sk->fd, payload, UIP_MAX_TCP_PAYLOAD);
+
+ if (ret <= 0 || ret > UIP_MAX_TCP_PAYLOAD)
+ goto out;
+
+ left = ret;
+
+ while (left > 0) {
+ mutex_lock(sk->lock);
+ while ((len = sk->guest_acked + sk->window_size - sk->seq_server) <= 0)
+ pthread_cond_wait(&sk->cond, sk->lock);
+ mutex_unlock(sk->lock);
+
+ sk->payload = pos;
+ if (len > left)
+ len = left;
+ if (len > UIP_MAX_TCP_PAYLOAD)
+ len = UIP_MAX_TCP_PAYLOAD;
+ left -= len;
+ pos += len;
+
+ uip_tcp_payload_send(sk, UIP_TCP_FLAG_ACK, len);
+ }
+ }
+
+out:
+ /*
+ * Close server to guest TCP connection
+ */
+ uip_tcp_socket_close(sk, SHUT_RD);
+
+ uip_tcp_payload_send(sk, UIP_TCP_FLAG_FIN | UIP_TCP_FLAG_ACK, 0);
+ sk->seq_server += 1;
+
+ sk->read_done = 1;
+
+ free(payload);
+ pthread_exit(NULL);
+
+ return NULL;
+}
+
+static int uip_tcp_socket_receive(struct uip_tcp_socket *sk)
+{
+ if (sk->thread == 0)
+ return pthread_create(&sk->thread, NULL, uip_tcp_socket_thread, (void *)sk);
+
+ return 0;
+}
+
+static int uip_tcp_socket_send(struct uip_tcp_socket *sk, struct uip_tcp *tcp)
+{
+ int len;
+ int ret;
+ u8 *payload;
+
+ if (sk->write_done)
+ return 0;
+
+ payload = uip_tcp_payload(tcp);
+ len = uip_tcp_payloadlen(tcp);
+
+ ret = write(sk->fd, payload, len);
+ if (ret != len)
+ pr_warning("tcp send error");
+
+ return ret;
+}
+
+int uip_tx_do_ipv4_tcp(struct uip_tx_arg *arg)
+{
+ struct uip_tcp_socket *sk;
+ struct uip_tcp *tcp;
+ struct uip_ip *ip;
+ int ret;
+
+ tcp = (struct uip_tcp *)arg->eth;
+ ip = (struct uip_ip *)arg->eth;
+
+ /*
+ * Guest is trying to start a TCP session, let's fake SYN-ACK to guest
+ */
+ if (uip_tcp_is_syn(tcp)) {
+ sk = uip_tcp_socket_alloc(arg, ip->sip, ip->dip, tcp->sport, tcp->dport);
+ if (!sk)
+ return -1;
+
+ sk->window_size = ntohs(tcp->win);
+
+ /*
+ * Setup ISN number
+ */
+ sk->isn_guest = uip_tcp_isn(tcp);
+ sk->isn_server = uip_tcp_isn_alloc();
+
+ sk->seq_server = sk->isn_server;
+ sk->ack_server = sk->isn_guest + 1;
+ uip_tcp_payload_send(sk, UIP_TCP_FLAG_SYN | UIP_TCP_FLAG_ACK, 0);
+ sk->seq_server += 1;
+
+ /*
+ * Start receive thread for data from remote to guest
+ */
+ uip_tcp_socket_receive(sk);
+
+ goto out;
+ }
+
+ /*
+ * Find socket we have allocated
+ */
+ sk = uip_tcp_socket_find(arg, ip->sip, ip->dip, tcp->sport, tcp->dport);
+ if (!sk)
+ return -1;
+
+ mutex_lock(sk->lock);
+ sk->window_size = ntohs(tcp->win);
+ sk->guest_acked = ntohl(tcp->ack);
+ pthread_cond_signal(&sk->cond);
+ mutex_unlock(sk->lock);
+
+ if (uip_tcp_is_fin(tcp)) {
+ if (sk->write_done)
+ goto out;
+
+ sk->write_done = 1;
+ sk->ack_server += 1;
+ uip_tcp_payload_send(sk, UIP_TCP_FLAG_ACK, 0);
+
+ /*
+ * Close guest to server TCP connection
+ */
+ uip_tcp_socket_close(sk, SHUT_WR);
+
+ goto out;
+ }
+
+ /*
+ * Ignore guest to server frames with zero tcp payload
+ */
+ if (uip_tcp_payloadlen(tcp) == 0)
+ goto out;
+
+ /*
+ * Sent out TCP data to remote host
+ */
+ ret = uip_tcp_socket_send(sk, tcp);
+ if (ret < 0)
+ return -1;
+ /*
+ * Send ACK to guest imediately
+ */
+ sk->ack_server += ret;
+ uip_tcp_payload_send(sk, UIP_TCP_FLAG_ACK, 0);
+
+out:
+ return 0;
+}
--- /dev/null
+#include "kvm/uip.h"
+
+#include <linux/virtio_net.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <sys/socket.h>
+#include <sys/epoll.h>
+#include <fcntl.h>
+
+#define UIP_UDP_MAX_EVENTS 1000
+
+static struct uip_udp_socket *uip_udp_socket_find(struct uip_tx_arg *arg, u32 sip, u32 dip, u16 sport, u16 dport)
+{
+ struct list_head *sk_head;
+ struct uip_udp_socket *sk;
+ pthread_mutex_t *sk_lock;
+ struct epoll_event ev;
+ int flags;
+ int ret;
+
+ sk_head = &arg->info->udp_socket_head;
+ sk_lock = &arg->info->udp_socket_lock;
+
+ /*
+ * Find existing sk
+ */
+ mutex_lock(sk_lock);
+ list_for_each_entry(sk, sk_head, list) {
+ if (sk->sip == sip && sk->dip == dip && sk->sport == sport && sk->dport == dport) {
+ mutex_unlock(sk_lock);
+ return sk;
+ }
+ }
+ mutex_unlock(sk_lock);
+
+ /*
+ * Allocate new one
+ */
+ sk = malloc(sizeof(*sk));
+ memset(sk, 0, sizeof(*sk));
+
+ sk->lock = sk_lock;
+
+ sk->fd = socket(AF_INET, SOCK_DGRAM, 0);
+ if (sk->fd < 0)
+ goto out;
+
+ /*
+ * Set non-blocking
+ */
+ flags = fcntl(sk->fd, F_GETFL, 0);
+ flags |= O_NONBLOCK;
+ fcntl(sk->fd, F_SETFL, flags);
+
+ /*
+ * Add sk->fd to epoll_wait
+ */
+ ev.events = EPOLLIN;
+ ev.data.fd = sk->fd;
+ ev.data.ptr = sk;
+ if (arg->info->udp_epollfd <= 0)
+ arg->info->udp_epollfd = epoll_create(UIP_UDP_MAX_EVENTS);
+ ret = epoll_ctl(arg->info->udp_epollfd, EPOLL_CTL_ADD, sk->fd, &ev);
+ if (ret == -1)
+ pr_warning("epoll_ctl error");
+
+ sk->addr.sin_family = AF_INET;
+ sk->addr.sin_addr.s_addr = dip;
+ sk->addr.sin_port = dport;
+
+ sk->sip = sip;
+ sk->dip = dip;
+ sk->sport = sport;
+ sk->dport = dport;
+
+ mutex_lock(sk_lock);
+ list_add_tail(&sk->list, sk_head);
+ mutex_unlock(sk_lock);
+
+ return sk;
+
+out:
+ free(sk);
+ return NULL;
+}
+
+static int uip_udp_socket_send(struct uip_udp_socket *sk, struct uip_udp *udp)
+{
+ int len;
+ int ret;
+
+ len = ntohs(udp->len) - uip_udp_hdrlen(udp);
+
+ ret = sendto(sk->fd, udp->payload, len, 0, (struct sockaddr *)&sk->addr, sizeof(sk->addr));
+ if (ret != len)
+ return -1;
+
+ return 0;
+}
+
+int uip_udp_make_pkg(struct uip_info *info, struct uip_udp_socket *sk, struct uip_buf *buf, u8* payload, int payload_len)
+{
+ struct uip_eth *eth2;
+ struct uip_udp *udp2;
+ struct uip_ip *ip2;
+
+ /*
+ * Cook a ethernet frame
+ */
+ udp2 = (struct uip_udp *)(buf->eth);
+ eth2 = (struct uip_eth *)buf->eth;
+ ip2 = (struct uip_ip *)(buf->eth);
+
+ eth2->src = info->host_mac;
+ eth2->dst = info->guest_mac;
+ eth2->type = htons(UIP_ETH_P_IP);
+
+ ip2->vhl = UIP_IP_VER_4 | UIP_IP_HDR_LEN;
+ ip2->tos = 0;
+ ip2->id = 0;
+ ip2->flgfrag = 0;
+ ip2->ttl = UIP_IP_TTL;
+ ip2->proto = UIP_IP_P_UDP;
+ ip2->csum = 0;
+
+ ip2->sip = sk->dip;
+ ip2->dip = sk->sip;
+ udp2->sport = sk->dport;
+ udp2->dport = sk->sport;
+
+ udp2->len = htons(payload_len + uip_udp_hdrlen(udp2));
+ udp2->csum = 0;
+
+ if (payload)
+ memcpy(udp2->payload, payload, payload_len);
+
+ ip2->len = udp2->len + htons(uip_ip_hdrlen(ip2));
+ ip2->csum = uip_csum_ip(ip2);
+ udp2->csum = uip_csum_udp(udp2);
+
+ /*
+ * virtio_net_hdr
+ */
+ buf->vnet_len = sizeof(struct virtio_net_hdr);
+ memset(buf->vnet, 0, buf->vnet_len);
+
+ buf->eth_len = ntohs(ip2->len) + uip_eth_hdrlen(&ip2->eth);
+
+ return 0;
+}
+
+static void *uip_udp_socket_thread(void *p)
+{
+ struct epoll_event events[UIP_UDP_MAX_EVENTS];
+ struct uip_udp_socket *sk;
+ struct uip_info *info;
+ struct uip_buf *buf;
+ int payload_len;
+ u8 *payload;
+ int nfds;
+ int i;
+
+ info = p;
+
+ do {
+ payload = malloc(UIP_MAX_UDP_PAYLOAD);
+ } while (!payload);
+
+ while (1) {
+ nfds = epoll_wait(info->udp_epollfd, events, UIP_UDP_MAX_EVENTS, -1);
+
+ if (nfds == -1)
+ continue;
+
+ for (i = 0; i < nfds; i++) {
+
+ sk = events[i].data.ptr;
+ payload_len = recvfrom(sk->fd, payload, UIP_MAX_UDP_PAYLOAD, 0, NULL, NULL);
+ if (payload_len < 0)
+ continue;
+
+ /*
+ * Get free buffer to send data to guest
+ */
+ buf = uip_buf_get_free(info);
+
+ uip_udp_make_pkg(info, sk, buf, payload, payload_len);
+
+ /*
+ * Send data received from socket to guest
+ */
+ uip_buf_set_used(info, buf);
+ }
+ }
+
+ free(payload);
+ pthread_exit(NULL);
+ return NULL;
+}
+
+int uip_tx_do_ipv4_udp(struct uip_tx_arg *arg)
+{
+ struct uip_udp_socket *sk;
+ struct uip_info *info;
+ struct uip_udp *udp;
+ struct uip_ip *ip;
+ int ret;
+
+ udp = (struct uip_udp *)(arg->eth);
+ ip = (struct uip_ip *)(arg->eth);
+ info = arg->info;
+
+ if (uip_udp_is_dhcp(udp)) {
+ uip_tx_do_ipv4_udp_dhcp(arg);
+ return 0;
+ }
+
+ /*
+ * Find socket we have allocated before, otherwise allocate one
+ */
+ sk = uip_udp_socket_find(arg, ip->sip, ip->dip, udp->sport, udp->dport);
+ if (!sk)
+ return -1;
+
+ /*
+ * Send out UDP data to remote host
+ */
+ ret = uip_udp_socket_send(sk, udp);
+ if (ret)
+ return -1;
+
+ if (!info->udp_thread)
+ pthread_create(&info->udp_thread, NULL, uip_udp_socket_thread, (void *)info);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/pci.h"
+#include "kvm/ioport.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+
+#include <linux/err.h>
+#include <assert.h>
+
+#define PCI_BAR_OFFSET(b) (offsetof(struct pci_device_header, bar[b]))
+
+static struct pci_device_header *pci_devices[PCI_MAX_DEVICES];
+
+static union pci_config_address pci_config_address;
+
+/* This is within our PCI gap - in an unused area.
+ * Note this is a PCI *bus address*, is used to assign BARs etc.!
+ * (That's why it can still 32bit even with 64bit guests-- 64bit
+ * PCI isn't currently supported.)
+ */
+static u32 io_space_blocks = KVM_PCI_MMIO_AREA;
+
+u32 pci_get_io_space_block(u32 size)
+{
+ u32 block = io_space_blocks;
+ io_space_blocks += size;
+
+ return block;
+}
+
+static void *pci_config_address_ptr(u16 port)
+{
+ unsigned long offset;
+ void *base;
+
+ offset = port - PCI_CONFIG_ADDRESS;
+ base = &pci_config_address;
+
+ return base + offset;
+}
+
+static bool pci_config_address_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ void *p = pci_config_address_ptr(port);
+
+ memcpy(p, data, size);
+
+ return true;
+}
+
+static bool pci_config_address_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ void *p = pci_config_address_ptr(port);
+
+ memcpy(data, p, size);
+
+ return true;
+}
+
+static struct ioport_operations pci_config_address_ops = {
+ .io_in = pci_config_address_in,
+ .io_out = pci_config_address_out,
+};
+
+static bool pci_device_exists(u8 bus_number, u8 device_number, u8 function_number)
+{
+ struct pci_device_header *dev;
+
+ if (pci_config_address.bus_number != bus_number)
+ return false;
+
+ if (pci_config_address.function_number != function_number)
+ return false;
+
+ if (device_number >= PCI_MAX_DEVICES)
+ return false;
+
+ dev = pci_devices[device_number];
+
+ return dev != NULL;
+}
+
+static bool pci_config_data_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ /*
+ * If someone accesses PCI configuration space offsets that are not
+ * aligned to 4 bytes, it uses ioports to signify that.
+ */
+ pci_config_address.reg_offset = port - PCI_CONFIG_DATA;
+
+ pci__config_wr(kvm, pci_config_address, data, size);
+
+ return true;
+}
+
+static bool pci_config_data_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ /*
+ * If someone accesses PCI configuration space offsets that are not
+ * aligned to 4 bytes, it uses ioports to signify that.
+ */
+ pci_config_address.reg_offset = port - PCI_CONFIG_DATA;
+
+ pci__config_rd(kvm, pci_config_address, data, size);
+
+ return true;
+}
+
+static struct ioport_operations pci_config_data_ops = {
+ .io_in = pci_config_data_in,
+ .io_out = pci_config_data_out,
+};
+
+void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
+{
+ u8 dev_num;
+
+ dev_num = addr.device_number;
+
+ if (pci_device_exists(0, dev_num, 0)) {
+ unsigned long offset;
+
+ offset = addr.w & 0xff;
+ if (offset < sizeof(struct pci_device_header)) {
+ void *p = pci_devices[dev_num];
+ u8 bar = (offset - PCI_BAR_OFFSET(0)) / (sizeof(u32));
+ u32 sz = PCI_IO_SIZE;
+
+ if (bar < 6 && pci_devices[dev_num]->bar_size[bar])
+ sz = pci_devices[dev_num]->bar_size[bar];
+
+ /*
+ * If the kernel masks the BAR it would expect to find the
+ * size of the BAR there next time it reads from it.
+ * When the kernel got the size it would write the address
+ * back.
+ */
+ if (*(u32 *)(p + offset)) {
+ /* See if kernel tries to mask one of the BARs */
+ if ((offset >= PCI_BAR_OFFSET(0)) &&
+ (offset <= PCI_BAR_OFFSET(6)) &&
+ (ioport__read32(data) == 0xFFFFFFFF))
+ memcpy(p + offset, &sz, sizeof(sz));
+ else
+ memcpy(p + offset, data, size);
+ }
+ }
+ }
+}
+
+void pci__config_rd(struct kvm *kvm, union pci_config_address addr, void *data, int size)
+{
+ u8 dev_num;
+
+ dev_num = addr.device_number;
+
+ if (pci_device_exists(0, dev_num, 0)) {
+ unsigned long offset;
+
+ offset = addr.w & 0xff;
+ if (offset < sizeof(struct pci_device_header)) {
+ void *p = pci_devices[dev_num];
+
+ memcpy(data, p + offset, size);
+ } else {
+ memset(data, 0x00, size);
+ }
+ } else {
+ memset(data, 0xff, size);
+ }
+}
+
+int pci__register(struct pci_device_header *dev, u8 dev_num)
+{
+ if (dev_num >= PCI_MAX_DEVICES)
+ return -ENOSPC;
+
+ pci_devices[dev_num] = dev;
+
+ return 0;
+}
+
+struct pci_device_header *pci__find_dev(u8 dev_num)
+{
+ if (dev_num >= PCI_MAX_DEVICES)
+ return ERR_PTR(-EOVERFLOW);
+
+ return pci_devices[dev_num];
+}
+
+int pci__init(struct kvm *kvm)
+{
+ int r;
+
+ r = ioport__register(kvm, PCI_CONFIG_DATA + 0, &pci_config_data_ops, 4, NULL);
+ if (r < 0)
+ return r;
+
+ r = ioport__register(kvm, PCI_CONFIG_ADDRESS + 0, &pci_config_address_ops, 4, NULL);
+ if (r < 0) {
+ ioport__unregister(kvm, PCI_CONFIG_DATA);
+ return r;
+ }
+
+ return 0;
+}
+base_init(pci__init);
+
+int pci__exit(struct kvm *kvm)
+{
+ ioport__unregister(kvm, PCI_CONFIG_DATA);
+ ioport__unregister(kvm, PCI_CONFIG_ADDRESS);
+
+ return 0;
+}
+base_exit(pci__exit);
--- /dev/null
+#include "kvm/kvm.h"
+
+#include <stdbool.h>
+
+bool kvm__load_firmware(struct kvm *kvm, const char *firmware_filename)
+{
+ return false;
+}
--- /dev/null
+/*
+ * PPC CPU identification
+ *
+ * This is a very simple "host CPU info" struct to get us going.
+ * For the little host information we need, I don't want to grub about
+ * parsing stuff in /proc/device-tree so just match host PVR to differentiate
+ * PPC970 and POWER7 (which is all that's currently supported).
+ *
+ * Qemu does something similar but this is MUCH simpler!
+ *
+ * Copyright 2012 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <kvm/kvm.h>
+#include <sys/ioctl.h>
+
+#include "cpu_info.h"
+#include "kvm/util.h"
+
+/* POWER7 */
+
+static struct cpu_info cpu_power7_info = {
+ .name = "POWER7",
+ .tb_freq = 512000000,
+ .d_bsize = 128,
+ .i_bsize = 128,
+ .flags = CPUINFO_FLAG_DFP | CPUINFO_FLAG_VSX | CPUINFO_FLAG_VMX,
+ .mmu_info = {
+ .flags = KVM_PPC_PAGE_SIZES_REAL | KVM_PPC_1T_SEGMENTS,
+ .slb_size = 32,
+ },
+};
+
+/* PPC970/G5 */
+
+static struct cpu_info cpu_970_info = {
+ .name = "G5",
+ .tb_freq = 33333333,
+ .d_bsize = 128,
+ .i_bsize = 128,
+ .flags = CPUINFO_FLAG_VMX,
+};
+
+/* This is a default catchall for 'no match' on PVR: */
+static struct cpu_info cpu_dummy_info = { .name = "unknown" };
+
+static struct pvr_info host_pvr_info[] = {
+ { 0xffffffff, 0x0f000003, &cpu_power7_info },
+ { 0xffff0000, 0x003f0000, &cpu_power7_info },
+ { 0xffff0000, 0x004a0000, &cpu_power7_info },
+ { 0xffff0000, 0x00390000, &cpu_970_info },
+ { 0xffff0000, 0x003c0000, &cpu_970_info },
+ { 0xffff0000, 0x00440000, &cpu_970_info },
+ { 0xffff0000, 0x00450000, &cpu_970_info },
+};
+
+/* If we can't query the kernel for supported page sizes assume 4K and 16M */
+static struct kvm_ppc_one_seg_page_size fallback_sps[] = {
+ [0] = {
+ .page_shift = 12,
+ .slb_enc = 0,
+ .enc = {
+ [0] = {
+ .page_shift = 12,
+ .pte_enc = 0,
+ },
+ },
+ },
+ [1] = {
+ .page_shift = 24,
+ .slb_enc = 0x100,
+ .enc = {
+ [0] = {
+ .page_shift = 24,
+ .pte_enc = 0,
+ },
+ },
+ },
+};
+
+
+static void setup_mmu_info(struct kvm *kvm, struct cpu_info *cpu_info)
+{
+ static struct kvm_ppc_smmu_info *mmu_info;
+ struct kvm_ppc_one_seg_page_size *sps;
+ int i, j, k, valid;
+
+ if (!kvm__supports_extension(kvm, KVM_CAP_PPC_GET_SMMU_INFO)) {
+ memcpy(&cpu_info->mmu_info.sps, fallback_sps, sizeof(fallback_sps));
+ } else if (ioctl(kvm->vm_fd, KVM_PPC_GET_SMMU_INFO, &cpu_info->mmu_info) < 0) {
+ die_perror("KVM_PPC_GET_SMMU_INFO failed");
+ }
+
+ mmu_info = &cpu_info->mmu_info;
+
+ if (!(mmu_info->flags & KVM_PPC_PAGE_SIZES_REAL))
+ /* Guest pages are not restricted by the backing page size */
+ return;
+
+ /* Filter based on backing page size */
+
+ for (i = 0; i < KVM_PPC_PAGE_SIZES_MAX_SZ; i++) {
+ sps = &mmu_info->sps[i];
+
+ if (!sps->page_shift)
+ break;
+
+ if (kvm->ram_pagesize < (1ul << sps->page_shift)) {
+ /* Mark the whole segment size invalid */
+ sps->page_shift = 0;
+ continue;
+ }
+
+ /* Check each page size for the segment */
+ for (j = 0, valid = 0; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++) {
+ if (!sps->enc[j].page_shift)
+ break;
+
+ if (kvm->ram_pagesize < (1ul << sps->enc[j].page_shift))
+ sps->enc[j].page_shift = 0;
+ else
+ valid++;
+ }
+
+ if (!valid) {
+ /* Mark the whole segment size invalid */
+ sps->page_shift = 0;
+ continue;
+ }
+
+ /* Mark any trailing entries invalid if we broke out early */
+ for (k = j; k < KVM_PPC_PAGE_SIZES_MAX_SZ; k++)
+ sps->enc[k].page_shift = 0;
+
+ /* Collapse holes */
+ for (j = 0; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++) {
+ if (sps->enc[j].page_shift)
+ continue;
+
+ for (k = j + 1; k < KVM_PPC_PAGE_SIZES_MAX_SZ; k++) {
+ if (sps->enc[k].page_shift) {
+ sps->enc[j] = sps->enc[k];
+ sps->enc[k].page_shift = 0;
+ break;
+ }
+ }
+ }
+ }
+
+ /* Mark any trailing entries invalid if we broke out early */
+ for (j = i; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++)
+ mmu_info->sps[j].page_shift = 0;
+
+ /* Collapse holes */
+ for (i = 0; i < KVM_PPC_PAGE_SIZES_MAX_SZ; i++) {
+ if (mmu_info->sps[i].page_shift)
+ continue;
+
+ for (j = i + 1; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++) {
+ if (mmu_info->sps[j].page_shift) {
+ mmu_info->sps[i] = mmu_info->sps[j];
+ mmu_info->sps[j].page_shift = 0;
+ break;
+ }
+ }
+ }
+}
+
+struct cpu_info *find_cpu_info(struct kvm *kvm)
+{
+ struct cpu_info *info;
+ unsigned int i;
+ u32 pvr = kvm->arch.pvr;
+
+ for (info = NULL, i = 0; i < ARRAY_SIZE(host_pvr_info); i++) {
+ if ((pvr & host_pvr_info[i].pvr_mask) == host_pvr_info[i].pvr) {
+ info = host_pvr_info[i].cpu_info;
+ break;
+ }
+ }
+
+ /* Didn't find anything? Rut-ro. */
+ if (!info) {
+ pr_warning("Host CPU unsupported by kvmtool\n");
+ info = &cpu_dummy_info;
+ }
+
+ setup_mmu_info(kvm, info);
+
+ return info;
+}
--- /dev/null
+/*
+ * PPC CPU identification
+ *
+ * Copyright 2012 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef CPU_INFO_H
+#define CPU_INFO_H
+
+#include <kvm/kvm.h>
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/kvm.h>
+
+struct cpu_info {
+ const char *name;
+ u32 tb_freq; /* timebase frequency */
+ u32 d_bsize; /* d-cache block size */
+ u32 i_bsize; /* i-cache block size */
+ u32 flags;
+ struct kvm_ppc_smmu_info mmu_info;
+};
+
+struct pvr_info {
+ u32 pvr_mask;
+ u32 pvr;
+ struct cpu_info *cpu_info;
+};
+
+/* Misc capabilities/CPU properties */
+#define CPUINFO_FLAG_DFP 0x00000001
+#define CPUINFO_FLAG_VMX 0x00000002
+#define CPUINFO_FLAG_VSX 0x00000004
+
+struct cpu_info *find_cpu_info(struct kvm *kvm);
+
+#endif
--- /dev/null
+#ifndef _KVM_BARRIER_H_
+#define _KVM_BARRIER_H_
+
+#include <asm/barrier.h>
+
+#endif /* _KVM_BARRIER_H_ */
--- /dev/null
+/*
+ * PPC64 architecture-specific definitions
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef KVM__KVM_ARCH_H
+#define KVM__KVM_ARCH_H
+
+#include <stdbool.h>
+#include <linux/types.h>
+#include <time.h>
+
+/*
+ * MMIO lives after RAM, but it'd be nice if it didn't constantly move.
+ * Choose a suitably high address, e.g. 63T... This limits RAM size.
+ */
+#define PPC_MMIO_START 0x3F0000000000UL
+#define PPC_MMIO_SIZE 0x010000000000UL
+
+#define KERNEL_LOAD_ADDR 0x0000000000000000
+#define KERNEL_START_ADDR 0x0000000000000000
+#define KERNEL_SECONDARY_START_ADDR 0x0000000000000060
+#define INITRD_LOAD_ADDR 0x0000000002800000
+
+#define FDT_MAX_SIZE 0x10000
+#define RTAS_MAX_SIZE 0x10000
+
+#define TIMEBASE_FREQ 512000000ULL
+
+#define KVM_MMIO_START PPC_MMIO_START
+
+/*
+ * This is the address that pci_get_io_space_block() starts allocating
+ * from. Note that this is a PCI bus address.
+ */
+#define KVM_PCI_MMIO_AREA 0x1000000
+#define KVM_VIRTIO_MMIO_AREA 0x2000000
+
+struct spapr_phb;
+
+struct kvm_arch {
+ u64 sdr1;
+ u32 pvr;
+ unsigned long rtas_gra;
+ unsigned long rtas_size;
+ unsigned long fdt_gra;
+ unsigned long initrd_gra;
+ unsigned long initrd_size;
+ struct icp_state *icp;
+ struct spapr_phb *phb;
+};
+
+/* Helper for the various bits of code that generate FDT nodes */
+#define _FDT(exp) \
+ do { \
+ int ret = (exp); \
+ if (ret < 0) { \
+ die("Error creating device tree: %s: %s\n", \
+ #exp, fdt_strerror(ret)); \
+ } \
+ } while (0)
+
+#endif /* KVM__KVM_ARCH_H */
--- /dev/null
+/*
+ * PPC64 cpu-specific definitions
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef KVM__KVM_CPU_ARCH_H
+#define KVM__KVM_CPU_ARCH_H
+
+/* Architecture-specific kvm_cpu definitions. */
+
+#include <linux/kvm.h> /* for struct kvm_regs */
+#include <stdbool.h>
+#include <pthread.h>
+
+#define MSR_SF (1UL<<63)
+#define MSR_HV (1UL<<60)
+#define MSR_VEC (1UL<<25)
+#define MSR_VSX (1UL<<23)
+#define MSR_POW (1UL<<18)
+#define MSR_EE (1UL<<15)
+#define MSR_PR (1UL<<14)
+#define MSR_FP (1UL<<13)
+#define MSR_ME (1UL<<12)
+#define MSR_FE0 (1UL<<11)
+#define MSR_SE (1UL<<10)
+#define MSR_BE (1UL<<9)
+#define MSR_FE1 (1UL<<8)
+#define MSR_IR (1UL<<5)
+#define MSR_DR (1UL<<4)
+#define MSR_PMM (1UL<<2)
+#define MSR_RI (1UL<<1)
+#define MSR_LE (1UL<<0)
+
+#define POWER7_EXT_IRQ 0
+
+struct kvm;
+
+struct kvm_cpu {
+ pthread_t thread; /* VCPU thread */
+
+ unsigned long cpu_id;
+
+ struct kvm *kvm; /* parent KVM */
+ int vcpu_fd; /* For VCPU ioctls() */
+ struct kvm_run *kvm_run;
+
+ struct kvm_regs regs;
+ struct kvm_sregs sregs;
+ struct kvm_fpu fpu;
+
+ u8 is_running;
+ u8 paused;
+ u8 needs_nmi;
+ /*
+ * Although PPC KVM doesn't yet support coalesced MMIO, generic code
+ * needs this in our kvm_cpu:
+ */
+ struct kvm_coalesced_mmio_ring *ring;
+};
+
+void kvm_cpu__irq(struct kvm_cpu *vcpu, int pin, int level);
+
+/* This is never actually called on PPC. */
+static inline bool kvm_cpu__emulate_io(struct kvm *kvm, u16 port, void *data, int direction, int size, u32 count)
+{
+ return false;
+}
+
+bool kvm_cpu__emulate_mmio(struct kvm *kvm, u64 phys_addr, u8 *data, u32 len, u8 is_write);
+
+#endif /* KVM__KVM_CPU_ARCH_H */
--- /dev/null
+/*
+ * PPC64 ioport platform setup. There isn't any! :-)
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "kvm/ioport.h"
+
+#include <stdlib.h>
+
+void ioport__setup_arch(struct kvm *kvm)
+{
+ /* PPC has no legacy ioports to set up */
+}
--- /dev/null
+/*
+ * PPC64 IRQ routines
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "kvm/irq.h"
+#include "kvm/kvm.h"
+#include "kvm/util.h"
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/list.h>
+#include <linux/kvm.h>
+#include <sys/ioctl.h>
+
+#include <stddef.h>
+#include <stdlib.h>
+
+#include "kvm/pci.h"
+
+#include "xics.h"
+#include "spapr_pci.h"
+
+/*
+ * FIXME: The code in this file assumes an SPAPR guest, using XICS. Make
+ * generic & cope with multiple PPC platform types.
+ */
+
+static int pci_devs = 0;
+
+int irq__register_device(u32 dev, u8 *num, u8 *pin, u8 *line)
+{
+ if (pci_devs >= PCI_MAX_DEVICES)
+ die("Hit PCI device limit!\n");
+
+ *num = pci_devs++;
+
+ *pin = 1;
+ /*
+ * Have I said how nasty I find this? Line should be dontcare... PHB
+ * should determine which CPU/XICS IRQ to fire.
+ */
+ *line = xics_alloc_irqnum();
+ return 0;
+}
+
+int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg)
+{
+ die(__FUNCTION__);
+ return 0;
+}
--- /dev/null
+/*
+ * PPC64 processor support
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "kvm/kvm-cpu.h"
+
+#include "kvm/symbol.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+
+#include "spapr.h"
+#include "spapr_pci.h"
+#include "xics.h"
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <stdio.h>
+#include <assert.h>
+
+static int debug_fd;
+
+void kvm_cpu__set_debug_fd(int fd)
+{
+ debug_fd = fd;
+}
+
+int kvm_cpu__get_debug_fd(void)
+{
+ return debug_fd;
+}
+
+static struct kvm_cpu *kvm_cpu__new(struct kvm *kvm)
+{
+ struct kvm_cpu *vcpu;
+
+ vcpu = calloc(1, sizeof *vcpu);
+ if (!vcpu)
+ return NULL;
+
+ vcpu->kvm = kvm;
+
+ return vcpu;
+}
+
+void kvm_cpu__delete(struct kvm_cpu *vcpu)
+{
+ free(vcpu);
+}
+
+struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned long cpu_id)
+{
+ struct kvm_cpu *vcpu;
+ int mmap_size;
+ struct kvm_enable_cap papr_cap = { .cap = KVM_CAP_PPC_PAPR };
+
+ vcpu = kvm_cpu__new(kvm);
+ if (!vcpu)
+ return NULL;
+
+ vcpu->cpu_id = cpu_id;
+
+ vcpu->vcpu_fd = ioctl(vcpu->kvm->vm_fd, KVM_CREATE_VCPU, cpu_id);
+ if (vcpu->vcpu_fd < 0)
+ die_perror("KVM_CREATE_VCPU ioctl");
+
+ mmap_size = ioctl(vcpu->kvm->sys_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
+ if (mmap_size < 0)
+ die_perror("KVM_GET_VCPU_MMAP_SIZE ioctl");
+
+ vcpu->kvm_run = mmap(NULL, mmap_size, PROT_RW, MAP_SHARED, vcpu->vcpu_fd, 0);
+ if (vcpu->kvm_run == MAP_FAILED)
+ die("unable to mmap vcpu fd");
+
+ if (ioctl(vcpu->vcpu_fd, KVM_ENABLE_CAP, &papr_cap) < 0)
+ die("unable to enable PAPR capability");
+
+ /*
+ * We start all CPUs, directing non-primary threads into the kernel's
+ * secondary start point. When we come to support SLOF, we will start
+ * only one and SLOF will RTAS call us to ask for others to be
+ * started. (FIXME: make more generic & interface with whichever
+ * firmware a platform may be using.)
+ */
+ vcpu->is_running = true;
+
+ return vcpu;
+}
+
+static void kvm_cpu__setup_fpu(struct kvm_cpu *vcpu)
+{
+ /* Don't have to do anything, there's no expected FPU state. */
+}
+
+static void kvm_cpu__setup_regs(struct kvm_cpu *vcpu)
+{
+ /*
+ * FIXME: This assumes PPC64 and Linux guest. It doesn't use the
+ * OpenFirmware entry method, but instead the "embedded" entry which
+ * passes the FDT address directly.
+ */
+ struct kvm_regs *r = &vcpu->regs;
+
+ if (vcpu->cpu_id == 0) {
+ r->pc = KERNEL_START_ADDR;
+ r->gpr[3] = vcpu->kvm->arch.fdt_gra;
+ r->gpr[5] = 0;
+ } else {
+ r->pc = KERNEL_SECONDARY_START_ADDR;
+ r->gpr[3] = vcpu->cpu_id;
+ }
+ r->msr = 0x8000000000001000UL; /* 64bit, non-HV, ME */
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_REGS, &vcpu->regs) < 0)
+ die_perror("KVM_SET_REGS failed");
+}
+
+static void kvm_cpu__setup_sregs(struct kvm_cpu *vcpu)
+{
+ /*
+ * Some sregs setup to initialise SDR1/PVR/HIOR on PPC64 SPAPR
+ * platforms using PR KVM. (Technically, this is all ignored on
+ * SPAPR HV KVM.) Different setup is required for non-PV non-SPAPR
+ * platforms! (FIXME.)
+ */
+ struct kvm_sregs sregs;
+ struct kvm_one_reg reg = {};
+ u64 value;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &sregs) < 0)
+ die("KVM_GET_SREGS failed");
+
+ sregs.u.s.sdr1 = vcpu->kvm->arch.sdr1;
+ sregs.pvr = vcpu->kvm->arch.pvr;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_SREGS, &sregs) < 0)
+ die("KVM_SET_SREGS failed");
+
+ reg.id = KVM_REG_PPC_HIOR;
+ value = 0;
+ reg.addr = (u64)&value;
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_ONE_REG, ®) < 0)
+ die("KVM_SET_ONE_REG failed");
+}
+
+/**
+ * kvm_cpu__reset_vcpu - reset virtual CPU to a known state
+ */
+void kvm_cpu__reset_vcpu(struct kvm_cpu *vcpu)
+{
+ kvm_cpu__setup_regs(vcpu);
+ kvm_cpu__setup_sregs(vcpu);
+ kvm_cpu__setup_fpu(vcpu);
+}
+
+/* kvm_cpu__irq - set KVM's IRQ flag on this vcpu */
+void kvm_cpu__irq(struct kvm_cpu *vcpu, int pin, int level)
+{
+ unsigned int virq = level ? KVM_INTERRUPT_SET_LEVEL : KVM_INTERRUPT_UNSET;
+
+ /* FIXME: POWER-specific */
+ if (pin != POWER7_EXT_IRQ)
+ return;
+ if (ioctl(vcpu->vcpu_fd, KVM_INTERRUPT, &virq) < 0)
+ pr_warning("Could not KVM_INTERRUPT.");
+}
+
+void kvm_cpu__arch_nmi(struct kvm_cpu *cpu)
+{
+}
+
+bool kvm_cpu__handle_exit(struct kvm_cpu *vcpu)
+{
+ bool ret = true;
+ struct kvm_run *run = vcpu->kvm_run;
+ switch(run->exit_reason) {
+ case KVM_EXIT_PAPR_HCALL:
+ run->papr_hcall.ret = spapr_hypercall(vcpu, run->papr_hcall.nr,
+ (target_ulong*)run->papr_hcall.args);
+ break;
+ default:
+ ret = false;
+ }
+ return ret;
+}
+
+bool kvm_cpu__emulate_mmio(struct kvm *kvm, u64 phys_addr, u8 *data, u32 len, u8 is_write)
+{
+ /*
+ * FIXME: This function will need to be split in order to support
+ * various PowerPC platforms/PHB types, etc. It currently assumes SPAPR
+ * PPC64 guest.
+ */
+ bool ret = false;
+
+ if ((phys_addr >= SPAPR_PCI_WIN_START) &&
+ (phys_addr < SPAPR_PCI_WIN_END)) {
+ ret = spapr_phb_mmio(kvm, phys_addr, data, len, is_write);
+ } else {
+ pr_warning("MMIO %s unknown address %llx (size %d)!\n",
+ is_write ? "write to" : "read from",
+ phys_addr, len);
+ }
+ return ret;
+}
+
+#define CONDSTR_BIT(m, b) (((m) & MSR_##b) ? #b" " : "")
+
+void kvm_cpu__show_registers(struct kvm_cpu *vcpu)
+{
+ struct kvm_regs regs;
+ struct kvm_sregs sregs;
+ int r;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_REGS, ®s) < 0)
+ die("KVM_GET_REGS failed");
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &sregs) < 0)
+ die("KVM_GET_SREGS failed");
+
+ dprintf(debug_fd, "\n Registers:\n");
+ dprintf(debug_fd, " NIP: %016llx MSR: %016llx "
+ "( %s%s%s%s%s%s%s%s%s%s%s%s)\n",
+ regs.pc, regs.msr,
+ CONDSTR_BIT(regs.msr, SF),
+ CONDSTR_BIT(regs.msr, HV), /* ! */
+ CONDSTR_BIT(regs.msr, VEC),
+ CONDSTR_BIT(regs.msr, VSX),
+ CONDSTR_BIT(regs.msr, EE),
+ CONDSTR_BIT(regs.msr, PR),
+ CONDSTR_BIT(regs.msr, FP),
+ CONDSTR_BIT(regs.msr, ME),
+ CONDSTR_BIT(regs.msr, IR),
+ CONDSTR_BIT(regs.msr, DR),
+ CONDSTR_BIT(regs.msr, RI),
+ CONDSTR_BIT(regs.msr, LE));
+ dprintf(debug_fd, " CTR: %016llx LR: %016llx CR: %08llx\n",
+ regs.ctr, regs.lr, regs.cr);
+ dprintf(debug_fd, " SRR0: %016llx SRR1: %016llx XER: %016llx\n",
+ regs.srr0, regs.srr1, regs.xer);
+ dprintf(debug_fd, " SPRG0: %016llx SPRG1: %016llx\n",
+ regs.sprg0, regs.sprg1);
+ dprintf(debug_fd, " SPRG2: %016llx SPRG3: %016llx\n",
+ regs.sprg2, regs.sprg3);
+ dprintf(debug_fd, " SPRG4: %016llx SPRG5: %016llx\n",
+ regs.sprg4, regs.sprg5);
+ dprintf(debug_fd, " SPRG6: %016llx SPRG7: %016llx\n",
+ regs.sprg6, regs.sprg7);
+ dprintf(debug_fd, " GPRs:\n ");
+ for (r = 0; r < 32; r++) {
+ dprintf(debug_fd, "%016llx ", regs.gpr[r]);
+ if ((r & 3) == 3)
+ dprintf(debug_fd, "\n ");
+ }
+ dprintf(debug_fd, "\n");
+
+ /* FIXME: Assumes SLB-based (book3s) guest */
+ for (r = 0; r < 32; r++) {
+ dprintf(debug_fd, " SLB%02d %016llx %016llx\n", r,
+ sregs.u.s.ppc64.slb[r].slbe,
+ sregs.u.s.ppc64.slb[r].slbv);
+ }
+ dprintf(debug_fd, "----------\n");
+}
+
+void kvm_cpu__show_code(struct kvm_cpu *vcpu)
+{
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_REGS, &vcpu->regs) < 0)
+ die("KVM_GET_REGS failed");
+
+ /* FIXME: Dump/disassemble some code...! */
+
+ dprintf(debug_fd, "\n Stack:\n");
+ dprintf(debug_fd, " ------\n");
+ /* Only works in real mode: */
+ kvm__dump_mem(vcpu->kvm, vcpu->regs.gpr[1], 32);
+}
+
+void kvm_cpu__show_page_tables(struct kvm_cpu *vcpu)
+{
+ /* Does nothing yet */
+}
--- /dev/null
+/*
+ * PPC64 (SPAPR) platform support
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * Portions of FDT setup borrowed from QEMU, copyright 2010 David Gibson, IBM
+ * Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "kvm/kvm.h"
+#include "kvm/util.h"
+#include "libfdt.h"
+#include "cpu_info.h"
+
+#include "spapr.h"
+#include "spapr_hvcons.h"
+#include "spapr_pci.h"
+
+#include <linux/kvm.h>
+
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <asm/unistd.h>
+#include <errno.h>
+
+#include <linux/byteorder.h>
+
+#define HPT_ORDER 24
+
+#define HUGETLBFS_PATH "/var/lib/hugetlbfs/global/pagesize-16MB/"
+
+#define PHANDLE_XICP 0x00001111
+
+static char kern_cmdline[2048];
+
+struct kvm_ext kvm_req_ext[] = {
+ { DEFINE_KVM_EXT(KVM_CAP_PPC_UNSET_IRQ) },
+ { DEFINE_KVM_EXT(KVM_CAP_PPC_IRQ_LEVEL) },
+ { 0, 0 }
+};
+
+static uint32_t mfpvr(void)
+{
+ uint32_t r;
+ asm volatile ("mfpvr %0" : "=r"(r));
+ return r;
+}
+
+bool kvm__arch_cpu_supports_vm(void)
+{
+ return true;
+}
+
+void kvm__init_ram(struct kvm *kvm)
+{
+ u64 phys_start, phys_size;
+ void *host_mem;
+
+ phys_start = 0;
+ phys_size = kvm->ram_size;
+ host_mem = kvm->ram_start;
+
+ /*
+ * We put MMIO at PPC_MMIO_START, high up. Make sure that this doesn't
+ * crash into the end of RAM -- on PPC64 at least, this is so high
+ * (63TB!) that this is unlikely.
+ */
+ if (phys_size >= PPC_MMIO_START)
+ die("Too much memory (%lld, what a nice problem): "
+ "overlaps MMIO!\n",
+ phys_size);
+
+ kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+}
+
+void kvm__arch_set_cmdline(char *cmdline, bool video)
+{
+ /* We don't need anything unusual in here. */
+}
+
+/* Architecture-specific KVM init */
+void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size)
+{
+ int cap_ppc_rma;
+ unsigned long hpt;
+
+ kvm->ram_size = ram_size;
+
+ /* Map "default" hugetblfs path to the standard 16M mount point */
+ if (hugetlbfs_path && !strcmp(hugetlbfs_path, "default"))
+ hugetlbfs_path = HUGETLBFS_PATH;
+
+ kvm->ram_start = mmap_anon_or_hugetlbfs(kvm, hugetlbfs_path, kvm->ram_size);
+
+ if (kvm->ram_start == MAP_FAILED)
+ die("Couldn't map %lld bytes for RAM (%d)\n",
+ kvm->ram_size, errno);
+
+ /* FDT goes at top of memory, RTAS just below */
+ kvm->arch.fdt_gra = kvm->ram_size - FDT_MAX_SIZE;
+ /* FIXME: Not all PPC systems have RTAS */
+ kvm->arch.rtas_gra = kvm->arch.fdt_gra - RTAS_MAX_SIZE;
+ madvise(kvm->ram_start, kvm->ram_size, MADV_MERGEABLE);
+
+ /* FIXME: SPAPR-PR specific; allocate a guest HPT. */
+ if (posix_memalign((void **)&hpt, (1<<HPT_ORDER), (1<<HPT_ORDER)))
+ die("Can't allocate %d bytes for HPT\n", (1<<HPT_ORDER));
+
+ kvm->arch.sdr1 = ((hpt + 0x3ffffULL) & ~0x3ffffULL) | (HPT_ORDER-18);
+
+ kvm->arch.pvr = mfpvr();
+
+ /* FIXME: This is book3s-specific */
+ cap_ppc_rma = ioctl(kvm->sys_fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_RMA);
+ if (cap_ppc_rma == 2)
+ die("Need contiguous RMA allocation on this hardware, "
+ "which is not yet supported.");
+
+ /* Do these before FDT setup, IRQ setup, etc. */
+ /* FIXME: SPAPR-specific */
+ hypercall_init();
+ register_core_rtas();
+ /* Now that hypercalls are initialised, register a couple for the console: */
+ spapr_hvcons_init();
+ spapr_create_phb(kvm, "pci", SPAPR_PCI_BUID,
+ SPAPR_PCI_MEM_WIN_ADDR,
+ SPAPR_PCI_MEM_WIN_SIZE,
+ SPAPR_PCI_IO_WIN_ADDR,
+ SPAPR_PCI_IO_WIN_SIZE);
+}
+
+void kvm__arch_delete_ram(struct kvm *kvm)
+{
+ munmap(kvm->ram_start, kvm->ram_size);
+}
+
+void kvm__irq_trigger(struct kvm *kvm, int irq)
+{
+ kvm__irq_line(kvm, irq, 1);
+ kvm__irq_line(kvm, irq, 0);
+}
+
+void kvm__arch_periodic_poll(struct kvm *kvm)
+{
+ /* FIXME: Should register callbacks to platform-specific polls */
+ spapr_hvcons_poll(kvm);
+}
+
+int load_flat_binary(struct kvm *kvm, int fd_kernel, int fd_initrd, const char *kernel_cmdline)
+{
+ void *p;
+ void *k_start;
+ void *i_start;
+ int nr;
+
+ if (lseek(fd_kernel, 0, SEEK_SET) < 0)
+ die_perror("lseek");
+
+ p = k_start = guest_flat_to_host(kvm, KERNEL_LOAD_ADDR);
+
+ while ((nr = read(fd_kernel, p, 65536)) > 0)
+ p += nr;
+
+ pr_info("Loaded kernel to 0x%x (%ld bytes)", KERNEL_LOAD_ADDR, p-k_start);
+
+ if (fd_initrd != -1) {
+ if (lseek(fd_initrd, 0, SEEK_SET) < 0)
+ die_perror("lseek");
+
+ if (p-k_start > INITRD_LOAD_ADDR)
+ die("Kernel overlaps initrd!");
+
+ /* Round up kernel size to 8byte alignment, and load initrd right after. */
+ i_start = p = guest_flat_to_host(kvm, INITRD_LOAD_ADDR);
+
+ while (((nr = read(fd_initrd, p, 65536)) > 0) &&
+ p < (kvm->ram_start + kvm->ram_size))
+ p += nr;
+
+ if (p >= (kvm->ram_start + kvm->ram_size))
+ die("initrd too big to contain in guest RAM.\n");
+
+ pr_info("Loaded initrd to 0x%x (%ld bytes)",
+ INITRD_LOAD_ADDR, p-i_start);
+ kvm->arch.initrd_gra = INITRD_LOAD_ADDR;
+ kvm->arch.initrd_size = p-i_start;
+ } else {
+ kvm->arch.initrd_size = 0;
+ }
+ strncpy(kern_cmdline, kernel_cmdline, 2048);
+ kern_cmdline[2047] = '\0';
+
+ return true;
+}
+
+bool load_bzimage(struct kvm *kvm, int fd_kernel,
+ int fd_initrd, const char *kernel_cmdline, u16 vidmode)
+{
+ /* We don't support bzImages. */
+ return false;
+}
+
+struct fdt_prop {
+ void *value;
+ int size;
+};
+
+static void generate_segment_page_sizes(struct kvm_ppc_smmu_info *info, struct fdt_prop *prop)
+{
+ struct kvm_ppc_one_seg_page_size *sps;
+ int i, j, size;
+ u32 *p;
+
+ for (size = 0, i = 0; i < KVM_PPC_PAGE_SIZES_MAX_SZ; i++) {
+ sps = &info->sps[i];
+
+ if (sps->page_shift == 0)
+ break;
+
+ /* page shift, slb enc & count */
+ size += 3;
+
+ for (j = 0; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++) {
+ if (info->sps[i].enc[j].page_shift == 0)
+ break;
+
+ /* page shift & pte enc */
+ size += 2;
+ }
+ }
+
+ if (!size) {
+ prop->value = NULL;
+ prop->size = 0;
+ return;
+ }
+
+ /* Convert size to bytes */
+ prop->size = size * sizeof(u32);
+
+ prop->value = malloc(prop->size);
+ if (!prop->value)
+ die_perror("malloc failed");
+
+ p = (u32 *)prop->value;
+ for (i = 0; i < KVM_PPC_PAGE_SIZES_MAX_SZ; i++) {
+ sps = &info->sps[i];
+
+ if (sps->page_shift == 0)
+ break;
+
+ *p++ = sps->page_shift;
+ *p++ = sps->slb_enc;
+
+ for (j = 0; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++)
+ if (!info->sps[i].enc[j].page_shift)
+ break;
+
+ *p++ = j; /* count of enc */
+
+ for (j = 0; j < KVM_PPC_PAGE_SIZES_MAX_SZ; j++) {
+ if (!info->sps[i].enc[j].page_shift)
+ break;
+
+ *p++ = info->sps[i].enc[j].page_shift;
+ *p++ = info->sps[i].enc[j].pte_enc;
+ }
+ }
+}
+
+#define SMT_THREADS 4
+
+/*
+ * Set up the FDT for the kernel: This function is currently fairly SPAPR-heavy,
+ * and whilst most PPC targets will require CPU/memory nodes, others like RTAS
+ * should eventually be added separately.
+ */
+static int setup_fdt(struct kvm *kvm)
+{
+ uint64_t mem_reg_property[] = { 0, cpu_to_be64(kvm->ram_size) };
+ int smp_cpus = kvm->nrcpus;
+ uint32_t int_server_ranges_prop[] = {0, cpu_to_be32(smp_cpus)};
+ char hypertas_prop_kvm[] = "hcall-pft\0hcall-term\0"
+ "hcall-dabr\0hcall-interrupt\0hcall-tce\0hcall-vio\0"
+ "hcall-splpar\0hcall-bulk";
+ int i, j;
+ char cpu_name[30];
+ u8 staging_fdt[FDT_MAX_SIZE];
+ struct cpu_info *cpu_info = find_cpu_info(kvm);
+ struct fdt_prop segment_page_sizes;
+ u32 segment_sizes_1T[] = {0x1c, 0x28, 0xffffffff, 0xffffffff};
+
+ /* Generate an appropriate DT at kvm->arch.fdt_gra */
+ void *fdt_dest = guest_flat_to_host(kvm, kvm->arch.fdt_gra);
+ void *fdt = staging_fdt;
+
+ _FDT(fdt_create(fdt, FDT_MAX_SIZE));
+ _FDT(fdt_finish_reservemap(fdt));
+
+ _FDT(fdt_begin_node(fdt, ""));
+
+ _FDT(fdt_property_string(fdt, "device_type", "chrp"));
+ _FDT(fdt_property_string(fdt, "model", "IBM pSeries (kvmtool)"));
+ _FDT(fdt_property_cell(fdt, "#address-cells", 0x2));
+ _FDT(fdt_property_cell(fdt, "#size-cells", 0x2));
+
+ /* RTAS */
+ _FDT(fdt_begin_node(fdt, "rtas"));
+ /* This is what the kernel uses to switch 'We're an LPAR'! */
+ _FDT(fdt_property(fdt, "ibm,hypertas-functions", hypertas_prop_kvm,
+ sizeof(hypertas_prop_kvm)));
+ _FDT(fdt_property_cell(fdt, "linux,rtas-base", kvm->arch.rtas_gra));
+ _FDT(fdt_property_cell(fdt, "linux,rtas-entry", kvm->arch.rtas_gra));
+ _FDT(fdt_property_cell(fdt, "rtas-size", kvm->arch.rtas_size));
+ /* Now add properties for all RTAS tokens: */
+ if (spapr_rtas_fdt_setup(kvm, fdt))
+ die("Couldn't create RTAS FDT properties\n");
+
+ _FDT(fdt_end_node(fdt));
+
+ /* /chosen */
+ _FDT(fdt_begin_node(fdt, "chosen"));
+ /* cmdline */
+ _FDT(fdt_property_string(fdt, "bootargs", kern_cmdline));
+ /* Initrd */
+ if (kvm->arch.initrd_size != 0) {
+ uint32_t ird_st_prop = cpu_to_be32(kvm->arch.initrd_gra);
+ uint32_t ird_end_prop = cpu_to_be32(kvm->arch.initrd_gra +
+ kvm->arch.initrd_size);
+ _FDT(fdt_property(fdt, "linux,initrd-start",
+ &ird_st_prop, sizeof(ird_st_prop)));
+ _FDT(fdt_property(fdt, "linux,initrd-end",
+ &ird_end_prop, sizeof(ird_end_prop)));
+ }
+
+ /*
+ * stdout-path: This is assuming we're using the HV console. Also, the
+ * address is hardwired until we do a VIO bus.
+ */
+ _FDT(fdt_property_string(fdt, "linux,stdout-path",
+ "/vdevice/vty@30000000"));
+ _FDT(fdt_end_node(fdt));
+
+ /*
+ * Memory: We don't alloc. a separate RMA yet. If we ever need to
+ * (CAP_PPC_RMA == 2) then have one memory node for 0->RMAsize, and
+ * another RMAsize->endOfMem.
+ */
+ _FDT(fdt_begin_node(fdt, "memory@0"));
+ _FDT(fdt_property_string(fdt, "device_type", "memory"));
+ _FDT(fdt_property(fdt, "reg", mem_reg_property,
+ sizeof(mem_reg_property)));
+ _FDT(fdt_end_node(fdt));
+
+ generate_segment_page_sizes(&cpu_info->mmu_info, &segment_page_sizes);
+
+ /* CPUs */
+ _FDT(fdt_begin_node(fdt, "cpus"));
+ _FDT(fdt_property_cell(fdt, "#address-cells", 0x1));
+ _FDT(fdt_property_cell(fdt, "#size-cells", 0x0));
+
+ for (i = 0; i < smp_cpus; i += SMT_THREADS) {
+ int32_t pft_size_prop[] = { 0, HPT_ORDER };
+ uint32_t servers_prop[SMT_THREADS];
+ uint32_t gservers_prop[SMT_THREADS * 2];
+ int threads = (smp_cpus - i) >= SMT_THREADS ? SMT_THREADS :
+ smp_cpus - i;
+
+ sprintf(cpu_name, "PowerPC,%s@%d", cpu_info->name, i);
+ _FDT(fdt_begin_node(fdt, cpu_name));
+ sprintf(cpu_name, "PowerPC,%s", cpu_info->name);
+ _FDT(fdt_property_string(fdt, "name", cpu_name));
+ _FDT(fdt_property_string(fdt, "device_type", "cpu"));
+
+ _FDT(fdt_property_cell(fdt, "reg", i));
+ _FDT(fdt_property_cell(fdt, "cpu-version", kvm->arch.pvr));
+
+ _FDT(fdt_property_cell(fdt, "dcache-block-size", cpu_info->d_bsize));
+ _FDT(fdt_property_cell(fdt, "icache-block-size", cpu_info->i_bsize));
+
+ _FDT(fdt_property_cell(fdt, "timebase-frequency", cpu_info->tb_freq));
+ /* Lies, but safeish lies! */
+ _FDT(fdt_property_cell(fdt, "clock-frequency", 0xddbab200));
+
+ if (cpu_info->mmu_info.slb_size)
+ _FDT(fdt_property_cell(fdt, "ibm,slb-size", cpu_info->mmu_info.slb_size));
+
+ /*
+ * HPT size is hardwired; KVM currently fixes it at 16MB but the
+ * moment that changes we'll need to read it out of the kernel.
+ */
+ _FDT(fdt_property(fdt, "ibm,pft-size", pft_size_prop,
+ sizeof(pft_size_prop)));
+
+ _FDT(fdt_property_string(fdt, "status", "okay"));
+ _FDT(fdt_property(fdt, "64-bit", NULL, 0));
+ /* A server for each thread in this core */
+ for (j = 0; j < SMT_THREADS; j++) {
+ servers_prop[j] = cpu_to_be32(i+j);
+ /*
+ * Hack borrowed from QEMU, direct the group queues back
+ * to cpu 0:
+ */
+ gservers_prop[j*2] = cpu_to_be32(i+j);
+ gservers_prop[j*2 + 1] = 0;
+ }
+ _FDT(fdt_property(fdt, "ibm,ppc-interrupt-server#s",
+ servers_prop, threads * sizeof(uint32_t)));
+ _FDT(fdt_property(fdt, "ibm,ppc-interrupt-gserver#s",
+ gservers_prop,
+ threads * 2 * sizeof(uint32_t)));
+
+ if (segment_page_sizes.value)
+ _FDT(fdt_property(fdt, "ibm,segment-page-sizes",
+ segment_page_sizes.value,
+ segment_page_sizes.size));
+
+ if (cpu_info->mmu_info.flags & KVM_PPC_1T_SEGMENTS)
+ _FDT(fdt_property(fdt, "ibm,processor-segment-sizes",
+ segment_sizes_1T, sizeof(segment_sizes_1T)));
+
+ /* VSX / DFP options: */
+ if (cpu_info->flags & CPUINFO_FLAG_VMX)
+ _FDT(fdt_property_cell(fdt, "ibm,vmx",
+ (cpu_info->flags &
+ CPUINFO_FLAG_VSX) ? 2 : 1));
+ if (cpu_info->flags & CPUINFO_FLAG_DFP)
+ _FDT(fdt_property_cell(fdt, "ibm,dfp", 0x1));
+ _FDT(fdt_end_node(fdt));
+ }
+ _FDT(fdt_end_node(fdt));
+
+ /* IRQ controller */
+ _FDT(fdt_begin_node(fdt, "interrupt-controller@0"));
+
+ _FDT(fdt_property_string(fdt, "device_type",
+ "PowerPC-External-Interrupt-Presentation"));
+ _FDT(fdt_property_string(fdt, "compatible", "IBM,ppc-xicp"));
+ _FDT(fdt_property_cell(fdt, "reg", 0));
+ _FDT(fdt_property(fdt, "interrupt-controller", NULL, 0));
+ _FDT(fdt_property(fdt, "ibm,interrupt-server-ranges",
+ int_server_ranges_prop,
+ sizeof(int_server_ranges_prop)));
+ _FDT(fdt_property_cell(fdt, "#interrupt-cells", 2));
+ _FDT(fdt_property_cell(fdt, "linux,phandle", PHANDLE_XICP));
+ _FDT(fdt_property_cell(fdt, "phandle", PHANDLE_XICP));
+ _FDT(fdt_end_node(fdt));
+
+ /*
+ * VIO: See comment in linux,stdout-path; we don't yet represent a VIO
+ * bus/address allocation so addresses are hardwired here.
+ */
+ _FDT(fdt_begin_node(fdt, "vdevice"));
+ _FDT(fdt_property_cell(fdt, "#address-cells", 0x1));
+ _FDT(fdt_property_cell(fdt, "#size-cells", 0x0));
+ _FDT(fdt_property_string(fdt, "device_type", "vdevice"));
+ _FDT(fdt_property_string(fdt, "compatible", "IBM,vdevice"));
+ _FDT(fdt_begin_node(fdt, "vty@30000000"));
+ _FDT(fdt_property_string(fdt, "name", "vty"));
+ _FDT(fdt_property_string(fdt, "device_type", "serial"));
+ _FDT(fdt_property_string(fdt, "compatible", "hvterm1"));
+ _FDT(fdt_property_cell(fdt, "reg", 0x30000000));
+ _FDT(fdt_end_node(fdt));
+ _FDT(fdt_end_node(fdt));
+
+ /* Finalise: */
+ _FDT(fdt_end_node(fdt)); /* Root node */
+ _FDT(fdt_finish(fdt));
+
+ _FDT(fdt_open_into(fdt, fdt_dest, FDT_MAX_SIZE));
+
+ /* PCI */
+ if (spapr_populate_pci_devices(kvm, PHANDLE_XICP, fdt_dest))
+ die("Fail populating PCI device nodes");
+
+ _FDT(fdt_add_mem_rsv(fdt_dest, kvm->arch.rtas_gra, kvm->arch.rtas_size));
+ _FDT(fdt_pack(fdt_dest));
+
+ free(segment_page_sizes.value);
+
+ return 0;
+}
+firmware_init(setup_fdt);
+
+/**
+ * kvm__arch_setup_firmware
+ */
+int kvm__arch_setup_firmware(struct kvm *kvm)
+{
+ /*
+ * Set up RTAS stub. All it is is a single hypercall:
+ * 0: 7c 64 1b 78 mr r4,r3
+ * 4: 3c 60 00 00 lis r3,0
+ * 8: 60 63 f0 00 ori r3,r3,61440
+ * c: 44 00 00 22 sc 1
+ * 10: 4e 80 00 20 blr
+ */
+ uint32_t *rtas = guest_flat_to_host(kvm, kvm->arch.rtas_gra);
+
+ rtas[0] = 0x7c641b78;
+ rtas[1] = 0x3c600000;
+ rtas[2] = 0x6063f000;
+ rtas[3] = 0x44000022;
+ rtas[4] = 0x4e800020;
+ kvm->arch.rtas_size = 20;
+
+ pr_info("Set up %ld bytes of RTAS at 0x%lx\n",
+ kvm->arch.rtas_size, kvm->arch.rtas_gra);
+
+ /* Load SLOF */
+
+ return 0;
+}
+
+int kvm__arch_free_firmware(struct kvm *kvm)
+{
+ return 0;
+}
--- /dev/null
+/*
+ * SPAPR definitions and declarations
+ *
+ * Borrowed heavily from QEMU's spapr.h,
+ * Copyright (c) 2010 David Gibson, IBM Corporation.
+ *
+ * Modifications by Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#if !defined(__HW_SPAPR_H__)
+#define __HW_SPAPR_H__
+
+#include <inttypes.h>
+
+/* We need some of the H_ hcall defs, but they're __KERNEL__ only. */
+#define __KERNEL__
+#include <asm/hvcall.h>
+#undef __KERNEL__
+
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+
+typedef unsigned long target_ulong;
+typedef uintptr_t target_phys_addr_t;
+
+/*
+ * The hcalls above are standardized in PAPR and implemented by pHyp
+ * as well.
+ *
+ * We also need some hcalls which are specific to qemu / KVM-on-POWER.
+ * So far we just need one for H_RTAS, but in future we'll need more
+ * for extensions like virtio. We put those into the 0xf000-0xfffc
+ * range which is reserved by PAPR for "platform-specific" hcalls.
+ */
+#define KVMPPC_HCALL_BASE 0xf000
+#define KVMPPC_H_RTAS (KVMPPC_HCALL_BASE + 0x0)
+#define KVMPPC_HCALL_MAX KVMPPC_H_RTAS
+
+#define DEBUG_SPAPR_HCALLS
+
+#ifdef DEBUG_SPAPR_HCALLS
+#define hcall_dprintf(fmt, ...) \
+ do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+#else
+#define hcall_dprintf(fmt, ...) \
+ do { } while (0)
+#endif
+
+typedef target_ulong (*spapr_hcall_fn)(struct kvm_cpu *vcpu,
+ target_ulong opcode,
+ target_ulong *args);
+
+void hypercall_init(void);
+void register_core_rtas(void);
+
+void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
+target_ulong spapr_hypercall(struct kvm_cpu *vcpu, target_ulong opcode,
+ target_ulong *args);
+
+int spapr_rtas_fdt_setup(struct kvm *kvm, void *fdt);
+
+static inline uint32_t rtas_ld(struct kvm *kvm, target_ulong phys, int n)
+{
+ return *((uint32_t *)guest_flat_to_host(kvm, phys + 4*n));
+}
+
+static inline void rtas_st(struct kvm *kvm, target_ulong phys, int n, uint32_t val)
+{
+ *((uint32_t *)guest_flat_to_host(kvm, phys + 4*n)) = val;
+}
+
+typedef void (*spapr_rtas_fn)(struct kvm_cpu *vcpu, uint32_t token,
+ uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets);
+void spapr_rtas_register(const char *name, spapr_rtas_fn fn);
+target_ulong spapr_rtas_call(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets);
+
+#define SPAPR_PCI_BUID 0x800000020000001ULL
+#define SPAPR_PCI_MEM_WIN_ADDR (KVM_MMIO_START + 0xA0000000)
+#define SPAPR_PCI_MEM_WIN_SIZE 0x20000000
+#define SPAPR_PCI_IO_WIN_ADDR (SPAPR_PCI_MEM_WIN_ADDR + SPAPR_PCI_MEM_WIN_SIZE)
+#define SPAPR_PCI_IO_WIN_SIZE 0x2000000
+
+#define SPAPR_PCI_WIN_START SPAPR_PCI_MEM_WIN_ADDR
+#define SPAPR_PCI_WIN_END (SPAPR_PCI_IO_WIN_ADDR + SPAPR_PCI_IO_WIN_SIZE)
+
+#endif /* !defined (__HW_SPAPR_H__) */
--- /dev/null
+/*
+ * SPAPR hypercalls
+ *
+ * Borrowed heavily from QEMU's spapr_hcall.c,
+ * Copyright (c) 2010 David Gibson, IBM Corporation.
+ *
+ * Copyright (c) 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "spapr.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+
+#include <stdio.h>
+#include <assert.h>
+
+static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
+static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX -
+ KVMPPC_HCALL_BASE + 1];
+
+static target_ulong h_set_dabr(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args)
+{
+ /* FIXME: Implement this for -PR. (-HV does this in kernel.) */
+ return H_HARDWARE;
+}
+
+static target_ulong h_rtas(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args)
+{
+ target_ulong rtas_r3 = args[0];
+ /*
+ * Pointer read from phys mem; these ptrs cannot be MMIO (!) so just
+ * reference guest RAM directly.
+ */
+ uint32_t token, nargs, nret;
+
+ token = rtas_ld(vcpu->kvm, rtas_r3, 0);
+ nargs = rtas_ld(vcpu->kvm, rtas_r3, 1);
+ nret = rtas_ld(vcpu->kvm, rtas_r3, 2);
+
+ return spapr_rtas_call(vcpu, token, nargs, rtas_r3 + 12,
+ nret, rtas_r3 + 12 + 4*nargs);
+}
+
+static target_ulong h_logical_load(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args)
+{
+ /* SLOF will require these, though kernel doesn't. */
+ die(__PRETTY_FUNCTION__);
+ return H_PARAMETER;
+}
+
+static target_ulong h_logical_store(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args)
+{
+ /* SLOF will require these, though kernel doesn't. */
+ die(__PRETTY_FUNCTION__);
+ return H_PARAMETER;
+}
+
+static target_ulong h_logical_icbi(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args)
+{
+ /* KVM will trap this in the kernel. Die if it misses. */
+ die(__PRETTY_FUNCTION__);
+ return H_SUCCESS;
+}
+
+static target_ulong h_logical_dcbf(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args)
+{
+ /* KVM will trap this in the kernel. Die if it misses. */
+ die(__PRETTY_FUNCTION__);
+ return H_SUCCESS;
+}
+
+void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn)
+{
+ spapr_hcall_fn *slot;
+
+ if (opcode <= MAX_HCALL_OPCODE) {
+ assert((opcode & 0x3) == 0);
+
+ slot = &papr_hypercall_table[opcode / 4];
+ } else {
+ assert((opcode >= KVMPPC_HCALL_BASE) &&
+ (opcode <= KVMPPC_HCALL_MAX));
+
+ slot = &kvmppc_hypercall_table[opcode - KVMPPC_HCALL_BASE];
+ }
+
+ assert(!(*slot) || (fn == *slot));
+ *slot = fn;
+}
+
+target_ulong spapr_hypercall(struct kvm_cpu *vcpu, target_ulong opcode,
+ target_ulong *args)
+{
+ if ((opcode <= MAX_HCALL_OPCODE)
+ && ((opcode & 0x3) == 0)) {
+ spapr_hcall_fn fn = papr_hypercall_table[opcode / 4];
+
+ if (fn) {
+ return fn(vcpu, opcode, args);
+ }
+ } else if ((opcode >= KVMPPC_HCALL_BASE) &&
+ (opcode <= KVMPPC_HCALL_MAX)) {
+ spapr_hcall_fn fn = kvmppc_hypercall_table[opcode -
+ KVMPPC_HCALL_BASE];
+
+ if (fn) {
+ return fn(vcpu, opcode, args);
+ }
+ }
+
+ hcall_dprintf("Unimplemented hcall 0x%lx\n", opcode);
+ return H_FUNCTION;
+}
+
+void hypercall_init(void)
+{
+ /* hcall-dabr */
+ spapr_register_hypercall(H_SET_DABR, h_set_dabr);
+
+ spapr_register_hypercall(H_LOGICAL_CI_LOAD, h_logical_load);
+ spapr_register_hypercall(H_LOGICAL_CI_STORE, h_logical_store);
+ spapr_register_hypercall(H_LOGICAL_CACHE_LOAD, h_logical_load);
+ spapr_register_hypercall(H_LOGICAL_CACHE_STORE, h_logical_store);
+ spapr_register_hypercall(H_LOGICAL_ICBI, h_logical_icbi);
+ spapr_register_hypercall(H_LOGICAL_DCBF, h_logical_dcbf);
+
+ /* KVM-PPC specific hcalls */
+ spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
+}
--- /dev/null
+/*
+ * SPAPR HV console
+ *
+ * Borrowed lightly from QEMU's spapr_vty.c, Copyright (c) 2010 David Gibson,
+ * IBM Corporation.
+ *
+ * Copyright (c) 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "kvm/term.h"
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/util.h"
+#include "spapr.h"
+#include "spapr_hvcons.h"
+
+#include <stdio.h>
+#include <sys/uio.h>
+#include <errno.h>
+
+#include <linux/byteorder.h>
+
+union hv_chario {
+ struct {
+ uint64_t char0_7;
+ uint64_t char8_15;
+ } a;
+ uint8_t buf[16];
+};
+
+static unsigned long h_put_term_char(struct kvm_cpu *vcpu, unsigned long opcode, unsigned long *args)
+{
+ /* To do: Read register from args[0], and check it. */
+ unsigned long len = args[1];
+ union hv_chario data;
+ struct iovec iov;
+
+ if (len > 16) {
+ return H_PARAMETER;
+ }
+ data.a.char0_7 = cpu_to_be64(args[2]);
+ data.a.char8_15 = cpu_to_be64(args[3]);
+
+ iov.iov_base = data.buf;
+ iov.iov_len = len;
+ do {
+ int ret;
+
+ if (vcpu->kvm->cfg.active_console == CONSOLE_HV)
+ ret = term_putc_iov(&iov, 1, 0);
+ else
+ ret = 0;
+ if (ret < 0) {
+ die("term_putc_iov error %d!\n", errno);
+ }
+ iov.iov_base += ret;
+ iov.iov_len -= ret;
+ } while (iov.iov_len > 0);
+
+ return H_SUCCESS;
+}
+
+
+static unsigned long h_get_term_char(struct kvm_cpu *vcpu, unsigned long opcode, unsigned long *args)
+{
+ /* To do: Read register from args[0], and check it. */
+ unsigned long *len = args + 0;
+ unsigned long *char0_7 = args + 1;
+ unsigned long *char8_15 = args + 2;
+ union hv_chario data;
+ struct iovec iov;
+
+ if (vcpu->kvm->cfg.active_console != CONSOLE_HV)
+ return H_SUCCESS;
+
+ if (term_readable(0)) {
+ iov.iov_base = data.buf;
+ iov.iov_len = 16;
+
+ *len = term_getc_iov(vcpu->kvm, &iov, 1, 0);
+ *char0_7 = be64_to_cpu(data.a.char0_7);
+ *char8_15 = be64_to_cpu(data.a.char8_15);
+ } else {
+ *len = 0;
+ }
+
+ return H_SUCCESS;
+}
+
+void spapr_hvcons_poll(struct kvm *kvm)
+{
+ if (term_readable(0)) {
+ /*
+ * We can inject an IRQ to guest here if we want. The guest
+ * will happily poll, though, so not required.
+ */
+ }
+}
+
+void spapr_hvcons_init(void)
+{
+ spapr_register_hypercall(H_PUT_TERM_CHAR, h_put_term_char);
+ spapr_register_hypercall(H_GET_TERM_CHAR, h_get_term_char);
+}
--- /dev/null
+/*
+ * SPAPR HV console
+ *
+ * Copyright (c) 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef spapr_hvcons_H
+#define spapr_hvcons_H
+
+#include "kvm/kvm.h"
+
+void spapr_hvcons_init(void);
+void spapr_hvcons_poll(struct kvm *kvm);
+
+#endif
--- /dev/null
+/*
+ * SPAPR PHB emulation, RTAS interface to PCI config space, device tree nodes
+ * for enumerated devices.
+ *
+ * Borrowed heavily from QEMU's spapr_pci.c,
+ * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
+ * Copyright (c) 2011 David Gibson, IBM Corporation.
+ *
+ * Modifications copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "spapr.h"
+#include "spapr_pci.h"
+#include "kvm/util.h"
+#include "kvm/pci.h"
+#include "libfdt.h"
+
+#include <linux/pci_regs.h>
+#include <linux/byteorder.h>
+
+
+/* #define DEBUG_PHB yes */
+#ifdef DEBUG_PHB
+#define phb_dprintf(fmt, ...) \
+ do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+#else
+#define phb_dprintf(fmt, ...) \
+ do { } while (0)
+#endif
+
+static const uint32_t bars[] = {
+ PCI_BASE_ADDRESS_0, PCI_BASE_ADDRESS_1,
+ PCI_BASE_ADDRESS_2, PCI_BASE_ADDRESS_3,
+ PCI_BASE_ADDRESS_4, PCI_BASE_ADDRESS_5
+ /*, PCI_ROM_ADDRESS*/
+};
+
+#define PCI_NUM_REGIONS 7
+
+/* Macros to operate with address in OF binding to PCI */
+#define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p))
+#define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */
+#define b_p(x) b_x((x), 30, 1) /* 1 if prefetchable */
+#define b_t(x) b_x((x), 29, 1) /* 1 if the address is aliased */
+#define b_ss(x) b_x((x), 24, 2) /* the space code */
+#define b_bbbbbbbb(x) b_x((x), 16, 8) /* bus number */
+#define b_ddddd(x) b_x((x), 11, 5) /* device number */
+#define b_fff(x) b_x((x), 8, 3) /* function number */
+#define b_rrrrrrrr(x) b_x((x), 0, 8) /* register number */
+
+#define SS_M64 3
+#define SS_M32 2
+#define SS_IO 1
+#define SS_CONFIG 0
+
+
+static struct spapr_phb phb;
+
+
+static void rtas_ibm_read_pci_config(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ uint32_t val = 0;
+ uint64_t buid = ((uint64_t)rtas_ld(vcpu->kvm, args, 1) << 32) | rtas_ld(vcpu->kvm, args, 2);
+ union pci_config_address addr = { .w = rtas_ld(vcpu->kvm, args, 0) };
+ struct pci_device_header *dev = pci__find_dev(addr.device_number);
+ uint32_t size = rtas_ld(vcpu->kvm, args, 3);
+
+ if (buid != phb.buid || !dev || (size > 4)) {
+ phb_dprintf("- cfgRd buid 0x%lx cfg addr 0x%x size %d not found\n",
+ buid, addr.w, size);
+
+ rtas_st(vcpu->kvm, rets, 0, -1);
+ return;
+ }
+ pci__config_rd(vcpu->kvm, addr, &val, size);
+ /* It appears this wants a byteswapped result... */
+ switch (size) {
+ case 4:
+ val = le32_to_cpu(val);
+ break;
+ case 2:
+ val = le16_to_cpu(val>>16);
+ break;
+ case 1:
+ val = val >> 24;
+ break;
+ }
+ phb_dprintf("- cfgRd buid 0x%lx addr 0x%x (/%d): b%d,d%d,f%d,r0x%x, val 0x%x\n",
+ buid, addr.w, size, addr.bus_number, addr.device_number, addr.function_number,
+ addr.register_number, val);
+
+ rtas_st(vcpu->kvm, rets, 0, 0);
+ rtas_st(vcpu->kvm, rets, 1, val);
+}
+
+static void rtas_read_pci_config(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ uint32_t val;
+ union pci_config_address addr = { .w = rtas_ld(vcpu->kvm, args, 0) };
+ struct pci_device_header *dev = pci__find_dev(addr.device_number);
+ uint32_t size = rtas_ld(vcpu->kvm, args, 1);
+
+ if (!dev || (size > 4)) {
+ rtas_st(vcpu->kvm, rets, 0, -1);
+ return;
+ }
+ pci__config_rd(vcpu->kvm, addr, &val, size);
+ switch (size) {
+ case 4:
+ val = le32_to_cpu(val);
+ break;
+ case 2:
+ val = le16_to_cpu(val>>16); /* We're yuck-endian. */
+ break;
+ case 1:
+ val = val >> 24;
+ break;
+ }
+ phb_dprintf("- cfgRd addr 0x%x size %d, val 0x%x\n", addr.w, size, val);
+ rtas_st(vcpu->kvm, rets, 0, 0);
+ rtas_st(vcpu->kvm, rets, 1, val);
+}
+
+static void rtas_ibm_write_pci_config(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ uint64_t buid = ((uint64_t)rtas_ld(vcpu->kvm, args, 1) << 32) | rtas_ld(vcpu->kvm, args, 2);
+ union pci_config_address addr = { .w = rtas_ld(vcpu->kvm, args, 0) };
+ struct pci_device_header *dev = pci__find_dev(addr.device_number);
+ uint32_t size = rtas_ld(vcpu->kvm, args, 3);
+ uint32_t val = rtas_ld(vcpu->kvm, args, 4);
+
+ if (buid != phb.buid || !dev || (size > 4)) {
+ phb_dprintf("- cfgWr buid 0x%lx cfg addr 0x%x/%d error (val 0x%x)\n",
+ buid, addr.w, size, val);
+
+ rtas_st(vcpu->kvm, rets, 0, -1);
+ return;
+ }
+ phb_dprintf("- cfgWr buid 0x%lx addr 0x%x (/%d): b%d,d%d,f%d,r0x%x, val 0x%x\n",
+ buid, addr.w, size, addr.bus_number, addr.device_number, addr.function_number,
+ addr.register_number, val);
+ switch (size) {
+ case 4:
+ val = le32_to_cpu(val);
+ break;
+ case 2:
+ val = le16_to_cpu(val) << 16;
+ break;
+ case 1:
+ val = val >> 24;
+ break;
+ }
+ pci__config_wr(vcpu->kvm, addr, &val, size);
+ rtas_st(vcpu->kvm, rets, 0, 0);
+}
+
+static void rtas_write_pci_config(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ union pci_config_address addr = { .w = rtas_ld(vcpu->kvm, args, 0) };
+ struct pci_device_header *dev = pci__find_dev(addr.device_number);
+ uint32_t size = rtas_ld(vcpu->kvm, args, 1);
+ uint32_t val = rtas_ld(vcpu->kvm, args, 2);
+
+ if (!dev || (size > 4)) {
+ rtas_st(vcpu->kvm, rets, 0, -1);
+ return;
+ }
+
+ phb_dprintf("- cfgWr addr 0x%x (/%d): b%d,d%d,f%d,r0x%x, val 0x%x\n",
+ addr.w, size, addr.bus_number, addr.device_number, addr.function_number,
+ addr.register_number, val);
+ switch (size) {
+ case 4:
+ val = le32_to_cpu(val);
+ break;
+ case 2:
+ val = le16_to_cpu(val) << 16;
+ break;
+ case 1:
+ val = val >> 24;
+ break;
+ }
+ pci__config_wr(vcpu->kvm, addr, &val, size);
+ rtas_st(vcpu->kvm, rets, 0, 0);
+}
+
+void spapr_create_phb(struct kvm *kvm,
+ const char *busname, uint64_t buid,
+ uint64_t mem_win_addr, uint64_t mem_win_size,
+ uint64_t io_win_addr, uint64_t io_win_size)
+{
+ /*
+ * Since kvmtool doesn't really have any concept of buses etc.,
+ * there's nothing to register here. Just register RTAS.
+ */
+ spapr_rtas_register("read-pci-config", rtas_read_pci_config);
+ spapr_rtas_register("write-pci-config", rtas_write_pci_config);
+ spapr_rtas_register("ibm,read-pci-config", rtas_ibm_read_pci_config);
+ spapr_rtas_register("ibm,write-pci-config", rtas_ibm_write_pci_config);
+
+ phb.buid = buid;
+ phb.mem_addr = mem_win_addr;
+ phb.mem_size = mem_win_size;
+ phb.io_addr = io_win_addr;
+ phb.io_size = io_win_size;
+
+ kvm->arch.phb = &phb;
+}
+
+static uint32_t bar_to_ss(unsigned long bar)
+{
+ if ((bar & PCI_BASE_ADDRESS_SPACE) ==
+ PCI_BASE_ADDRESS_SPACE_IO)
+ return SS_IO;
+ else if (bar & PCI_BASE_ADDRESS_MEM_TYPE_64)
+ return SS_M64;
+ else
+ return SS_M32;
+}
+
+static unsigned long bar_to_addr(unsigned long bar)
+{
+ if ((bar & PCI_BASE_ADDRESS_SPACE) ==
+ PCI_BASE_ADDRESS_SPACE_IO)
+ return bar & PCI_BASE_ADDRESS_IO_MASK;
+ else
+ return bar & PCI_BASE_ADDRESS_MEM_MASK;
+}
+
+int spapr_populate_pci_devices(struct kvm *kvm,
+ uint32_t xics_phandle,
+ void *fdt)
+{
+ int bus_off, node_off = 0, devid, fn, i, n, devices;
+ char nodename[256];
+ struct {
+ uint32_t hi;
+ uint64_t addr;
+ uint64_t size;
+ } __attribute__((packed)) reg[PCI_NUM_REGIONS + 1],
+ assigned_addresses[PCI_NUM_REGIONS];
+ uint32_t bus_range[] = { cpu_to_be32(0), cpu_to_be32(0xff) };
+ struct {
+ uint32_t hi;
+ uint64_t child;
+ uint64_t parent;
+ uint64_t size;
+ } __attribute__((packed)) ranges[] = {
+ {
+ cpu_to_be32(b_ss(1)), cpu_to_be64(0),
+ cpu_to_be64(phb.io_addr),
+ cpu_to_be64(phb.io_size),
+ },
+ {
+ cpu_to_be32(b_ss(2)), cpu_to_be64(0),
+ cpu_to_be64(phb.mem_addr),
+ cpu_to_be64(phb.mem_size),
+ },
+ };
+ uint64_t bus_reg[] = { cpu_to_be64(phb.buid), 0 };
+ uint32_t interrupt_map_mask[] = {
+ cpu_to_be32(b_ddddd(-1)|b_fff(-1)), 0x0, 0x0, 0x0};
+ uint32_t interrupt_map[SPAPR_PCI_NUM_LSI][7];
+
+ /* Start populating the FDT */
+ sprintf(nodename, "pci@%" PRIx64, phb.buid);
+ bus_off = fdt_add_subnode(fdt, 0, nodename);
+ if (bus_off < 0) {
+ die("error making bus subnode, %s\n", fdt_strerror(bus_off));
+ return bus_off;
+ }
+
+ /* Write PHB properties */
+ _FDT(fdt_setprop_string(fdt, bus_off, "device_type", "pci"));
+ _FDT(fdt_setprop_string(fdt, bus_off, "compatible", "IBM,Logical_PHB"));
+ _FDT(fdt_setprop_cell(fdt, bus_off, "#address-cells", 0x3));
+ _FDT(fdt_setprop_cell(fdt, bus_off, "#size-cells", 0x2));
+ _FDT(fdt_setprop_cell(fdt, bus_off, "#interrupt-cells", 0x1));
+ _FDT(fdt_setprop(fdt, bus_off, "used-by-rtas", NULL, 0));
+ _FDT(fdt_setprop(fdt, bus_off, "bus-range", &bus_range, sizeof(bus_range)));
+ _FDT(fdt_setprop(fdt, bus_off, "ranges", &ranges, sizeof(ranges)));
+ _FDT(fdt_setprop(fdt, bus_off, "reg", &bus_reg, sizeof(bus_reg)));
+ _FDT(fdt_setprop(fdt, bus_off, "interrupt-map-mask",
+ &interrupt_map_mask, sizeof(interrupt_map_mask)));
+
+ /* Populate PCI devices and allocate IRQs */
+ devices = 0;
+
+ for (devid = 0; devid < PCI_MAX_DEVICES; devid++) {
+ uint32_t *irqmap = interrupt_map[devices];
+ struct pci_device_header *hdr = pci__find_dev(devid);
+
+ if (!hdr)
+ continue;
+
+ fn = 0; /* kvmtool doesn't yet do multifunction devices */
+
+ sprintf(nodename, "pci@%u,%u", devid, fn);
+
+ /* Allocate interrupt from the map */
+ if (devid > SPAPR_PCI_NUM_LSI) {
+ die("Unexpected behaviour in spapr_populate_pci_devices,"
+ "wrong devid %u\n", devid);
+ }
+ irqmap[0] = cpu_to_be32(b_ddddd(devid)|b_fff(fn));
+ irqmap[1] = 0;
+ irqmap[2] = 0;
+ irqmap[3] = 0;
+ irqmap[4] = cpu_to_be32(xics_phandle);
+ /*
+ * This is nasty; the PCI devs are set up such that their own
+ * header's irq_line indicates the direct XICS IRQ number to
+ * use. There REALLY needs to be a hierarchical system in place
+ * to 'raise' an IRQ on the bridge which indexes/looks up which
+ * XICS IRQ to fire.
+ */
+ irqmap[5] = cpu_to_be32(hdr->irq_line);
+ irqmap[6] = cpu_to_be32(0x8);
+
+ /* Add node to FDT */
+ node_off = fdt_add_subnode(fdt, bus_off, nodename);
+ if (node_off < 0) {
+ die("error making node subnode, %s\n", fdt_strerror(bus_off));
+ return node_off;
+ }
+
+ _FDT(fdt_setprop_cell(fdt, node_off, "vendor-id",
+ le16_to_cpu(hdr->vendor_id)));
+ _FDT(fdt_setprop_cell(fdt, node_off, "device-id",
+ le16_to_cpu(hdr->device_id)));
+ _FDT(fdt_setprop_cell(fdt, node_off, "revision-id",
+ hdr->revision_id));
+ _FDT(fdt_setprop_cell(fdt, node_off, "class-code",
+ hdr->class[0] | (hdr->class[1] << 8) | (hdr->class[2] << 16)));
+ _FDT(fdt_setprop_cell(fdt, node_off, "subsystem-id",
+ le16_to_cpu(hdr->subsys_id)));
+ _FDT(fdt_setprop_cell(fdt, node_off, "subsystem-vendor-id",
+ le16_to_cpu(hdr->subsys_vendor_id)));
+
+ /* Config space region comes first */
+ reg[0].hi = cpu_to_be32(
+ b_n(0) |
+ b_p(0) |
+ b_t(0) |
+ b_ss(SS_CONFIG) |
+ b_bbbbbbbb(0) |
+ b_ddddd(devid) |
+ b_fff(fn));
+ reg[0].addr = 0;
+ reg[0].size = 0;
+
+ n = 0;
+ /* Six BARs, no ROM supported, addresses are 32bit */
+ for (i = 0; i < 6; ++i) {
+ if (0 == hdr->bar[i]) {
+ continue;
+ }
+
+ reg[n+1].hi = cpu_to_be32(
+ b_n(0) |
+ b_p(0) |
+ b_t(0) |
+ b_ss(bar_to_ss(le32_to_cpu(hdr->bar[i]))) |
+ b_bbbbbbbb(0) |
+ b_ddddd(devid) |
+ b_fff(fn) |
+ b_rrrrrrrr(bars[i]));
+ reg[n+1].addr = 0;
+ reg[n+1].size = cpu_to_be64(hdr->bar_size[i]);
+
+ assigned_addresses[n].hi = cpu_to_be32(
+ b_n(1) |
+ b_p(0) |
+ b_t(0) |
+ b_ss(bar_to_ss(le32_to_cpu(hdr->bar[i]))) |
+ b_bbbbbbbb(0) |
+ b_ddddd(devid) |
+ b_fff(fn) |
+ b_rrrrrrrr(bars[i]));
+
+ /*
+ * Writing zeroes to assigned_addresses causes the guest kernel to
+ * reassign BARs
+ */
+ assigned_addresses[n].addr = cpu_to_be64(bar_to_addr(le32_to_cpu(hdr->bar[i])));
+ assigned_addresses[n].size = reg[n+1].size;
+
+ ++n;
+ }
+ _FDT(fdt_setprop(fdt, node_off, "reg", reg, sizeof(reg[0])*(n+1)));
+ _FDT(fdt_setprop(fdt, node_off, "assigned-addresses",
+ assigned_addresses,
+ sizeof(assigned_addresses[0])*(n)));
+ _FDT(fdt_setprop_cell(fdt, node_off, "interrupts",
+ hdr->irq_pin));
+
+ /* We don't set ibm,dma-window property as we don't have an IOMMU. */
+
+ ++devices;
+ }
+
+ /* Write interrupt map */
+ _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
+ devices * sizeof(interrupt_map[0])));
+
+ return 0;
+}
--- /dev/null
+/*
+ * SPAPR PHB definitions
+ *
+ * Modifications by Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef SPAPR_PCI_H
+#define SPAPR_PCI_H
+
+#include "kvm/kvm.h"
+#include "spapr.h"
+#include <inttypes.h>
+
+/* With XICS, we can easily accomodate 1 IRQ per PCI device. */
+
+#define SPAPR_PCI_NUM_LSI 256
+
+struct spapr_phb {
+ uint64_t buid;
+ uint64_t mem_addr;
+ uint64_t mem_size;
+ uint64_t io_addr;
+ uint64_t io_size;
+};
+
+void spapr_create_phb(struct kvm *kvm,
+ const char *busname, uint64_t buid,
+ uint64_t mem_win_addr, uint64_t mem_win_size,
+ uint64_t io_win_addr, uint64_t io_win_size);
+
+int spapr_populate_pci_devices(struct kvm *kvm,
+ uint32_t xics_phandle,
+ void *fdt);
+
+static inline bool spapr_phb_mmio(struct kvm *kvm, u64 phys_addr, u8 *data, u32 len, u8 is_write)
+{
+ if ((phys_addr >= SPAPR_PCI_IO_WIN_ADDR) &&
+ (phys_addr < SPAPR_PCI_IO_WIN_ADDR +
+ SPAPR_PCI_IO_WIN_SIZE)) {
+ return kvm__emulate_io(kvm, phys_addr - SPAPR_PCI_IO_WIN_ADDR,
+ data, is_write ? KVM_EXIT_IO_OUT :
+ KVM_EXIT_IO_IN,
+ len, 1);
+ } else if ((phys_addr >= SPAPR_PCI_MEM_WIN_ADDR) &&
+ (phys_addr < SPAPR_PCI_MEM_WIN_ADDR +
+ SPAPR_PCI_MEM_WIN_SIZE)) {
+ return kvm__emulate_mmio(kvm, phys_addr - SPAPR_PCI_MEM_WIN_ADDR,
+ data, len, is_write);
+ }
+ return false;
+}
+
+#endif
--- /dev/null
+/*
+ * SPAPR base RTAS calls
+ *
+ * Borrowed heavily from QEMU's spapr_rtas.c
+ * Copyright (c) 2010-2011 David Gibson, IBM Corporation.
+ *
+ * Modifications copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/util.h"
+#include "kvm/term.h"
+#include "libfdt.h"
+
+#include "spapr.h"
+
+#include <stdio.h>
+#include <assert.h>
+
+#define TOKEN_BASE 0x2000
+#define TOKEN_MAX 0x100
+
+#define RTAS_CONSOLE
+
+static struct rtas_call {
+ const char *name;
+ spapr_rtas_fn fn;
+} rtas_table[TOKEN_MAX];
+
+struct rtas_call *rtas_next = rtas_table;
+
+
+static void rtas_display_character(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ char c = rtas_ld(vcpu->kvm, args, 0);
+ term_putc(&c, 1, 0);
+ rtas_st(vcpu->kvm, rets, 0, 0);
+}
+
+#ifdef RTAS_CONSOLE
+static void rtas_put_term_char(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ char c = rtas_ld(vcpu->kvm, args, 0);
+
+ if (vcpu->kvm->cfg.active_console == CONSOLE_HV)
+ term_putc(&c, 1, 0);
+
+ rtas_st(vcpu->kvm, rets, 0, 0);
+}
+
+static void rtas_get_term_char(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ int c;
+
+ if (vcpu->kvm->cfg.active_console == CONSOLE_HV && term_readable(0) &&
+ (c = term_getc(vcpu->kvm, 0)) >= 0) {
+ rtas_st(vcpu->kvm, rets, 0, 0);
+ rtas_st(vcpu->kvm, rets, 1, c);
+ } else {
+ rtas_st(vcpu->kvm, rets, 0, -2);
+ }
+}
+#endif
+
+static void rtas_get_time_of_day(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ struct tm tm;
+ time_t tnow;
+
+ if (nret != 8) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ tnow = time(NULL);
+ /* Guest time is currently not offset in any way. */
+ gmtime_r(&tnow, &tm);
+
+ rtas_st(vcpu->kvm, rets, 0, 0); /* Success */
+ rtas_st(vcpu->kvm, rets, 1, tm.tm_year + 1900);
+ rtas_st(vcpu->kvm, rets, 2, tm.tm_mon + 1);
+ rtas_st(vcpu->kvm, rets, 3, tm.tm_mday);
+ rtas_st(vcpu->kvm, rets, 4, tm.tm_hour);
+ rtas_st(vcpu->kvm, rets, 5, tm.tm_min);
+ rtas_st(vcpu->kvm, rets, 6, tm.tm_sec);
+ rtas_st(vcpu->kvm, rets, 7, 0);
+}
+
+static void rtas_set_time_of_day(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ pr_warning("%s called; TOD set ignored.\n", __FUNCTION__);
+}
+
+static void rtas_power_off(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ if (nargs != 2 || nret != 1) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+ kvm_cpu__reboot(vcpu->kvm);
+}
+
+static void rtas_query_cpu_stopped_state(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ if (nargs != 1 || nret != 2) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ /*
+ * Can read id = rtas_ld(vcpu->kvm, args, 0), but
+ * we currently start all CPUs. So just return true.
+ */
+ rtas_st(vcpu->kvm, rets, 0, 0);
+ rtas_st(vcpu->kvm, rets, 1, 2);
+}
+
+static void rtas_start_cpu(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ die(__FUNCTION__);
+}
+
+target_ulong spapr_rtas_call(struct kvm_cpu *vcpu,
+ uint32_t token, uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ if ((token >= TOKEN_BASE)
+ && ((token - TOKEN_BASE) < TOKEN_MAX)) {
+ struct rtas_call *call = rtas_table + (token - TOKEN_BASE);
+
+ if (call->fn) {
+ call->fn(vcpu, token, nargs, args, nret, rets);
+ return H_SUCCESS;
+ }
+ }
+
+ /*
+ * HACK: Some Linux early debug code uses RTAS display-character,
+ * but assumes the token value is 0xa (which it is on some real
+ * machines) without looking it up in the device tree. This
+ * special case makes this work
+ */
+ if (token == 0xa) {
+ rtas_display_character(vcpu, 0xa, nargs, args, nret, rets);
+ return H_SUCCESS;
+ }
+
+ hcall_dprintf("Unknown RTAS token 0x%x\n", token);
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return H_PARAMETER;
+}
+
+void spapr_rtas_register(const char *name, spapr_rtas_fn fn)
+{
+ assert(rtas_next < (rtas_table + TOKEN_MAX));
+
+ rtas_next->name = name;
+ rtas_next->fn = fn;
+
+ rtas_next++;
+}
+
+/*
+ * This is called from the context of an open /rtas node, in order to add
+ * properties for the rtas call tokens.
+ */
+int spapr_rtas_fdt_setup(struct kvm *kvm, void *fdt)
+{
+ int ret;
+ int i;
+
+ for (i = 0; i < TOKEN_MAX; i++) {
+ struct rtas_call *call = &rtas_table[i];
+
+ if (!call->fn) {
+ continue;
+ }
+
+ ret = fdt_property_cell(fdt, call->name, i + TOKEN_BASE);
+
+ if (ret < 0) {
+ pr_warning("Couldn't add rtas token for %s: %s\n",
+ call->name, fdt_strerror(ret));
+ return ret;
+ }
+
+ }
+ return 0;
+}
+
+void register_core_rtas(void)
+{
+ spapr_rtas_register("display-character", rtas_display_character);
+ spapr_rtas_register("get-time-of-day", rtas_get_time_of_day);
+ spapr_rtas_register("set-time-of-day", rtas_set_time_of_day);
+ spapr_rtas_register("power-off", rtas_power_off);
+ spapr_rtas_register("query-cpu-stopped-state",
+ rtas_query_cpu_stopped_state);
+ spapr_rtas_register("start-cpu", rtas_start_cpu);
+#ifdef RTAS_CONSOLE
+ /* These are unused: We do console I/O via hcalls, not rtas. */
+ spapr_rtas_register("put-term-char", rtas_put_term_char);
+ spapr_rtas_register("get-term-char", rtas_get_term_char);
+#endif
+}
--- /dev/null
+/*
+ * PAPR Virtualized Interrupt System, aka ICS/ICP aka xics
+ *
+ * Borrowed heavily from QEMU's xics.c,
+ * Copyright (c) 2010,2011 David Gibson, IBM Corporation.
+ *
+ * Modifications copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include "spapr.h"
+#include "xics.h"
+#include "kvm/util.h"
+
+#include <stdio.h>
+#include <malloc.h>
+
+#define XICS_NUM_IRQS 1024
+
+
+/* #define DEBUG_XICS yes */
+#ifdef DEBUG_XICS
+#define xics_dprintf(fmt, ...) \
+ do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+#else
+#define xics_dprintf(fmt, ...) \
+ do { } while (0)
+#endif
+
+/*
+ * ICP: Presentation layer
+ */
+
+struct icp_server_state {
+ uint32_t xirr;
+ uint8_t pending_priority;
+ uint8_t mfrr;
+ struct kvm_cpu *cpu;
+};
+
+#define XICS_IRQ_OFFSET 16
+#define XISR_MASK 0x00ffffff
+#define CPPR_MASK 0xff000000
+
+#define XISR(ss) (((ss)->xirr) & XISR_MASK)
+#define CPPR(ss) (((ss)->xirr) >> 24)
+
+struct ics_state;
+
+struct icp_state {
+ unsigned long nr_servers;
+ struct icp_server_state *ss;
+ struct ics_state *ics;
+};
+
+static void ics_reject(struct ics_state *ics, int nr);
+static void ics_resend(struct ics_state *ics);
+static void ics_eoi(struct ics_state *ics, int nr);
+
+static inline void cpu_irq_raise(struct kvm_cpu *vcpu)
+{
+ xics_dprintf("INT1[%p]\n", vcpu);
+ kvm_cpu__irq(vcpu, POWER7_EXT_IRQ, 1);
+}
+
+static inline void cpu_irq_lower(struct kvm_cpu *vcpu)
+{
+ xics_dprintf("INT0[%p]\n", vcpu);
+ kvm_cpu__irq(vcpu, POWER7_EXT_IRQ, 0);
+}
+
+static void icp_check_ipi(struct icp_state *icp, int server)
+{
+ struct icp_server_state *ss = icp->ss + server;
+
+ if (XISR(ss) && (ss->pending_priority <= ss->mfrr)) {
+ return;
+ }
+
+ if (XISR(ss)) {
+ ics_reject(icp->ics, XISR(ss));
+ }
+
+ ss->xirr = (ss->xirr & ~XISR_MASK) | XICS_IPI;
+ ss->pending_priority = ss->mfrr;
+ cpu_irq_raise(ss->cpu);
+}
+
+static void icp_resend(struct icp_state *icp, int server)
+{
+ struct icp_server_state *ss = icp->ss + server;
+
+ if (ss->mfrr < CPPR(ss)) {
+ icp_check_ipi(icp, server);
+ }
+ ics_resend(icp->ics);
+}
+
+static void icp_set_cppr(struct icp_state *icp, int server, uint8_t cppr)
+{
+ struct icp_server_state *ss = icp->ss + server;
+ uint8_t old_cppr;
+ uint32_t old_xisr;
+
+ old_cppr = CPPR(ss);
+ ss->xirr = (ss->xirr & ~CPPR_MASK) | (cppr << 24);
+
+ if (cppr < old_cppr) {
+ if (XISR(ss) && (cppr <= ss->pending_priority)) {
+ old_xisr = XISR(ss);
+ ss->xirr &= ~XISR_MASK; /* Clear XISR */
+ cpu_irq_lower(ss->cpu);
+ ics_reject(icp->ics, old_xisr);
+ }
+ } else {
+ if (!XISR(ss)) {
+ icp_resend(icp, server);
+ }
+ }
+}
+
+static void icp_set_mfrr(struct icp_state *icp, int nr, uint8_t mfrr)
+{
+ struct icp_server_state *ss = icp->ss + nr;
+
+ ss->mfrr = mfrr;
+ if (mfrr < CPPR(ss)) {
+ icp_check_ipi(icp, nr);
+ }
+}
+
+static uint32_t icp_accept(struct icp_server_state *ss)
+{
+ uint32_t xirr;
+
+ cpu_irq_lower(ss->cpu);
+ xirr = ss->xirr;
+ ss->xirr = ss->pending_priority << 24;
+ return xirr;
+}
+
+static void icp_eoi(struct icp_state *icp, int server, uint32_t xirr)
+{
+ struct icp_server_state *ss = icp->ss + server;
+
+ ics_eoi(icp->ics, xirr & XISR_MASK);
+ /* Send EOI -> ICS */
+ ss->xirr = (ss->xirr & ~CPPR_MASK) | (xirr & CPPR_MASK);
+ if (!XISR(ss)) {
+ icp_resend(icp, server);
+ }
+}
+
+static void icp_irq(struct icp_state *icp, int server, int nr, uint8_t priority)
+{
+ struct icp_server_state *ss = icp->ss + server;
+ xics_dprintf("icp_irq(nr %d, server %d, prio 0x%x)\n", nr, server, priority);
+ if ((priority >= CPPR(ss))
+ || (XISR(ss) && (ss->pending_priority <= priority))) {
+ xics_dprintf("reject %d, CPPR 0x%x, XISR 0x%x, pprio 0x%x, prio 0x%x\n",
+ nr, CPPR(ss), XISR(ss), ss->pending_priority, priority);
+ ics_reject(icp->ics, nr);
+ } else {
+ if (XISR(ss)) {
+ xics_dprintf("reject %d, CPPR 0x%x, XISR 0x%x, pprio 0x%x, prio 0x%x\n",
+ nr, CPPR(ss), XISR(ss), ss->pending_priority, priority);
+ ics_reject(icp->ics, XISR(ss));
+ }
+ ss->xirr = (ss->xirr & ~XISR_MASK) | (nr & XISR_MASK);
+ ss->pending_priority = priority;
+ cpu_irq_raise(ss->cpu);
+ }
+}
+
+/*
+ * ICS: Source layer
+ */
+
+struct ics_irq_state {
+ int server;
+ uint8_t priority;
+ uint8_t saved_priority;
+ int rejected:1;
+ int masked_pending:1;
+};
+
+struct ics_state {
+ unsigned int nr_irqs;
+ unsigned int offset;
+ struct ics_irq_state *irqs;
+ struct icp_state *icp;
+};
+
+static int ics_valid_irq(struct ics_state *ics, uint32_t nr)
+{
+ return (nr >= ics->offset)
+ && (nr < (ics->offset + ics->nr_irqs));
+}
+
+static void ics_set_irq_msi(struct ics_state *ics, int srcno, int val)
+{
+ struct ics_irq_state *irq = ics->irqs + srcno;
+
+ if (val) {
+ if (irq->priority == 0xff) {
+ xics_dprintf(" irq pri ff, masked pending\n");
+ irq->masked_pending = 1;
+ } else {
+ icp_irq(ics->icp, irq->server, srcno + ics->offset, irq->priority);
+ }
+ }
+}
+
+static void ics_reject_msi(struct ics_state *ics, int nr)
+{
+ struct ics_irq_state *irq = ics->irqs + nr - ics->offset;
+
+ irq->rejected = 1;
+}
+
+static void ics_resend_msi(struct ics_state *ics)
+{
+ unsigned int i;
+
+ for (i = 0; i < ics->nr_irqs; i++) {
+ struct ics_irq_state *irq = ics->irqs + i;
+
+ /* FIXME: filter by server#? */
+ if (irq->rejected) {
+ irq->rejected = 0;
+ if (irq->priority != 0xff) {
+ icp_irq(ics->icp, irq->server, i + ics->offset, irq->priority);
+ }
+ }
+ }
+}
+
+static void ics_write_xive_msi(struct ics_state *ics, int nr, int server,
+ uint8_t priority)
+{
+ struct ics_irq_state *irq = ics->irqs + nr - ics->offset;
+
+ irq->server = server;
+ irq->priority = priority;
+ xics_dprintf("ics_write_xive_msi(nr %d, server %d, pri 0x%x)\n", nr, server, priority);
+
+ if (!irq->masked_pending || (priority == 0xff)) {
+ return;
+ }
+
+ irq->masked_pending = 0;
+ icp_irq(ics->icp, server, nr, priority);
+}
+
+static void ics_reject(struct ics_state *ics, int nr)
+{
+ ics_reject_msi(ics, nr);
+}
+
+static void ics_resend(struct ics_state *ics)
+{
+ ics_resend_msi(ics);
+}
+
+static void ics_eoi(struct ics_state *ics, int nr)
+{
+}
+
+/*
+ * Exported functions
+ */
+
+static int allocated_irqnum = XICS_IRQ_OFFSET;
+
+/*
+ * xics_alloc_irqnum(): This is hacky. The problem boils down to the PCI device
+ * code which just calls kvm__irq_line( .. pcidev->pci_hdr.irq_line ..) at will.
+ * Each PCI device's IRQ line is allocated by irq__register_device() (which
+ * allocates an IRQ AND allocates a.. PCI device num..).
+ *
+ * In future I'd like to at least mimic some kind of 'upstream IRQ controller'
+ * whereby PCI devices let their PHB know when they want to IRQ, and that
+ * percolates up.
+ *
+ * For now, allocate a REAL xics irq number and (via irq__register_device) push
+ * that into the config space. 8 bits only though!
+ */
+int xics_alloc_irqnum(void)
+{
+ int irq = allocated_irqnum++;
+
+ if (irq > 255)
+ die("Huge numbers of IRQs aren't supported with the daft kvmtool IRQ system.");
+
+ return irq;
+}
+
+static target_ulong h_cppr(struct kvm_cpu *vcpu,
+ target_ulong opcode, target_ulong *args)
+{
+ target_ulong cppr = args[0];
+
+ xics_dprintf("h_cppr(%lx)\n", cppr);
+ icp_set_cppr(vcpu->kvm->arch.icp, vcpu->cpu_id, cppr);
+ return H_SUCCESS;
+}
+
+static target_ulong h_ipi(struct kvm_cpu *vcpu,
+ target_ulong opcode, target_ulong *args)
+{
+ target_ulong server = args[0];
+ target_ulong mfrr = args[1];
+
+ xics_dprintf("h_ipi(%lx, %lx)\n", server, mfrr);
+ if (server >= vcpu->kvm->arch.icp->nr_servers) {
+ return H_PARAMETER;
+ }
+
+ icp_set_mfrr(vcpu->kvm->arch.icp, server, mfrr);
+ return H_SUCCESS;
+}
+
+static target_ulong h_xirr(struct kvm_cpu *vcpu,
+ target_ulong opcode, target_ulong *args)
+{
+ uint32_t xirr = icp_accept(vcpu->kvm->arch.icp->ss + vcpu->cpu_id);
+
+ xics_dprintf("h_xirr() = %x\n", xirr);
+ args[0] = xirr;
+ return H_SUCCESS;
+}
+
+static target_ulong h_eoi(struct kvm_cpu *vcpu,
+ target_ulong opcode, target_ulong *args)
+{
+ target_ulong xirr = args[0];
+
+ xics_dprintf("h_eoi(%lx)\n", xirr);
+ icp_eoi(vcpu->kvm->arch.icp, vcpu->cpu_id, xirr);
+ return H_SUCCESS;
+}
+
+static void rtas_set_xive(struct kvm_cpu *vcpu, uint32_t token,
+ uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ struct ics_state *ics = vcpu->kvm->arch.icp->ics;
+ uint32_t nr, server, priority;
+
+ if ((nargs != 3) || (nret != 1)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ nr = rtas_ld(vcpu->kvm, args, 0);
+ server = rtas_ld(vcpu->kvm, args, 1);
+ priority = rtas_ld(vcpu->kvm, args, 2);
+
+ xics_dprintf("rtas_set_xive(%x,%x,%x)\n", nr, server, priority);
+ if (!ics_valid_irq(ics, nr) || (server >= ics->icp->nr_servers)
+ || (priority > 0xff)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ ics_write_xive_msi(ics, nr, server, priority);
+
+ rtas_st(vcpu->kvm, rets, 0, 0); /* Success */
+}
+
+static void rtas_get_xive(struct kvm_cpu *vcpu, uint32_t token,
+ uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ struct ics_state *ics = vcpu->kvm->arch.icp->ics;
+ uint32_t nr;
+
+ if ((nargs != 1) || (nret != 3)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ nr = rtas_ld(vcpu->kvm, args, 0);
+
+ if (!ics_valid_irq(ics, nr)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ rtas_st(vcpu->kvm, rets, 0, 0); /* Success */
+ rtas_st(vcpu->kvm, rets, 1, ics->irqs[nr - ics->offset].server);
+ rtas_st(vcpu->kvm, rets, 2, ics->irqs[nr - ics->offset].priority);
+}
+
+static void rtas_int_off(struct kvm_cpu *vcpu, uint32_t token,
+ uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ struct ics_state *ics = vcpu->kvm->arch.icp->ics;
+ uint32_t nr;
+
+ if ((nargs != 1) || (nret != 1)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ nr = rtas_ld(vcpu->kvm, args, 0);
+
+ if (!ics_valid_irq(ics, nr)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ /* ME: QEMU wrote xive_msi here, in #if 0. Deleted. */
+
+ rtas_st(vcpu->kvm, rets, 0, 0); /* Success */
+}
+
+static void rtas_int_on(struct kvm_cpu *vcpu, uint32_t token,
+ uint32_t nargs, target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+ struct ics_state *ics = vcpu->kvm->arch.icp->ics;
+ uint32_t nr;
+
+ if ((nargs != 1) || (nret != 1)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ nr = rtas_ld(vcpu->kvm, args, 0);
+
+ if (!ics_valid_irq(ics, nr)) {
+ rtas_st(vcpu->kvm, rets, 0, -3);
+ return;
+ }
+
+ /* ME: QEMU wrote xive_msi here, in #if 0. Deleted. */
+
+ rtas_st(vcpu->kvm, rets, 0, 0); /* Success */
+}
+
+static int xics_init(struct kvm *kvm)
+{
+ int max_server_num;
+ unsigned int i;
+ struct icp_state *icp;
+ struct ics_state *ics;
+ int j;
+
+ max_server_num = kvm->nrcpus;
+
+ icp = malloc(sizeof(*icp));
+ icp->nr_servers = max_server_num + 1;
+ icp->ss = malloc(icp->nr_servers * sizeof(struct icp_server_state));
+
+ for (i = 0; i < icp->nr_servers; i++) {
+ icp->ss[i].xirr = 0;
+ icp->ss[i].pending_priority = 0;
+ icp->ss[i].cpu = 0;
+ icp->ss[i].mfrr = 0xff;
+ }
+
+ /*
+ * icp->ss[env->cpu_index].cpu is set by CPUs calling in to
+ * xics_cpu_register().
+ */
+
+ ics = malloc(sizeof(*ics));
+ ics->nr_irqs = XICS_NUM_IRQS;
+ ics->offset = XICS_IRQ_OFFSET;
+ ics->irqs = malloc(ics->nr_irqs * sizeof(struct ics_irq_state));
+
+ icp->ics = ics;
+ ics->icp = icp;
+
+ for (i = 0; i < ics->nr_irqs; i++) {
+ ics->irqs[i].server = 0;
+ ics->irqs[i].priority = 0xff;
+ ics->irqs[i].saved_priority = 0xff;
+ ics->irqs[i].rejected = 0;
+ ics->irqs[i].masked_pending = 0;
+ }
+
+ spapr_register_hypercall(H_CPPR, h_cppr);
+ spapr_register_hypercall(H_IPI, h_ipi);
+ spapr_register_hypercall(H_XIRR, h_xirr);
+ spapr_register_hypercall(H_EOI, h_eoi);
+
+ spapr_rtas_register("ibm,set-xive", rtas_set_xive);
+ spapr_rtas_register("ibm,get-xive", rtas_get_xive);
+ spapr_rtas_register("ibm,int-off", rtas_int_off);
+ spapr_rtas_register("ibm,int-on", rtas_int_on);
+
+ for (j = 0; j < kvm->nrcpus; j++) {
+ struct kvm_cpu *vcpu = kvm->cpus[j];
+
+ if (vcpu->cpu_id >= icp->nr_servers)
+ die("Invalid server number for cpuid %ld\n", vcpu->cpu_id);
+
+ icp->ss[vcpu->cpu_id].cpu = vcpu;
+ }
+
+ kvm->arch.icp = icp;
+
+ return 0;
+}
+base_init(xics_init);
+
+
+void kvm__irq_line(struct kvm *kvm, int irq, int level)
+{
+ /*
+ * Route event to ICS, which routes to ICP, which eventually does a
+ * kvm_cpu__irq(vcpu, POWER7_EXT_IRQ, 1)
+ */
+ xics_dprintf("Raising IRQ %d -> %d\n", irq, level);
+ ics_set_irq_msi(kvm->arch.icp->ics, irq - kvm->arch.icp->ics->offset, level);
+}
--- /dev/null
+/*
+ * PAPR Virtualized Interrupt System, aka ICS/ICP aka xics
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef XICS_H
+#define XICS_H
+
+#define XICS_IPI 0x2
+
+int xics_alloc_irqnum(void);
+
+#endif
--- /dev/null
+#include "kvm/symbol.h"
+
+#include "kvm/kvm.h"
+
+#include <linux/err.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+#include <bfd.h>
+
+static bfd *abfd;
+
+int symbol_init(struct kvm *kvm)
+{
+ int ret = 0;
+
+ if (!kvm->vmlinux)
+ return 0;
+
+ bfd_init();
+
+ abfd = bfd_openr(kvm->vmlinux, NULL);
+ if (abfd == NULL) {
+ bfd_error_type err = bfd_get_error();
+
+ switch (err) {
+ case bfd_error_no_memory:
+ ret = -ENOMEM;
+ break;
+ case bfd_error_invalid_target:
+ ret = -EINVAL;
+ break;
+ default:
+ ret = -EFAULT;
+ break;
+ }
+ }
+
+ return ret;
+}
+late_init(symbol_init);
+
+static asymbol *lookup(asymbol **symbols, int nr_symbols, const char *symbol_name)
+{
+ int i, ret;
+
+ ret = -ENOENT;
+
+ for (i = 0; i < nr_symbols; i++) {
+ asymbol *symbol = symbols[i];
+
+ if (!strcmp(bfd_asymbol_name(symbol), symbol_name))
+ return symbol;
+ }
+
+ return ERR_PTR(ret);
+}
+
+char *symbol_lookup(struct kvm *kvm, unsigned long addr, char *sym, size_t size)
+{
+ const char *filename;
+ bfd_vma sym_offset;
+ bfd_vma sym_start;
+ asection *section;
+ unsigned int line;
+ const char *func;
+ long symtab_size;
+ asymbol *symbol;
+ asymbol **syms;
+ int nr_syms, ret;
+
+ ret = -ENOENT;
+ if (!abfd)
+ goto not_found;
+
+ if (!bfd_check_format(abfd, bfd_object))
+ goto not_found;
+
+ symtab_size = bfd_get_symtab_upper_bound(abfd);
+ if (!symtab_size)
+ goto not_found;
+
+ ret = -ENOMEM;
+ syms = malloc(symtab_size);
+ if (!syms)
+ goto not_found;
+
+ nr_syms = bfd_canonicalize_symtab(abfd, syms);
+
+ ret = -ENOENT;
+ section = bfd_get_section_by_name(abfd, ".debug_aranges");
+ if (!section)
+ goto not_found;
+
+ if (!bfd_find_nearest_line(abfd, section, NULL, addr, &filename, &func, &line))
+ goto not_found;
+
+ if (!func)
+ goto not_found;
+
+ symbol = lookup(syms, nr_syms, func);
+ if (IS_ERR(symbol))
+ goto not_found;
+
+ sym_start = bfd_asymbol_value(symbol);
+
+ sym_offset = addr - sym_start;
+
+ snprintf(sym, size, "%s+%llx (%s:%i)", func, (long long) sym_offset, filename, line);
+
+ sym[size - 1] = '\0';
+
+ free(syms);
+
+ return sym;
+
+not_found:
+ return ERR_PTR(ret);
+}
+
+int symbol_exit(struct kvm *kvm)
+{
+ bfd_boolean ret = TRUE;
+
+ if (abfd)
+ ret = bfd_close(abfd);
+
+ if (ret == TRUE)
+ return 0;
+
+ return -EFAULT;
+}
+late_exit(symbol_exit);
--- /dev/null
+#include <poll.h>
+#include <stdbool.h>
+#include <termios.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/uio.h>
+#include <signal.h>
+#include <pty.h>
+#include <utmp.h>
+
+#include "kvm/read-write.h"
+#include "kvm/term.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+
+#define TERM_FD_IN 0
+#define TERM_FD_OUT 1
+
+static struct termios orig_term;
+
+int term_escape_char = 0x01; /* ctrl-a is used for escape */
+bool term_got_escape = false;
+
+int term_fds[4][2];
+
+int term_getc(struct kvm *kvm, int term)
+{
+ unsigned char c;
+
+ if (read_in_full(term_fds[term][TERM_FD_IN], &c, 1) < 0)
+ return -1;
+
+ if (term_got_escape) {
+ term_got_escape = false;
+ if (c == 'x')
+ kvm_cpu__reboot(kvm);
+ if (c == term_escape_char)
+ return c;
+ }
+
+ if (c == term_escape_char) {
+ term_got_escape = true;
+ return -1;
+ }
+
+ return c;
+}
+
+int term_putc(char *addr, int cnt, int term)
+{
+ int ret;
+
+ while (cnt--) {
+ ret = write(term_fds[term][TERM_FD_OUT], addr++, 1);
+ if (ret < 0)
+ return 0;
+ }
+
+ return cnt;
+}
+
+int term_getc_iov(struct kvm *kvm, struct iovec *iov, int iovcnt, int term)
+{
+ int c;
+
+ c = term_getc(kvm, term);
+
+ if (c < 0)
+ return 0;
+
+ *((char *)iov[TERM_FD_IN].iov_base) = (char)c;
+
+ return sizeof(char);
+}
+
+int term_putc_iov(struct iovec *iov, int iovcnt, int term)
+{
+ return writev(term_fds[term][TERM_FD_OUT], iov, iovcnt);
+}
+
+bool term_readable(int term)
+{
+ struct pollfd pollfd = (struct pollfd) {
+ .fd = term_fds[term][TERM_FD_IN],
+ .events = POLLIN,
+ .revents = 0,
+ };
+
+ return poll(&pollfd, 1, 0) > 0;
+}
+
+static void term_cleanup(void)
+{
+ int i;
+
+ for (i = 0; i < 4; i++)
+ tcsetattr(term_fds[i][TERM_FD_IN], TCSANOW, &orig_term);
+}
+
+static void term_sig_cleanup(int sig)
+{
+ term_cleanup();
+ signal(sig, SIG_DFL);
+ raise(sig);
+}
+
+void term_set_tty(int term)
+{
+ struct termios orig_term;
+ int master, slave;
+ char new_pty[PATH_MAX];
+
+ if (tcgetattr(STDIN_FILENO, &orig_term) < 0)
+ die("unable to save initial standard input settings");
+
+ orig_term.c_lflag &= ~(ICANON | ECHO | ISIG);
+
+ if (openpty(&master, &slave, new_pty, &orig_term, NULL) < 0)
+ return;
+
+ close(slave);
+
+ pr_info("Assigned terminal %d to pty %s\n", term, new_pty);
+
+ term_fds[term][TERM_FD_IN] = term_fds[term][TERM_FD_OUT] = master;
+}
+
+int tty_parser(const struct option *opt, const char *arg, int unset)
+{
+ int tty = atoi(arg);
+
+ term_set_tty(tty);
+
+ return 0;
+}
+
+int term_init(struct kvm *kvm)
+{
+ struct termios term;
+ int i, r;
+
+ r = tcgetattr(STDIN_FILENO, &orig_term);
+ if (r < 0) {
+ pr_warning("unable to save initial standard input settings");
+ return r;
+ }
+
+
+ term = orig_term;
+ term.c_lflag &= ~(ICANON | ECHO | ISIG);
+ tcsetattr(STDIN_FILENO, TCSANOW, &term);
+
+ for (i = 0; i < 4; i++)
+ if (term_fds[i][TERM_FD_IN] == 0) {
+ term_fds[i][TERM_FD_IN] = STDIN_FILENO;
+ term_fds[i][TERM_FD_OUT] = STDOUT_FILENO;
+ }
+
+ signal(SIGTERM, term_sig_cleanup);
+ atexit(term_cleanup);
+
+ return 0;
+}
+dev_init(term_init);
+
+int term_exit(struct kvm *kvm)
+{
+ return 0;
+}
+dev_exit(term_exit);
--- /dev/null
+all: kernel pit boot
+
+kernel:
+ $(MAKE) -C kernel
+.PHONY: kernel
+
+pit:
+ $(MAKE) -C pit
+.PHONY: pit
+
+boot:
+ $(MAKE) -C boot
+.PHONY: boot
+
+clean:
+ $(MAKE) -C kernel clean
+ $(MAKE) -C pit clean
+ $(MAKE) -C boot clean
+.PHONY: clean
--- /dev/null
+NAME := init
+
+OBJ := $(NAME).o
+
+all: $(.o)
+ rm -rf rootfs
+ mkdir rootfs
+ gcc -static init.c -o rootfs/init
+ mkisofs rootfs > boot_test.iso
+
+clean:
+ rm -rf rootfs boot_test.iso
+.PHONY: clean
--- /dev/null
+#include <linux/reboot.h>
+#include <unistd.h>
+
+int main(int argc, char *argv[])
+{
+ puts("hello, KVM guest!\r");
+
+ reboot(LINUX_REBOOT_CMD_RESTART);
+
+ return 0;
+}
--- /dev/null
+kernel.bin
+kernel.elf
--- /dev/null
+NAME := kernel
+
+BIN := $(NAME).bin
+ELF := $(NAME).elf
+OBJ := $(NAME).o
+
+all: $(BIN)
+
+$(BIN): $(ELF)
+ objcopy -O binary $< $@
+
+$(ELF): $(OBJ)
+ ld -Ttext=0x00 -nostdlib -static $< -o $@
+
+%.o: %.S
+ gcc -nostdinc -c $< -o $@
+
+clean:
+ rm -f $(BIN) $(ELF) $(OBJ)
+.PHONY: clean
--- /dev/null
+Compiling
+---------
+
+You can simply type:
+
+ $Â make
+
+to build a 16-bit binary that uses the i8086 instruction set.
+
+Disassembling
+-------------
+
+Use the "-m i8086" command line option with objdump to make sure it knows we're
+dealing with i8086 instruction set:
+
+ $ objdump -d -m i8086 i8086.elf
--- /dev/null
+ .code16gcc
+ .text
+ .globl _start
+ .type _start, @function
+_start:
+ # "This is probably the largest possible kernel that is bug free." -- Avi Kivity
+ 1:
+ jmp 1b
--- /dev/null
+*.bin
+*.elf
--- /dev/null
+NAME := tick
+
+BIN := $(NAME).bin
+ELF := $(NAME).elf
+OBJ := $(NAME).o
+
+all: $(BIN)
+
+$(BIN): $(ELF)
+ objcopy -O binary $< $@
+
+$(ELF): $(OBJ)
+ ld -Ttext=0x00 -nostdlib -static $< -o $@
+
+%.o: %.S
+ gcc -nostdinc -c $< -o $@
+
+clean:
+ rm -f $(BIN) $(ELF) $(OBJ)
+.PHONY: clean
--- /dev/null
+Compiling
+---------
+
+You can simply type:
+
+ $Â make
+
+to build a 16-bit binary that uses the i8086 instruction set.
+
+Disassembling
+-------------
+
+Use the "-m i8086" command line option with objdump to make sure it knows we're
+dealing with i8086 instruction set:
+
+ $ objdump -d -m i8086 i8086.elf
--- /dev/null
+#define IO_PIC 0x20
+#define IRQ_OFFSET 32
+#define IO_PIT 0x40
+#define TIMER_FREQ 1193182
+#define TIMER_DIV(x) ((TIMER_FREQ+(x)/2)/(x))
+
+#define TEST_COUNT 0x0200
+
+ .code16gcc
+ .text
+ .globl _start
+ .type _start, @function
+_start:
+/*
+ * fill up noop handlers
+ */
+ xorw %ax, %ax
+ xorw %di, %di
+ movw %ax, %es
+ movw $256, %cx
+fill_noop_idt:
+ movw $noop_handler, %es:(%di)
+ movw %cs, %es:2(%di)
+ add $4, %di
+ loop fill_noop_idt
+
+set_idt:
+ movw $timer_isr, %es:(IRQ_OFFSET*4)
+ movw %cs, %es:(IRQ_OFFSET*4+2)
+
+set_pic:
+ # ICW1
+ mov $0x11, %al
+ mov $(IO_PIC), %dx
+ out %al,%dx
+ # ICW2
+ mov $(IRQ_OFFSET), %al
+ mov $(IO_PIC+1), %dx
+ out %al, %dx
+ # ICW3
+ mov $0x00, %al
+ mov $(IO_PIC+1), %dx
+ out %al, %dx
+ # ICW4
+ mov $0x3, %al
+ mov $(IO_PIC+1), %dx
+ out %al, %dx
+
+set_pit:
+ # set 8254 mode
+ mov $(IO_PIT+3), %dx
+ mov $0x34, %al
+ outb %al, %dx
+ # set 8254 freq 1KHz
+ mov $(IO_PIT), %dx
+ movb $(TIMER_DIV(1000) % 256), %al
+ outb %al, %dx
+ movb $(TIMER_DIV(1000) / 256), %al
+ outb %al, %dx
+
+enable_irq0:
+ mov $0xfe, %al
+ mov $(IO_PIC+1), %dx
+ out %al, %dx
+ sti
+loop:
+ 1:
+ jmp 1b
+
+test_ok:
+ mov $0x3f8,%dx
+ cs lea msg2, %si
+ mov $(msg2_end-msg2), %cx
+ cs rep/outsb
+
+ /* Reboot by using the i8042 reboot line */
+ mov $0xfe, %al
+ outb %al, $0x64
+
+timer_isr:
+ cli
+ pushaw
+ pushfw
+ mov $0x3f8,%dx
+ mov $0x2e, %al # .
+ out %al,%dx
+ decw count
+ jz test_ok
+ popfw
+ popaw
+ iretw
+
+noop_handler:
+ iretw
+
+count:
+ .word TEST_COUNT
+
+msg2:
+ .asciz "\nTest OK\n"
+msg2_end:
--- /dev/null
+#include "kvm/sdl.h"
+
+#include "kvm/framebuffer.h"
+#include "kvm/i8042.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/vesa.h"
+
+#include <SDL/SDL.h>
+#include <pthread.h>
+#include <signal.h>
+#include <linux/err.h>
+
+#define FRAME_RATE 25
+
+#define SCANCODE_UNKNOWN 0
+#define SCANCODE_NORMAL 1
+#define SCANCODE_ESCAPED 2
+#define SCANCODE_KEY_PAUSE 3
+#define SCANCODE_KEY_PRNTSCRN 4
+
+struct set2_scancode {
+ u8 code;
+ u8 type;
+};
+
+#define DEFINE_SC(_code) {\
+ .code = _code,\
+ .type = SCANCODE_NORMAL,\
+}
+
+/* escaped scancodes */
+#define DEFINE_ESC(_code) {\
+ .code = _code,\
+ .type = SCANCODE_ESCAPED,\
+}
+
+static const struct set2_scancode const keymap[256] = {
+ [9] = DEFINE_SC(0x76), /* <esc> */
+ [10] = DEFINE_SC(0x16), /* 1 */
+ [11] = DEFINE_SC(0x1e), /* 2 */
+ [12] = DEFINE_SC(0x26), /* 3 */
+ [13] = DEFINE_SC(0x25), /* 4 */
+ [14] = DEFINE_SC(0x2e), /* 5 */
+ [15] = DEFINE_SC(0x36), /* 6 */
+ [16] = DEFINE_SC(0x3d), /* 7 */
+ [17] = DEFINE_SC(0x3e), /* 8 */
+ [18] = DEFINE_SC(0x46), /* 9 */
+ [19] = DEFINE_SC(0x45), /* 9 */
+ [20] = DEFINE_SC(0x4e), /* - */
+ [21] = DEFINE_SC(0x55), /* + */
+ [22] = DEFINE_SC(0x66), /* <backspace> */
+ [23] = DEFINE_SC(0x0d), /* <tab> */
+ [24] = DEFINE_SC(0x15), /* q */
+ [25] = DEFINE_SC(0x1d), /* w */
+ [26] = DEFINE_SC(0x24), /* e */
+ [27] = DEFINE_SC(0x2d), /* r */
+ [28] = DEFINE_SC(0x2c), /* t */
+ [29] = DEFINE_SC(0x35), /* y */
+ [30] = DEFINE_SC(0x3c), /* u */
+ [31] = DEFINE_SC(0x43), /* i */
+ [32] = DEFINE_SC(0x44), /* o */
+ [33] = DEFINE_SC(0x4d), /* p */
+ [34] = DEFINE_SC(0x54), /* [ */
+ [35] = DEFINE_SC(0x5b), /* ] */
+ [36] = DEFINE_SC(0x5a), /* <enter> */
+ [37] = DEFINE_SC(0x14), /* <left ctrl> */
+ [38] = DEFINE_SC(0x1c), /* a */
+ [39] = DEFINE_SC(0x1b), /* s */
+ [40] = DEFINE_SC(0x23), /* d */
+ [41] = DEFINE_SC(0x2b), /* f */
+ [42] = DEFINE_SC(0x34), /* g */
+ [43] = DEFINE_SC(0x33), /* h */
+ [44] = DEFINE_SC(0x3b), /* j */
+ [45] = DEFINE_SC(0x42), /* k */
+ [46] = DEFINE_SC(0x4b), /* l */
+ [47] = DEFINE_SC(0x4c), /* ; */
+ [48] = DEFINE_SC(0x52), /* ' */
+ [49] = DEFINE_SC(0x0e), /* ` */
+ [50] = DEFINE_SC(0x12), /* <left shift> */
+ [51] = DEFINE_SC(0x5d), /* \ */
+ [52] = DEFINE_SC(0x1a), /* z */
+ [53] = DEFINE_SC(0x22), /* x */
+ [54] = DEFINE_SC(0x21), /* c */
+ [55] = DEFINE_SC(0x2a), /* v */
+ [56] = DEFINE_SC(0x32), /* b */
+ [57] = DEFINE_SC(0x31), /* n */
+ [58] = DEFINE_SC(0x3a), /* m */
+ [59] = DEFINE_SC(0x41), /* < */
+ [60] = DEFINE_SC(0x49), /* > */
+ [61] = DEFINE_SC(0x4a), /* / */
+ [62] = DEFINE_SC(0x59), /* <right shift> */
+ [63] = DEFINE_SC(0x7c), /* keypad * */
+ [64] = DEFINE_SC(0x11), /* <left alt> */
+ [65] = DEFINE_SC(0x29), /* <space> */
+
+ [67] = DEFINE_SC(0x05), /* <F1> */
+ [68] = DEFINE_SC(0x06), /* <F2> */
+ [69] = DEFINE_SC(0x04), /* <F3> */
+ [70] = DEFINE_SC(0x0c), /* <F4> */
+ [71] = DEFINE_SC(0x03), /* <F5> */
+ [72] = DEFINE_SC(0x0b), /* <F6> */
+ [73] = DEFINE_SC(0x83), /* <F7> */
+ [74] = DEFINE_SC(0x0a), /* <F8> */
+ [75] = DEFINE_SC(0x01), /* <F9> */
+ [76] = DEFINE_SC(0x09), /* <F10> */
+
+ [79] = DEFINE_SC(0x6c), /* keypad 7 */
+ [80] = DEFINE_SC(0x75), /* keypad 8 */
+ [81] = DEFINE_SC(0x7d), /* keypad 9 */
+ [82] = DEFINE_SC(0x7b), /* keypad - */
+ [83] = DEFINE_SC(0x6b), /* keypad 4 */
+ [84] = DEFINE_SC(0x73), /* keypad 5 */
+ [85] = DEFINE_SC(0x74), /* keypad 6 */
+ [86] = DEFINE_SC(0x79), /* keypad + */
+ [87] = DEFINE_SC(0x69), /* keypad 1 */
+ [88] = DEFINE_SC(0x72), /* keypad 2 */
+ [89] = DEFINE_SC(0x7a), /* keypad 3 */
+ [90] = DEFINE_SC(0x70), /* keypad 0 */
+ [91] = DEFINE_SC(0x71), /* keypad . */
+
+ [94] = DEFINE_SC(0x61), /* <INT 1> */
+ [95] = DEFINE_SC(0x78), /* <F11> */
+ [96] = DEFINE_SC(0x07), /* <F12> */
+
+ [104] = DEFINE_ESC(0x5a), /* keypad <enter> */
+ [105] = DEFINE_ESC(0x14), /* <right ctrl> */
+ [106] = DEFINE_ESC(0x4a), /* keypad / */
+ [108] = DEFINE_ESC(0x11), /* <right alt> */
+ [110] = DEFINE_ESC(0x6c), /* <home> */
+ [111] = DEFINE_ESC(0x75), /* <up> */
+ [112] = DEFINE_ESC(0x7d), /* <pag up> */
+ [113] = DEFINE_ESC(0x6b), /* <left> */
+ [114] = DEFINE_ESC(0x74), /* <right> */
+ [115] = DEFINE_ESC(0x69), /* <end> */
+ [116] = DEFINE_ESC(0x72), /* <down> */
+ [117] = DEFINE_ESC(0x7a), /* <pag down> */
+ [118] = DEFINE_ESC(0x70), /* <ins> */
+ [119] = DEFINE_ESC(0x71), /* <delete> */
+};
+static bool running, done;
+
+static const struct set2_scancode *to_code(u8 scancode)
+{
+ return &keymap[scancode];
+}
+
+static void key_press(const struct set2_scancode *sc)
+{
+ switch (sc->type) {
+ case SCANCODE_ESCAPED:
+ kbd_queue(0xe0);
+ /* fallthrough */
+ case SCANCODE_NORMAL:
+ kbd_queue(sc->code);
+ break;
+ case SCANCODE_KEY_PAUSE:
+ kbd_queue(0xe1);
+ kbd_queue(0x14);
+ kbd_queue(0x77);
+ kbd_queue(0xe1);
+ kbd_queue(0xf0);
+ kbd_queue(0x14);
+ kbd_queue(0x77);
+ break;
+ case SCANCODE_KEY_PRNTSCRN:
+ kbd_queue(0xe0);
+ kbd_queue(0x12);
+ kbd_queue(0xe0);
+ kbd_queue(0x7c);
+ break;
+ }
+}
+
+static void key_release(const struct set2_scancode *sc)
+{
+ switch (sc->type) {
+ case SCANCODE_ESCAPED:
+ kbd_queue(0xe0);
+ /* fallthrough */
+ case SCANCODE_NORMAL:
+ kbd_queue(0xf0);
+ kbd_queue(sc->code);
+ break;
+ case SCANCODE_KEY_PAUSE:
+ /* nothing to do */
+ break;
+ case SCANCODE_KEY_PRNTSCRN:
+ kbd_queue(0xe0);
+ kbd_queue(0xf0);
+ kbd_queue(0x7c);
+ kbd_queue(0xe0);
+ kbd_queue(0xf0);
+ kbd_queue(0x12);
+ break;
+ }
+}
+
+static void *sdl__thread(void *p)
+{
+ Uint32 rmask, gmask, bmask, amask;
+ struct framebuffer *fb = p;
+ SDL_Surface *guest_screen;
+ SDL_Surface *screen;
+ SDL_Event ev;
+ Uint32 flags;
+
+ if (SDL_Init(SDL_INIT_VIDEO) != 0)
+ die("Unable to initialize SDL");
+
+ rmask = 0x000000ff;
+ gmask = 0x0000ff00;
+ bmask = 0x00ff0000;
+ amask = 0x00000000;
+
+ guest_screen = SDL_CreateRGBSurfaceFrom(fb->mem, fb->width, fb->height, fb->depth, fb->width * fb->depth / 8, rmask, gmask, bmask, amask);
+ if (!guest_screen)
+ die("Unable to create SDL RBG surface");
+
+ flags = SDL_HWSURFACE | SDL_ASYNCBLIT | SDL_HWACCEL | SDL_DOUBLEBUF;
+
+ SDL_WM_SetCaption("KVM tool", "KVM tool");
+
+ screen = SDL_SetVideoMode(fb->width, fb->height, fb->depth, flags);
+ if (!screen)
+ die("Unable to set SDL video mode");
+
+ SDL_EnableKeyRepeat(200, 50);
+
+ while (running) {
+ SDL_BlitSurface(guest_screen, NULL, screen, NULL);
+ SDL_Flip(screen);
+
+ while (SDL_PollEvent(&ev)) {
+ switch (ev.type) {
+ case SDL_KEYDOWN: {
+ const struct set2_scancode *sc = to_code(ev.key.keysym.scancode);
+ if (sc->type == SCANCODE_UNKNOWN) {
+ pr_warning("key '%d' not found in keymap", ev.key.keysym.scancode);
+ break;
+ }
+ key_press(sc);
+ break;
+ }
+ case SDL_KEYUP: {
+ const struct set2_scancode *sc = to_code(ev.key.keysym.scancode);
+ if (sc->type == SCANCODE_UNKNOWN)
+ break;
+ key_release(sc);
+ break;
+ }
+ case SDL_QUIT:
+ goto exit;
+ }
+ }
+
+ SDL_Delay(1000 / FRAME_RATE);
+ }
+
+ if (running == false && done == false) {
+ done = true;
+ return NULL;
+ }
+exit:
+ kvm_cpu__reboot(fb->kvm);
+
+ return NULL;
+}
+
+static int sdl__start(struct framebuffer *fb)
+{
+ pthread_t thread;
+
+ running = true;
+
+ if (pthread_create(&thread, NULL, sdl__thread, fb) != 0)
+ return -1;
+
+ return 0;
+}
+
+static int sdl__stop(struct framebuffer *fb)
+{
+ running = false;
+ while (done == false)
+ sleep(0);
+
+ return 0;
+}
+
+static struct fb_target_operations sdl_ops = {
+ .start = sdl__start,
+ .stop = sdl__stop,
+};
+
+int sdl__init(struct kvm *kvm)
+{
+ struct framebuffer *fb;
+
+ if (!kvm->cfg.sdl)
+ return 0;
+
+ fb = vesa__init(kvm);
+ if (IS_ERR(fb)) {
+ pr_err("vesa__init() failed with error %ld\n", PTR_ERR(fb));
+ return PTR_ERR(fb);
+ }
+
+ return fb__attach(fb, &sdl_ops);
+}
+dev_init(sdl__init);
+
+int sdl__exit(struct kvm *kvm)
+{
+ if (kvm->cfg.sdl)
+ return sdl__stop(NULL);
+
+ return 0;
+}
+dev_exit(sdl__exit);
--- /dev/null
+#include "kvm/vnc.h"
+
+#include "kvm/framebuffer.h"
+#include "kvm/i8042.h"
+#include "kvm/vesa.h"
+
+#include <linux/types.h>
+#include <rfb/keysym.h>
+#include <rfb/rfb.h>
+#include <pthread.h>
+#include <linux/err.h>
+
+#define VESA_QUEUE_SIZE 128
+#define VESA_IRQ 14
+
+/*
+ * This "6000" value is pretty much the result of experimentation
+ * It seems that around this value, things update pretty smoothly
+ */
+#define VESA_UPDATE_TIME 6000
+
+/*
+ * We can map the letters and numbers without a fuss,
+ * but the other characters not so much.
+ */
+static char letters[26] = {
+ 0x1c, 0x32, 0x21, 0x23, 0x24, /* a-e */
+ 0x2b, 0x34, 0x33, 0x43, 0x3b, /* f-j */
+ 0x42, 0x4b, 0x3a, 0x31, 0x44, /* k-o */
+ 0x4d, 0x15, 0x2d, 0x1b, 0x2c, /* p-t */
+ 0x3c, 0x2a, 0x1d, 0x22, 0x35, /* u-y */
+ 0x1a,
+};
+
+static rfbScreenInfoPtr server;
+static char num[10] = {
+ 0x45, 0x16, 0x1e, 0x26, 0x2e, 0x23, 0x36, 0x3d, 0x3e, 0x46,
+};
+
+/*
+ * This is called when the VNC server receives a key event
+ * The reason this function is such a beast is that we have
+ * to convert from ASCII characters (which is what VNC gets)
+ * to PC keyboard scancodes, which is what Linux expects to
+ * get from its keyboard. ASCII and the scancode set don't
+ * really seem to mesh in any good way beyond some basics with
+ * the letters and numbers.
+ */
+static void kbd_handle_key(rfbBool down, rfbKeySym key, rfbClientPtr cl)
+{
+ char tosend = 0;
+
+ if (key >= 0x41 && key <= 0x5a)
+ key += 0x20; /* convert to lowercase */
+
+ if (key >= 0x61 && key <= 0x7a) /* a-z */
+ tosend = letters[key - 0x61];
+
+ if (key >= 0x30 && key <= 0x39)
+ tosend = num[key - 0x30];
+
+ switch (key) {
+ case XK_Insert: kbd_queue(0xe0); tosend = 0x70; break;
+ case XK_Delete: kbd_queue(0xe0); tosend = 0x71; break;
+ case XK_Up: kbd_queue(0xe0); tosend = 0x75; break;
+ case XK_Down: kbd_queue(0xe0); tosend = 0x72; break;
+ case XK_Left: kbd_queue(0xe0); tosend = 0x6b; break;
+ case XK_Right: kbd_queue(0xe0); tosend = 0x74; break;
+ case XK_Page_Up: kbd_queue(0xe0); tosend = 0x7d; break;
+ case XK_Page_Down: kbd_queue(0xe0); tosend = 0x7a; break;
+ case XK_Home: kbd_queue(0xe0); tosend = 0x6c; break;
+ case XK_BackSpace: tosend = 0x66; break;
+ case XK_Tab: tosend = 0x0d; break;
+ case XK_Return: tosend = 0x5a; break;
+ case XK_Escape: tosend = 0x76; break;
+ case XK_End: tosend = 0x69; break;
+ case XK_Shift_L: tosend = 0x12; break;
+ case XK_Shift_R: tosend = 0x59; break;
+ case XK_Control_R: kbd_queue(0xe0);
+ case XK_Control_L: tosend = 0x14; break;
+ case XK_Alt_R: kbd_queue(0xe0);
+ case XK_Alt_L: tosend = 0x11; break;
+ case XK_quoteleft: tosend = 0x0e; break;
+ case XK_minus: tosend = 0x4e; break;
+ case XK_equal: tosend = 0x55; break;
+ case XK_bracketleft: tosend = 0x54; break;
+ case XK_bracketright: tosend = 0x5b; break;
+ case XK_backslash: tosend = 0x5d; break;
+ case XK_Caps_Lock: tosend = 0x58; break;
+ case XK_semicolon: tosend = 0x4c; break;
+ case XK_quoteright: tosend = 0x52; break;
+ case XK_comma: tosend = 0x41; break;
+ case XK_period: tosend = 0x49; break;
+ case XK_slash: tosend = 0x4a; break;
+ case XK_space: tosend = 0x29; break;
+
+ /*
+ * This is where I handle the shifted characters.
+ * They don't really map nicely the way A-Z maps to a-z,
+ * so I'm doing it manually
+ */
+ case XK_exclam: tosend = 0x16; break;
+ case XK_quotedbl: tosend = 0x52; break;
+ case XK_numbersign: tosend = 0x26; break;
+ case XK_dollar: tosend = 0x25; break;
+ case XK_percent: tosend = 0x2e; break;
+ case XK_ampersand: tosend = 0x3d; break;
+ case XK_parenleft: tosend = 0x46; break;
+ case XK_parenright: tosend = 0x45; break;
+ case XK_asterisk: tosend = 0x3e; break;
+ case XK_plus: tosend = 0x55; break;
+ case XK_colon: tosend = 0x4c; break;
+ case XK_less: tosend = 0x41; break;
+ case XK_greater: tosend = 0x49; break;
+ case XK_question: tosend = 0x4a; break;
+ case XK_at: tosend = 0x1e; break;
+ case XK_asciicircum: tosend = 0x36; break;
+ case XK_underscore: tosend = 0x4e; break;
+ case XK_braceleft: tosend = 0x54; break;
+ case XK_braceright: tosend = 0x5b; break;
+ case XK_bar: tosend = 0x5d; break;
+ case XK_asciitilde: tosend = 0x0e; break;
+ default: break;
+ }
+
+ /*
+ * If this is a "key up" event (the user has released the key, we
+ * need to send 0xf0 first.
+ */
+ if (!down && tosend != 0x0)
+ kbd_queue(0xf0);
+
+ if (tosend)
+ kbd_queue(tosend);
+}
+
+/* The previous X and Y coordinates of the mouse */
+static int xlast, ylast = -1;
+
+/*
+ * This function is called by the VNC server whenever a mouse event occurs.
+ */
+static void kbd_handle_ptr(int buttonMask, int x, int y, rfbClientPtr cl)
+{
+ int dx, dy;
+ char b1 = 0x8;
+
+ /* The VNC mask and the PS/2 button encoding are the same */
+ b1 |= buttonMask;
+
+ if (xlast >= 0 && ylast >= 0) {
+ /* The PS/2 mouse sends deltas, not absolutes */
+ dx = x - xlast;
+ dy = ylast - y;
+
+ /* Set overflow bits if needed */
+ if (dy > 255)
+ b1 |= 0x80;
+ if (dx > 255)
+ b1 |= 0x40;
+
+ /* Set negative bits if needed */
+ if (dy < 0)
+ b1 |= 0x20;
+ if (dx < 0)
+ b1 |= 0x10;
+
+ mouse_queue(b1);
+ mouse_queue(dx);
+ mouse_queue(dy);
+ }
+
+ xlast = x;
+ ylast = y;
+ rfbDefaultPtrAddEvent(buttonMask, x, y, cl);
+}
+
+static void *vnc__thread(void *p)
+{
+ struct framebuffer *fb = p;
+ /*
+ * Make a fake argc and argv because the getscreen function
+ * seems to want it.
+ */
+ char argv[1][1] = {{0}};
+ int argc = 1;
+
+ server = rfbGetScreen(&argc, (char **) argv, fb->width, fb->height, 8, 3, 4);
+ server->frameBuffer = fb->mem;
+ server->alwaysShared = TRUE;
+ server->kbdAddEvent = kbd_handle_key;
+ server->ptrAddEvent = kbd_handle_ptr;
+ rfbInitServer(server);
+
+ while (rfbIsActive(server)) {
+ rfbMarkRectAsModified(server, 0, 0, fb->width, fb->height);
+ rfbProcessEvents(server, server->deferUpdateTime * VESA_UPDATE_TIME);
+ }
+ return NULL;
+}
+
+static int vnc__start(struct framebuffer *fb)
+{
+ pthread_t thread;
+
+ if (pthread_create(&thread, NULL, vnc__thread, fb) != 0)
+ return -1;
+
+ return 0;
+}
+
+static int vnc__stop(struct framebuffer *fb)
+{
+ rfbShutdownServer(server, TRUE);
+
+ return 0;
+}
+
+static struct fb_target_operations vnc_ops = {
+ .start = vnc__start,
+ .stop = vnc__stop,
+};
+
+int vnc__init(struct kvm *kvm)
+{
+ struct framebuffer *fb;
+
+ if (!kvm->cfg.vnc)
+ return 0;
+
+ fb = vesa__init(kvm);
+ if (IS_ERR(fb)) {
+ pr_err("vesa__init() failed with error %ld\n", PTR_ERR(fb));
+ return PTR_ERR(fb);
+ }
+
+ return fb__attach(fb, &vnc_ops);
+}
+dev_init(vnc__init);
+
+int vnc__exit(struct kvm *kvm)
+{
+ if (kvm->cfg.vnc)
+ return vnc__stop(NULL);
+
+ return 0;
+}
+dev_exit(vnc__exit);
--- /dev/null
+#!/bin/sh
+
+if [ $# -eq 1 ] ; then
+ OUTPUT=$1
+fi
+
+GVF=${OUTPUT}KVMTOOLS-VERSION-FILE
+
+LF='
+'
+
+# First check if there is a .git to get the version from git describe
+# otherwise try to get the version from the kernel makefile
+if test -d ../../.git -o -f ../../.git &&
+ VN=$(git describe --abbrev=4 HEAD 2>/dev/null) &&
+ case "$VN" in
+ *$LF*) (exit 1) ;;
+ v[0-9]*)
+ git update-index -q --refresh
+ test -z "$(git diff-index --name-only HEAD --)" ||
+ VN="$VN-dirty" ;;
+ esac
+then
+ VN=$(echo "$VN" | sed -e 's/-/./g');
+else
+ VN=$(MAKEFLAGS= make -sC ../.. kernelversion)
+fi
+
+VN=$(expr "$VN" : v*'\(.*\)')
+
+if test -r $GVF
+then
+ VC=$(sed -e 's/^KVMTOOLS_VERSION = //' <$GVF)
+else
+ VC=unset
+fi
+test "$VN" = "$VC" || {
+ echo >&2 "KVMTOOLS_VERSION = $VN"
+ echo "KVMTOOLS_VERSION = $VN" >$GVF
+}
--- /dev/null
+#!/bin/sh
+
+echo "/* Automatically generated by $0 */
+struct cmdname_help
+{
+ char name[16];
+ char help[80];
+};
+
+static struct cmdname_help common_cmds[] = {"
+
+sed -n 's/^lkvm-\([^ \t]*\).*common/\1/p' command-list.txt |
+while read cmd
+do
+ # TODO following sed command should be fixed
+ sed -n '/^NAME/,/^lkvm-'"$cmd"'/ {
+ /NAME/d
+ /--/d
+ s/.*kvm-'"$cmd"' - \(.*\)/ {"'"$cmd"'", "\1"},/
+ p
+ }' "Documentation/kvm-$cmd.txt"
+done
+echo "};"
--- /dev/null
+#include <linux/list.h>
+#include <linux/kernel.h>
+
+#include "kvm/kvm.h"
+#include "kvm/util-init.h"
+
+#define PRIORITY_LISTS 10
+
+static struct hlist_head init_lists[PRIORITY_LISTS];
+static struct hlist_head exit_lists[PRIORITY_LISTS];
+
+int init_list_add(struct init_item *t, int (*init)(struct kvm *),
+ int priority, const char *name)
+{
+ t->init = init;
+ t->fn_name = name;
+ hlist_add_head(&t->n, &init_lists[priority]);
+
+ return 0;
+}
+
+int exit_list_add(struct init_item *t, int (*init)(struct kvm *),
+ int priority, const char *name)
+{
+ t->init = init;
+ t->fn_name = name;
+ hlist_add_head(&t->n, &exit_lists[priority]);
+
+ return 0;
+}
+
+int init_list__init(struct kvm *kvm)
+{
+ unsigned int i;
+ int r = 0;
+ struct hlist_node *n;
+ struct init_item *t;
+
+ for (i = 0; i < ARRAY_SIZE(init_lists); i++)
+ hlist_for_each_entry(t, n, &init_lists[i], n) {
+ r = t->init(kvm);
+ if (r < 0) {
+ pr_warning("Failed init: %s\n", t->fn_name);
+ goto fail;
+ }
+ }
+
+fail:
+ return r;
+}
+
+int init_list__exit(struct kvm *kvm)
+{
+ int i;
+ int r = 0;
+ struct hlist_node *n;
+ struct init_item *t;
+
+ for (i = ARRAY_SIZE(exit_lists) - 1; i >= 0; i--)
+ hlist_for_each_entry(t, n, &exit_lists[i], n) {
+ r = t->init(kvm);
+ if (r < 0) {
+ pr_warning("%s failed.\n", t->fn_name);
+ goto fail;
+ }
+ }
+fail:
+ return r;
+}
--- /dev/null
+#!/bin/sh
+switch=vbr0
+/sbin/ifconfig $1 0.0.0.0 up
+/usr/sbin/brctl addif ${switch} $1
+/usr/sbin/brctl setfd ${switch} 0
+/usr/sbin/brctl stp ${switch} off
--- /dev/null
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <stdbool.h>
+
+/* user defined includes */
+#include <linux/types.h>
+#include <kvm/util.h>
+#include <kvm/parse-options.h>
+#include <kvm/strbuf.h>
+
+#define OPT_SHORT 1
+#define OPT_UNSET 2
+
+static int opterror(const struct option *opt, const char *reason, int flags)
+{
+ if (flags & OPT_SHORT)
+ return pr_err("switch `%c' %s", opt->short_name, reason);
+ if (flags & OPT_UNSET)
+ return pr_err("option `no-%s' %s", opt->long_name, reason);
+ return pr_err("option `%s' %s", opt->long_name, reason);
+}
+
+static int get_arg(struct parse_opt_ctx_t *p, const struct option *opt,
+ int flags, const char **arg)
+{
+ if (p->opt) {
+ *arg = p->opt;
+ p->opt = NULL;
+ } else if ((opt->flags & PARSE_OPT_LASTARG_DEFAULT) && (p->argc == 1 ||
+ **(p->argv + 1) == '-')) {
+ *arg = (const char *)opt->defval;
+ } else if (p->argc > 1) {
+ p->argc--;
+ *arg = *++p->argv;
+ } else
+ return opterror(opt, "requires a value", flags);
+ return 0;
+}
+
+static int readnum(const struct option *opt, int flags,
+ const char *str, char **end)
+{
+ switch (opt->type) {
+ case OPTION_INTEGER:
+ *(int *)opt->value = strtol(str, end, 0);
+ break;
+ case OPTION_UINTEGER:
+ *(unsigned int *)opt->value = strtol(str, end, 0);
+ break;
+ case OPTION_LONG:
+ *(long *)opt->value = strtol(str, end, 0);
+ break;
+ case OPTION_U64:
+ *(u64 *)opt->value = strtoull(str, end, 0);
+ break;
+ default:
+ return opterror(opt, "invalid numeric conversion", flags);
+ }
+
+ return 0;
+}
+
+static int get_value(struct parse_opt_ctx_t *p,
+ const struct option *opt, int flags)
+{
+ const char *s, *arg = NULL;
+ const int unset = flags & OPT_UNSET;
+
+ if (unset && p->opt)
+ return opterror(opt, "takes no value", flags);
+ if (unset && (opt->flags & PARSE_OPT_NONEG))
+ return opterror(opt, "isn't available", flags);
+
+ if (!(flags & OPT_SHORT) && p->opt) {
+ switch (opt->type) {
+ case OPTION_CALLBACK:
+ if (!(opt->flags & PARSE_OPT_NOARG))
+ break;
+ /* FALLTHROUGH */
+ case OPTION_BOOLEAN:
+ case OPTION_INCR:
+ case OPTION_BIT:
+ case OPTION_SET_UINT:
+ case OPTION_SET_PTR:
+ return opterror(opt, "takes no value", flags);
+ case OPTION_END:
+ case OPTION_ARGUMENT:
+ case OPTION_GROUP:
+ case OPTION_STRING:
+ case OPTION_INTEGER:
+ case OPTION_UINTEGER:
+ case OPTION_LONG:
+ case OPTION_U64:
+ default:
+ break;
+ }
+ }
+
+ switch (opt->type) {
+ case OPTION_BIT:
+ if (unset)
+ *(int *)opt->value &= ~opt->defval;
+ else
+ *(int *)opt->value |= opt->defval;
+ return 0;
+
+ case OPTION_BOOLEAN:
+ *(bool *)opt->value = unset ? false : true;
+ return 0;
+
+ case OPTION_INCR:
+ *(int *)opt->value = unset ? 0 : *(int *)opt->value + 1;
+ return 0;
+
+ case OPTION_SET_UINT:
+ *(unsigned int *)opt->value = unset ? 0 : opt->defval;
+ return 0;
+
+ case OPTION_SET_PTR:
+ *(void **)opt->value = unset ? NULL : (void *)opt->defval;
+ return 0;
+
+ case OPTION_STRING:
+ if (unset)
+ *(const char **)opt->value = NULL;
+ else if (opt->flags & PARSE_OPT_OPTARG && !p->opt)
+ *(const char **)opt->value = (const char *)opt->defval;
+ else
+ return get_arg(p, opt, flags,
+ (const char **)opt->value);
+ return 0;
+
+ case OPTION_CALLBACK:
+ if (unset)
+ return (*opt->callback)(opt, NULL, 1) ? (-1) : 0;
+ if (opt->flags & PARSE_OPT_NOARG)
+ return (*opt->callback)(opt, NULL, 0) ? (-1) : 0;
+ if (opt->flags & PARSE_OPT_OPTARG && !p->opt)
+ return (*opt->callback)(opt, NULL, 0) ? (-1) : 0;
+ if (get_arg(p, opt, flags, &arg))
+ return -1;
+ return (*opt->callback)(opt, arg, 0) ? (-1) : 0;
+
+ case OPTION_INTEGER:
+ if (unset) {
+ *(int *)opt->value = 0;
+ return 0;
+ }
+ if (opt->flags & PARSE_OPT_OPTARG && !p->opt) {
+ *(int *)opt->value = opt->defval;
+ return 0;
+ }
+ if (get_arg(p, opt, flags, &arg))
+ return -1;
+ return readnum(opt, flags, arg, (char **)&s);
+
+ case OPTION_UINTEGER:
+ if (unset) {
+ *(unsigned int *)opt->value = 0;
+ return 0;
+ }
+ if (opt->flags & PARSE_OPT_OPTARG && !p->opt) {
+ *(unsigned int *)opt->value = opt->defval;
+ return 0;
+ }
+ if (get_arg(p, opt, flags, &arg))
+ return -1;
+ return readnum(opt, flags, arg, (char **)&s);
+
+ case OPTION_LONG:
+ if (unset) {
+ *(long *)opt->value = 0;
+ return 0;
+ }
+ if (opt->flags & PARSE_OPT_OPTARG && !p->opt) {
+ *(long *)opt->value = opt->defval;
+ return 0;
+ }
+ if (get_arg(p, opt, flags, &arg))
+ return -1;
+ return readnum(opt, flags, arg, (char **)&s);
+
+ case OPTION_U64:
+ if (unset) {
+ *(u64 *)opt->value = 0;
+ return 0;
+ }
+ if (opt->flags & PARSE_OPT_OPTARG && !p->opt) {
+ *(u64 *)opt->value = opt->defval;
+ return 0;
+ }
+ if (get_arg(p, opt, flags, &arg))
+ return -1;
+ return readnum(opt, flags, arg, (char **)&s);
+
+ case OPTION_END:
+ case OPTION_ARGUMENT:
+ case OPTION_GROUP:
+ default:
+ die("should not happen, someone must be hit on the forehead");
+ }
+}
+
+#define USAGE_OPTS_WIDTH 24
+#define USAGE_GAP 2
+
+static int usage_with_options_internal(const char * const *usagestr,
+ const struct option *opts, int full)
+{
+ if (!usagestr)
+ return PARSE_OPT_HELP;
+
+ fprintf(stderr, "\n usage: %s\n", *usagestr++);
+ while (*usagestr && **usagestr)
+ fprintf(stderr, " or: %s\n", *usagestr++);
+ while (*usagestr) {
+ fprintf(stderr, "%s%s\n",
+ **usagestr ? " " : "",
+ *usagestr);
+ usagestr++;
+ }
+
+ if (opts->type != OPTION_GROUP)
+ fputc('\n', stderr);
+
+ for (; opts->type != OPTION_END; opts++) {
+ size_t pos;
+ int pad;
+
+ if (opts->type == OPTION_GROUP) {
+ fputc('\n', stderr);
+ if (*opts->help)
+ fprintf(stderr, "%s\n", opts->help);
+ continue;
+ }
+ if (!full && (opts->flags & PARSE_OPT_HIDDEN))
+ continue;
+
+ pos = fprintf(stderr, " ");
+ if (opts->short_name)
+ pos += fprintf(stderr, "-%c", opts->short_name);
+ else
+ pos += fprintf(stderr, " ");
+
+ if (opts->long_name && opts->short_name)
+ pos += fprintf(stderr, ", ");
+ if (opts->long_name)
+ pos += fprintf(stderr, "--%s", opts->long_name);
+
+ switch (opts->type) {
+ case OPTION_ARGUMENT:
+ break;
+ case OPTION_LONG:
+ case OPTION_U64:
+ case OPTION_INTEGER:
+ case OPTION_UINTEGER:
+ if (opts->flags & PARSE_OPT_OPTARG)
+ if (opts->long_name)
+ pos += fprintf(stderr, "[=<n>]");
+ else
+ pos += fprintf(stderr, "[<n>]");
+ else
+ pos += fprintf(stderr, " <n>");
+ break;
+ case OPTION_CALLBACK:
+ if (opts->flags & PARSE_OPT_NOARG)
+ break;
+ /* FALLTHROUGH */
+ case OPTION_STRING:
+ if (opts->argh) {
+ if (opts->flags & PARSE_OPT_OPTARG)
+ if (opts->long_name)
+ pos += fprintf(stderr, "[=<%s>]", opts->argh);
+ else
+ pos += fprintf(stderr, "[<%s>]", opts->argh);
+ else
+ pos += fprintf(stderr, " <%s>", opts->argh);
+ } else {
+ if (opts->flags & PARSE_OPT_OPTARG)
+ if (opts->long_name)
+ pos += fprintf(stderr, "[=...]");
+ else
+ pos += fprintf(stderr, "[...]");
+ else
+ pos += fprintf(stderr, " ...");
+ }
+ break;
+ default: /* OPTION_{BIT,BOOLEAN,SET_UINT,SET_PTR} */
+ case OPTION_END:
+ case OPTION_GROUP:
+ case OPTION_BIT:
+ case OPTION_BOOLEAN:
+ case OPTION_INCR:
+ case OPTION_SET_UINT:
+ case OPTION_SET_PTR:
+ break;
+ }
+ if (pos <= USAGE_OPTS_WIDTH)
+ pad = USAGE_OPTS_WIDTH - pos;
+ else {
+ fputc('\n', stderr);
+ pad = USAGE_OPTS_WIDTH;
+ }
+ fprintf(stderr, "%*s%s\n", pad + USAGE_GAP, "", opts->help);
+ }
+ fputc('\n', stderr);
+
+ return PARSE_OPT_HELP;
+}
+
+void usage_with_options(const char * const *usagestr,
+ const struct option *opts)
+{
+ usage_with_options_internal(usagestr, opts, 0);
+ exit(129);
+}
+
+static void check_typos(const char *arg, const struct option *options)
+{
+ if (strlen(arg) < 3)
+ return;
+
+ if (!prefixcmp(arg, "no-")) {
+ pr_err("did you mean `--%s` (with two dashes ?)", arg);
+ exit(129);
+ }
+
+ for (; options->type != OPTION_END; options++) {
+ if (!options->long_name)
+ continue;
+ if (!prefixcmp(options->long_name, arg)) {
+ pr_err("did you mean `--%s` (with two dashes ?)", arg);
+ exit(129);
+ }
+ }
+}
+
+static int parse_options_usage(const char * const *usagestr,
+ const struct option *opts)
+{
+ return usage_with_options_internal(usagestr, opts, 0);
+}
+
+static int parse_short_opt(struct parse_opt_ctx_t *p,
+ const struct option *options)
+{
+ for (; options->type != OPTION_END; options++) {
+ if (options->short_name == *p->opt) {
+ p->opt = p->opt[1] ? p->opt + 1 : NULL;
+ return get_value(p, options, OPT_SHORT);
+ }
+ }
+ return -2;
+}
+
+static int parse_long_opt(struct parse_opt_ctx_t *p, const char *arg,
+ const struct option *options)
+{
+ const char *arg_end = strchr(arg, '=');
+ const struct option *abbrev_option = NULL, *ambiguous_option = NULL;
+ int abbrev_flags = 0, ambiguous_flags = 0;
+
+ if (!arg_end)
+ arg_end = arg + strlen(arg);
+
+ for (; options->type != OPTION_END; options++) {
+ const char *rest;
+ int flags = 0;
+
+ if (!options->long_name)
+ continue;
+
+ rest = skip_prefix(arg, options->long_name);
+ if (options->type == OPTION_ARGUMENT) {
+ if (!rest)
+ continue;
+ if (*rest == '=')
+ return opterror(options, "takes no value",
+ flags);
+ if (*rest)
+ continue;
+ p->out[p->cpidx++] = arg - 2;
+ return 0;
+ }
+ if (!rest) {
+ /* abbreviated? */
+ if (!strncmp(options->long_name, arg, arg_end - arg)) {
+is_abbreviated:
+ if (abbrev_option) {
+ /*
+ * If this is abbreviated, it is
+ * ambiguous. So when there is no
+ * exact match later, we need to
+ * error out.
+ */
+ ambiguous_option = abbrev_option;
+ ambiguous_flags = abbrev_flags;
+ }
+ if (!(flags & OPT_UNSET) && *arg_end)
+ p->opt = arg_end + 1;
+ abbrev_option = options;
+ abbrev_flags = flags;
+ continue;
+ }
+ /* negated and abbreviated very much? */
+ if (!prefixcmp("no-", arg)) {
+ flags |= OPT_UNSET;
+ goto is_abbreviated;
+ }
+ /* negated? */
+ if (strncmp(arg, "no-", 3))
+ continue;
+ flags |= OPT_UNSET;
+ rest = skip_prefix(arg + 3, options->long_name);
+ /* abbreviated and negated? */
+ if (!rest && !prefixcmp(options->long_name, arg + 3))
+ goto is_abbreviated;
+ if (!rest)
+ continue;
+ }
+ if (*rest) {
+ if (*rest != '=')
+ continue;
+ p->opt = rest + 1;
+ }
+ return get_value(p, options, flags);
+ }
+
+ if (ambiguous_option)
+ return pr_err("Ambiguous option: %s "
+ "(could be --%s%s or --%s%s)",
+ arg,
+ (ambiguous_flags & OPT_UNSET) ? "no-" : "",
+ ambiguous_option->long_name,
+ (abbrev_flags & OPT_UNSET) ? "no-" : "",
+ abbrev_option->long_name);
+ if (abbrev_option)
+ return get_value(p, abbrev_option, abbrev_flags);
+ return -2;
+}
+
+
+static void parse_options_start(struct parse_opt_ctx_t *ctx, int argc,
+ const char **argv, int flags)
+{
+ memset(ctx, 0, sizeof(*ctx));
+ ctx->argc = argc;
+ ctx->argv = argv;
+ ctx->out = argv;
+ ctx->cpidx = ((flags & PARSE_OPT_KEEP_ARGV0) != 0);
+ ctx->flags = flags;
+ if ((flags & PARSE_OPT_KEEP_UNKNOWN) &&
+ (flags & PARSE_OPT_STOP_AT_NON_OPTION))
+ die("STOP_AT_NON_OPTION and KEEP_UNKNOWN don't go together");
+}
+
+static int parse_options_end(struct parse_opt_ctx_t *ctx)
+{
+ memmove(ctx->out + ctx->cpidx, ctx->argv, ctx->argc * sizeof(*ctx->out));
+ ctx->out[ctx->cpidx + ctx->argc] = NULL;
+ return ctx->cpidx + ctx->argc;
+}
+
+
+static int parse_options_step(struct parse_opt_ctx_t *ctx,
+ const struct option *options, const char * const usagestr[])
+{
+ int internal_help = !(ctx->flags & PARSE_OPT_NO_INTERNAL_HELP);
+
+ /* we must reset ->opt, unknown short option leave it dangling */
+ ctx->opt = NULL;
+
+ for (; ctx->argc; ctx->argc--, ctx->argv++) {
+ const char *arg = ctx->argv[0];
+
+ if (*arg != '-' || !arg[1]) {
+ if (ctx->flags & PARSE_OPT_STOP_AT_NON_OPTION)
+ break;
+ ctx->out[ctx->cpidx++] = ctx->argv[0];
+ continue;
+ }
+
+ if (arg[1] != '-') {
+ ctx->opt = arg + 1;
+ if (internal_help && *ctx->opt == 'h')
+ return parse_options_usage(usagestr, options);
+ switch (parse_short_opt(ctx, options)) {
+ case -1:
+ return parse_options_usage(usagestr, options);
+ case -2:
+ goto unknown;
+ default:
+ break;
+ }
+ if (ctx->opt)
+ check_typos(arg + 1, options);
+ while (ctx->opt) {
+ if (internal_help && *ctx->opt == 'h')
+ return parse_options_usage(usagestr,
+ options);
+ switch (parse_short_opt(ctx, options)) {
+ case -1:
+ return parse_options_usage(usagestr,
+ options);
+ case -2:
+ /* fake a short option thing to hide
+ * the fact that we may have
+ * started to parse aggregated stuff
+ *
+ * This is leaky, too bad.
+ */
+ ctx->argv[0] = strdup(ctx->opt - 1);
+ *(char *)ctx->argv[0] = '-';
+ goto unknown;
+ default:
+ break;
+ }
+ }
+ continue;
+ }
+
+ if (!arg[2]) { /* "--" */
+ if (!(ctx->flags & PARSE_OPT_KEEP_DASHDASH)) {
+ ctx->argc--;
+ ctx->argv++;
+ }
+ break;
+ }
+
+ if (internal_help && !strcmp(arg + 2, "help-all"))
+ return usage_with_options_internal(usagestr, options,
+ 1);
+ if (internal_help && !strcmp(arg + 2, "help"))
+ return parse_options_usage(usagestr, options);
+ switch (parse_long_opt(ctx, arg + 2, options)) {
+ case -1:
+ return parse_options_usage(usagestr, options);
+ case -2:
+ goto unknown;
+ default:
+ break;
+ }
+ continue;
+unknown:
+ if (!(ctx->flags & PARSE_OPT_KEEP_UNKNOWN))
+ return PARSE_OPT_UNKNOWN;
+ ctx->out[ctx->cpidx++] = ctx->argv[0];
+ ctx->opt = NULL;
+ }
+ return PARSE_OPT_DONE;
+}
+
+int parse_options(int argc, const char **argv, const struct option *options,
+ const char * const usagestr[], int flags)
+{
+ struct parse_opt_ctx_t ctx;
+
+ parse_options_start(&ctx, argc, argv, flags);
+ switch (parse_options_step(&ctx, options, usagestr)) {
+ case PARSE_OPT_HELP:
+ exit(129);
+ case PARSE_OPT_DONE:
+ break;
+ default: /* PARSE_OPT_UNKNOWN */
+ if (ctx.argv[0][1] == '-') {
+ pr_err("unknown option `%s'", ctx.argv[0] + 2);
+ } else {
+ pr_err("unknown switch `%c'", *ctx.opt);
+ }
+ usage_with_options(usagestr, options);
+ }
+
+ return parse_options_end(&ctx);
+}
--- /dev/null
+#include <kvm/rbtree-interval.h>
+#include <stddef.h>
+#include <errno.h>
+
+struct rb_int_node *rb_int_search_single(struct rb_root *root, u64 point)
+{
+ struct rb_node *node = root->rb_node;
+ struct rb_node *lowest = NULL;
+
+ while (node) {
+ struct rb_int_node *cur = rb_int(node);
+
+ if (node->rb_left && (rb_int(node->rb_left)->max_high > point)) {
+ node = node->rb_left;
+ } else if (cur->low <= point && cur->high > point) {
+ lowest = node;
+ break;
+ } else if (point > cur->low) {
+ node = node->rb_right;
+ } else {
+ break;
+ }
+ }
+
+ if (lowest == NULL)
+ return NULL;
+
+ return rb_int(lowest);
+}
+
+struct rb_int_node *rb_int_search_range(struct rb_root *root, u64 low, u64 high)
+{
+ struct rb_int_node *range;
+
+ range = rb_int_search_single(root, low);
+ if (range == NULL)
+ return NULL;
+
+ /* We simply verify that 'high' is smaller than the end of the range where 'low' is located */
+ if (range->high < high)
+ return NULL;
+
+ return range;
+}
+
+static void update_node_max_high(struct rb_node *node, void *arg)
+{
+ struct rb_int_node *i_node = rb_int(node);
+
+ i_node->max_high = i_node->high;
+
+ if (node->rb_left)
+ i_node->max_high = max(i_node->max_high, rb_int(node->rb_left)->max_high);
+ if (node->rb_right)
+ i_node->max_high = max(i_node->max_high, rb_int(node->rb_right)->max_high);
+}
+
+int rb_int_insert(struct rb_root *root, struct rb_int_node *i_node)
+{
+ struct rb_node **node = &(root->rb_node), *parent = NULL;
+
+ while (*node) {
+ int result = i_node->low - rb_int(*node)->low;
+
+ parent = *node;
+ if (result < 0)
+ node = &((*node)->rb_left);
+ else if (result > 0)
+ node = &((*node)->rb_right);
+ else
+ return -EEXIST;
+ }
+
+ rb_link_node(&i_node->node, parent, node);
+ rb_insert_color(&i_node->node, root);
+
+ rb_augment_insert(&i_node->node, update_node_max_high, NULL);
+ return 0;
+}
+
+void rb_int_erase(struct rb_root *root, struct rb_int_node *node)
+{
+ struct rb_node *deepest;
+
+ deepest = rb_augment_erase_begin(&node->node);
+ rb_erase(&node->node, root);
+ rb_augment_erase_end(deepest, update_node_max_high, NULL);
+
+}
--- /dev/null
+#include "kvm/read-write.h"
+
+#include <sys/types.h>
+#include <sys/uio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+/* Same as read(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xread(int fd, void *buf, size_t count)
+{
+ ssize_t nr;
+
+restart:
+ nr = read(fd, buf, count);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+/* Same as write(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xwrite(int fd, const void *buf, size_t count)
+{
+ ssize_t nr;
+
+restart:
+ nr = write(fd, buf, count);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+ssize_t read_in_full(int fd, void *buf, size_t count)
+{
+ ssize_t total = 0;
+ char *p = buf;
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xread(fd, p, count);
+ if (nr <= 0) {
+ if (total > 0)
+ return total;
+
+ return -1;
+ }
+
+ count -= nr;
+ total += nr;
+ p += nr;
+ }
+
+ return total;
+}
+
+ssize_t write_in_full(int fd, const void *buf, size_t count)
+{
+ const char *p = buf;
+ ssize_t total = 0;
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xwrite(fd, p, count);
+ if (nr < 0)
+ return -1;
+ if (nr == 0) {
+ errno = ENOSPC;
+ return -1;
+ }
+ count -= nr;
+ total += nr;
+ p += nr;
+ }
+
+ return total;
+}
+
+/* Same as pread(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xpread(int fd, void *buf, size_t count, off_t offset)
+{
+ ssize_t nr;
+
+restart:
+ nr = pread(fd, buf, count, offset);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+/* Same as pwrite(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xpwrite(int fd, const void *buf, size_t count, off_t offset)
+{
+ ssize_t nr;
+
+restart:
+ nr = pwrite(fd, buf, count, offset);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset)
+{
+ ssize_t total = 0;
+ char *p = buf;
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xpread(fd, p, count, offset);
+ if (nr <= 0) {
+ if (total > 0)
+ return total;
+
+ return -1;
+ }
+
+ count -= nr;
+ total += nr;
+ p += nr;
+ offset += nr;
+ }
+
+ return total;
+}
+
+ssize_t pwrite_in_full(int fd, const void *buf, size_t count, off_t offset)
+{
+ const char *p = buf;
+ ssize_t total = 0;
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xpwrite(fd, p, count, offset);
+ if (nr < 0)
+ return -1;
+ if (nr == 0) {
+ errno = ENOSPC;
+ return -1;
+ }
+ count -= nr;
+ total += nr;
+ p += nr;
+ offset += nr;
+ }
+
+ return total;
+}
+
+/* Same as readv(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xreadv(int fd, const struct iovec *iov, int iovcnt)
+{
+ ssize_t nr;
+
+restart:
+ nr = readv(fd, iov, iovcnt);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+/* Same as writev(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xwritev(int fd, const struct iovec *iov, int iovcnt)
+{
+ ssize_t nr;
+
+restart:
+ nr = writev(fd, iov, iovcnt);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+static inline ssize_t get_iov_size(const struct iovec *iov, int iovcnt)
+{
+ size_t size = 0;
+ while (iovcnt--)
+ size += (iov++)->iov_len;
+
+ return size;
+}
+
+static inline void shift_iovec(const struct iovec **iov, int *iovcnt,
+ size_t nr, ssize_t *total, size_t *count, off_t *offset)
+{
+ while (nr >= (*iov)->iov_len) {
+ nr -= (*iov)->iov_len;
+ *total += (*iov)->iov_len;
+ *count -= (*iov)->iov_len;
+ if (offset)
+ *offset += (*iov)->iov_len;
+ (*iovcnt)--;
+ (*iov)++;
+ }
+}
+
+ssize_t readv_in_full(int fd, const struct iovec *iov, int iovcnt)
+{
+ ssize_t total = 0;
+ size_t count = get_iov_size(iov, iovcnt);
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xreadv(fd, iov, iovcnt);
+ if (nr <= 0) {
+ if (total > 0)
+ return total;
+
+ return -1;
+ }
+
+ shift_iovec(&iov, &iovcnt, nr, &total, &count, NULL);
+ }
+
+ return total;
+}
+
+ssize_t writev_in_full(int fd, const struct iovec *iov, int iovcnt)
+{
+ ssize_t total = 0;
+ size_t count = get_iov_size(iov, iovcnt);
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xwritev(fd, iov, iovcnt);
+ if (nr < 0)
+ return -1;
+ if (nr == 0) {
+ errno = ENOSPC;
+ return -1;
+ }
+
+ shift_iovec(&iov, &iovcnt, nr, &total, &count, NULL);
+ }
+
+ return total;
+}
+
+/* Same as preadv(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xpreadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+ ssize_t nr;
+
+restart:
+ nr = preadv(fd, iov, iovcnt, offset);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+/* Same as pwritev(2) except that this function never returns EAGAIN or EINTR. */
+ssize_t xpwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+ ssize_t nr;
+
+restart:
+ nr = pwritev(fd, iov, iovcnt, offset);
+ if ((nr < 0) && ((errno == EAGAIN) || (errno == EINTR)))
+ goto restart;
+
+ return nr;
+}
+
+ssize_t preadv_in_full(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+ ssize_t total = 0;
+ size_t count = get_iov_size(iov, iovcnt);
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xpreadv(fd, iov, iovcnt, offset);
+ if (nr <= 0) {
+ if (total > 0)
+ return total;
+
+ return -1;
+ }
+
+ shift_iovec(&iov, &iovcnt, nr, &total, &count, &offset);
+ }
+
+ return total;
+}
+
+ssize_t pwritev_in_full(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+ ssize_t total = 0;
+ size_t count = get_iov_size(iov, iovcnt);
+
+ while (count > 0) {
+ ssize_t nr;
+
+ nr = xpwritev(fd, iov, iovcnt, offset);
+ if (nr < 0)
+ return -1;
+ if (nr == 0) {
+ errno = ENOSPC;
+ return -1;
+ }
+
+ shift_iovec(&iov, &iovcnt, nr, &total, &count, &offset);
+ }
+
+ return total;
+}
+
+#ifdef CONFIG_HAS_AIO
+int aio_pwritev(io_context_t ctx, struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt,
+ off_t offset, int ev, void *param)
+{
+ struct iocb *ios[1] = { iocb };
+ int ret;
+
+ io_prep_pwritev(iocb, fd, iov, iovcnt, offset);
+ io_set_eventfd(iocb, ev);
+ iocb->data = param;
+
+restart:
+ ret = io_submit(ctx, 1, ios);
+ if (ret == -EAGAIN)
+ goto restart;
+ return ret;
+}
+
+int aio_preadv(io_context_t ctx, struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt,
+ off_t offset, int ev, void *param)
+{
+ struct iocb *ios[1] = { iocb };
+ int ret;
+
+ io_prep_preadv(iocb, fd, iov, iovcnt, offset);
+ io_set_eventfd(iocb, ev);
+ iocb->data = param;
+
+restart:
+ ret = io_submit(ctx, 1, ios);
+ if (ret == -EAGAIN)
+ goto restart;
+ return ret;
+}
+#endif
--- /dev/null
+#!/bin/bash
+#
+# Author: Amos Kong <kongjianjun@gmail.com>
+# Date: Apr 14, 2011
+# Description: this script is used to create/delete a private bridge,
+# launch a dhcp server on the bridge by dnsmasq.
+#
+# @ ./set_private_br.sh $bridge_name $subnet_prefix
+# @ ./set_private_br.sh vbr0 192.168.33
+
+brname='vbr0'
+subnet='192.168.33'
+
+add_br()
+{
+ echo "add new private bridge: $brname"
+ /usr/sbin/brctl addbr $brname
+ echo 1 > /proc/sys/net/ipv6/conf/$brname/disable_ipv6
+ echo 1 > /proc/sys/net/ipv4/ip_forward
+ /usr/sbin/brctl stp $brname on
+ /usr/sbin/brctl setfd $brname 0
+ ifconfig $brname $subnet.1
+ ifconfig $brname up
+ # Add forward rule, then guest can access public network
+ iptables -t nat -A POSTROUTING -s $subnet.254/24 ! -d $subnet.254/24 -j MASQUERADE
+ /etc/init.d/dnsmasq stop
+ /etc/init.d/tftpd-hpa stop 2>/dev/null
+ dnsmasq --strict-order --bind-interfaces --listen-address $subnet.1 --dhcp-range $subnet.1,$subnet.254 $tftp_cmd
+}
+
+del_br()
+{
+ echo "cleanup bridge setup"
+ kill -9 `pgrep dnsmasq|tail -1`
+ ifconfig $brname down
+ /usr/sbin/brctl delbr $brname
+ iptables -t nat -D POSTROUTING -s $subnet.254/24 ! -d $subnet.254/24 -j MASQUERADE
+}
+
+
+if [ $# = 0 ]; then
+ del_br 2>/dev/null
+ exit
+fi
+if [ $# > 1 ]; then
+ brname="$1"
+fi
+if [ $# = 2 ]; then
+ subnet="$2"
+fi
+add_br
--- /dev/null
+
+/* user defined headers */
+#include <kvm/util.h>
+#include <kvm/strbuf.h>
+
+int prefixcmp(const char *str, const char *prefix)
+{
+ for (; ; str++, prefix++) {
+ if (!*prefix)
+ return 0;
+ else if (*str != *prefix)
+ return (unsigned char)*prefix - (unsigned char)*str;
+ }
+}
+
+/**
+ * strlcat - Append a length-limited, %NUL-terminated string to another
+ * @dest: The string to be appended to
+ * @src: The string to append to it
+ * @count: The size of the destination buffer.
+ */
+size_t strlcat(char *dest, const char *src, size_t count)
+{
+ size_t dsize = strlen(dest);
+ size_t len = strlen(src);
+ size_t res = dsize + len;
+
+ DIE_IF(dsize >= count);
+
+ dest += dsize;
+ count -= dsize;
+ if (len >= count)
+ len = count - 1;
+
+ memcpy(dest, src, len);
+ dest[len] = 0;
+
+ return res;
+}
+
+/**
+ * strlcpy - Copy a %NUL terminated string into a sized buffer
+ * @dest: Where to copy the string to
+ * @src: Where to copy the string from
+ * @size: size of destination buffer
+ *
+ * Compatible with *BSD: the result is always a valid
+ * NUL-terminated string that fits in the buffer (unless,
+ * of course, the buffer size is zero). It does not pad
+ * out the result like strncpy() does.
+ */
+size_t strlcpy(char *dest, const char *src, size_t size)
+{
+ size_t ret = strlen(src);
+
+ if (size) {
+ size_t len = (ret >= size) ? size - 1 : ret;
+ memcpy(dest, src, len);
+ dest[len] = '\0';
+ }
+ return ret;
+}
--- /dev/null
+#include "kvm/threadpool.h"
+#include "kvm/mutex.h"
+#include "kvm/kvm.h"
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <pthread.h>
+#include <stdbool.h>
+
+static pthread_mutex_t job_mutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_mutex_t thread_mutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t job_cond = PTHREAD_COND_INITIALIZER;
+
+static LIST_HEAD(head);
+
+static pthread_t *threads;
+static long threadcount;
+static bool running;
+
+static struct thread_pool__job *thread_pool__job_pop_locked(void)
+{
+ struct thread_pool__job *job;
+
+ if (list_empty(&head))
+ return NULL;
+
+ job = list_first_entry(&head, struct thread_pool__job, queue);
+ list_del(&job->queue);
+
+ return job;
+}
+
+static void thread_pool__job_push_locked(struct thread_pool__job *job)
+{
+ list_add_tail(&job->queue, &head);
+}
+
+static struct thread_pool__job *thread_pool__job_pop(void)
+{
+ struct thread_pool__job *job;
+
+ mutex_lock(&job_mutex);
+ job = thread_pool__job_pop_locked();
+ mutex_unlock(&job_mutex);
+ return job;
+}
+
+static void thread_pool__job_push(struct thread_pool__job *job)
+{
+ mutex_lock(&job_mutex);
+ thread_pool__job_push_locked(job);
+ mutex_unlock(&job_mutex);
+}
+
+static void thread_pool__handle_job(struct thread_pool__job *job)
+{
+ while (job) {
+ job->callback(job->kvm, job->data);
+
+ mutex_lock(&job->mutex);
+
+ if (--job->signalcount > 0)
+ /* If the job was signaled again while we were working */
+ thread_pool__job_push(job);
+
+ mutex_unlock(&job->mutex);
+
+ job = thread_pool__job_pop();
+ }
+}
+
+static void thread_pool__threadfunc_cleanup(void *param)
+{
+ mutex_unlock(&job_mutex);
+}
+
+static void *thread_pool__threadfunc(void *param)
+{
+ pthread_cleanup_push(thread_pool__threadfunc_cleanup, NULL);
+
+ while (running) {
+ struct thread_pool__job *curjob = NULL;
+
+ mutex_lock(&job_mutex);
+ while (running && (curjob = thread_pool__job_pop_locked()) == NULL)
+ pthread_cond_wait(&job_cond, &job_mutex);
+ mutex_unlock(&job_mutex);
+
+ if (running)
+ thread_pool__handle_job(curjob);
+ }
+
+ pthread_cleanup_pop(0);
+
+ return NULL;
+}
+
+static int thread_pool__addthread(void)
+{
+ int res;
+ void *newthreads;
+
+ mutex_lock(&thread_mutex);
+ newthreads = realloc(threads, (threadcount + 1) * sizeof(pthread_t));
+ if (newthreads == NULL) {
+ mutex_unlock(&thread_mutex);
+ return -1;
+ }
+
+ threads = newthreads;
+
+ res = pthread_create(threads + threadcount, NULL,
+ thread_pool__threadfunc, NULL);
+
+ if (res == 0)
+ threadcount++;
+ mutex_unlock(&thread_mutex);
+
+ return res;
+}
+
+int thread_pool__init(struct kvm *kvm)
+{
+ unsigned long i;
+ unsigned int thread_count = sysconf(_SC_NPROCESSORS_ONLN);
+
+ running = true;
+
+ for (i = 0; i < thread_count; i++)
+ if (thread_pool__addthread() < 0)
+ return i;
+
+ return i;
+}
+late_init(thread_pool__init);
+
+int thread_pool__exit(struct kvm *kvm)
+{
+ int i;
+ void *NUL = NULL;
+
+ running = false;
+
+ for (i = 0; i < threadcount; i++) {
+ mutex_lock(&job_mutex);
+ pthread_cond_signal(&job_cond);
+ mutex_unlock(&job_mutex);
+ }
+
+ for (i = 0; i < threadcount; i++) {
+ pthread_join(threads[i], NUL);
+ }
+
+ return 0;
+}
+late_exit(thread_pool__exit);
+
+void thread_pool__do_job(struct thread_pool__job *job)
+{
+ struct thread_pool__job *jobinfo = job;
+
+ if (jobinfo == NULL || jobinfo->callback == NULL)
+ return;
+
+ mutex_lock(&jobinfo->mutex);
+ if (jobinfo->signalcount++ == 0)
+ thread_pool__job_push(job);
+ mutex_unlock(&jobinfo->mutex);
+
+ mutex_lock(&job_mutex);
+ pthread_cond_signal(&job_cond);
+ mutex_unlock(&job_mutex);
+}
--- /dev/null
+/*
+ * Taken from perf which in turn take it from GIT
+ */
+
+#include "kvm/util.h"
+
+#include <kvm/kvm.h>
+#include <linux/magic.h> /* For HUGETLBFS_MAGIC */
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/statfs.h>
+
+static void report(const char *prefix, const char *err, va_list params)
+{
+ char msg[1024];
+ vsnprintf(msg, sizeof(msg), err, params);
+ fprintf(stderr, " %s%s\n", prefix, msg);
+}
+
+static NORETURN void die_builtin(const char *err, va_list params)
+{
+ report(" Fatal: ", err, params);
+ exit(128);
+}
+
+static void error_builtin(const char *err, va_list params)
+{
+ report(" Error: ", err, params);
+}
+
+static void warn_builtin(const char *warn, va_list params)
+{
+ report(" Warning: ", warn, params);
+}
+
+static void info_builtin(const char *info, va_list params)
+{
+ report(" Info: ", info, params);
+}
+
+void die(const char *err, ...)
+{
+ va_list params;
+
+ va_start(params, err);
+ die_builtin(err, params);
+ va_end(params);
+}
+
+int pr_err(const char *err, ...)
+{
+ va_list params;
+
+ va_start(params, err);
+ error_builtin(err, params);
+ va_end(params);
+ return -1;
+}
+
+void pr_warning(const char *warn, ...)
+{
+ va_list params;
+
+ va_start(params, warn);
+ warn_builtin(warn, params);
+ va_end(params);
+}
+
+void pr_info(const char *info, ...)
+{
+ va_list params;
+
+ va_start(params, info);
+ info_builtin(info, params);
+ va_end(params);
+}
+
+void die_perror(const char *s)
+{
+ perror(s);
+ exit(1);
+}
+
+void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
+{
+ char mpath[PATH_MAX];
+ int fd;
+ struct statfs sfs;
+ void *addr;
+ unsigned long blk_size;
+
+ if (statfs(htlbfs_path, &sfs) < 0)
+ die("Can't stat %s\n", htlbfs_path);
+
+ if ((unsigned int)sfs.f_type != HUGETLBFS_MAGIC)
+ die("%s is not hugetlbfs!\n", htlbfs_path);
+
+ blk_size = (unsigned long)sfs.f_bsize;
+ if (sfs.f_bsize == 0 || blk_size > size) {
+ die("Can't use hugetlbfs pagesize %ld for mem size %lld\n",
+ blk_size, size);
+ }
+
+ kvm->ram_pagesize = blk_size;
+
+ snprintf(mpath, PATH_MAX, "%s/kvmtoolXXXXXX", htlbfs_path);
+ fd = mkstemp(mpath);
+ if (fd < 0)
+ die("Can't open %s for hugetlbfs map\n", mpath);
+ unlink(mpath);
+ if (ftruncate(fd, size) < 0)
+ die("Can't ftruncate for mem mapping size %lld\n",
+ size);
+ addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
+ close(fd);
+
+ return addr;
+}
+
+/* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
+void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *hugetlbfs_path, u64 size)
+{
+ if (hugetlbfs_path)
+ /*
+ * We don't /need/ to map guest RAM from hugetlbfs, but we do so
+ * if the user specifies a hugetlbfs path.
+ */
+ return mmap_hugetlbfs(kvm, hugetlbfs_path, size);
+ else {
+ kvm->ram_pagesize = getpagesize();
+ return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
+ }
+}
--- /dev/null
+#include "kvm/util.h"
+#include "kvm/virtio-9p.h"
+
+#include <endian.h>
+#include <stdint.h>
+
+#include <linux/compiler.h>
+#include <net/9p/9p.h>
+
+static void virtio_p9_pdu_read(struct p9_pdu *pdu, void *data, size_t size)
+{
+ size_t len;
+ int i, copied = 0;
+ u16 iov_cnt = pdu->out_iov_cnt;
+ size_t offset = pdu->read_offset;
+ struct iovec *iov = pdu->out_iov;
+
+ for (i = 0; i < iov_cnt && size; i++) {
+ if (offset >= iov[i].iov_len) {
+ offset -= iov[i].iov_len;
+ continue;
+ } else {
+ len = MIN(iov[i].iov_len - offset, size);
+ memcpy(data, iov[i].iov_base + offset, len);
+ size -= len;
+ data += len;
+ offset = 0;
+ copied += len;
+ }
+ }
+ pdu->read_offset += copied;
+}
+
+static void virtio_p9_pdu_write(struct p9_pdu *pdu,
+ const void *data, size_t size)
+{
+ size_t len;
+ int i, copied = 0;
+ u16 iov_cnt = pdu->in_iov_cnt;
+ size_t offset = pdu->write_offset;
+ struct iovec *iov = pdu->in_iov;
+
+ for (i = 0; i < iov_cnt && size; i++) {
+ if (offset >= iov[i].iov_len) {
+ offset -= iov[i].iov_len;
+ continue;
+ } else {
+ len = MIN(iov[i].iov_len - offset, size);
+ memcpy(iov[i].iov_base + offset, data, len);
+ size -= len;
+ data += len;
+ offset = 0;
+ copied += len;
+ }
+ }
+ pdu->write_offset += copied;
+}
+
+static void virtio_p9_wstat_free(struct p9_wstat *stbuf)
+{
+ free(stbuf->name);
+ free(stbuf->uid);
+ free(stbuf->gid);
+ free(stbuf->muid);
+}
+
+static int virtio_p9_decode(struct p9_pdu *pdu, const char *fmt, va_list ap)
+{
+ int retval = 0;
+ const char *ptr;
+
+ for (ptr = fmt; *ptr; ptr++) {
+ switch (*ptr) {
+ case 'b':
+ {
+ int8_t *val = va_arg(ap, int8_t *);
+ virtio_p9_pdu_read(pdu, val, sizeof(*val));
+ }
+ break;
+ case 'w':
+ {
+ int16_t le_val;
+ int16_t *val = va_arg(ap, int16_t *);
+ virtio_p9_pdu_read(pdu, &le_val, sizeof(le_val));
+ *val = le16toh(le_val);
+ }
+ break;
+ case 'd':
+ {
+ int32_t le_val;
+ int32_t *val = va_arg(ap, int32_t *);
+ virtio_p9_pdu_read(pdu, &le_val, sizeof(le_val));
+ *val = le32toh(le_val);
+ }
+ break;
+ case 'q':
+ {
+ int64_t le_val;
+ int64_t *val = va_arg(ap, int64_t *);
+ virtio_p9_pdu_read(pdu, &le_val, sizeof(le_val));
+ *val = le64toh(le_val);
+ }
+ break;
+ case 's':
+ {
+ int16_t len;
+ char **str = va_arg(ap, char **);
+
+ virtio_p9_pdu_readf(pdu, "w", &len);
+ *str = malloc(len + 1);
+ if (*str == NULL) {
+ retval = ENOMEM;
+ break;
+ }
+ virtio_p9_pdu_read(pdu, *str, len);
+ (*str)[len] = 0;
+ }
+ break;
+ case 'Q':
+ {
+ struct p9_qid *qid = va_arg(ap, struct p9_qid *);
+ retval = virtio_p9_pdu_readf(pdu, "bdq",
+ &qid->type, &qid->version,
+ &qid->path);
+ }
+ break;
+ case 'S':
+ {
+ struct p9_wstat *stbuf = va_arg(ap, struct p9_wstat *);
+ memset(stbuf, 0, sizeof(struct p9_wstat));
+ stbuf->n_uid = stbuf->n_gid = stbuf->n_muid = -1;
+ retval = virtio_p9_pdu_readf(pdu, "wwdQdddqssss",
+ &stbuf->size, &stbuf->type,
+ &stbuf->dev, &stbuf->qid,
+ &stbuf->mode, &stbuf->atime,
+ &stbuf->mtime, &stbuf->length,
+ &stbuf->name, &stbuf->uid,
+ &stbuf->gid, &stbuf->muid);
+ if (retval)
+ virtio_p9_wstat_free(stbuf);
+ }
+ break;
+ case 'I':
+ {
+ struct p9_iattr_dotl *p9attr = va_arg(ap,
+ struct p9_iattr_dotl *);
+
+ retval = virtio_p9_pdu_readf(pdu, "ddddqqqqq",
+ &p9attr->valid,
+ &p9attr->mode,
+ &p9attr->uid,
+ &p9attr->gid,
+ &p9attr->size,
+ &p9attr->atime_sec,
+ &p9attr->atime_nsec,
+ &p9attr->mtime_sec,
+ &p9attr->mtime_nsec);
+ }
+ break;
+ default:
+ retval = EINVAL;
+ break;
+ }
+ }
+ return retval;
+}
+
+static int virtio_p9_pdu_encode(struct p9_pdu *pdu, const char *fmt, va_list ap)
+{
+ int retval = 0;
+ const char *ptr;
+
+ for (ptr = fmt; *ptr; ptr++) {
+ switch (*ptr) {
+ case 'b':
+ {
+ int8_t val = va_arg(ap, int);
+ virtio_p9_pdu_write(pdu, &val, sizeof(val));
+ }
+ break;
+ case 'w':
+ {
+ int16_t val = htole16(va_arg(ap, int));
+ virtio_p9_pdu_write(pdu, &val, sizeof(val));
+ }
+ break;
+ case 'd':
+ {
+ int32_t val = htole32(va_arg(ap, int32_t));
+ virtio_p9_pdu_write(pdu, &val, sizeof(val));
+ }
+ break;
+ case 'q':
+ {
+ int64_t val = htole64(va_arg(ap, int64_t));
+ virtio_p9_pdu_write(pdu, &val, sizeof(val));
+ }
+ break;
+ case 's':
+ {
+ uint16_t len = 0;
+ const char *s = va_arg(ap, char *);
+ if (s)
+ len = MIN(strlen(s), USHRT_MAX);
+ virtio_p9_pdu_writef(pdu, "w", len);
+ virtio_p9_pdu_write(pdu, s, len);
+ }
+ break;
+ case 'Q':
+ {
+ struct p9_qid *qid = va_arg(ap, struct p9_qid *);
+ retval = virtio_p9_pdu_writef(pdu, "bdq",
+ qid->type, qid->version,
+ qid->path);
+ }
+ break;
+ case 'S':
+ {
+ struct p9_wstat *stbuf = va_arg(ap, struct p9_wstat *);
+ retval = virtio_p9_pdu_writef(pdu, "wwdQdddqssss",
+ stbuf->size, stbuf->type,
+ stbuf->dev, &stbuf->qid,
+ stbuf->mode, stbuf->atime,
+ stbuf->mtime, stbuf->length,
+ stbuf->name, stbuf->uid,
+ stbuf->gid, stbuf->muid);
+ }
+ break;
+ case 'A':
+ {
+ struct p9_stat_dotl *stbuf = va_arg(ap,
+ struct p9_stat_dotl *);
+ retval = virtio_p9_pdu_writef(pdu,
+ "qQdddqqqqqqqqqqqqqqq",
+ stbuf->st_result_mask,
+ &stbuf->qid,
+ stbuf->st_mode,
+ stbuf->st_uid,
+ stbuf->st_gid,
+ stbuf->st_nlink,
+ stbuf->st_rdev,
+ stbuf->st_size,
+ stbuf->st_blksize,
+ stbuf->st_blocks,
+ stbuf->st_atime_sec,
+ stbuf->st_atime_nsec,
+ stbuf->st_mtime_sec,
+ stbuf->st_mtime_nsec,
+ stbuf->st_ctime_sec,
+ stbuf->st_ctime_nsec,
+ stbuf->st_btime_sec,
+ stbuf->st_btime_nsec,
+ stbuf->st_gen,
+ stbuf->st_data_version);
+ }
+ break;
+ default:
+ retval = EINVAL;
+ break;
+ }
+ }
+ return retval;
+}
+
+int virtio_p9_pdu_readf(struct p9_pdu *pdu, const char *fmt, ...)
+{
+ int ret;
+ va_list ap;
+
+ va_start(ap, fmt);
+ ret = virtio_p9_decode(pdu, fmt, ap);
+ va_end(ap);
+
+ return ret;
+}
+
+int virtio_p9_pdu_writef(struct p9_pdu *pdu, const char *fmt, ...)
+{
+ int ret;
+ va_list ap;
+
+ va_start(ap, fmt);
+ ret = virtio_p9_pdu_encode(pdu, fmt, ap);
+ va_end(ap);
+
+ return ret;
+}
--- /dev/null
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/ioport.h"
+#include "kvm/util.h"
+#include "kvm/threadpool.h"
+#include "kvm/irq.h"
+#include "kvm/virtio-9p.h"
+#include "kvm/guest_compat.h"
+#include "kvm/builtin-setup.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/vfs.h>
+
+#include <linux/virtio_ring.h>
+#include <linux/virtio_9p.h>
+#include <net/9p/9p.h>
+
+static LIST_HEAD(devs);
+static int compat_id = -1;
+
+static int insert_new_fid(struct p9_dev *dev, struct p9_fid *fid);
+static struct p9_fid *find_or_create_fid(struct p9_dev *dev, u32 fid)
+{
+ struct rb_node *node = dev->fids.rb_node;
+ struct p9_fid *pfid = NULL;
+
+ while (node) {
+ struct p9_fid *cur = rb_entry(node, struct p9_fid, node);
+
+ if (fid < cur->fid) {
+ node = node->rb_left;
+ } else if (fid > cur->fid) {
+ node = node->rb_right;
+ } else {
+ return cur;
+ }
+ }
+
+ pfid = calloc(sizeof(*pfid), 1);
+ if (!pfid)
+ return NULL;
+
+ pfid->fid = fid;
+ strcpy(pfid->abs_path, dev->root_dir);
+ pfid->path = pfid->abs_path + strlen(dev->root_dir);
+
+ insert_new_fid(dev, pfid);
+
+ return pfid;
+}
+
+static int insert_new_fid(struct p9_dev *dev, struct p9_fid *fid)
+{
+ struct rb_node **node = &(dev->fids.rb_node), *parent = NULL;
+
+ while (*node) {
+ int result = fid->fid - rb_entry(*node, struct p9_fid, node)->fid;
+
+ parent = *node;
+ if (result < 0)
+ node = &((*node)->rb_left);
+ else if (result > 0)
+ node = &((*node)->rb_right);
+ else
+ return -EEXIST;
+ }
+
+ rb_link_node(&fid->node, parent, node);
+ rb_insert_color(&fid->node, &dev->fids);
+ return 0;
+}
+
+static struct p9_fid *get_fid(struct p9_dev *p9dev, int fid)
+{
+ struct p9_fid *new;
+
+ new = find_or_create_fid(p9dev, fid);
+
+ return new;
+}
+
+/* Warning: Immediately use value returned from this function */
+static const char *rel_to_abs(struct p9_dev *p9dev,
+ const char *path, char *abs_path)
+{
+ sprintf(abs_path, "%s/%s", p9dev->root_dir, path);
+
+ return abs_path;
+}
+
+static void stat2qid(struct stat *st, struct p9_qid *qid)
+{
+ *qid = (struct p9_qid) {
+ .path = st->st_ino,
+ .version = st->st_mtime,
+ };
+
+ if (S_ISDIR(st->st_mode))
+ qid->type |= P9_QTDIR;
+}
+
+static void close_fid(struct p9_dev *p9dev, u32 fid)
+{
+ struct p9_fid *pfid = get_fid(p9dev, fid);
+
+ if (pfid->fd > 0)
+ close(pfid->fd);
+
+ if (pfid->dir)
+ closedir(pfid->dir);
+
+ rb_erase(&pfid->node, &p9dev->fids);
+ free(pfid);
+}
+
+static void virtio_p9_set_reply_header(struct p9_pdu *pdu, u32 size)
+{
+ u8 cmd;
+ u16 tag;
+
+ pdu->read_offset = sizeof(u32);
+ virtio_p9_pdu_readf(pdu, "bw", &cmd, &tag);
+ pdu->write_offset = 0;
+ /* cmd + 1 is the reply message */
+ virtio_p9_pdu_writef(pdu, "dbw", size, cmd + 1, tag);
+}
+
+static u16 virtio_p9_update_iov_cnt(struct iovec iov[], u32 count, int iov_cnt)
+{
+ int i;
+ u32 total = 0;
+ for (i = 0; (i < iov_cnt) && (total < count); i++) {
+ if (total + iov[i].iov_len > count) {
+ /* we don't need this iov fully */
+ iov[i].iov_len -= ((total + iov[i].iov_len) - count);
+ i++;
+ break;
+ }
+ total += iov[i].iov_len;
+ }
+ return i;
+}
+
+static void virtio_p9_error_reply(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, int err, u32 *outlen)
+{
+ u16 tag;
+
+ pdu->write_offset = VIRTIO_9P_HDR_LEN;
+ virtio_p9_pdu_writef(pdu, "d", err);
+ *outlen = pdu->write_offset;
+
+ /* read the tag from input */
+ pdu->read_offset = sizeof(u32) + sizeof(u8);
+ virtio_p9_pdu_readf(pdu, "w", &tag);
+
+ /* Update the header */
+ pdu->write_offset = 0;
+ virtio_p9_pdu_writef(pdu, "dbw", *outlen, P9_RLERROR, tag);
+}
+
+static void virtio_p9_version(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u32 msize;
+ char *version;
+ virtio_p9_pdu_readf(pdu, "ds", &msize, &version);
+ /*
+ * reply with the same msize the client sent us
+ * Error out if the request is not for 9P2000.L
+ */
+ if (!strcmp(version, VIRTIO_9P_VERSION_DOTL))
+ virtio_p9_pdu_writef(pdu, "ds", msize, version);
+ else
+ virtio_p9_pdu_writef(pdu, "ds", msize, "unknown");
+
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ free(version);
+ return;
+}
+
+static void virtio_p9_clunk(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u32 fid;
+
+ virtio_p9_pdu_readf(pdu, "d", &fid);
+ close_fid(p9dev, fid);
+
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+}
+
+/*
+ * FIXME!! Need to map to protocol independent value. Upstream
+ * 9p also have the same BUG
+ */
+static int virtio_p9_openflags(int flags)
+{
+ flags &= ~(O_NOCTTY | O_ASYNC | O_CREAT | O_DIRECT);
+ flags |= O_NOFOLLOW;
+ return flags;
+}
+
+static bool is_dir(struct p9_fid *fid)
+{
+ struct stat st;
+
+ stat(fid->abs_path, &st);
+
+ return S_ISDIR(st.st_mode);
+}
+
+static void virtio_p9_open(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u32 fid, flags;
+ struct stat st;
+ struct p9_qid qid;
+ struct p9_fid *new_fid;
+
+
+ virtio_p9_pdu_readf(pdu, "dd", &fid, &flags);
+ new_fid = get_fid(p9dev, fid);
+
+ if (lstat(new_fid->abs_path, &st) < 0)
+ goto err_out;
+
+ stat2qid(&st, &qid);
+
+ if (is_dir(new_fid)) {
+ new_fid->dir = opendir(new_fid->abs_path);
+ if (!new_fid->dir)
+ goto err_out;
+ } else {
+ new_fid->fd = open(new_fid->abs_path,
+ virtio_p9_openflags(flags));
+ if (new_fid->fd < 0)
+ goto err_out;
+ }
+ /* FIXME!! need ot send proper iounit */
+ virtio_p9_pdu_writef(pdu, "Qd", &qid, 0);
+
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_create(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int fd, ret;
+ char *name;
+ struct stat st;
+ struct p9_qid qid;
+ struct p9_fid *dfid;
+ char full_path[PATH_MAX];
+ u32 dfid_val, flags, mode, gid;
+
+ virtio_p9_pdu_readf(pdu, "dsddd", &dfid_val,
+ &name, &flags, &mode, &gid);
+ dfid = get_fid(p9dev, dfid_val);
+
+ flags = virtio_p9_openflags(flags);
+
+ sprintf(full_path, "%s/%s", dfid->abs_path, name);
+ fd = open(full_path, flags | O_CREAT, mode);
+ if (fd < 0)
+ goto err_out;
+ dfid->fd = fd;
+
+ if (lstat(full_path, &st) < 0)
+ goto err_out;
+
+ ret = chmod(full_path, mode & 0777);
+ if (ret < 0)
+ goto err_out;
+
+ sprintf(dfid->path, "%s/%s", dfid->path, name);
+ stat2qid(&st, &qid);
+ virtio_p9_pdu_writef(pdu, "Qd", &qid, 0);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ free(name);
+ return;
+err_out:
+ free(name);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_mkdir(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ char *name;
+ struct stat st;
+ struct p9_qid qid;
+ struct p9_fid *dfid;
+ char full_path[PATH_MAX];
+ u32 dfid_val, mode, gid;
+
+ virtio_p9_pdu_readf(pdu, "dsdd", &dfid_val,
+ &name, &mode, &gid);
+ dfid = get_fid(p9dev, dfid_val);
+
+ sprintf(full_path, "%s/%s", dfid->abs_path, name);
+ ret = mkdir(full_path, mode);
+ if (ret < 0)
+ goto err_out;
+
+ if (lstat(full_path, &st) < 0)
+ goto err_out;
+
+ ret = chmod(full_path, mode & 0777);
+ if (ret < 0)
+ goto err_out;
+
+ stat2qid(&st, &qid);
+ virtio_p9_pdu_writef(pdu, "Qd", &qid, 0);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ free(name);
+ return;
+err_out:
+ free(name);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_walk(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u8 i;
+ u16 nwqid;
+ u16 nwname;
+ struct p9_qid wqid;
+ struct p9_fid *new_fid, *old_fid;
+ u32 fid_val, newfid_val;
+
+
+ virtio_p9_pdu_readf(pdu, "ddw", &fid_val, &newfid_val, &nwname);
+ new_fid = get_fid(p9dev, newfid_val);
+
+ nwqid = 0;
+ if (nwname) {
+ struct p9_fid *fid = get_fid(p9dev, fid_val);
+
+ strcpy(new_fid->path, fid->path);
+ /* skip the space for count */
+ pdu->write_offset += sizeof(u16);
+ for (i = 0; i < nwname; i++) {
+ struct stat st;
+ char tmp[PATH_MAX] = {0};
+ char full_path[PATH_MAX];
+ char *str;
+
+ virtio_p9_pdu_readf(pdu, "s", &str);
+
+ /* Format the new path we're 'walk'ing into */
+ sprintf(tmp, "%s/%s", new_fid->path, str);
+
+ free(str);
+
+ if (lstat(rel_to_abs(p9dev, tmp, full_path), &st) < 0)
+ goto err_out;
+
+ stat2qid(&st, &wqid);
+ strcpy(new_fid->path, tmp);
+ new_fid->uid = fid->uid;
+ nwqid++;
+ virtio_p9_pdu_writef(pdu, "Q", &wqid);
+ }
+ } else {
+ /*
+ * update write_offset so our outlen get correct value
+ */
+ pdu->write_offset += sizeof(u16);
+ old_fid = get_fid(p9dev, fid_val);
+ strcpy(new_fid->path, old_fid->path);
+ new_fid->uid = old_fid->uid;
+ }
+ *outlen = pdu->write_offset;
+ pdu->write_offset = VIRTIO_9P_HDR_LEN;
+ virtio_p9_pdu_writef(pdu, "d", nwqid);
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_attach(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ char *uname;
+ char *aname;
+ struct stat st;
+ struct p9_qid qid;
+ struct p9_fid *fid;
+ u32 fid_val, afid, uid;
+
+ virtio_p9_pdu_readf(pdu, "ddssd", &fid_val, &afid,
+ &uname, &aname, &uid);
+
+ free(uname);
+ free(aname);
+
+ if (lstat(p9dev->root_dir, &st) < 0)
+ goto err_out;
+
+ stat2qid(&st, &qid);
+
+ fid = get_fid(p9dev, fid_val);
+ fid->uid = uid;
+ strcpy(fid->path, "/");
+
+ virtio_p9_pdu_writef(pdu, "Q", &qid);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_fill_stat(struct p9_dev *p9dev,
+ struct stat *st, struct p9_stat_dotl *statl)
+{
+ memset(statl, 0, sizeof(*statl));
+ statl->st_mode = st->st_mode;
+ statl->st_nlink = st->st_nlink;
+ statl->st_uid = st->st_uid;
+ statl->st_gid = st->st_gid;
+ statl->st_rdev = st->st_rdev;
+ statl->st_size = st->st_size;
+ statl->st_blksize = st->st_blksize;
+ statl->st_blocks = st->st_blocks;
+ statl->st_atime_sec = st->st_atime;
+ statl->st_atime_nsec = st->st_atim.tv_nsec;
+ statl->st_mtime_sec = st->st_mtime;
+ statl->st_mtime_nsec = st->st_mtim.tv_nsec;
+ statl->st_ctime_sec = st->st_ctime;
+ statl->st_ctime_nsec = st->st_ctim.tv_nsec;
+ /* Currently we only support BASIC fields in stat */
+ statl->st_result_mask = P9_STATS_BASIC;
+ stat2qid(st, &statl->qid);
+}
+
+static void virtio_p9_read(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u64 offset;
+ u32 fid_val;
+ u16 iov_cnt;
+ void *iov_base;
+ size_t iov_len;
+ u32 count, rcount;
+ struct p9_fid *fid;
+
+
+ rcount = 0;
+ virtio_p9_pdu_readf(pdu, "dqd", &fid_val, &offset, &count);
+ fid = get_fid(p9dev, fid_val);
+
+ iov_base = pdu->in_iov[0].iov_base;
+ iov_len = pdu->in_iov[0].iov_len;
+ iov_cnt = pdu->in_iov_cnt;
+ pdu->in_iov[0].iov_base += VIRTIO_9P_HDR_LEN + sizeof(u32);
+ pdu->in_iov[0].iov_len -= VIRTIO_9P_HDR_LEN + sizeof(u32);
+ pdu->in_iov_cnt = virtio_p9_update_iov_cnt(pdu->in_iov,
+ count,
+ pdu->in_iov_cnt);
+ rcount = preadv(fid->fd, pdu->in_iov,
+ pdu->in_iov_cnt, offset);
+ if (rcount > count)
+ rcount = count;
+ /*
+ * Update the iov_base back, so that rest of
+ * pdu_writef works correctly.
+ */
+ pdu->in_iov[0].iov_base = iov_base;
+ pdu->in_iov[0].iov_len = iov_len;
+ pdu->in_iov_cnt = iov_cnt;
+
+ pdu->write_offset = VIRTIO_9P_HDR_LEN;
+ virtio_p9_pdu_writef(pdu, "d", rcount);
+ *outlen = pdu->write_offset + rcount;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+}
+
+static int virtio_p9_dentry_size(struct dirent *dent)
+{
+ /*
+ * Size of each dirent:
+ * qid(13) + offset(8) + type(1) + name_len(2) + name
+ */
+ return 24 + strlen(dent->d_name);
+}
+
+static void virtio_p9_readdir(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u32 fid_val;
+ u32 count, rcount;
+ struct stat st;
+ struct p9_fid *fid;
+ struct dirent *dent;
+ char full_path[PATH_MAX];
+ u64 offset, old_offset;
+
+ rcount = 0;
+ virtio_p9_pdu_readf(pdu, "dqd", &fid_val, &offset, &count);
+ fid = get_fid(p9dev, fid_val);
+
+ if (!is_dir(fid)) {
+ errno = EINVAL;
+ goto err_out;
+ }
+
+ /* Move the offset specified */
+ seekdir(fid->dir, offset);
+
+ old_offset = offset;
+ /* If reading a dir, fill the buffer with p9_stat entries */
+ dent = readdir(fid->dir);
+
+ /* Skip the space for writing count */
+ pdu->write_offset += sizeof(u32);
+ while (dent) {
+ u32 read;
+ struct p9_qid qid;
+
+ if ((rcount + virtio_p9_dentry_size(dent)) > count) {
+ /* seek to the previous offset and return */
+ seekdir(fid->dir, old_offset);
+ break;
+ }
+ old_offset = dent->d_off;
+ lstat(rel_to_abs(p9dev, dent->d_name, full_path), &st);
+ stat2qid(&st, &qid);
+ read = pdu->write_offset;
+ virtio_p9_pdu_writef(pdu, "Qqbs", &qid, dent->d_off,
+ dent->d_type, dent->d_name);
+ rcount += pdu->write_offset - read;
+ dent = readdir(fid->dir);
+ }
+
+ pdu->write_offset = VIRTIO_9P_HDR_LEN;
+ virtio_p9_pdu_writef(pdu, "d", rcount);
+ *outlen = pdu->write_offset + rcount;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+
+static void virtio_p9_getattr(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u32 fid_val;
+ struct stat st;
+ u64 request_mask;
+ struct p9_fid *fid;
+ struct p9_stat_dotl statl;
+
+ virtio_p9_pdu_readf(pdu, "dq", &fid_val, &request_mask);
+ fid = get_fid(p9dev, fid_val);
+ if (lstat(fid->abs_path, &st) < 0)
+ goto err_out;
+
+ virtio_p9_fill_stat(p9dev, &st, &statl);
+ virtio_p9_pdu_writef(pdu, "A", &statl);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+/* FIXME!! from linux/fs.h */
+/*
+ * Attribute flags. These should be or-ed together to figure out what
+ * has been changed!
+ */
+#define ATTR_MODE (1 << 0)
+#define ATTR_UID (1 << 1)
+#define ATTR_GID (1 << 2)
+#define ATTR_SIZE (1 << 3)
+#define ATTR_ATIME (1 << 4)
+#define ATTR_MTIME (1 << 5)
+#define ATTR_CTIME (1 << 6)
+#define ATTR_ATIME_SET (1 << 7)
+#define ATTR_MTIME_SET (1 << 8)
+#define ATTR_FORCE (1 << 9) /* Not a change, but a change it */
+#define ATTR_ATTR_FLAG (1 << 10)
+#define ATTR_KILL_SUID (1 << 11)
+#define ATTR_KILL_SGID (1 << 12)
+#define ATTR_FILE (1 << 13)
+#define ATTR_KILL_PRIV (1 << 14)
+#define ATTR_OPEN (1 << 15) /* Truncating from open(O_TRUNC) */
+#define ATTR_TIMES_SET (1 << 16)
+
+#define ATTR_MASK 127
+
+static void virtio_p9_setattr(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret = 0;
+ u32 fid_val;
+ struct p9_fid *fid;
+ struct p9_iattr_dotl p9attr;
+
+ virtio_p9_pdu_readf(pdu, "dI", &fid_val, &p9attr);
+ fid = get_fid(p9dev, fid_val);
+
+ if (p9attr.valid & ATTR_MODE) {
+ ret = chmod(fid->abs_path, p9attr.mode);
+ if (ret < 0)
+ goto err_out;
+ }
+ if (p9attr.valid & (ATTR_ATIME | ATTR_MTIME)) {
+ struct timespec times[2];
+ if (p9attr.valid & ATTR_ATIME) {
+ if (p9attr.valid & ATTR_ATIME_SET) {
+ times[0].tv_sec = p9attr.atime_sec;
+ times[0].tv_nsec = p9attr.atime_nsec;
+ } else {
+ times[0].tv_nsec = UTIME_NOW;
+ }
+ } else {
+ times[0].tv_nsec = UTIME_OMIT;
+ }
+ if (p9attr.valid & ATTR_MTIME) {
+ if (p9attr.valid & ATTR_MTIME_SET) {
+ times[1].tv_sec = p9attr.mtime_sec;
+ times[1].tv_nsec = p9attr.mtime_nsec;
+ } else {
+ times[1].tv_nsec = UTIME_NOW;
+ }
+ } else
+ times[1].tv_nsec = UTIME_OMIT;
+
+ ret = utimensat(-1, fid->abs_path, times, AT_SYMLINK_NOFOLLOW);
+ if (ret < 0)
+ goto err_out;
+ }
+ /*
+ * If the only valid entry in iattr is ctime we can call
+ * chown(-1,-1) to update the ctime of the file
+ */
+ if ((p9attr.valid & (ATTR_UID | ATTR_GID)) ||
+ ((p9attr.valid & ATTR_CTIME)
+ && !((p9attr.valid & ATTR_MASK) & ~ATTR_CTIME))) {
+ if (!(p9attr.valid & ATTR_UID))
+ p9attr.uid = -1;
+
+ if (!(p9attr.valid & ATTR_GID))
+ p9attr.gid = -1;
+
+ ret = lchown(fid->abs_path, p9attr.uid, p9attr.gid);
+ if (ret < 0)
+ goto err_out;
+ }
+ if (p9attr.valid & (ATTR_SIZE)) {
+ ret = truncate(fid->abs_path, p9attr.size);
+ if (ret < 0)
+ goto err_out;
+ }
+ *outlen = VIRTIO_9P_HDR_LEN;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_write(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+
+ u64 offset;
+ u32 fid_val;
+ u32 count;
+ ssize_t res;
+ u16 iov_cnt;
+ void *iov_base;
+ size_t iov_len;
+ struct p9_fid *fid;
+ /* u32 fid + u64 offset + u32 count */
+ int twrite_size = sizeof(u32) + sizeof(u64) + sizeof(u32);
+
+ virtio_p9_pdu_readf(pdu, "dqd", &fid_val, &offset, &count);
+ fid = get_fid(p9dev, fid_val);
+
+ iov_base = pdu->out_iov[0].iov_base;
+ iov_len = pdu->out_iov[0].iov_len;
+ iov_cnt = pdu->out_iov_cnt;
+
+ /* Adjust the iovec to skip the header and meta data */
+ pdu->out_iov[0].iov_base += (sizeof(struct p9_msg) + twrite_size);
+ pdu->out_iov[0].iov_len -= (sizeof(struct p9_msg) + twrite_size);
+ pdu->out_iov_cnt = virtio_p9_update_iov_cnt(pdu->out_iov, count,
+ pdu->out_iov_cnt);
+ res = pwritev(fid->fd, pdu->out_iov, pdu->out_iov_cnt, offset);
+ /*
+ * Update the iov_base back, so that rest of
+ * pdu_readf works correctly.
+ */
+ pdu->out_iov[0].iov_base = iov_base;
+ pdu->out_iov[0].iov_len = iov_len;
+ pdu->out_iov_cnt = iov_cnt;
+
+ if (res < 0)
+ goto err_out;
+ virtio_p9_pdu_writef(pdu, "d", res);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_remove(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ u32 fid_val;
+ struct p9_fid *fid;
+
+ virtio_p9_pdu_readf(pdu, "d", &fid_val);
+ fid = get_fid(p9dev, fid_val);
+
+ ret = remove(fid->abs_path);
+ if (ret < 0)
+ goto err_out;
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_rename(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ u32 fid_val, new_fid_val;
+ struct p9_fid *fid, *new_fid;
+ char full_path[PATH_MAX], *new_name;
+
+ virtio_p9_pdu_readf(pdu, "dds", &fid_val, &new_fid_val, &new_name);
+ fid = get_fid(p9dev, fid_val);
+ new_fid = get_fid(p9dev, new_fid_val);
+
+ sprintf(full_path, "%s/%s", new_fid->abs_path, new_name);
+ ret = rename(fid->abs_path, full_path);
+ if (ret < 0)
+ goto err_out;
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_readlink(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ u32 fid_val;
+ struct p9_fid *fid;
+ char target_path[PATH_MAX];
+
+ virtio_p9_pdu_readf(pdu, "d", &fid_val);
+ fid = get_fid(p9dev, fid_val);
+
+ memset(target_path, 0, PATH_MAX);
+ ret = readlink(fid->abs_path, target_path, PATH_MAX - 1);
+ if (ret < 0)
+ goto err_out;
+
+ virtio_p9_pdu_writef(pdu, "s", target_path);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_statfs(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ u64 fsid;
+ u32 fid_val;
+ struct p9_fid *fid;
+ struct statfs stat_buf;
+
+ virtio_p9_pdu_readf(pdu, "d", &fid_val);
+ fid = get_fid(p9dev, fid_val);
+
+ ret = statfs(fid->abs_path, &stat_buf);
+ if (ret < 0)
+ goto err_out;
+ /* FIXME!! f_blocks needs update based on client msize */
+ fsid = (unsigned int) stat_buf.f_fsid.__val[0] |
+ (unsigned long long)stat_buf.f_fsid.__val[1] << 32;
+ virtio_p9_pdu_writef(pdu, "ddqqqqqqd", stat_buf.f_type,
+ stat_buf.f_bsize, stat_buf.f_blocks,
+ stat_buf.f_bfree, stat_buf.f_bavail,
+ stat_buf.f_files, stat_buf.f_ffree,
+ fsid, stat_buf.f_namelen);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_mknod(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ char *name;
+ struct stat st;
+ struct p9_fid *dfid;
+ struct p9_qid qid;
+ char full_path[PATH_MAX];
+ u32 fid_val, mode, major, minor, gid;
+
+ virtio_p9_pdu_readf(pdu, "dsdddd", &fid_val, &name, &mode,
+ &major, &minor, &gid);
+
+ dfid = get_fid(p9dev, fid_val);
+ sprintf(full_path, "%s/%s", dfid->abs_path, name);
+ ret = mknod(full_path, mode, makedev(major, minor));
+ if (ret < 0)
+ goto err_out;
+
+ if (lstat(full_path, &st) < 0)
+ goto err_out;
+
+ ret = chmod(full_path, mode & 0777);
+ if (ret < 0)
+ goto err_out;
+
+ stat2qid(&st, &qid);
+ virtio_p9_pdu_writef(pdu, "Q", &qid);
+ free(name);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ free(name);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_fsync(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ struct p9_fid *fid;
+ u32 fid_val, datasync;
+
+ virtio_p9_pdu_readf(pdu, "dd", &fid_val, &datasync);
+ fid = get_fid(p9dev, fid_val);
+
+ if (datasync)
+ ret = fdatasync(fid->fd);
+ else
+ ret = fsync(fid->fd);
+ if (ret < 0)
+ goto err_out;
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_symlink(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ struct stat st;
+ u32 fid_val, gid;
+ struct p9_qid qid;
+ struct p9_fid *dfid;
+ char new_name[PATH_MAX];
+ char *old_path, *name;
+
+ virtio_p9_pdu_readf(pdu, "dssd", &fid_val, &name, &old_path, &gid);
+
+ dfid = get_fid(p9dev, fid_val);
+ sprintf(new_name, "%s/%s", dfid->abs_path, name);
+ ret = symlink(old_path, new_name);
+ if (ret < 0)
+ goto err_out;
+
+ if (lstat(new_name, &st) < 0)
+ goto err_out;
+
+ stat2qid(&st, &qid);
+ virtio_p9_pdu_writef(pdu, "Q", &qid);
+ free(name);
+ free(old_path);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ free(name);
+ free(old_path);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_link(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ char *name;
+ u32 fid_val, dfid_val;
+ struct p9_fid *dfid, *fid;
+ char full_path[PATH_MAX];
+
+ virtio_p9_pdu_readf(pdu, "dds", &dfid_val, &fid_val, &name);
+
+ dfid = get_fid(p9dev, dfid_val);
+ fid = get_fid(p9dev, fid_val);
+ sprintf(full_path, "%s/%s", dfid->abs_path, name);
+ ret = link(fid->abs_path, full_path);
+ if (ret < 0)
+ goto err_out;
+ free(name);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ free(name);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+
+}
+
+static void virtio_p9_lock(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u8 ret;
+ u32 fid_val;
+ struct p9_flock flock;
+
+ virtio_p9_pdu_readf(pdu, "dbdqqds", &fid_val, &flock.type,
+ &flock.flags, &flock.start, &flock.length,
+ &flock.proc_id, &flock.client_id);
+
+ /* Just return success */
+ ret = P9_LOCK_SUCCESS;
+ virtio_p9_pdu_writef(pdu, "d", ret);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ free(flock.client_id);
+ return;
+}
+
+static void virtio_p9_getlock(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u32 fid_val;
+ struct p9_getlock glock;
+ virtio_p9_pdu_readf(pdu, "dbqqds", &fid_val, &glock.type,
+ &glock.start, &glock.length, &glock.proc_id,
+ &glock.client_id);
+
+ /* Just return success */
+ glock.type = F_UNLCK;
+ virtio_p9_pdu_writef(pdu, "bqqds", glock.type,
+ glock.start, glock.length, glock.proc_id,
+ glock.client_id);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ free(glock.client_id);
+ return;
+}
+
+static int virtio_p9_ancestor(char *path, char *ancestor)
+{
+ int size = strlen(ancestor);
+ if (!strncmp(path, ancestor, size)) {
+ /*
+ * Now check whether ancestor is a full name or
+ * or directory component and not just part
+ * of a name.
+ */
+ if (path[size] == '\0' || path[size] == '/')
+ return 1;
+ }
+ return 0;
+}
+
+static void virtio_p9_fix_path(char *fid_path, char *old_name, char *new_name)
+{
+ char tmp_name[PATH_MAX];
+ size_t rp_sz = strlen(old_name);
+
+ if (rp_sz == strlen(fid_path)) {
+ /* replace the full name */
+ strcpy(fid_path, new_name);
+ return;
+ }
+ /* save the trailing path details */
+ strcpy(tmp_name, fid_path + rp_sz);
+ sprintf(fid_path, "%s%s", new_name, tmp_name);
+ return;
+}
+
+static void rename_fids(struct p9_dev *p9dev, char *old_name, char *new_name)
+{
+ struct rb_node *node = rb_first(&p9dev->fids);
+
+ while (node) {
+ struct p9_fid *fid = rb_entry(node, struct p9_fid, node);
+
+ if (fid->fid != P9_NOFID && virtio_p9_ancestor(fid->path, old_name)) {
+ virtio_p9_fix_path(fid->path, old_name, new_name);
+ }
+ node = rb_next(node);
+ }
+}
+
+static void virtio_p9_renameat(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ char *old_name, *new_name;
+ u32 old_dfid_val, new_dfid_val;
+ struct p9_fid *old_dfid, *new_dfid;
+ char old_full_path[PATH_MAX], new_full_path[PATH_MAX];
+
+
+ virtio_p9_pdu_readf(pdu, "dsds", &old_dfid_val, &old_name,
+ &new_dfid_val, &new_name);
+
+ old_dfid = get_fid(p9dev, old_dfid_val);
+ new_dfid = get_fid(p9dev, new_dfid_val);
+
+ sprintf(old_full_path, "%s/%s", old_dfid->abs_path, old_name);
+ sprintf(new_full_path, "%s/%s", new_dfid->abs_path, new_name);
+ ret = rename(old_full_path, new_full_path);
+ if (ret < 0)
+ goto err_out;
+ /*
+ * Now fix path in other fids, if the renamed path is part of
+ * that.
+ */
+ rename_fids(p9dev, old_name, new_name);
+ free(old_name);
+ free(new_name);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ free(old_name);
+ free(new_name);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_unlinkat(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ int ret;
+ char *name;
+ u32 fid_val, flags;
+ struct p9_fid *fid;
+ char full_path[PATH_MAX];
+
+ virtio_p9_pdu_readf(pdu, "dsd", &fid_val, &name, &flags);
+ fid = get_fid(p9dev, fid_val);
+
+ sprintf(full_path, "%s/%s", fid->abs_path, name);
+ ret = remove(full_path);
+ if (ret < 0)
+ goto err_out;
+ free(name);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+ return;
+err_out:
+ free(name);
+ virtio_p9_error_reply(p9dev, pdu, errno, outlen);
+ return;
+}
+
+static void virtio_p9_flush(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ u16 tag, oldtag;
+
+ virtio_p9_pdu_readf(pdu, "ww", &tag, &oldtag);
+ virtio_p9_pdu_writef(pdu, "w", tag);
+ *outlen = pdu->write_offset;
+ virtio_p9_set_reply_header(pdu, *outlen);
+
+ return;
+}
+
+static void virtio_p9_eopnotsupp(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen)
+{
+ return virtio_p9_error_reply(p9dev, pdu, EOPNOTSUPP, outlen);
+}
+
+typedef void p9_handler(struct p9_dev *p9dev,
+ struct p9_pdu *pdu, u32 *outlen);
+
+/* FIXME should be removed when merging with latest linus tree */
+#define P9_TRENAMEAT 74
+#define P9_TUNLINKAT 76
+
+static p9_handler *virtio_9p_dotl_handler [] = {
+ [P9_TREADDIR] = virtio_p9_readdir,
+ [P9_TSTATFS] = virtio_p9_statfs,
+ [P9_TGETATTR] = virtio_p9_getattr,
+ [P9_TSETATTR] = virtio_p9_setattr,
+ [P9_TXATTRWALK] = virtio_p9_eopnotsupp,
+ [P9_TXATTRCREATE] = virtio_p9_eopnotsupp,
+ [P9_TMKNOD] = virtio_p9_mknod,
+ [P9_TLOCK] = virtio_p9_lock,
+ [P9_TGETLOCK] = virtio_p9_getlock,
+ [P9_TRENAMEAT] = virtio_p9_renameat,
+ [P9_TREADLINK] = virtio_p9_readlink,
+ [P9_TUNLINKAT] = virtio_p9_unlinkat,
+ [P9_TMKDIR] = virtio_p9_mkdir,
+ [P9_TVERSION] = virtio_p9_version,
+ [P9_TLOPEN] = virtio_p9_open,
+ [P9_TATTACH] = virtio_p9_attach,
+ [P9_TWALK] = virtio_p9_walk,
+ [P9_TCLUNK] = virtio_p9_clunk,
+ [P9_TFSYNC] = virtio_p9_fsync,
+ [P9_TREAD] = virtio_p9_read,
+ [P9_TFLUSH] = virtio_p9_flush,
+ [P9_TLINK] = virtio_p9_link,
+ [P9_TSYMLINK] = virtio_p9_symlink,
+ [P9_TLCREATE] = virtio_p9_create,
+ [P9_TWRITE] = virtio_p9_write,
+ [P9_TREMOVE] = virtio_p9_remove,
+ [P9_TRENAME] = virtio_p9_rename,
+};
+
+static struct p9_pdu *virtio_p9_pdu_init(struct kvm *kvm, struct virt_queue *vq)
+{
+ struct p9_pdu *pdu = calloc(1, sizeof(*pdu));
+ if (!pdu)
+ return NULL;
+
+ /* skip the pdu header p9_msg */
+ pdu->read_offset = VIRTIO_9P_HDR_LEN;
+ pdu->write_offset = VIRTIO_9P_HDR_LEN;
+ pdu->queue_head = virt_queue__get_inout_iov(kvm, vq, pdu->in_iov,
+ pdu->out_iov, &pdu->in_iov_cnt, &pdu->out_iov_cnt);
+ return pdu;
+}
+
+static u8 virtio_p9_get_cmd(struct p9_pdu *pdu)
+{
+ struct p9_msg *msg;
+ /*
+ * we can peek directly into pdu for a u8
+ * value. The host endianess won't be an issue
+ */
+ msg = pdu->out_iov[0].iov_base;
+ return msg->cmd;
+}
+
+static bool virtio_p9_do_io_request(struct kvm *kvm, struct p9_dev_job *job)
+{
+ u8 cmd;
+ u32 len = 0;
+ p9_handler *handler;
+ struct p9_dev *p9dev;
+ struct virt_queue *vq;
+ struct p9_pdu *p9pdu;
+
+ vq = job->vq;
+ p9dev = job->p9dev;
+
+ p9pdu = virtio_p9_pdu_init(kvm, vq);
+ cmd = virtio_p9_get_cmd(p9pdu);
+
+ if ((cmd >= ARRAY_SIZE(virtio_9p_dotl_handler)) ||
+ !virtio_9p_dotl_handler[cmd])
+ handler = virtio_p9_eopnotsupp;
+ else
+ handler = virtio_9p_dotl_handler[cmd];
+
+ handler(p9dev, p9pdu, &len);
+ virt_queue__set_used_elem(vq, p9pdu->queue_head, len);
+ free(p9pdu);
+ return true;
+}
+
+static void virtio_p9_do_io(struct kvm *kvm, void *param)
+{
+ struct p9_dev_job *job = (struct p9_dev_job *)param;
+ struct p9_dev *p9dev = job->p9dev;
+ struct virt_queue *vq = job->vq;
+
+ while (virt_queue__available(vq)) {
+ virtio_p9_do_io_request(kvm, job);
+ p9dev->vdev.ops->signal_vq(kvm, &p9dev->vdev, vq - p9dev->vqs);
+ }
+}
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ struct p9_dev *p9dev = dev;
+
+ return ((u8 *)(p9dev->config));
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ return 1 << VIRTIO_9P_MOUNT_TAG;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ struct p9_dev *p9dev = dev;
+
+ p9dev->features = features;
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct p9_dev *p9dev = dev;
+ struct p9_dev_job *job;
+ struct virt_queue *queue;
+ void *p;
+
+ compat__remove_message(compat_id);
+
+ queue = &p9dev->vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+ job = &p9dev->jobs[vq];
+
+ vring_init(&queue->vring, VIRTQUEUE_NUM, p, VIRTIO_PCI_VRING_ALIGN);
+
+ *job = (struct p9_dev_job) {
+ .vq = queue,
+ .p9dev = p9dev,
+ };
+ thread_pool__init_job(&job->job_id, kvm, virtio_p9_do_io, job);
+
+ return 0;
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct p9_dev *p9dev = dev;
+
+ thread_pool__do_job(&p9dev->jobs[vq].job_id);
+
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct p9_dev *p9dev = dev;
+
+ return p9dev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ return VIRTQUEUE_NUM;
+}
+
+struct virtio_ops p9_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .notify_vq = notify_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+};
+
+int virtio_9p_rootdir_parser(const struct option *opt, const char *arg, int unset)
+{
+ char *tag_name;
+ char tmp[PATH_MAX];
+ struct kvm *kvm = opt->ptr;
+
+ /*
+ * 9p dir can be of the form dirname,tag_name or
+ * just dirname. In the later case we use the
+ * default tag name
+ */
+ tag_name = strstr(arg, ",");
+ if (tag_name) {
+ *tag_name = '\0';
+ tag_name++;
+ }
+ if (realpath(arg, tmp)) {
+ if (virtio_9p__register(kvm, tmp, tag_name) < 0)
+ die("Unable to initialize virtio 9p");
+ } else
+ die("Failed resolving 9p path");
+ return 0;
+}
+
+int virtio_9p_img_name_parser(const struct option *opt, const char *arg, int unset)
+{
+ char path[PATH_MAX];
+ struct stat st;
+ struct kvm *kvm = opt->ptr;
+
+ if (stat(arg, &st) == 0 &&
+ S_ISDIR(st.st_mode)) {
+ char tmp[PATH_MAX];
+
+ if (kvm->cfg.using_rootfs)
+ die("Please use only one rootfs directory atmost");
+
+ if (realpath(arg, tmp) == 0 ||
+ virtio_9p__register(kvm, tmp, "/dev/root") < 0)
+ die("Unable to initialize virtio 9p");
+ kvm->cfg.using_rootfs = 1;
+ return 0;
+ }
+
+ snprintf(path, PATH_MAX, "%s%s", kvm__get_dir(), arg);
+
+ if (stat(path, &st) == 0 &&
+ S_ISDIR(st.st_mode)) {
+ char tmp[PATH_MAX];
+
+ if (kvm->cfg.using_rootfs)
+ die("Please use only one rootfs directory atmost");
+
+ if (realpath(path, tmp) == 0 ||
+ virtio_9p__register(kvm, tmp, "/dev/root") < 0)
+ die("Unable to initialize virtio 9p");
+ if (virtio_9p__register(kvm, "/", "hostfs") < 0)
+ die("Unable to initialize virtio 9p");
+ kvm_setup_resolv(arg);
+ kvm->cfg.using_rootfs = kvm->cfg.custom_rootfs = 1;
+ kvm->cfg.custom_rootfs_name = arg;
+ return 0;
+ }
+
+ return -1;
+}
+
+int virtio_9p__init(struct kvm *kvm)
+{
+ struct p9_dev *p9dev;
+
+ list_for_each_entry(p9dev, &devs, list) {
+ virtio_init(kvm, p9dev, &p9dev->vdev, &p9_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_9P, VIRTIO_ID_9P, PCI_CLASS_9P);
+ }
+
+ return 0;
+}
+virtio_dev_init(virtio_9p__init);
+
+int virtio_9p__register(struct kvm *kvm, const char *root, const char *tag_name)
+{
+ struct p9_dev *p9dev;
+ int err = 0;
+
+ p9dev = calloc(1, sizeof(*p9dev));
+ if (!p9dev)
+ return -ENOMEM;
+
+ if (!tag_name)
+ tag_name = VIRTIO_9P_DEFAULT_TAG;
+
+ p9dev->config = calloc(1, sizeof(*p9dev->config) + strlen(tag_name) + 1);
+ if (p9dev->config == NULL) {
+ err = -ENOMEM;
+ goto free_p9dev;
+ }
+
+ strcpy(p9dev->root_dir, root);
+ p9dev->config->tag_len = strlen(tag_name);
+ if (p9dev->config->tag_len > MAX_TAG_LEN) {
+ err = -EINVAL;
+ goto free_p9dev_config;
+ }
+
+ memcpy(&p9dev->config->tag, tag_name, strlen(tag_name));
+
+ list_add(&p9dev->list, &devs);
+
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-9p", "CONFIG_NET_9P_VIRTIO");
+
+ return err;
+
+free_p9dev_config:
+ free(p9dev->config);
+free_p9dev:
+ free(p9dev);
+ return err;
+}
--- /dev/null
+#include "kvm/virtio-balloon.h"
+
+#include "kvm/virtio-pci-dev.h"
+
+#include "kvm/virtio.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+#include "kvm/threadpool.h"
+#include "kvm/guest_compat.h"
+#include "kvm/kvm-ipc.h"
+
+#include <linux/virtio_ring.h>
+#include <linux/virtio_balloon.h>
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <pthread.h>
+#include <sys/eventfd.h>
+
+#define NUM_VIRT_QUEUES 3
+#define VIRTIO_BLN_QUEUE_SIZE 128
+#define VIRTIO_BLN_INFLATE 0
+#define VIRTIO_BLN_DEFLATE 1
+#define VIRTIO_BLN_STATS 2
+
+struct bln_dev {
+ struct list_head list;
+ struct virtio_device vdev;
+
+ u32 features;
+
+ /* virtio queue */
+ struct virt_queue vqs[NUM_VIRT_QUEUES];
+ struct thread_pool__job jobs[NUM_VIRT_QUEUES];
+
+ struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+ struct virtio_balloon_stat *cur_stat;
+ u32 cur_stat_head;
+ u16 stat_count;
+ int stat_waitfd;
+
+ struct virtio_balloon_config config;
+};
+
+static struct bln_dev bdev;
+static int compat_id = -1;
+
+static bool virtio_bln_do_io_request(struct kvm *kvm, struct bln_dev *bdev, struct virt_queue *queue)
+{
+ struct iovec iov[VIRTIO_BLN_QUEUE_SIZE];
+ unsigned int len = 0;
+ u16 out, in, head;
+ u32 *ptrs, i;
+
+ head = virt_queue__get_iov(queue, iov, &out, &in, kvm);
+ ptrs = iov[0].iov_base;
+ len = iov[0].iov_len / sizeof(u32);
+
+ for (i = 0 ; i < len ; i++) {
+ void *guest_ptr;
+
+ guest_ptr = guest_flat_to_host(kvm, ptrs[i] << VIRTIO_BALLOON_PFN_SHIFT);
+ if (queue == &bdev->vqs[VIRTIO_BLN_INFLATE]) {
+ madvise(guest_ptr, 1 << VIRTIO_BALLOON_PFN_SHIFT, MADV_DONTNEED);
+ bdev->config.actual++;
+ } else if (queue == &bdev->vqs[VIRTIO_BLN_DEFLATE]) {
+ bdev->config.actual--;
+ }
+ }
+
+ virt_queue__set_used_elem(queue, head, len);
+
+ return true;
+}
+
+static bool virtio_bln_do_stat_request(struct kvm *kvm, struct bln_dev *bdev, struct virt_queue *queue)
+{
+ struct iovec iov[VIRTIO_BLN_QUEUE_SIZE];
+ u16 out, in, head;
+ struct virtio_balloon_stat *stat;
+ u64 wait_val = 1;
+
+ head = virt_queue__get_iov(queue, iov, &out, &in, kvm);
+ stat = iov[0].iov_base;
+
+ /* Initial empty stat buffer */
+ if (bdev->cur_stat == NULL) {
+ bdev->cur_stat = stat;
+ bdev->cur_stat_head = head;
+
+ return true;
+ }
+
+ memcpy(bdev->stats, stat, iov[0].iov_len);
+
+ bdev->stat_count = iov[0].iov_len / sizeof(struct virtio_balloon_stat);
+ bdev->cur_stat = stat;
+ bdev->cur_stat_head = head;
+
+ if (write(bdev->stat_waitfd, &wait_val, sizeof(wait_val)) <= 0)
+ return -EFAULT;
+
+ return 1;
+}
+
+static void virtio_bln_do_io(struct kvm *kvm, void *param)
+{
+ struct virt_queue *vq = param;
+
+ if (vq == &bdev.vqs[VIRTIO_BLN_STATS]) {
+ virtio_bln_do_stat_request(kvm, &bdev, vq);
+ bdev.vdev.ops->signal_vq(kvm, &bdev.vdev, VIRTIO_BLN_STATS);
+ return;
+ }
+
+ while (virt_queue__available(vq)) {
+ virtio_bln_do_io_request(kvm, &bdev, vq);
+ bdev.vdev.ops->signal_vq(kvm, &bdev.vdev, vq - bdev.vqs);
+ }
+}
+
+static int virtio_bln__collect_stats(struct kvm *kvm)
+{
+ u64 tmp;
+
+ virt_queue__set_used_elem(&bdev.vqs[VIRTIO_BLN_STATS], bdev.cur_stat_head,
+ sizeof(struct virtio_balloon_stat));
+ bdev.vdev.ops->signal_vq(kvm, &bdev.vdev, VIRTIO_BLN_STATS);
+
+ if (read(bdev.stat_waitfd, &tmp, sizeof(tmp)) <= 0)
+ return -EFAULT;
+
+ return 0;
+}
+
+static void virtio_bln__print_stats(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ int r;
+
+ if (WARN_ON(type != KVM_IPC_STAT || len))
+ return;
+
+ if (virtio_bln__collect_stats(kvm) < 0)
+ return;
+
+ r = write(fd, bdev.stats, sizeof(bdev.stats));
+ if (r < 0)
+ pr_warning("Failed sending memory stats");
+}
+
+static void handle_mem(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
+{
+ int mem;
+
+ if (WARN_ON(type != KVM_IPC_BALLOON || len != sizeof(int)))
+ return;
+
+ mem = *(int *)msg;
+ if (mem > 0) {
+ bdev.config.num_pages += 256 * mem;
+ } else if (mem < 0) {
+ if (bdev.config.num_pages < (u32)(256 * (-mem)))
+ return;
+
+ bdev.config.num_pages += 256 * mem;
+ }
+
+ /* Notify that the configuration space has changed */
+ bdev.vdev.ops->signal_config(kvm, &bdev.vdev);
+}
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ struct bln_dev *bdev = dev;
+
+ return ((u8 *)(&bdev->config));
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ return 1 << VIRTIO_BALLOON_F_STATS_VQ;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ struct bln_dev *bdev = dev;
+
+ bdev->features = features;
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct bln_dev *bdev = dev;
+ struct virt_queue *queue;
+ void *p;
+
+ compat__remove_message(compat_id);
+
+ queue = &bdev->vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+
+ thread_pool__init_job(&bdev->jobs[vq], kvm, virtio_bln_do_io, queue);
+ vring_init(&queue->vring, VIRTIO_BLN_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN);
+
+ return 0;
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct bln_dev *bdev = dev;
+
+ thread_pool__do_job(&bdev->jobs[vq]);
+
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct bln_dev *bdev = dev;
+
+ return bdev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ return VIRTIO_BLN_QUEUE_SIZE;
+}
+
+struct virtio_ops bln_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .notify_vq = notify_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+};
+
+int virtio_bln__init(struct kvm *kvm)
+{
+ if (!kvm->cfg.balloon)
+ return 0;
+
+ kvm_ipc__register_handler(KVM_IPC_BALLOON, handle_mem);
+ kvm_ipc__register_handler(KVM_IPC_STAT, virtio_bln__print_stats);
+
+ bdev.stat_waitfd = eventfd(0, 0);
+ memset(&bdev.config, 0, sizeof(struct virtio_balloon_config));
+
+ virtio_init(kvm, &bdev, &bdev.vdev, &bln_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_BLN, VIRTIO_ID_BALLOON, PCI_CLASS_BLN);
+
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-balloon", "CONFIG_VIRTIO_BALLOON");
+
+ return 0;
+}
+virtio_dev_init(virtio_bln__init);
+
+int virtio_bln__exit(struct kvm *kvm)
+{
+ return 0;
+}
+virtio_dev_exit(virtio_bln__exit);
--- /dev/null
+#include "kvm/virtio-blk.h"
+
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/disk-image.h"
+#include "kvm/mutex.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+#include "kvm/threadpool.h"
+#include "kvm/ioeventfd.h"
+#include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
+#include "kvm/virtio.h"
+
+#include <linux/virtio_ring.h>
+#include <linux/virtio_blk.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <pthread.h>
+
+#define VIRTIO_BLK_MAX_DEV 4
+
+/*
+ * the header and status consume too entries
+ */
+#define DISK_SEG_MAX (VIRTIO_BLK_QUEUE_SIZE - 2)
+#define VIRTIO_BLK_QUEUE_SIZE 256
+#define NUM_VIRT_QUEUES 1
+
+struct blk_dev_req {
+ struct virt_queue *vq;
+ struct blk_dev *bdev;
+ struct iovec iov[VIRTIO_BLK_QUEUE_SIZE];
+ u16 out, in, head;
+ struct kvm *kvm;
+};
+
+struct blk_dev {
+ pthread_mutex_t mutex;
+
+ struct list_head list;
+
+ struct virtio_device vdev;
+ struct virtio_blk_config blk_config;
+ struct disk_image *disk;
+ u32 features;
+
+ struct virt_queue vqs[NUM_VIRT_QUEUES];
+ struct blk_dev_req reqs[VIRTIO_BLK_QUEUE_SIZE];
+
+ pthread_t io_thread;
+ int io_efd;
+
+ struct kvm *kvm;
+};
+
+static LIST_HEAD(bdevs);
+static int compat_id = -1;
+
+void virtio_blk_complete(void *param, long len)
+{
+ struct blk_dev_req *req = param;
+ struct blk_dev *bdev = req->bdev;
+ int queueid = req->vq - bdev->vqs;
+ u8 *status;
+
+ /* status */
+ status = req->iov[req->out + req->in - 1].iov_base;
+ *status = (len < 0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+
+ mutex_lock(&bdev->mutex);
+ virt_queue__set_used_elem(req->vq, req->head, len);
+ mutex_unlock(&bdev->mutex);
+
+ if (virtio_queue__should_signal(&bdev->vqs[queueid]))
+ bdev->vdev.ops->signal_vq(req->kvm, &bdev->vdev, queueid);
+}
+
+static void virtio_blk_do_io_request(struct kvm *kvm, struct blk_dev_req *req)
+{
+ struct virtio_blk_outhdr *req_hdr;
+ ssize_t block_cnt;
+ struct blk_dev *bdev;
+ struct iovec *iov;
+ u16 out, in;
+
+ block_cnt = -1;
+ bdev = req->bdev;
+ iov = req->iov;
+ out = req->out;
+ in = req->in;
+ req_hdr = iov[0].iov_base;
+
+ switch (req_hdr->type) {
+ case VIRTIO_BLK_T_IN:
+ block_cnt = disk_image__read(bdev->disk, req_hdr->sector,
+ iov + 1, in + out - 2, req);
+ break;
+ case VIRTIO_BLK_T_OUT:
+ block_cnt = disk_image__write(bdev->disk, req_hdr->sector,
+ iov + 1, in + out - 2, req);
+ break;
+ case VIRTIO_BLK_T_FLUSH:
+ block_cnt = disk_image__flush(bdev->disk);
+ virtio_blk_complete(req, block_cnt);
+ break;
+ case VIRTIO_BLK_T_GET_ID:
+ block_cnt = VIRTIO_BLK_ID_BYTES;
+ disk_image__get_serial(bdev->disk,
+ (iov + 1)->iov_base, &block_cnt);
+ virtio_blk_complete(req, block_cnt);
+ break;
+ default:
+ pr_warning("request type %d", req_hdr->type);
+ block_cnt = -1;
+ break;
+ }
+}
+
+static void virtio_blk_do_io(struct kvm *kvm, struct virt_queue *vq, struct blk_dev *bdev)
+{
+ struct blk_dev_req *req;
+ u16 head;
+
+ while (virt_queue__available(vq)) {
+ head = virt_queue__pop(vq);
+ req = &bdev->reqs[head];
+ req->head = virt_queue__get_head_iov(vq, req->iov, &req->out,
+ &req->in, head, kvm);
+ req->vq = vq;
+
+ virtio_blk_do_io_request(kvm, req);
+ }
+}
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ struct blk_dev *bdev = dev;
+
+ return ((u8 *)(&bdev->blk_config));
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ return 1UL << VIRTIO_BLK_F_SEG_MAX
+ | 1UL << VIRTIO_BLK_F_FLUSH
+ | 1UL << VIRTIO_RING_F_EVENT_IDX
+ | 1UL << VIRTIO_RING_F_INDIRECT_DESC;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ struct blk_dev *bdev = dev;
+
+ bdev->features = features;
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct blk_dev *bdev = dev;
+ struct virt_queue *queue;
+ void *p;
+
+ compat__remove_message(compat_id);
+
+ queue = &bdev->vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+
+ vring_init(&queue->vring, VIRTIO_BLK_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN);
+
+ return 0;
+}
+
+static void *virtio_blk_thread(void *dev)
+{
+ struct blk_dev *bdev = dev;
+ u64 data;
+ int r;
+
+ while (1) {
+ r = read(bdev->io_efd, &data, sizeof(u64));
+ if (r < 0)
+ continue;
+ virtio_blk_do_io(bdev->kvm, &bdev->vqs[0], bdev);
+ }
+
+ pthread_exit(NULL);
+ return NULL;
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct blk_dev *bdev = dev;
+ u64 data = 1;
+ int r;
+
+ r = write(bdev->io_efd, &data, sizeof(data));
+ if (r < 0)
+ return r;
+
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct blk_dev *bdev = dev;
+
+ return bdev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ /* FIXME: dynamic */
+ return VIRTIO_BLK_QUEUE_SIZE;
+}
+
+static int set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+ /* FIXME: dynamic */
+ return size;
+}
+
+static struct virtio_ops blk_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .notify_vq = notify_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+ .set_size_vq = set_size_vq,
+};
+
+static int virtio_blk__init_one(struct kvm *kvm, struct disk_image *disk)
+{
+ struct blk_dev *bdev;
+ unsigned int i;
+
+ if (!disk)
+ return -EINVAL;
+
+ bdev = calloc(1, sizeof(struct blk_dev));
+ if (bdev == NULL)
+ return -ENOMEM;
+
+ *bdev = (struct blk_dev) {
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+ .disk = disk,
+ .blk_config = (struct virtio_blk_config) {
+ .capacity = disk->size / SECTOR_SIZE,
+ .seg_max = DISK_SEG_MAX,
+ },
+ .io_efd = eventfd(0, 0),
+ .kvm = kvm,
+ };
+
+ virtio_init(kvm, bdev, &bdev->vdev, &blk_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_BLK, VIRTIO_ID_BLOCK, PCI_CLASS_BLK);
+
+ list_add_tail(&bdev->list, &bdevs);
+
+ for (i = 0; i < ARRAY_SIZE(bdev->reqs); i++) {
+ bdev->reqs[i].bdev = bdev;
+ bdev->reqs[i].kvm = kvm;
+ }
+
+ disk_image__set_callback(bdev->disk, virtio_blk_complete);
+
+ pthread_create(&bdev->io_thread, NULL, virtio_blk_thread, bdev);
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-blk", "CONFIG_VIRTIO_BLK");
+
+ return 0;
+}
+
+static int virtio_blk__exit_one(struct kvm *kvm, struct blk_dev *bdev)
+{
+ list_del(&bdev->list);
+ free(bdev);
+
+ return 0;
+}
+
+int virtio_blk__init(struct kvm *kvm)
+{
+ int i, r = 0;
+
+ for (i = 0; i < kvm->nr_disks; i++) {
+ if (kvm->disks[i]->wwpn)
+ continue;
+ r = virtio_blk__init_one(kvm, kvm->disks[i]);
+ if (r < 0)
+ goto cleanup;
+ }
+
+ return 0;
+cleanup:
+ return virtio_blk__exit(kvm);
+}
+virtio_dev_init(virtio_blk__init);
+
+int virtio_blk__exit(struct kvm *kvm)
+{
+ while (!list_empty(&bdevs)) {
+ struct blk_dev *bdev;
+
+ bdev = list_first_entry(&bdevs, struct blk_dev, list);
+ virtio_blk__exit_one(kvm, bdev);
+ }
+
+ return 0;
+}
+virtio_dev_exit(virtio_blk__exit);
--- /dev/null
+#include "kvm/virtio-console.h"
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/disk-image.h"
+#include "kvm/virtio.h"
+#include "kvm/ioport.h"
+#include "kvm/util.h"
+#include "kvm/term.h"
+#include "kvm/mutex.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+#include "kvm/threadpool.h"
+#include "kvm/irq.h"
+#include "kvm/guest_compat.h"
+
+#include <linux/virtio_console.h>
+#include <linux/virtio_ring.h>
+#include <linux/virtio_blk.h>
+
+#include <sys/uio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <termios.h>
+#include <unistd.h>
+#include <fcntl.h>
+
+#define VIRTIO_CONSOLE_QUEUE_SIZE 128
+#define VIRTIO_CONSOLE_NUM_QUEUES 2
+#define VIRTIO_CONSOLE_RX_QUEUE 0
+#define VIRTIO_CONSOLE_TX_QUEUE 1
+
+struct con_dev {
+ pthread_mutex_t mutex;
+
+ struct virtio_device vdev;
+ struct virt_queue vqs[VIRTIO_CONSOLE_NUM_QUEUES];
+ struct virtio_console_config config;
+ u32 features;
+
+ struct thread_pool__job jobs[VIRTIO_CONSOLE_NUM_QUEUES];
+};
+
+static struct con_dev cdev = {
+ .mutex = PTHREAD_MUTEX_INITIALIZER,
+
+ .config = {
+ .cols = 80,
+ .rows = 24,
+ .max_nr_ports = 1,
+ },
+};
+
+static int compat_id = -1;
+
+/*
+ * Interrupts are injected for hvc0 only.
+ */
+static void virtio_console__inject_interrupt_callback(struct kvm *kvm, void *param)
+{
+ struct iovec iov[VIRTIO_CONSOLE_QUEUE_SIZE];
+ struct virt_queue *vq;
+ u16 out, in;
+ u16 head;
+ int len;
+
+ if (kvm->cfg.active_console != CONSOLE_VIRTIO)
+ return;
+
+ mutex_lock(&cdev.mutex);
+
+ vq = param;
+
+ if (term_readable(0) && virt_queue__available(vq)) {
+ head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+ len = term_getc_iov(kvm, iov, in, 0);
+ virt_queue__set_used_elem(vq, head, len);
+ cdev.vdev.ops->signal_vq(kvm, &cdev.vdev, vq - cdev.vqs);
+ }
+
+ mutex_unlock(&cdev.mutex);
+}
+
+void virtio_console__inject_interrupt(struct kvm *kvm)
+{
+ thread_pool__do_job(&cdev.jobs[VIRTIO_CONSOLE_RX_QUEUE]);
+}
+
+static void virtio_console_handle_callback(struct kvm *kvm, void *param)
+{
+ struct iovec iov[VIRTIO_CONSOLE_QUEUE_SIZE];
+ struct virt_queue *vq;
+ u16 out, in;
+ u16 head;
+ u32 len;
+
+ vq = param;
+
+ /*
+ * The current Linux implementation polls for the buffer
+ * to be used, rather than waiting for an interrupt.
+ * So there is no need to inject an interrupt for the tx path.
+ */
+
+ while (virt_queue__available(vq)) {
+ head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+ if (kvm->cfg.active_console == CONSOLE_VIRTIO)
+ len = term_putc_iov(iov, out, 0);
+ else
+ len = 0;
+ virt_queue__set_used_elem(vq, head, len);
+ }
+
+}
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ struct con_dev *cdev = dev;
+
+ return ((u8 *)(&cdev->config));
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ return 0;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ /* Unused */
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct virt_queue *queue;
+ void *p;
+
+ BUG_ON(vq >= VIRTIO_CONSOLE_NUM_QUEUES);
+
+ compat__remove_message(compat_id);
+
+ queue = &cdev.vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+
+ vring_init(&queue->vring, VIRTIO_CONSOLE_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN);
+
+ if (vq == VIRTIO_CONSOLE_TX_QUEUE)
+ thread_pool__init_job(&cdev.jobs[vq], kvm, virtio_console_handle_callback, queue);
+ else if (vq == VIRTIO_CONSOLE_RX_QUEUE)
+ thread_pool__init_job(&cdev.jobs[vq], kvm, virtio_console__inject_interrupt_callback, queue);
+
+ return 0;
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct con_dev *cdev = dev;
+
+ thread_pool__do_job(&cdev->jobs[vq]);
+
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct con_dev *cdev = dev;
+
+ return cdev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ return VIRTIO_CONSOLE_QUEUE_SIZE;
+}
+
+static struct virtio_ops con_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .notify_vq = notify_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+};
+
+int virtio_console__init(struct kvm *kvm)
+{
+ if (kvm->cfg.active_console != CONSOLE_VIRTIO)
+ return 0;
+
+ virtio_init(kvm, &cdev, &cdev.vdev, &con_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_CONSOLE, VIRTIO_ID_CONSOLE, PCI_CLASS_CONSOLE);
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-console", "CONFIG_VIRTIO_CONSOLE");
+
+ return 0;
+}
+virtio_dev_init(virtio_console__init);
+
+int virtio_console__exit(struct kvm *kvm)
+{
+ return 0;
+}
+virtio_dev_exit(virtio_console__exit);
--- /dev/null
+#include <linux/virtio_ring.h>
+#include <linux/types.h>
+#include <sys/uio.h>
+#include <stdlib.h>
+
+#include "kvm/guest_compat.h"
+#include "kvm/barrier.h"
+#include "kvm/virtio.h"
+#include "kvm/virtio-pci.h"
+#include "kvm/virtio-mmio.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+
+
+struct vring_used_elem *virt_queue__set_used_elem(struct virt_queue *queue, u32 head, u32 len)
+{
+ struct vring_used_elem *used_elem;
+
+ used_elem = &queue->vring.used->ring[queue->vring.used->idx % queue->vring.num];
+ used_elem->id = head;
+ used_elem->len = len;
+
+ /*
+ * Use wmb to assure that used elem was updated with head and len.
+ * We need a wmb here since we can't advance idx unless we're ready
+ * to pass the used element to the guest.
+ */
+ wmb();
+ queue->vring.used->idx++;
+
+ /*
+ * Use wmb to assure used idx has been increased before we signal the guest.
+ * Without a wmb here the guest may ignore the queue since it won't see
+ * an updated idx.
+ */
+ wmb();
+
+ return used_elem;
+}
+
+/*
+ * Each buffer in the virtqueues is actually a chain of descriptors. This
+ * function returns the next descriptor in the chain, or vq->vring.num if we're
+ * at the end.
+ */
+static unsigned next_desc(struct vring_desc *desc,
+ unsigned int i, unsigned int max)
+{
+ unsigned int next;
+
+ /* If this descriptor says it doesn't chain, we're done. */
+ if (!(desc[i].flags & VRING_DESC_F_NEXT))
+ return max;
+
+ /* Check they're not leading us off end of descriptors. */
+ next = desc[i].next;
+ /* Make sure compiler knows to grab that: we don't want it changing! */
+ wmb();
+
+ return next;
+}
+
+u16 virt_queue__get_head_iov(struct virt_queue *vq, struct iovec iov[], u16 *out, u16 *in, u16 head, struct kvm *kvm)
+{
+ struct vring_desc *desc;
+ u16 idx;
+ u16 max;
+
+ idx = head;
+ *out = *in = 0;
+ max = vq->vring.num;
+ desc = vq->vring.desc;
+
+ if (desc[idx].flags & VRING_DESC_F_INDIRECT) {
+ max = desc[idx].len / sizeof(struct vring_desc);
+ desc = guest_flat_to_host(kvm, desc[idx].addr);
+ idx = 0;
+ }
+
+ do {
+ /* Grab the first descriptor, and check it's OK. */
+ iov[*out + *in].iov_len = desc[idx].len;
+ iov[*out + *in].iov_base = guest_flat_to_host(kvm, desc[idx].addr);
+ /* If this is an input descriptor, increment that count. */
+ if (desc[idx].flags & VRING_DESC_F_WRITE)
+ (*in)++;
+ else
+ (*out)++;
+ } while ((idx = next_desc(desc, idx, max)) != max);
+
+ return head;
+}
+
+u16 virt_queue__get_iov(struct virt_queue *vq, struct iovec iov[], u16 *out, u16 *in, struct kvm *kvm)
+{
+ u16 head;
+
+ head = virt_queue__pop(vq);
+
+ return virt_queue__get_head_iov(vq, iov, out, in, head, kvm);
+}
+
+/* in and out are relative to guest */
+u16 virt_queue__get_inout_iov(struct kvm *kvm, struct virt_queue *queue,
+ struct iovec in_iov[], struct iovec out_iov[],
+ u16 *in, u16 *out)
+{
+ struct vring_desc *desc;
+ u16 head, idx;
+
+ idx = head = virt_queue__pop(queue);
+ *out = *in = 0;
+ do {
+ desc = virt_queue__get_desc(queue, idx);
+ if (desc->flags & VRING_DESC_F_WRITE) {
+ in_iov[*in].iov_base = guest_flat_to_host(kvm,
+ desc->addr);
+ in_iov[*in].iov_len = desc->len;
+ (*in)++;
+ } else {
+ out_iov[*out].iov_base = guest_flat_to_host(kvm,
+ desc->addr);
+ out_iov[*out].iov_len = desc->len;
+ (*out)++;
+ }
+ if (desc->flags & VRING_DESC_F_NEXT)
+ idx = desc->next;
+ else
+ break;
+ } while (1);
+
+ return head;
+}
+
+int virtio__get_dev_specific_field(int offset, bool msix, u32 *config_off)
+{
+ if (msix) {
+ if (offset < 4)
+ return VIRTIO_PCI_O_MSIX;
+ else
+ offset -= 4;
+ }
+
+ *config_off = offset;
+
+ return VIRTIO_PCI_O_CONFIG;
+}
+
+bool virtio_queue__should_signal(struct virt_queue *vq)
+{
+ u16 old_idx, new_idx, event_idx;
+
+ old_idx = vq->last_used_signalled;
+ new_idx = vq->vring.used->idx;
+ event_idx = vring_used_event(&vq->vring);
+
+ if (vring_need_event(event_idx, new_idx, old_idx)) {
+ vq->last_used_signalled = new_idx;
+ return true;
+ }
+
+ return false;
+}
+
+int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ struct virtio_ops *ops, enum virtio_trans trans,
+ int device_id, int subsys_id, int class)
+{
+ void *virtio;
+
+ switch (trans) {
+ case VIRTIO_PCI:
+ virtio = calloc(sizeof(struct virtio_pci), 1);
+ if (!virtio)
+ return -ENOMEM;
+ vdev->virtio = virtio;
+ vdev->ops = ops;
+ vdev->ops->signal_vq = virtio_pci__signal_vq;
+ vdev->ops->signal_config = virtio_pci__signal_config;
+ vdev->ops->init = virtio_pci__init;
+ vdev->ops->exit = virtio_pci__exit;
+ vdev->ops->init(kvm, dev, vdev, device_id, subsys_id, class);
+ break;
+ case VIRTIO_MMIO:
+ virtio = calloc(sizeof(struct virtio_mmio), 1);
+ if (!virtio)
+ return -ENOMEM;
+ vdev->virtio = virtio;
+ vdev->ops = ops;
+ vdev->ops->signal_vq = virtio_mmio_signal_vq;
+ vdev->ops->signal_config = virtio_mmio_signal_config;
+ vdev->ops->init = virtio_mmio_init;
+ vdev->ops->exit = virtio_mmio_exit;
+ vdev->ops->init(kvm, dev, vdev, device_id, subsys_id, class);
+ break;
+ default:
+ return -1;
+ };
+
+ return 0;
+}
+
+int virtio_compat_add_message(const char *device, const char *config)
+{
+ int len = 1024;
+ int compat_id;
+ char *title;
+ char *desc;
+
+ title = malloc(len);
+ if (!title)
+ return -ENOMEM;
+
+ desc = malloc(len);
+ if (!desc) {
+ free(title);
+ return -ENOMEM;
+ }
+
+ snprintf(title, len, "%s device was not detected.", device);
+ snprintf(desc, len, "While you have requested a %s device, "
+ "the guest kernel did not initialize it.\n"
+ "\tPlease make sure that the guest kernel was "
+ "compiled with %s=y enabled in .config.",
+ device, config);
+
+ compat_id = compat__add_message(title, desc);
+
+ free(desc);
+ free(title);
+
+ return compat_id;
+}
--- /dev/null
+#include "kvm/virtio-mmio.h"
+#include "kvm/ioeventfd.h"
+#include "kvm/ioport.h"
+#include "kvm/virtio.h"
+#include "kvm/kvm.h"
+#include "kvm/irq.h"
+
+#include <linux/virtio_mmio.h>
+#include <string.h>
+
+static u32 virtio_mmio_io_space_blocks = KVM_VIRTIO_MMIO_AREA;
+
+static u32 virtio_mmio_get_io_space_block(u32 size)
+{
+ u32 block = virtio_mmio_io_space_blocks;
+ virtio_mmio_io_space_blocks += size;
+
+ return block;
+}
+
+static void virtio_mmio_ioevent_callback(struct kvm *kvm, void *param)
+{
+ struct virtio_mmio_ioevent_param *ioeventfd = param;
+ struct virtio_mmio *vmmio = ioeventfd->vdev->virtio;
+
+ ioeventfd->vdev->ops->notify_vq(kvm, vmmio->dev, ioeventfd->vq);
+}
+
+static int virtio_mmio_init_ioeventfd(struct kvm *kvm,
+ struct virtio_device *vdev, u32 vq)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+ struct ioevent ioevent;
+ int err;
+
+ vmmio->ioeventfds[vq] = (struct virtio_mmio_ioevent_param) {
+ .vdev = vdev,
+ .vq = vq,
+ };
+
+ ioevent = (struct ioevent) {
+ .io_addr = vmmio->addr + VIRTIO_MMIO_QUEUE_NOTIFY,
+ .io_len = sizeof(u32),
+ .fn = virtio_mmio_ioevent_callback,
+ .fn_ptr = &vmmio->ioeventfds[vq],
+ .datamatch = vq,
+ .fn_kvm = kvm,
+ .fd = eventfd(0, 0),
+ };
+
+ if (vdev->use_vhost)
+ /*
+ * Vhost will poll the eventfd in host kernel side,
+ * no need to poll in userspace.
+ */
+ err = ioeventfd__add_event(&ioevent, true, false);
+ else
+ /* Need to poll in userspace. */
+ err = ioeventfd__add_event(&ioevent, true, true);
+ if (err)
+ return err;
+
+ if (vdev->ops->notify_vq_eventfd)
+ vdev->ops->notify_vq_eventfd(kvm, vmmio->dev, vq, ioevent.fd);
+
+ return 0;
+}
+
+int virtio_mmio_signal_vq(struct kvm *kvm, struct virtio_device *vdev, u32 vq)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+
+ vmmio->hdr.interrupt_state |= VIRTIO_MMIO_INT_VRING;
+ kvm__irq_trigger(vmmio->kvm, vmmio->irq);
+
+ return 0;
+}
+
+int virtio_mmio_signal_config(struct kvm *kvm, struct virtio_device *vdev)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+
+ vmmio->hdr.interrupt_state |= VIRTIO_MMIO_INT_CONFIG;
+ kvm__irq_trigger(vmmio->kvm, vmmio->irq);
+
+ return 0;
+}
+
+static void virtio_mmio_device_specific(u64 addr, u8 *data, u32 len,
+ u8 is_write, struct virtio_device *vdev)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+ u32 i;
+
+ for (i = 0; i < len; i++) {
+ if (is_write)
+ vdev->ops->get_config(vmmio->kvm, vmmio->dev)[addr + i] =
+ *(u8 *)data + i;
+ else
+ data[i] = vdev->ops->get_config(vmmio->kvm,
+ vmmio->dev)[addr + i];
+ }
+}
+
+static void virtio_mmio_config_in(u64 addr, void *data, u32 len,
+ struct virtio_device *vdev)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+ u32 val = 0;
+
+ switch (addr) {
+ case VIRTIO_MMIO_MAGIC_VALUE:
+ case VIRTIO_MMIO_VERSION:
+ case VIRTIO_MMIO_DEVICE_ID:
+ case VIRTIO_MMIO_VENDOR_ID:
+ case VIRTIO_MMIO_STATUS:
+ case VIRTIO_MMIO_INTERRUPT_STATUS:
+ ioport__write32(data, *(u32 *)(((void *)&vmmio->hdr) + addr));
+ break;
+ case VIRTIO_MMIO_HOST_FEATURES:
+ if (vmmio->hdr.host_features_sel == 0)
+ val = vdev->ops->get_host_features(vmmio->kvm,
+ vmmio->dev);
+ ioport__write32(data, val);
+ break;
+ case VIRTIO_MMIO_QUEUE_PFN:
+ val = vdev->ops->get_pfn_vq(vmmio->kvm, vmmio->dev,
+ vmmio->hdr.queue_sel);
+ ioport__write32(data, val);
+ break;
+ case VIRTIO_MMIO_QUEUE_NUM_MAX:
+ val = vdev->ops->get_size_vq(vmmio->kvm, vmmio->dev,
+ vmmio->hdr.queue_sel);
+ ioport__write32(data, val);
+ break;
+ default:
+ break;
+ }
+}
+
+static void virtio_mmio_config_out(u64 addr, void *data, u32 len,
+ struct virtio_device *vdev)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+ u32 val = 0;
+
+ switch (addr) {
+ case VIRTIO_MMIO_HOST_FEATURES_SEL:
+ case VIRTIO_MMIO_GUEST_FEATURES_SEL:
+ case VIRTIO_MMIO_QUEUE_SEL:
+ case VIRTIO_MMIO_STATUS:
+ val = ioport__read32(data);
+ *(u32 *)(((void *)&vmmio->hdr) + addr) = val;
+ break;
+ case VIRTIO_MMIO_GUEST_FEATURES:
+ if (vmmio->hdr.guest_features_sel == 0) {
+ val = ioport__read32(data);
+ vdev->ops->set_guest_features(vmmio->kvm,
+ vmmio->dev, val);
+ }
+ break;
+ case VIRTIO_MMIO_GUEST_PAGE_SIZE:
+ val = ioport__read32(data);
+ vmmio->hdr.guest_page_size = val;
+ /* FIXME: set guest page size */
+ break;
+ case VIRTIO_MMIO_QUEUE_NUM:
+ val = ioport__read32(data);
+ vmmio->hdr.queue_num = val;
+ /* FIXME: set vq size */
+ vdev->ops->set_size_vq(vmmio->kvm, vmmio->dev,
+ vmmio->hdr.queue_sel, val);
+ break;
+ case VIRTIO_MMIO_QUEUE_ALIGN:
+ val = ioport__read32(data);
+ vmmio->hdr.queue_align = val;
+ /* FIXME: set used ring alignment */
+ break;
+ case VIRTIO_MMIO_QUEUE_PFN:
+ val = ioport__read32(data);
+ virtio_mmio_init_ioeventfd(vmmio->kvm, vdev, vmmio->hdr.queue_sel);
+ vdev->ops->init_vq(vmmio->kvm, vmmio->dev,
+ vmmio->hdr.queue_sel, val);
+ break;
+ case VIRTIO_MMIO_QUEUE_NOTIFY:
+ val = ioport__read32(data);
+ vdev->ops->notify_vq(vmmio->kvm, vmmio->dev, val);
+ break;
+ case VIRTIO_MMIO_INTERRUPT_ACK:
+ val = ioport__read32(data);
+ vmmio->hdr.interrupt_state &= ~val;
+ break;
+ default:
+ break;
+ };
+}
+
+static void virtio_mmio_mmio_callback(u64 addr, u8 *data, u32 len,
+ u8 is_write, void *ptr)
+{
+ struct virtio_device *vdev = ptr;
+ struct virtio_mmio *vmmio = vdev->virtio;
+ u32 offset = addr - vmmio->addr;
+
+ if (offset >= VIRTIO_MMIO_CONFIG) {
+ offset -= VIRTIO_MMIO_CONFIG;
+ virtio_mmio_device_specific(offset, data, len, is_write, ptr);
+ return;
+ }
+
+ if (is_write)
+ virtio_mmio_config_out(offset, data, len, ptr);
+ else
+ virtio_mmio_config_in(offset, data, len, ptr);
+}
+
+int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ int device_id, int subsys_id, int class)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+ u8 device, pin, line;
+
+ vmmio->addr = virtio_mmio_get_io_space_block(VIRTIO_MMIO_IO_SIZE);
+ vmmio->kvm = kvm;
+ vmmio->dev = dev;
+
+ kvm__register_mmio(kvm, vmmio->addr, VIRTIO_MMIO_IO_SIZE,
+ false, virtio_mmio_mmio_callback, vdev);
+
+ vmmio->hdr = (struct virtio_mmio_hdr) {
+ .magic = {'v', 'i', 'r', 't'},
+ .version = 1,
+ .device_id = device_id - 0x1000 + 1,
+ .vendor_id = 0x4d564b4c , /* 'LKVM' */
+ .queue_num_max = 256,
+ };
+
+ if (irq__register_device(subsys_id, &device, &pin, &line) < 0)
+ return -1;
+ vmmio->irq = line;
+
+ /*
+ * Instantiate guest virtio-mmio devices using kernel command line
+ * (or module) parameter, e.g
+ *
+ * virtio_mmio.devices=0x200@0xd2000000:5,0x200@0xd2000200:6
+ */
+ pr_info("virtio-mmio.devices=0x%x@0x%x:%d\n", VIRTIO_MMIO_IO_SIZE, vmmio->addr, line);
+
+ return 0;
+}
+
+int virtio_mmio_exit(struct kvm *kvm, struct virtio_device *vdev)
+{
+ struct virtio_mmio *vmmio = vdev->virtio;
+ int i;
+
+ kvm__deregister_mmio(kvm, vmmio->addr);
+
+ for (i = 0; i < VIRTIO_MMIO_MAX_VQ; i++)
+ ioeventfd__del_event(vmmio->addr + VIRTIO_MMIO_QUEUE_NOTIFY, i);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/virtio-net.h"
+#include "kvm/virtio.h"
+#include "kvm/types.h"
+#include "kvm/mutex.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/irq.h"
+#include "kvm/uip.h"
+#include "kvm/guest_compat.h"
+
+#include <linux/vhost.h>
+#include <linux/virtio_net.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+
+#include <arpa/inet.h>
+#include <net/if.h>
+
+#include <unistd.h>
+#include <fcntl.h>
+
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/eventfd.h>
+
+#define VIRTIO_NET_QUEUE_SIZE 256
+#define VIRTIO_NET_NUM_QUEUES 2
+#define VIRTIO_NET_RX_QUEUE 0
+#define VIRTIO_NET_TX_QUEUE 1
+
+struct net_dev;
+
+struct net_dev_operations {
+ int (*rx)(struct iovec *iov, u16 in, struct net_dev *ndev);
+ int (*tx)(struct iovec *iov, u16 in, struct net_dev *ndev);
+};
+
+struct net_dev {
+ pthread_mutex_t mutex;
+ struct virtio_device vdev;
+ struct list_head list;
+
+ struct virt_queue vqs[VIRTIO_NET_NUM_QUEUES];
+ struct virtio_net_config config;
+ u32 features;
+
+ pthread_t io_rx_thread;
+ pthread_mutex_t io_rx_lock;
+ pthread_cond_t io_rx_cond;
+
+ pthread_t io_tx_thread;
+ pthread_mutex_t io_tx_lock;
+ pthread_cond_t io_tx_cond;
+
+ int vhost_fd;
+ int tap_fd;
+ char tap_name[IFNAMSIZ];
+
+ int mode;
+
+ struct uip_info info;
+ struct net_dev_operations *ops;
+ struct kvm *kvm;
+};
+
+static LIST_HEAD(ndevs);
+static int compat_id = -1;
+
+static void *virtio_net_rx_thread(void *p)
+{
+ struct iovec iov[VIRTIO_NET_QUEUE_SIZE];
+ struct virt_queue *vq;
+ struct kvm *kvm;
+ struct net_dev *ndev = p;
+ u16 out, in;
+ u16 head;
+ int len;
+
+ kvm = ndev->kvm;
+ vq = &ndev->vqs[VIRTIO_NET_RX_QUEUE];
+
+ while (1) {
+ mutex_lock(&ndev->io_rx_lock);
+ if (!virt_queue__available(vq))
+ pthread_cond_wait(&ndev->io_rx_cond, &ndev->io_rx_lock);
+ mutex_unlock(&ndev->io_rx_lock);
+
+ while (virt_queue__available(vq)) {
+ head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+ len = ndev->ops->rx(iov, in, ndev);
+ virt_queue__set_used_elem(vq, head, len);
+
+ /* We should interrupt guest right now, otherwise latency is huge. */
+ if (virtio_queue__should_signal(&ndev->vqs[VIRTIO_NET_RX_QUEUE]))
+ ndev->vdev.ops->signal_vq(kvm, &ndev->vdev,
+ VIRTIO_NET_RX_QUEUE);
+ }
+ }
+
+ pthread_exit(NULL);
+ return NULL;
+
+}
+
+static void *virtio_net_tx_thread(void *p)
+{
+ struct iovec iov[VIRTIO_NET_QUEUE_SIZE];
+ struct virt_queue *vq;
+ struct kvm *kvm;
+ struct net_dev *ndev = p;
+ u16 out, in;
+ u16 head;
+ int len;
+
+ kvm = ndev->kvm;
+ vq = &ndev->vqs[VIRTIO_NET_TX_QUEUE];
+
+ while (1) {
+ mutex_lock(&ndev->io_tx_lock);
+ if (!virt_queue__available(vq))
+ pthread_cond_wait(&ndev->io_tx_cond, &ndev->io_tx_lock);
+ mutex_unlock(&ndev->io_tx_lock);
+
+ while (virt_queue__available(vq)) {
+ head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+ len = ndev->ops->tx(iov, out, ndev);
+ virt_queue__set_used_elem(vq, head, len);
+ }
+
+ if (virtio_queue__should_signal(&ndev->vqs[VIRTIO_NET_TX_QUEUE]))
+ ndev->vdev.ops->signal_vq(kvm, &ndev->vdev, VIRTIO_NET_TX_QUEUE);
+ }
+
+ pthread_exit(NULL);
+
+ return NULL;
+
+}
+
+static void virtio_net_handle_callback(struct kvm *kvm, struct net_dev *ndev, int queue)
+{
+ switch (queue) {
+ case VIRTIO_NET_TX_QUEUE:
+ mutex_lock(&ndev->io_tx_lock);
+ pthread_cond_signal(&ndev->io_tx_cond);
+ mutex_unlock(&ndev->io_tx_lock);
+ break;
+ case VIRTIO_NET_RX_QUEUE:
+ mutex_lock(&ndev->io_rx_lock);
+ pthread_cond_signal(&ndev->io_rx_cond);
+ mutex_unlock(&ndev->io_rx_lock);
+ break;
+ default:
+ pr_warning("Unknown queue index %u", queue);
+ }
+}
+
+static bool virtio_net__tap_init(const struct virtio_net_params *params,
+ struct net_dev *ndev)
+{
+ int sock = socket(AF_INET, SOCK_STREAM, 0);
+ int pid, status, offload, hdr_len;
+ struct sockaddr_in sin = {0};
+ struct ifreq ifr;
+
+ /* Did the user already gave us the FD? */
+ if (params->fd) {
+ ndev->tap_fd = params->fd;
+ return 1;
+ }
+
+ ndev->tap_fd = open("/dev/net/tun", O_RDWR);
+ if (ndev->tap_fd < 0) {
+ pr_warning("Unable to open /dev/net/tun");
+ goto fail;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+ if (ioctl(ndev->tap_fd, TUNSETIFF, &ifr) < 0) {
+ pr_warning("Config tap device error. Are you root?");
+ goto fail;
+ }
+
+ strncpy(ndev->tap_name, ifr.ifr_name, sizeof(ndev->tap_name));
+
+ if (ioctl(ndev->tap_fd, TUNSETNOCSUM, 1) < 0) {
+ pr_warning("Config tap device TUNSETNOCSUM error");
+ goto fail;
+ }
+
+ hdr_len = sizeof(struct virtio_net_hdr);
+ if (ioctl(ndev->tap_fd, TUNSETVNETHDRSZ, &hdr_len) < 0)
+ pr_warning("Config tap device TUNSETVNETHDRSZ error");
+
+ offload = TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | TUN_F_UFO;
+ if (ioctl(ndev->tap_fd, TUNSETOFFLOAD, offload) < 0) {
+ pr_warning("Config tap device TUNSETOFFLOAD error");
+ goto fail;
+ }
+
+ if (strcmp(params->script, "none")) {
+ pid = fork();
+ if (pid == 0) {
+ execl(params->script, params->script, ndev->tap_name, NULL);
+ _exit(1);
+ } else {
+ waitpid(pid, &status, 0);
+ if (WIFEXITED(status) && WEXITSTATUS(status) != 0) {
+ pr_warning("Fail to setup tap by %s", params->script);
+ goto fail;
+ }
+ }
+ } else {
+ memset(&ifr, 0, sizeof(ifr));
+ strncpy(ifr.ifr_name, ndev->tap_name, sizeof(ndev->tap_name));
+ sin.sin_addr.s_addr = inet_addr(params->host_ip);
+ memcpy(&(ifr.ifr_addr), &sin, sizeof(ifr.ifr_addr));
+ ifr.ifr_addr.sa_family = AF_INET;
+ if (ioctl(sock, SIOCSIFADDR, &ifr) < 0) {
+ pr_warning("Could not set ip address on tap device");
+ goto fail;
+ }
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strncpy(ifr.ifr_name, ndev->tap_name, sizeof(ndev->tap_name));
+ ioctl(sock, SIOCGIFFLAGS, &ifr);
+ ifr.ifr_flags |= IFF_UP | IFF_RUNNING;
+ if (ioctl(sock, SIOCSIFFLAGS, &ifr) < 0)
+ pr_warning("Could not bring tap device up");
+
+ close(sock);
+
+ return 1;
+
+fail:
+ if (sock >= 0)
+ close(sock);
+ if (ndev->tap_fd >= 0)
+ close(ndev->tap_fd);
+
+ return 0;
+}
+
+static void virtio_net__io_thread_init(struct kvm *kvm, struct net_dev *ndev)
+{
+ pthread_mutex_init(&ndev->io_tx_lock, NULL);
+ pthread_mutex_init(&ndev->io_rx_lock, NULL);
+
+ pthread_cond_init(&ndev->io_tx_cond, NULL);
+ pthread_cond_init(&ndev->io_rx_cond, NULL);
+
+ pthread_create(&ndev->io_tx_thread, NULL, virtio_net_tx_thread, ndev);
+ pthread_create(&ndev->io_rx_thread, NULL, virtio_net_rx_thread, ndev);
+}
+
+static inline int tap_ops_tx(struct iovec *iov, u16 out, struct net_dev *ndev)
+{
+ return writev(ndev->tap_fd, iov, out);
+}
+
+static inline int tap_ops_rx(struct iovec *iov, u16 in, struct net_dev *ndev)
+{
+ return readv(ndev->tap_fd, iov, in);
+}
+
+static inline int uip_ops_tx(struct iovec *iov, u16 out, struct net_dev *ndev)
+{
+ return uip_tx(iov, out, &ndev->info);
+}
+
+static inline int uip_ops_rx(struct iovec *iov, u16 in, struct net_dev *ndev)
+{
+ return uip_rx(iov, in, &ndev->info);
+}
+
+static struct net_dev_operations tap_ops = {
+ .rx = tap_ops_rx,
+ .tx = tap_ops_tx,
+};
+
+static struct net_dev_operations uip_ops = {
+ .rx = uip_ops_rx,
+ .tx = uip_ops_tx,
+};
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ struct net_dev *ndev = dev;
+
+ return ((u8 *)(&ndev->config));
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ return 1UL << VIRTIO_NET_F_MAC
+ | 1UL << VIRTIO_NET_F_CSUM
+ | 1UL << VIRTIO_NET_F_HOST_UFO
+ | 1UL << VIRTIO_NET_F_HOST_TSO4
+ | 1UL << VIRTIO_NET_F_HOST_TSO6
+ | 1UL << VIRTIO_NET_F_GUEST_UFO
+ | 1UL << VIRTIO_NET_F_GUEST_TSO4
+ | 1UL << VIRTIO_NET_F_GUEST_TSO6
+ | 1UL << VIRTIO_RING_F_EVENT_IDX
+ | 1UL << VIRTIO_RING_F_INDIRECT_DESC;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ struct net_dev *ndev = dev;
+
+ ndev->features = features;
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct vhost_vring_state state = { .index = vq };
+ struct vhost_vring_addr addr;
+ struct net_dev *ndev = dev;
+ struct virt_queue *queue;
+ void *p;
+ int r;
+
+ compat__remove_message(compat_id);
+
+ queue = &ndev->vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+
+ /* FIXME: respect pci and mmio vring alignment */
+ vring_init(&queue->vring, VIRTIO_NET_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN);
+
+ if (ndev->vhost_fd == 0)
+ return 0;
+
+ state.num = queue->vring.num;
+ r = ioctl(ndev->vhost_fd, VHOST_SET_VRING_NUM, &state);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_NUM failed");
+ state.num = 0;
+ r = ioctl(ndev->vhost_fd, VHOST_SET_VRING_BASE, &state);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_BASE failed");
+
+ addr = (struct vhost_vring_addr) {
+ .index = vq,
+ .desc_user_addr = (u64)(unsigned long)queue->vring.desc,
+ .avail_user_addr = (u64)(unsigned long)queue->vring.avail,
+ .used_user_addr = (u64)(unsigned long)queue->vring.used,
+ };
+
+ r = ioctl(ndev->vhost_fd, VHOST_SET_VRING_ADDR, &addr);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_ADDR failed");
+
+ return 0;
+}
+
+static void notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+ struct net_dev *ndev = dev;
+ struct kvm_irqfd irq;
+ struct vhost_vring_file file;
+ int r;
+
+ if (ndev->vhost_fd == 0)
+ return;
+
+ irq = (struct kvm_irqfd) {
+ .gsi = gsi,
+ .fd = eventfd(0, 0),
+ };
+ file = (struct vhost_vring_file) {
+ .index = vq,
+ .fd = irq.fd,
+ };
+
+ r = ioctl(kvm->vm_fd, KVM_IRQFD, &irq);
+ if (r < 0)
+ die_perror("KVM_IRQFD failed");
+
+ r = ioctl(ndev->vhost_fd, VHOST_SET_VRING_CALL, &file);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_CALL failed");
+ file.fd = ndev->tap_fd;
+ r = ioctl(ndev->vhost_fd, VHOST_NET_SET_BACKEND, &file);
+ if (r != 0)
+ die("VHOST_NET_SET_BACKEND failed %d", errno);
+
+}
+
+static void notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 efd)
+{
+ struct net_dev *ndev = dev;
+ struct vhost_vring_file file = {
+ .index = vq,
+ .fd = efd,
+ };
+ int r;
+
+ if (ndev->vhost_fd == 0)
+ return;
+
+ r = ioctl(ndev->vhost_fd, VHOST_SET_VRING_KICK, &file);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_KICK failed");
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct net_dev *ndev = dev;
+
+ virtio_net_handle_callback(kvm, ndev, vq);
+
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct net_dev *ndev = dev;
+
+ return ndev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ /* FIXME: dynamic */
+ return VIRTIO_NET_QUEUE_SIZE;
+}
+
+static int set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+ /* FIXME: dynamic */
+ return size;
+}
+
+static struct virtio_ops net_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+ .set_size_vq = set_size_vq,
+ .notify_vq = notify_vq,
+ .notify_vq_gsi = notify_vq_gsi,
+ .notify_vq_eventfd = notify_vq_eventfd,
+};
+
+static void virtio_net__vhost_init(struct kvm *kvm, struct net_dev *ndev)
+{
+ u64 features = 1UL << VIRTIO_RING_F_EVENT_IDX;
+ struct vhost_memory *mem;
+ int r;
+
+ ndev->vhost_fd = open("/dev/vhost-net", O_RDWR);
+ if (ndev->vhost_fd < 0)
+ die_perror("Failed openning vhost-net device");
+
+ mem = calloc(1, sizeof(*mem) + sizeof(struct vhost_memory_region));
+ if (mem == NULL)
+ die("Failed allocating memory for vhost memory map");
+
+ mem->nregions = 1;
+ mem->regions[0] = (struct vhost_memory_region) {
+ .guest_phys_addr = 0,
+ .memory_size = kvm->ram_size,
+ .userspace_addr = (unsigned long)kvm->ram_start,
+ };
+
+ r = ioctl(ndev->vhost_fd, VHOST_SET_OWNER);
+ if (r != 0)
+ die_perror("VHOST_SET_OWNER failed");
+
+ r = ioctl(ndev->vhost_fd, VHOST_SET_FEATURES, &features);
+ if (r != 0)
+ die_perror("VHOST_SET_FEATURES failed");
+ r = ioctl(ndev->vhost_fd, VHOST_SET_MEM_TABLE, mem);
+ if (r != 0)
+ die_perror("VHOST_SET_MEM_TABLE failed");
+
+ ndev->vdev.use_vhost = true;
+
+ free(mem);
+}
+
+static inline void str_to_mac(const char *str, char *mac)
+{
+ sscanf(str, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
+ mac, mac+1, mac+2, mac+3, mac+4, mac+5);
+}
+static int set_net_param(struct kvm *kvm, struct virtio_net_params *p,
+ const char *param, const char *val)
+{
+ if (strcmp(param, "guest_mac") == 0) {
+ str_to_mac(val, p->guest_mac);
+ } else if (strcmp(param, "mode") == 0) {
+ if (!strncmp(val, "user", 4)) {
+ int i;
+
+ for (i = 0; i < kvm->cfg.num_net_devices; i++)
+ if (kvm->cfg.net_params[i].mode == NET_MODE_USER)
+ die("Only one usermode network device allowed at a time");
+ p->mode = NET_MODE_USER;
+ } else if (!strncmp(val, "tap", 3)) {
+ p->mode = NET_MODE_TAP;
+ } else if (!strncmp(val, "none", 4)) {
+ kvm->cfg.no_net = 1;
+ return -1;
+ } else
+ die("Unknown network mode %s, please use user, tap or none", kvm->cfg.network);
+ } else if (strcmp(param, "script") == 0) {
+ p->script = strdup(val);
+ } else if (strcmp(param, "guest_ip") == 0) {
+ p->guest_ip = strdup(val);
+ } else if (strcmp(param, "host_ip") == 0) {
+ p->host_ip = strdup(val);
+ } else if (strcmp(param, "trans") == 0) {
+ p->trans = strdup(val);
+ } else if (strcmp(param, "vhost") == 0) {
+ p->vhost = atoi(val);
+ } else if (strcmp(param, "fd") == 0) {
+ p->fd = atoi(val);
+ } else
+ die("Unknown network parameter %s", param);
+
+ return 0;
+}
+
+int netdev_parser(const struct option *opt, const char *arg, int unset)
+{
+ struct virtio_net_params p;
+ char *buf = NULL, *cmd = NULL, *cur = NULL;
+ bool on_cmd = true;
+ struct kvm *kvm = opt->ptr;
+
+ if (arg) {
+ buf = strdup(arg);
+ if (buf == NULL)
+ die("Failed allocating new net buffer");
+ cur = strtok(buf, ",=");
+ }
+
+ p = (struct virtio_net_params) {
+ .guest_ip = DEFAULT_GUEST_ADDR,
+ .host_ip = DEFAULT_HOST_ADDR,
+ .script = DEFAULT_SCRIPT,
+ .mode = NET_MODE_TAP,
+ };
+
+ str_to_mac(DEFAULT_GUEST_MAC, p.guest_mac);
+ p.guest_mac[5] += kvm->cfg.num_net_devices;
+
+ while (cur) {
+ if (on_cmd) {
+ cmd = cur;
+ } else {
+ if (set_net_param(kvm, &p, cmd, cur) < 0)
+ goto done;
+ }
+ on_cmd = !on_cmd;
+
+ cur = strtok(NULL, ",=");
+ };
+
+ kvm->cfg.num_net_devices++;
+
+ kvm->cfg.net_params = realloc(kvm->cfg.net_params, kvm->cfg.num_net_devices * sizeof(*kvm->cfg.net_params));
+ if (kvm->cfg.net_params == NULL)
+ die("Failed adding new network device");
+
+ kvm->cfg.net_params[kvm->cfg.num_net_devices - 1] = p;
+
+done:
+ free(buf);
+ return 0;
+}
+
+static int virtio_net__init_one(struct virtio_net_params *params)
+{
+ int i;
+ struct net_dev *ndev;
+
+ ndev = calloc(1, sizeof(struct net_dev));
+ if (ndev == NULL)
+ return -ENOMEM;
+
+ list_add_tail(&ndev->list, &ndevs);
+
+ ndev->kvm = params->kvm;
+
+ mutex_init(&ndev->mutex);
+ ndev->config.status = VIRTIO_NET_S_LINK_UP;
+
+ for (i = 0 ; i < 6 ; i++) {
+ ndev->config.mac[i] = params->guest_mac[i];
+ ndev->info.guest_mac.addr[i] = params->guest_mac[i];
+ ndev->info.host_mac.addr[i] = params->host_mac[i];
+ }
+
+ ndev->mode = params->mode;
+ if (ndev->mode == NET_MODE_TAP) {
+ if (!virtio_net__tap_init(params, ndev))
+ die_perror("You have requested a TAP device, but creation of one has failed because");
+ ndev->ops = &tap_ops;
+ } else {
+ ndev->info.host_ip = ntohl(inet_addr(params->host_ip));
+ ndev->info.guest_ip = ntohl(inet_addr(params->guest_ip));
+ ndev->info.guest_netmask = ntohl(inet_addr("255.255.255.0"));
+ ndev->info.buf_nr = 20,
+ uip_init(&ndev->info);
+ ndev->ops = &uip_ops;
+ }
+
+ if (params->trans && strcmp(params->trans, "mmio") == 0)
+ virtio_init(params->kvm, ndev, &ndev->vdev, &net_dev_virtio_ops,
+ VIRTIO_MMIO, PCI_DEVICE_ID_VIRTIO_NET, VIRTIO_ID_NET, PCI_CLASS_NET);
+ else
+ virtio_init(params->kvm, ndev, &ndev->vdev, &net_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_NET, VIRTIO_ID_NET, PCI_CLASS_NET);
+
+ if (params->vhost)
+ virtio_net__vhost_init(params->kvm, ndev);
+ else
+ virtio_net__io_thread_init(params->kvm, ndev);
+
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-net", "CONFIG_VIRTIO_NET");
+
+ return 0;
+}
+
+int virtio_net__init(struct kvm *kvm)
+{
+ int i;
+
+ for (i = 0; i < kvm->cfg.num_net_devices; i++) {
+ kvm->cfg.net_params[i].kvm = kvm;
+ virtio_net__init_one(&kvm->cfg.net_params[i]);
+ }
+
+ if (kvm->cfg.num_net_devices == 0 && kvm->cfg.no_net == 0) {
+ struct virtio_net_params net_params;
+
+ net_params = (struct virtio_net_params) {
+ .guest_ip = kvm->cfg.guest_ip,
+ .host_ip = kvm->cfg.host_ip,
+ .kvm = kvm,
+ .script = kvm->cfg.script,
+ .mode = NET_MODE_USER,
+ };
+ str_to_mac(kvm->cfg.guest_mac, net_params.guest_mac);
+ str_to_mac(kvm->cfg.host_mac, net_params.host_mac);
+
+ virtio_net__init_one(&net_params);
+ }
+
+ return 0;
+}
+virtio_dev_init(virtio_net__init);
+
+int virtio_net__exit(struct kvm *kvm)
+{
+ return 0;
+}
+virtio_dev_exit(virtio_net__exit);
--- /dev/null
+#include "kvm/virtio-pci.h"
+
+#include "kvm/ioport.h"
+#include "kvm/kvm.h"
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/irq.h"
+#include "kvm/virtio.h"
+#include "kvm/ioeventfd.h"
+
+#include <sys/ioctl.h>
+#include <linux/virtio_pci.h>
+#include <linux/byteorder.h>
+#include <string.h>
+
+static void virtio_pci__ioevent_callback(struct kvm *kvm, void *param)
+{
+ struct virtio_pci_ioevent_param *ioeventfd = param;
+ struct virtio_pci *vpci = ioeventfd->vdev->virtio;
+
+ ioeventfd->vdev->ops->notify_vq(kvm, vpci->dev, ioeventfd->vq);
+}
+
+static int virtio_pci__init_ioeventfd(struct kvm *kvm, struct virtio_device *vdev, u32 vq)
+{
+ struct ioevent ioevent;
+ struct virtio_pci *vpci = vdev->virtio;
+ int r;
+
+ vpci->ioeventfds[vq] = (struct virtio_pci_ioevent_param) {
+ .vdev = vdev,
+ .vq = vq,
+ };
+
+ ioevent = (struct ioevent) {
+ .io_addr = vpci->base_addr + VIRTIO_PCI_QUEUE_NOTIFY,
+ .io_len = sizeof(u16),
+ .fn = virtio_pci__ioevent_callback,
+ .fn_ptr = &vpci->ioeventfds[vq],
+ .datamatch = vq,
+ .fn_kvm = kvm,
+ .fd = eventfd(0, 0),
+ };
+
+ if (vdev->use_vhost)
+ /*
+ * Vhost will poll the eventfd in host kernel side,
+ * no need to poll in userspace.
+ */
+ r = ioeventfd__add_event(&ioevent, true, false);
+ else
+ /* Need to poll in userspace. */
+ r = ioeventfd__add_event(&ioevent, true, true);
+ if (r)
+ return r;
+
+ if (vdev->ops->notify_vq_eventfd)
+ vdev->ops->notify_vq_eventfd(kvm, vpci->dev, vq, ioevent.fd);
+
+ return 0;
+}
+
+static inline bool virtio_pci__msix_enabled(struct virtio_pci *vpci)
+{
+ return vpci->pci_hdr.msix.ctrl & cpu_to_le16(PCI_MSIX_FLAGS_ENABLE);
+}
+
+static bool virtio_pci__specific_io_in(struct kvm *kvm, struct virtio_device *vdev, u16 port,
+ void *data, int size, int offset)
+{
+ u32 config_offset;
+ struct virtio_pci *vpci = vdev->virtio;
+ int type = virtio__get_dev_specific_field(offset - 20,
+ virtio_pci__msix_enabled(vpci),
+ &config_offset);
+ if (type == VIRTIO_PCI_O_MSIX) {
+ switch (offset) {
+ case VIRTIO_MSI_CONFIG_VECTOR:
+ ioport__write16(data, vpci->config_vector);
+ break;
+ case VIRTIO_MSI_QUEUE_VECTOR:
+ ioport__write16(data, vpci->vq_vector[vpci->queue_selector]);
+ break;
+ };
+
+ return true;
+ } else if (type == VIRTIO_PCI_O_CONFIG) {
+ u8 cfg;
+
+ cfg = vdev->ops->get_config(kvm, vpci->dev)[config_offset];
+ ioport__write8(data, cfg);
+ return true;
+ }
+
+ return false;
+}
+
+static bool virtio_pci__io_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ unsigned long offset;
+ bool ret = true;
+ struct virtio_device *vdev;
+ struct virtio_pci *vpci;
+ u32 val;
+
+ vdev = ioport->priv;
+ vpci = vdev->virtio;
+ offset = port - vpci->base_addr;
+
+ switch (offset) {
+ case VIRTIO_PCI_HOST_FEATURES:
+ val = vdev->ops->get_host_features(kvm, vpci->dev);
+ ioport__write32(data, val);
+ break;
+ case VIRTIO_PCI_QUEUE_PFN:
+ val = vdev->ops->get_pfn_vq(kvm, vpci->dev, vpci->queue_selector);
+ ioport__write32(data, val);
+ break;
+ case VIRTIO_PCI_QUEUE_NUM:
+ val = vdev->ops->get_size_vq(kvm, vpci->dev, vpci->queue_selector);
+ ioport__write16(data, val);
+ break;
+ case VIRTIO_PCI_STATUS:
+ ioport__write8(data, vpci->status);
+ break;
+ case VIRTIO_PCI_ISR:
+ ioport__write8(data, vpci->isr);
+ kvm__irq_line(kvm, vpci->pci_hdr.irq_line, VIRTIO_IRQ_LOW);
+ vpci->isr = VIRTIO_IRQ_LOW;
+ break;
+ default:
+ ret = virtio_pci__specific_io_in(kvm, vdev, port, data, size, offset);
+ break;
+ };
+
+ return ret;
+}
+
+static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *vdev, u16 port,
+ void *data, int size, int offset)
+{
+ struct virtio_pci *vpci = vdev->virtio;
+ u32 config_offset, gsi, vec;
+ int type = virtio__get_dev_specific_field(offset - 20, virtio_pci__msix_enabled(vpci),
+ &config_offset);
+ if (type == VIRTIO_PCI_O_MSIX) {
+ switch (offset) {
+ case VIRTIO_MSI_CONFIG_VECTOR:
+ vec = vpci->config_vector = ioport__read16(data);
+
+ gsi = irq__add_msix_route(kvm, &vpci->msix_table[vec].msg);
+
+ vpci->config_gsi = gsi;
+ break;
+ case VIRTIO_MSI_QUEUE_VECTOR:
+ vec = vpci->vq_vector[vpci->queue_selector] = ioport__read16(data);
+
+ gsi = irq__add_msix_route(kvm, &vpci->msix_table[vec].msg);
+ vpci->gsis[vpci->queue_selector] = gsi;
+ if (vdev->ops->notify_vq_gsi)
+ vdev->ops->notify_vq_gsi(kvm, vpci->dev,
+ vpci->queue_selector, gsi);
+ break;
+ };
+
+ return true;
+ } else if (type == VIRTIO_PCI_O_CONFIG) {
+ vdev->ops->get_config(kvm, vpci->dev)[config_offset] = *(u8 *)data;
+
+ return true;
+ }
+
+ return false;
+}
+
+static bool virtio_pci__io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ unsigned long offset;
+ bool ret = true;
+ struct virtio_device *vdev;
+ struct virtio_pci *vpci;
+ u32 val;
+
+ vdev = ioport->priv;
+ vpci = vdev->virtio;
+ offset = port - vpci->base_addr;
+
+ switch (offset) {
+ case VIRTIO_PCI_GUEST_FEATURES:
+ val = ioport__read32(data);
+ vdev->ops->set_guest_features(kvm, vpci->dev, val);
+ break;
+ case VIRTIO_PCI_QUEUE_PFN:
+ val = ioport__read32(data);
+ virtio_pci__init_ioeventfd(kvm, vdev, vpci->queue_selector);
+ vdev->ops->init_vq(kvm, vpci->dev, vpci->queue_selector, val);
+ break;
+ case VIRTIO_PCI_QUEUE_SEL:
+ vpci->queue_selector = ioport__read16(data);
+ break;
+ case VIRTIO_PCI_QUEUE_NOTIFY:
+ val = ioport__read16(data);
+ vdev->ops->notify_vq(kvm, vpci->dev, val);
+ break;
+ case VIRTIO_PCI_STATUS:
+ vpci->status = ioport__read8(data);
+ break;
+ default:
+ ret = virtio_pci__specific_io_out(kvm, vdev, port, data, size, offset);
+ break;
+ };
+
+ return ret;
+}
+
+static struct ioport_operations virtio_pci__io_ops = {
+ .io_in = virtio_pci__io_in,
+ .io_out = virtio_pci__io_out,
+};
+
+static void virtio_pci__mmio_callback(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr)
+{
+ struct virtio_pci *vpci = ptr;
+ void *table;
+ u32 offset;
+
+ if (addr > vpci->msix_io_block + PCI_IO_SIZE) {
+ table = &vpci->msix_pba;
+ offset = vpci->msix_io_block + PCI_IO_SIZE;
+ } else {
+ table = &vpci->msix_table;
+ offset = vpci->msix_io_block;
+ }
+
+ if (is_write)
+ memcpy(table + addr - offset, data, len);
+ else
+ memcpy(data, table + addr - offset, len);
+}
+
+static void virtio_pci__signal_msi(struct kvm *kvm, struct virtio_pci *vpci, int vec)
+{
+ struct kvm_msi msi = {
+ .address_lo = vpci->msix_table[vec].msg.address_lo,
+ .address_hi = vpci->msix_table[vec].msg.address_hi,
+ .data = vpci->msix_table[vec].msg.data,
+ };
+
+ ioctl(kvm->vm_fd, KVM_SIGNAL_MSI, &msi);
+}
+
+int virtio_pci__signal_vq(struct kvm *kvm, struct virtio_device *vdev, u32 vq)
+{
+ struct virtio_pci *vpci = vdev->virtio;
+ int tbl = vpci->vq_vector[vq];
+
+ if (virtio_pci__msix_enabled(vpci)) {
+ if (vpci->pci_hdr.msix.ctrl & cpu_to_le16(PCI_MSIX_FLAGS_MASKALL) ||
+ vpci->msix_table[tbl].ctrl & cpu_to_le16(PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+
+ vpci->msix_pba |= 1 << tbl;
+ return 0;
+ }
+
+ if (vpci->features & VIRTIO_PCI_F_SIGNAL_MSI)
+ virtio_pci__signal_msi(kvm, vpci, vpci->vq_vector[vq]);
+ else
+ kvm__irq_trigger(kvm, vpci->gsis[vq]);
+ } else {
+ vpci->isr = VIRTIO_IRQ_HIGH;
+ kvm__irq_trigger(kvm, vpci->pci_hdr.irq_line);
+ }
+ return 0;
+}
+
+int virtio_pci__signal_config(struct kvm *kvm, struct virtio_device *vdev)
+{
+ struct virtio_pci *vpci = vdev->virtio;
+ int tbl = vpci->config_vector;
+
+ if (virtio_pci__msix_enabled(vpci)) {
+ if (vpci->pci_hdr.msix.ctrl & cpu_to_le16(PCI_MSIX_FLAGS_MASKALL) ||
+ vpci->msix_table[tbl].ctrl & cpu_to_le16(PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+
+ vpci->msix_pba |= 1 << tbl;
+ return 0;
+ }
+
+ if (vpci->features & VIRTIO_PCI_F_SIGNAL_MSI)
+ virtio_pci__signal_msi(kvm, vpci, vpci->vq_vector[vpci->config_vector]);
+ else
+ kvm__irq_trigger(kvm, vpci->config_gsi);
+ } else {
+ vpci->isr = VIRTIO_PCI_ISR_CONFIG;
+ kvm__irq_trigger(kvm, vpci->pci_hdr.irq_line);
+ }
+
+ return 0;
+}
+
+int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
+ int device_id, int subsys_id, int class)
+{
+ struct virtio_pci *vpci = vdev->virtio;
+ u8 pin, line, ndev;
+ int r;
+
+ vpci->dev = dev;
+ vpci->msix_io_block = pci_get_io_space_block(PCI_IO_SIZE * 2);
+
+ r = ioport__register(kvm, IOPORT_EMPTY, &virtio_pci__io_ops, IOPORT_SIZE, vdev);
+ if (r < 0)
+ return r;
+
+ vpci->base_addr = (u16)r;
+ r = kvm__register_mmio(kvm, vpci->msix_io_block, PCI_IO_SIZE, false,
+ virtio_pci__mmio_callback, vpci);
+ if (r < 0)
+ goto free_ioport;
+
+ vpci->pci_hdr = (struct pci_device_header) {
+ .vendor_id = cpu_to_le16(PCI_VENDOR_ID_REDHAT_QUMRANET),
+ .device_id = cpu_to_le16(device_id),
+ .header_type = PCI_HEADER_TYPE_NORMAL,
+ .revision_id = 0,
+ .class[0] = class & 0xff,
+ .class[1] = (class >> 8) & 0xff,
+ .class[2] = (class >> 16) & 0xff,
+ .subsys_vendor_id = cpu_to_le16(PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET),
+ .subsys_id = cpu_to_le16(subsys_id),
+ .bar[0] = cpu_to_le32(vpci->base_addr
+ | PCI_BASE_ADDRESS_SPACE_IO),
+ .bar[1] = cpu_to_le32(vpci->msix_io_block
+ | PCI_BASE_ADDRESS_SPACE_MEMORY),
+ .status = cpu_to_le16(PCI_STATUS_CAP_LIST),
+ .capabilities = (void *)&vpci->pci_hdr.msix - (void *)&vpci->pci_hdr,
+ .bar_size[0] = IOPORT_SIZE,
+ .bar_size[1] = PCI_IO_SIZE,
+ .bar_size[3] = PCI_IO_SIZE,
+ };
+
+ vpci->pci_hdr.msix.cap = PCI_CAP_ID_MSIX;
+ vpci->pci_hdr.msix.next = 0;
+ /*
+ * We at most have VIRTIO_PCI_MAX_VQ entries for virt queue,
+ * VIRTIO_PCI_MAX_CONFIG entries for config.
+ *
+ * To quote the PCI spec:
+ *
+ * System software reads this field to determine the
+ * MSI-X Table Size N, which is encoded as N-1.
+ * For example, a returned value of "00000000011"
+ * indicates a table size of 4.
+ */
+ vpci->pci_hdr.msix.ctrl = cpu_to_le16(VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG - 1);
+
+ /*
+ * Both table and PBA could be mapped on the same BAR, but for now
+ * we're not in short of BARs
+ */
+ vpci->pci_hdr.msix.table_offset = cpu_to_le32(1); /* Use BAR 1 */
+ vpci->pci_hdr.msix.pba_offset = cpu_to_le32(1 | PCI_IO_SIZE); /* Use BAR 3 */
+ vpci->config_vector = 0;
+
+ r = irq__register_device(subsys_id, &ndev, &pin, &line);
+ if (r < 0)
+ goto free_mmio;
+
+ if (kvm__supports_extension(kvm, KVM_CAP_SIGNAL_MSI))
+ vpci->features |= VIRTIO_PCI_F_SIGNAL_MSI;
+
+ vpci->pci_hdr.irq_pin = pin;
+ vpci->pci_hdr.irq_line = line;
+ r = pci__register(&vpci->pci_hdr, ndev);
+ if (r < 0)
+ goto free_ioport;
+
+ return 0;
+
+free_mmio:
+ kvm__deregister_mmio(kvm, vpci->msix_io_block);
+free_ioport:
+ ioport__unregister(kvm, vpci->base_addr);
+ return r;
+}
+
+int virtio_pci__exit(struct kvm *kvm, struct virtio_device *vdev)
+{
+ struct virtio_pci *vpci = vdev->virtio;
+ int i;
+
+ kvm__deregister_mmio(kvm, vpci->msix_io_block);
+ ioport__unregister(kvm, vpci->base_addr);
+
+ for (i = 0; i < VIRTIO_PCI_MAX_VQ; i++)
+ ioeventfd__del_event(vpci->base_addr + VIRTIO_PCI_QUEUE_NOTIFY, i);
+
+ return 0;
+}
--- /dev/null
+#include "kvm/virtio-rng.h"
+
+#include "kvm/virtio-pci-dev.h"
+
+#include "kvm/virtio.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+#include "kvm/threadpool.h"
+#include "kvm/guest_compat.h"
+
+#include <linux/virtio_ring.h>
+#include <linux/virtio_rng.h>
+
+#include <linux/list.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <pthread.h>
+#include <linux/kernel.h>
+
+#define NUM_VIRT_QUEUES 1
+#define VIRTIO_RNG_QUEUE_SIZE 128
+
+struct rng_dev_job {
+ struct virt_queue *vq;
+ struct rng_dev *rdev;
+ struct thread_pool__job job_id;
+};
+
+struct rng_dev {
+ struct list_head list;
+ struct virtio_device vdev;
+
+ int fd;
+
+ /* virtio queue */
+ struct virt_queue vqs[NUM_VIRT_QUEUES];
+ struct rng_dev_job jobs[NUM_VIRT_QUEUES];
+};
+
+static LIST_HEAD(rdevs);
+static int compat_id = -1;
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ /* Unused */
+ return 0;
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ /* Unused */
+ return 0;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ /* Unused */
+}
+
+static bool virtio_rng_do_io_request(struct kvm *kvm, struct rng_dev *rdev, struct virt_queue *queue)
+{
+ struct iovec iov[VIRTIO_RNG_QUEUE_SIZE];
+ unsigned int len = 0;
+ u16 out, in, head;
+
+ head = virt_queue__get_iov(queue, iov, &out, &in, kvm);
+ len = readv(rdev->fd, iov, in);
+
+ virt_queue__set_used_elem(queue, head, len);
+
+ return true;
+}
+
+static void virtio_rng_do_io(struct kvm *kvm, void *param)
+{
+ struct rng_dev_job *job = param;
+ struct virt_queue *vq = job->vq;
+ struct rng_dev *rdev = job->rdev;
+
+ while (virt_queue__available(vq))
+ virtio_rng_do_io_request(kvm, rdev, vq);
+
+ rdev->vdev.ops->signal_vq(kvm, &rdev->vdev, vq - rdev->vqs);
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct rng_dev *rdev = dev;
+ struct virt_queue *queue;
+ struct rng_dev_job *job;
+ void *p;
+
+ compat__remove_message(compat_id);
+
+ queue = &rdev->vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+
+ job = &rdev->jobs[vq];
+
+ vring_init(&queue->vring, VIRTIO_RNG_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN);
+
+ *job = (struct rng_dev_job) {
+ .vq = queue,
+ .rdev = rdev,
+ };
+
+ thread_pool__init_job(&job->job_id, kvm, virtio_rng_do_io, job);
+
+ return 0;
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct rng_dev *rdev = dev;
+
+ thread_pool__do_job(&rdev->jobs[vq].job_id);
+
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct rng_dev *rdev = dev;
+
+ return rdev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ return VIRTIO_RNG_QUEUE_SIZE;
+}
+
+static struct virtio_ops rng_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .notify_vq = notify_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+};
+
+int virtio_rng__init(struct kvm *kvm)
+{
+ struct rng_dev *rdev;
+ int r;
+
+ if (!kvm->cfg.virtio_rng)
+ return 0;
+
+ rdev = malloc(sizeof(*rdev));
+ if (rdev == NULL)
+ return -ENOMEM;
+
+ rdev->fd = open("/dev/urandom", O_RDONLY);
+ if (rdev->fd < 0) {
+ r = rdev->fd;
+ goto cleanup;
+ }
+
+ r = virtio_init(kvm, rdev, &rdev->vdev, &rng_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_RNG, VIRTIO_ID_RNG, PCI_CLASS_RNG);
+ if (r < 0)
+ goto cleanup;
+
+ list_add_tail(&rdev->list, &rdevs);
+
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-rng", "CONFIG_HW_RANDOM_VIRTIO");
+ return 0;
+cleanup:
+ close(rdev->fd);
+ free(rdev);
+
+ return r;
+}
+virtio_dev_init(virtio_rng__init);
+
+int virtio_rng__exit(struct kvm *kvm)
+{
+ struct rng_dev *rdev, *tmp;
+
+ list_for_each_entry_safe(rdev, tmp, &rdevs, list) {
+ list_del(&rdev->list);
+ rdev->vdev.ops->exit(kvm, &rdev->vdev);
+ free(rdev);
+ }
+
+ return 0;
+}
+virtio_dev_exit(virtio_rng__exit);
--- /dev/null
+#include "kvm/virtio-scsi.h"
+#include "kvm/virtio-pci-dev.h"
+#include "kvm/disk-image.h"
+#include "kvm/kvm.h"
+#include "kvm/pci.h"
+#include "kvm/ioeventfd.h"
+#include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
+#include "kvm/virtio.h"
+
+#include <linux/kernel.h>
+#include <linux/virtio_scsi.h>
+#include <linux/vhost.h>
+
+#define VIRTIO_SCSI_QUEUE_SIZE 128
+#define NUM_VIRT_QUEUES 3
+
+static LIST_HEAD(sdevs);
+static int compat_id = -1;
+
+struct scsi_dev {
+ struct virt_queue vqs[NUM_VIRT_QUEUES];
+ struct virtio_scsi_config config;
+ struct vhost_scsi_target target;
+ u32 features;
+ int vhost_fd;
+ struct virtio_device vdev;
+ struct list_head list;
+ struct kvm *kvm;
+};
+
+static u8 *get_config(struct kvm *kvm, void *dev)
+{
+ struct scsi_dev *sdev = dev;
+
+ return ((u8 *)(&sdev->config));
+}
+
+static u32 get_host_features(struct kvm *kvm, void *dev)
+{
+ return 1UL << VIRTIO_RING_F_EVENT_IDX |
+ 1UL << VIRTIO_RING_F_INDIRECT_DESC;
+}
+
+static void set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+ struct scsi_dev *sdev = dev;
+
+ sdev->features = features;
+}
+
+static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
+{
+ struct vhost_vring_state state = { .index = vq };
+ struct vhost_vring_addr addr;
+ struct scsi_dev *sdev = dev;
+ struct virt_queue *queue;
+ void *p;
+ int r;
+
+ compat__remove_message(compat_id);
+
+ queue = &sdev->vqs[vq];
+ queue->pfn = pfn;
+ p = guest_pfn_to_host(kvm, queue->pfn);
+
+ vring_init(&queue->vring, VIRTIO_SCSI_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN);
+
+ if (sdev->vhost_fd == 0)
+ return 0;
+
+ state.num = queue->vring.num;
+ r = ioctl(sdev->vhost_fd, VHOST_SET_VRING_NUM, &state);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_NUM failed");
+ state.num = 0;
+ r = ioctl(sdev->vhost_fd, VHOST_SET_VRING_BASE, &state);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_BASE failed");
+
+ addr = (struct vhost_vring_addr) {
+ .index = vq,
+ .desc_user_addr = (u64)(unsigned long)queue->vring.desc,
+ .avail_user_addr = (u64)(unsigned long)queue->vring.avail,
+ .used_user_addr = (u64)(unsigned long)queue->vring.used,
+ };
+
+ r = ioctl(sdev->vhost_fd, VHOST_SET_VRING_ADDR, &addr);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_ADDR failed");
+
+ return 0;
+}
+
+static void notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+ struct vhost_vring_file file;
+ struct scsi_dev *sdev = dev;
+ struct kvm_irqfd irq;
+ int r;
+
+ if (sdev->vhost_fd == 0)
+ return;
+
+ irq = (struct kvm_irqfd) {
+ .gsi = gsi,
+ .fd = eventfd(0, 0),
+ };
+ file = (struct vhost_vring_file) {
+ .index = vq,
+ .fd = irq.fd,
+ };
+
+ r = ioctl(kvm->vm_fd, KVM_IRQFD, &irq);
+ if (r < 0)
+ die_perror("KVM_IRQFD failed");
+
+ r = ioctl(sdev->vhost_fd, VHOST_SET_VRING_CALL, &file);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_CALL failed");
+
+ if (vq > 0)
+ return;
+
+ r = ioctl(sdev->vhost_fd, VHOST_SCSI_SET_ENDPOINT, &sdev->target);
+ if (r != 0)
+ die("VHOST_SCSI_SET_ENDPOINT failed %d", errno);
+}
+
+static void notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 efd)
+{
+ struct scsi_dev *sdev = dev;
+ struct vhost_vring_file file = {
+ .index = vq,
+ .fd = efd,
+ };
+ int r;
+
+ if (sdev->vhost_fd == 0)
+ return;
+
+ r = ioctl(sdev->vhost_fd, VHOST_SET_VRING_KICK, &file);
+ if (r < 0)
+ die_perror("VHOST_SET_VRING_KICK failed");
+}
+
+static int notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ return 0;
+}
+
+static int get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct scsi_dev *sdev = dev;
+
+ return sdev->vqs[vq].pfn;
+}
+
+static int get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ return VIRTIO_SCSI_QUEUE_SIZE;
+}
+
+static int set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+ return size;
+}
+
+static struct virtio_ops scsi_dev_virtio_ops = (struct virtio_ops) {
+ .get_config = get_config,
+ .get_host_features = get_host_features,
+ .set_guest_features = set_guest_features,
+ .init_vq = init_vq,
+ .get_pfn_vq = get_pfn_vq,
+ .get_size_vq = get_size_vq,
+ .set_size_vq = set_size_vq,
+ .notify_vq = notify_vq,
+ .notify_vq_gsi = notify_vq_gsi,
+ .notify_vq_eventfd = notify_vq_eventfd,
+};
+
+static void virtio_scsi_vhost_init(struct kvm *kvm, struct scsi_dev *sdev)
+{
+ struct vhost_memory *mem;
+ u64 features;
+ int r;
+
+ sdev->vhost_fd = open("/dev/vhost-scsi", O_RDWR);
+ if (sdev->vhost_fd < 0)
+ die_perror("Failed openning vhost-scsi device");
+
+ mem = calloc(1, sizeof(*mem) + sizeof(struct vhost_memory_region));
+ if (mem == NULL)
+ die("Failed allocating memory for vhost memory map");
+
+ mem->nregions = 1;
+ mem->regions[0] = (struct vhost_memory_region) {
+ .guest_phys_addr = 0,
+ .memory_size = kvm->ram_size,
+ .userspace_addr = (unsigned long)kvm->ram_start,
+ };
+
+ r = ioctl(sdev->vhost_fd, VHOST_SET_OWNER);
+ if (r != 0)
+ die_perror("VHOST_SET_OWNER failed");
+
+ r = ioctl(sdev->vhost_fd, VHOST_GET_FEATURES, &features);
+ if (r != 0)
+ die_perror("VHOST_GET_FEATURES failed");
+
+ r = ioctl(sdev->vhost_fd, VHOST_SET_FEATURES, &features);
+ if (r != 0)
+ die_perror("VHOST_SET_FEATURES failed");
+ r = ioctl(sdev->vhost_fd, VHOST_SET_MEM_TABLE, mem);
+ if (r != 0)
+ die_perror("VHOST_SET_MEM_TABLE failed");
+
+ sdev->vdev.use_vhost = true;
+
+ free(mem);
+}
+
+
+static int virtio_scsi_init_one(struct kvm *kvm, struct disk_image *disk)
+{
+ struct scsi_dev *sdev;
+
+ if (!disk)
+ return -EINVAL;
+
+ sdev = calloc(1, sizeof(struct scsi_dev));
+ if (sdev == NULL)
+ return -ENOMEM;
+
+ *sdev = (struct scsi_dev) {
+ .config = (struct virtio_scsi_config) {
+ .num_queues = NUM_VIRT_QUEUES - 2,
+ .seg_max = VIRTIO_SCSI_CDB_SIZE - 2,
+ .max_sectors = 65535,
+ .cmd_per_lun = 128,
+ .sense_size = VIRTIO_SCSI_SENSE_SIZE,
+ .cdb_size = VIRTIO_SCSI_CDB_SIZE,
+ .max_channel = 0,
+ .max_target = 0,
+ .max_lun = 16383,
+ .event_info_size = sizeof(struct virtio_scsi_event),
+ },
+ .kvm = kvm,
+ };
+ strncpy((char *)&sdev->target.vhost_wwpn, disk->wwpn, sizeof(sdev->target.vhost_wwpn));
+ sdev->target.vhost_tpgt = strtol(disk->tpgt, NULL, 0);
+
+ virtio_init(kvm, sdev, &sdev->vdev, &scsi_dev_virtio_ops,
+ VIRTIO_PCI, PCI_DEVICE_ID_VIRTIO_SCSI, VIRTIO_ID_SCSI, PCI_CLASS_BLK);
+
+ list_add_tail(&sdev->list, &sdevs);
+
+ virtio_scsi_vhost_init(kvm, sdev);
+
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-scsi", "CONFIG_VIRTIO_SCSI");
+
+ return 0;
+}
+
+static int virtio_scsi_exit_one(struct kvm *kvm, struct scsi_dev *sdev)
+{
+ int r;
+
+ r = ioctl(sdev->vhost_fd, VHOST_SCSI_CLEAR_ENDPOINT, &sdev->target);
+ if (r != 0)
+ die("VHOST_SCSI_CLEAR_ENDPOINT failed %d", errno);
+
+ list_del(&sdev->list);
+ free(sdev);
+
+ return 0;
+}
+
+int virtio_scsi_init(struct kvm *kvm)
+{
+ int i, r = 0;
+
+ for (i = 0; i < kvm->nr_disks; i++) {
+ if (!kvm->disks[i]->wwpn)
+ continue;
+ r = virtio_scsi_init_one(kvm, kvm->disks[i]);
+ if (r < 0)
+ goto cleanup;
+ }
+
+ return 0;
+cleanup:
+ return virtio_scsi_exit(kvm);
+}
+virtio_dev_init(virtio_scsi_init);
+
+int virtio_scsi_exit(struct kvm *kvm)
+{
+ while (!list_empty(&sdevs)) {
+ struct scsi_dev *sdev;
+
+ sdev = list_first_entry(&sdevs, struct scsi_dev, list);
+ virtio_scsi_exit_one(kvm, sdev);
+ }
+
+ return 0;
+}
+virtio_dev_exit(virtio_scsi_exit);
--- /dev/null
+#include "kvm/kvm.h"
+#include "kvm/boot-protocol.h"
+#include "kvm/e820.h"
+#include "kvm/interrupt.h"
+#include "kvm/util.h"
+
+#include <string.h>
+#include <asm/e820.h>
+
+#include "bios/bios-rom.h"
+
+struct irq_handler {
+ unsigned long address;
+ unsigned int irq;
+ void *handler;
+ size_t size;
+};
+
+#define BIOS_IRQ_PA_ADDR(name) (MB_BIOS_BEGIN + BIOS_OFFSET__##name)
+#define BIOS_IRQ_FUNC(name) ((char *)&bios_rom[BIOS_OFFSET__##name])
+#define BIOS_IRQ_SIZE(name) (BIOS_ENTRY_SIZE(BIOS_OFFSET__##name))
+
+#define DEFINE_BIOS_IRQ_HANDLER(_irq, _handler) \
+ { \
+ .irq = _irq, \
+ .address = BIOS_IRQ_PA_ADDR(_handler), \
+ .handler = BIOS_IRQ_FUNC(_handler), \
+ .size = BIOS_IRQ_SIZE(_handler), \
+ }
+
+static struct irq_handler bios_irq_handlers[] = {
+ DEFINE_BIOS_IRQ_HANDLER(0x10, bios_int10),
+ DEFINE_BIOS_IRQ_HANDLER(0x15, bios_int15),
+};
+
+static void setup_irq_handler(struct kvm *kvm, struct irq_handler *handler)
+{
+ struct real_intr_desc intr_desc;
+ void *p;
+
+ p = guest_flat_to_host(kvm, handler->address);
+ memcpy(p, handler->handler, handler->size);
+
+ intr_desc = (struct real_intr_desc) {
+ .segment = REAL_SEGMENT(MB_BIOS_BEGIN),
+ .offset = handler->address - MB_BIOS_BEGIN,
+ };
+
+ DIE_IF((handler->address - MB_BIOS_BEGIN) > 0xffffUL);
+
+ interrupt_table__set(&kvm->arch.interrupt_table, &intr_desc, handler->irq);
+}
+
+/**
+ * e820_setup - setup some simple E820 memory map
+ * @kvm - guest system descriptor
+ */
+static void e820_setup(struct kvm *kvm)
+{
+ struct e820map *e820;
+ struct e820entry *mem_map;
+ unsigned int i = 0;
+
+ e820 = guest_flat_to_host(kvm, E820_MAP_START);
+ mem_map = e820->map;
+
+ mem_map[i++] = (struct e820entry) {
+ .addr = REAL_MODE_IVT_BEGIN,
+ .size = EBDA_START - REAL_MODE_IVT_BEGIN,
+ .type = E820_RAM,
+ };
+ mem_map[i++] = (struct e820entry) {
+ .addr = EBDA_START,
+ .size = VGA_RAM_BEGIN - EBDA_START,
+ .type = E820_RESERVED,
+ };
+ mem_map[i++] = (struct e820entry) {
+ .addr = MB_BIOS_BEGIN,
+ .size = MB_BIOS_END - MB_BIOS_BEGIN,
+ .type = E820_RESERVED,
+ };
+ if (kvm->ram_size < KVM_32BIT_GAP_START) {
+ mem_map[i++] = (struct e820entry) {
+ .addr = BZ_KERNEL_START,
+ .size = kvm->ram_size - BZ_KERNEL_START,
+ .type = E820_RAM,
+ };
+ } else {
+ mem_map[i++] = (struct e820entry) {
+ .addr = BZ_KERNEL_START,
+ .size = KVM_32BIT_GAP_START - BZ_KERNEL_START,
+ .type = E820_RAM,
+ };
+ mem_map[i++] = (struct e820entry) {
+ .addr = KVM_32BIT_MAX_MEM_SIZE,
+ .size = kvm->ram_size - KVM_32BIT_MAX_MEM_SIZE,
+ .type = E820_RAM,
+ };
+ }
+
+ BUG_ON(i > E820_X_MAX);
+
+ e820->nr_map = i;
+}
+
+static void setup_vga_rom(struct kvm *kvm)
+{
+ u16 *mode;
+ void *p;
+
+ p = guest_flat_to_host(kvm, VGA_ROM_OEM_STRING);
+ memset(p, 0, VGA_ROM_OEM_STRING_SIZE);
+ strncpy(p, "KVM VESA", VGA_ROM_OEM_STRING_SIZE);
+
+ mode = guest_flat_to_host(kvm, VGA_ROM_MODES);
+ mode[0] = 0x0112;
+ mode[1] = 0xffff;
+}
+
+/**
+ * setup_bios - inject BIOS into guest memory
+ * @kvm - guest system descriptor
+ */
+void setup_bios(struct kvm *kvm)
+{
+ unsigned long address = MB_BIOS_BEGIN;
+ struct real_intr_desc intr_desc;
+ unsigned int i;
+ void *p;
+
+ /*
+ * before anything else -- clean some known areas
+ * we definitely don't want any trash here
+ */
+ p = guest_flat_to_host(kvm, BDA_START);
+ memset(p, 0, BDA_END - BDA_START);
+
+ p = guest_flat_to_host(kvm, EBDA_START);
+ memset(p, 0, EBDA_END - EBDA_START);
+
+ p = guest_flat_to_host(kvm, MB_BIOS_BEGIN);
+ memset(p, 0, MB_BIOS_END - MB_BIOS_BEGIN);
+
+ p = guest_flat_to_host(kvm, VGA_ROM_BEGIN);
+ memset(p, 0, VGA_ROM_END - VGA_ROM_BEGIN);
+
+ /* just copy the bios rom into the place */
+ p = guest_flat_to_host(kvm, MB_BIOS_BEGIN);
+ memcpy(p, bios_rom, bios_rom_size);
+
+ /* E820 memory map must be present */
+ e820_setup(kvm);
+
+ /* VESA needs own tricks */
+ setup_vga_rom(kvm);
+
+ /*
+ * Setup a *fake* real mode vector table, it has only
+ * one real handler which does just iret
+ */
+ address = BIOS_IRQ_PA_ADDR(bios_intfake);
+ intr_desc = (struct real_intr_desc) {
+ .segment = REAL_SEGMENT(MB_BIOS_BEGIN),
+ .offset = address - MB_BIOS_BEGIN,
+ };
+ interrupt_table__setup(&kvm->arch.interrupt_table, &intr_desc);
+
+ for (i = 0; i < ARRAY_SIZE(bios_irq_handlers); i++)
+ setup_irq_handler(kvm, &bios_irq_handlers[i]);
+
+ /* we almost done */
+ p = guest_flat_to_host(kvm, 0);
+ interrupt_table__copy(&kvm->arch.interrupt_table, p, REAL_INTR_SIZE);
+}
--- /dev/null
+bios-rom.bin
+bios-rom.bin.elf
+bios-rom.h
--- /dev/null
+#include <kvm/assembly.h>
+
+ .org 0
+#ifdef CONFIG_X86_64
+ .code64
+#else
+ .code32
+#endif
+
+GLOBAL(bios_rom)
+ .incbin "x86/bios/bios.bin"
+END(bios_rom)
--- /dev/null
+#include "kvm/e820.h"
+
+#include "kvm/segment.h"
+#include "kvm/bios.h"
+
+#include <asm/processor-flags.h>
+#include <asm/e820.h>
+
+static inline void set_fs(u16 seg)
+{
+ asm volatile("movw %0,%%fs" : : "rm" (seg));
+}
+
+static inline u8 rdfs8(unsigned long addr)
+{
+ u8 v;
+
+ asm volatile("addr32 movb %%fs:%1,%0" : "=q" (v) : "m" (*(u8 *)addr));
+
+ return v;
+}
+
+static inline u32 rdfs32(unsigned long addr)
+{
+ u32 v;
+
+ asm volatile("addr32 movl %%fs:%1,%0" : "=q" (v) : "m" (*(u32 *)addr));
+
+ return v;
+}
+
+bioscall void e820_query_map(struct biosregs *regs)
+{
+ struct e820map *e820;
+ u32 map_size;
+ u16 fs_seg;
+ u32 ndx;
+
+ e820 = (struct e820map *)E820_MAP_START;
+ fs_seg = flat_to_seg16(E820_MAP_START);
+ set_fs(fs_seg);
+
+ ndx = regs->ebx;
+
+ map_size = rdfs32(flat_to_off16((u32)&e820->nr_map, fs_seg));
+
+ if (ndx < map_size) {
+ u32 start;
+ unsigned int i;
+ u8 *p;
+
+ fs_seg = flat_to_seg16(E820_MAP_START);
+ set_fs(fs_seg);
+
+ start = (u32)&e820->map[ndx];
+
+ p = (void *) regs->edi;
+
+ for (i = 0; i < sizeof(struct e820entry); i++)
+ *p++ = rdfs8(flat_to_off16(start + i, fs_seg));
+ }
+
+ regs->eax = SMAP;
+ regs->ecx = sizeof(struct e820entry);
+ regs->ebx = ++ndx;
+
+ /* Clear CF to indicate success. */
+ regs->eflags &= ~X86_EFLAGS_CF;
+
+ if (ndx >= map_size)
+ regs->ebx = 0; /* end of map */
+}
--- /dev/null
+/*
+ * Our pretty trivial BIOS emulation
+ */
+
+#include <kvm/bios.h>
+#include <kvm/assembly.h>
+
+ .org 0
+ .code16gcc
+
+#define EFLAGS_CF (1 << 0)
+
+#include "macro.S"
+
+/* If you change these macros, remember to update 'struct biosregs' */
+.macro SAVE_BIOSREGS
+ pushl %fs
+ pushl %es
+ pushl %ds
+ pushl %edi
+ pushl %esi
+ pushl %ebp
+ pushl %esp
+ pushl %edx
+ pushl %ecx
+ pushl %ebx
+ pushl %eax
+.endm
+
+.macro RESTORE_BIOSREGS
+ popl %eax
+ popl %ebx
+ popl %ecx
+ popl %edx
+ popl %esp
+ popl %ebp
+ popl %esi
+ popl %edi
+ popl %ds
+ popl %es
+ popl %fs
+.endm
+
+/*
+ * fake interrupt handler, nothing can be faster ever
+ */
+ENTRY(bios_intfake)
+ /*
+ * Set CF to indicate failure. We don't want callers to think that the
+ * interrupt handler succeeded and then treat the return values in
+ * registers as valid data.
+ */
+ orl $EFLAGS_CF, 0x4(%esp)
+
+ IRET
+ENTRY_END(bios_intfake)
+
+/*
+ * int 10 - video - service
+ */
+ENTRY(bios_int10)
+ SAVE_BIOSREGS
+
+ movl %esp, %eax
+ /* this is way easier than doing it in assembly */
+ /* just push all the regs and jump to a C handler */
+ call int10_handler
+
+ RESTORE_BIOSREGS
+
+ /* Clear CF to indicate success. */
+ andl $~EFLAGS_CF, 0x4(%esp)
+
+ IRET
+ENTRY_END(bios_int10)
+
+ENTRY(bios_int15)
+ SAVE_BIOSREGS
+
+ movl %esp, %eax
+ call int15_handler
+
+ RESTORE_BIOSREGS
+
+ IRET
+ENTRY_END(bios_int15)
+
+GLOBAL(__locals)
+
+#include "local.S"
+
+END(__locals)
--- /dev/null
+#!/bin/sh
+
+echo "/* Autogenerated file, don't edit */"
+echo "#ifndef BIOS_OFFSETS_H"
+echo "#define BIOS_OFFSETS_H"
+
+echo ""
+echo "#define BIOS_ENTRY_SIZE(name) (name##_end - name)"
+echo ""
+
+nm bios.bin.elf | grep ' [Tt] ' | awk '{ print "#define BIOS_OFFSET__" $3 " 0x" $1; }'
+
+echo ""
+echo "#endif"
--- /dev/null
+#include "kvm/segment.h"
+#include "kvm/bios.h"
+#include "kvm/vesa.h"
+
+#include "bios/memcpy.h"
+
+#include <boot/vesa.h>
+
+static far_ptr gen_far_ptr(unsigned int pa)
+{
+ far_ptr ptr;
+
+ ptr.seg = (pa >> 4);
+ ptr.off = pa - (ptr.seg << 4);
+
+ return ptr;
+}
+
+static inline void outb(unsigned short port, unsigned char val)
+{
+ asm volatile("outb %0, %1" : : "a"(val), "Nd"(port));
+}
+
+/*
+ * It's probably much more useful to make this print to the serial
+ * line rather than print to a non-displayed VGA memory
+ */
+static inline void int10_putchar(struct biosregs *args)
+{
+ u8 al = args->eax & 0xFF;
+
+ outb(0x3f8, al);
+}
+
+static void vbe_get_mode(struct biosregs *args)
+{
+ struct vesa_mode_info *info = (struct vesa_mode_info *) args->edi;
+
+ *info = (struct vesa_mode_info) {
+ .mode_attr = 0xd9, /* 11011011 */
+ .logical_scan = VESA_WIDTH*4,
+ .h_res = VESA_WIDTH,
+ .v_res = VESA_HEIGHT,
+ .bpp = VESA_BPP,
+ .memory_layout = 6,
+ .memory_planes = 1,
+ .lfb_ptr = VESA_MEM_ADDR,
+ .rmask = 8,
+ .gmask = 8,
+ .bmask = 8,
+ .resv_mask = 8,
+ .resv_pos = 24,
+ .bpos = 16,
+ .gpos = 8,
+ };
+}
+
+static void vbe_get_info(struct biosregs *args)
+{
+ struct vesa_general_info *infop = (struct vesa_general_info *) args->edi;
+ struct vesa_general_info info;
+
+ info = (struct vesa_general_info) {
+ .signature = VESA_MAGIC,
+ .version = 0x102,
+ .vendor_string = gen_far_ptr(VGA_ROM_BEGIN),
+ .capabilities = 0x10,
+ .video_mode_ptr = gen_far_ptr(VGA_ROM_MODES),
+ .total_memory = (4 * VESA_WIDTH * VESA_HEIGHT) / 0x10000,
+ };
+
+ memcpy16(args->es, infop, args->ds, &info, sizeof(info));
+}
+
+#define VBE_STATUS_OK 0x004F
+
+static void int10_vesa(struct biosregs *args)
+{
+ u8 al;
+
+ al = args->eax & 0xff;
+
+ switch (al) {
+ case 0x00:
+ vbe_get_info(args);
+ break;
+ case 0x01:
+ vbe_get_mode(args);
+ break;
+ }
+
+ args->eax = VBE_STATUS_OK;
+}
+
+bioscall void int10_handler(struct biosregs *args)
+{
+ u8 ah;
+
+ ah = (args->eax & 0xff00) >> 8;
+
+ switch (ah) {
+ case 0x0e:
+ int10_putchar(args);
+ break;
+ case 0x4f:
+ int10_vesa(args);
+ break;
+ }
+
+}
--- /dev/null
+#include "kvm/bios.h"
+
+#include "kvm/e820.h"
+
+#include <asm/processor-flags.h>
+
+bioscall void int15_handler(struct biosregs *regs)
+{
+ switch (regs->eax) {
+ case 0xe820:
+ e820_query_map(regs);
+ break;
+ default:
+ /* Set CF to indicate failure. */
+ regs->eflags |= X86_EFLAGS_CF;
+ break;
+ }
+}
--- /dev/null
+/*
+ * Local variables for almost every BIOS irq handler
+ * Must be put somewhere inside irq handler body
+ */
+__CALLER_SS: .int 0
+__CALLER_SP: .long 0
+__CALLER_CLOBBER: .long 0
--- /dev/null
+/*
+ * handy BIOS macros
+ */
+
+/*
+ * switch to BIOS stack
+ */
+.macro stack_swap
+ movw %ss, %cs:(__CALLER_SS)
+ movl %esp, %cs:(__CALLER_SP)
+ movl %edx, %cs:(__CALLER_CLOBBER)
+ movw $MB_BIOS_SS, %dx
+ movw %dx, %ss
+ movw $MB_BIOS_SP, %sp
+ movl %cs:(__CALLER_CLOBBER), %edx
+.endm
+
+/*
+ * restore the original stack
+ */
+.macro stack_restore
+ movl %cs:(__CALLER_SP), %esp
+ movw %cs:(__CALLER_SS), %ss
+.endm
+
--- /dev/null
+#include "bios/memcpy.h"
+
+/*
+ * Copy memory area in 16-bit real mode.
+ */
+void memcpy16(u16 dst_seg, void *dst, u16 src_seg, const void *src, size_t len)
+{
+ __asm__ __volatile__ (
+ "pushw %%ds \n"
+ "pushw %%es \n"
+ "movw %[src_seg], %%ds \n"
+ "movw %[dst_seg], %%es \n"
+ "rep movsb %%ds:(%%si), %%es:(%%di) \n"
+ "popw %%es \n"
+ "popw %%ds \n"
+ :
+ : "S"(src),
+ "D"(dst),
+ "c"(len),
+ [src_seg] "r"(src_seg),
+ [dst_seg] "r"(dst_seg)
+ : "cc", "memory");
+}
--- /dev/null
+OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
+OUTPUT_ARCH(i386)
+
+SECTIONS {
+ .text 0 : {
+ *(.text)
+ }
+
+ /DISCARD/ : {
+ *(.debug*)
+ *(.data)
+ *(.bss)
+ *(.eh_frame*)
+ }
+}
+
--- /dev/null
+#include "kvm/kvm.h"
+
+#include "kvm/util.h"
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <stdbool.h>
+#include <fcntl.h>
+
+#define BIOS_SELECTOR 0xf000
+#define BIOS_IP 0xfff0
+#define BIOS_SP 0x8000
+
+bool kvm__load_firmware(struct kvm *kvm, const char *firmware_filename)
+{
+ struct stat st;
+ void *p;
+ int fd;
+ int nr;
+
+ fd = open(firmware_filename, O_RDONLY);
+ if (fd < 0)
+ return false;
+
+ if (fstat(fd, &st))
+ return false;
+
+ if (st.st_size > MB_FIRMWARE_BIOS_SIZE)
+ die("firmware image %s is too big to fit in memory (%Lu KB).\n", firmware_filename, (u64)(st.st_size / 1024));
+
+ p = guest_flat_to_host(kvm, MB_FIRMWARE_BIOS_BEGIN);
+
+ while ((nr = read(fd, p, st.st_size)) > 0)
+ p += nr;
+
+ kvm->arch.boot_selector = BIOS_SELECTOR;
+ kvm->arch.boot_ip = BIOS_IP;
+ kvm->arch.boot_sp = BIOS_SP;
+
+ return true;
+}
--- /dev/null
+#include "kvm/kvm-cpu.h"
+
+#include "kvm/kvm.h"
+#include "kvm/util.h"
+
+#include <sys/ioctl.h>
+#include <stdlib.h>
+
+#define CPUID_FUNC_PERFMON 0x0A
+
+#define MAX_KVM_CPUID_ENTRIES 100
+
+static void filter_cpuid(struct kvm_cpuid2 *kvm_cpuid)
+{
+ unsigned int i;
+
+ /*
+ * Filter CPUID functions that are not supported by the hypervisor.
+ */
+ for (i = 0; i < kvm_cpuid->nent; i++) {
+ struct kvm_cpuid_entry2 *entry = &kvm_cpuid->entries[i];
+
+ switch (entry->function) {
+ case 1:
+ /* Set X86_FEATURE_HYPERVISOR */
+ if (entry->index == 0)
+ entry->ecx |= (1 << 31);
+ break;
+ case 6:
+ /* Clear X86_FEATURE_EPB */
+ entry->ecx = entry->ecx & ~(1 << 3);
+ break;
+ case CPUID_FUNC_PERFMON:
+ entry->eax = 0x00; /* disable it */
+ break;
+ default:
+ /* Keep the CPUID function as -is */
+ break;
+ };
+ }
+}
+
+void kvm_cpu__setup_cpuid(struct kvm_cpu *vcpu)
+{
+ struct kvm_cpuid2 *kvm_cpuid;
+
+ kvm_cpuid = calloc(1, sizeof(*kvm_cpuid) +
+ MAX_KVM_CPUID_ENTRIES * sizeof(*kvm_cpuid->entries));
+
+ kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES;
+ if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0)
+ die_perror("KVM_GET_SUPPORTED_CPUID failed");
+
+ filter_cpuid(kvm_cpuid);
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_CPUID2, kvm_cpuid) < 0)
+ die_perror("KVM_SET_CPUID2 failed");
+
+ free(kvm_cpuid);
+}
--- /dev/null
+#ifndef ASSEMBLY_H_
+#define ASSEMBLY_H_
+
+#define __ALIGN .p2align 4, 0x90
+#define ENTRY(name) \
+ __ALIGN; \
+ .globl name; \
+ name:
+
+#define GLOBAL(name) \
+ .globl name; \
+ name:
+
+#define ENTRY_END(name) GLOBAL(name##_end)
+#define END(name) GLOBAL(name##_end)
+
+/*
+ * gas produces size override prefix with which
+ * we are unhappy, lets make it hardcoded for
+ * 16 bit mode
+ */
+#define IRET .byte 0xcf
+
+#endif /* ASSEMBLY_H_ */
--- /dev/null
+#ifndef _KVM_BARRIER_H_
+#define _KVM_BARRIER_H_
+
+#define barrier() asm volatile("": : :"memory")
+
+#define mb() asm volatile ("mfence": : :"memory")
+#define rmb() asm volatile ("lfence": : :"memory")
+#define wmb() asm volatile ("sfence": : :"memory")
+
+#ifdef CONFIG_SMP
+#define smp_mb() mb()
+#define smp_rmb() rmb()
+#define smp_wmb() wmb()
+#else
+#define smp_mb() barrier()
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
+#endif
+
+#endif /* _KVM_BARRIER_H_ */
--- /dev/null
+#ifndef BIOS_EXPORT_H_
+#define BIOS_EXPORT_H_
+
+struct kvm;
+
+extern char bios_rom[0];
+extern char bios_rom_end[0];
+
+#define bios_rom_size (bios_rom_end - bios_rom)
+
+extern void setup_bios(struct kvm *kvm);
+
+#endif /* BIOS_EXPORT_H_ */
--- /dev/null
+#ifndef BIOS_H_
+#define BIOS_H_
+
+/*
+ * X86-32 Memory Map (typical)
+ * start end
+ * Real Mode Interrupt Vector Table 0x00000000 0x000003FF
+ * BDA area 0x00000400 0x000004FF
+ * Conventional Low Memory 0x00000500 0x0009FBFF
+ * EBDA area 0x0009FC00 0x0009FFFF
+ * VIDEO RAM 0x000A0000 0x000BFFFF
+ * VIDEO ROM (BIOS) 0x000C0000 0x000C7FFF
+ * ROMs & unus. space (mapped hw & misc)0x000C8000 0x000EFFFF 160 KiB (typically)
+ * Motherboard BIOS 0x000F0000 0x000FFFFF
+ * Extended Memory 0x00100000 0xFEBFFFFF
+ * Reserved (configs, ACPI, PnP, etc) 0xFEC00000 0xFFFFFFFF
+ */
+
+#define REAL_MODE_IVT_BEGIN 0x00000000
+#define REAL_MODE_IVT_END 0x000003ff
+
+#define BDA_START 0x00000400
+#define BDA_END 0x000004ff
+
+#define EBDA_START 0x0009fc00
+#define EBDA_END 0x0009ffff
+
+#define E820_MAP_START EBDA_START
+
+#define MB_BIOS_BEGIN 0x000f0000
+#define MB_FIRMWARE_BIOS_BEGIN 0x000e0000
+#define MB_BIOS_END 0x000fffff
+
+#define MB_BIOS_SIZE (MB_BIOS_END - MB_BIOS_BEGIN + 1)
+#define MB_FIRMWARE_BIOS_SIZE (MB_BIOS_END - MB_FIRMWARE_BIOS_BEGIN + 1)
+
+#define VGA_RAM_BEGIN 0x000a0000
+#define VGA_RAM_END 0x000bffff
+
+#define VGA_ROM_BEGIN 0x000c0000
+#define VGA_ROM_OEM_STRING VGA_ROM_BEGIN
+#define VGA_ROM_OEM_STRING_SIZE 16
+#define VGA_ROM_MODES (VGA_ROM_OEM_STRING + VGA_ROM_OEM_STRING_SIZE)
+#define VGA_ROM_MODES_SIZE 32
+#define VGA_ROM_END 0x000c7fff
+
+/* we handle one page only */
+#define VGA_RAM_SEG (VGA_RAM_BEGIN >> 4)
+#define VGA_PAGE_SIZE 0x007d0 /* 80x25 */
+
+/* real mode interrupt vector table */
+#define REAL_INTR_BASE REAL_MODE_IVT_BEGIN
+#define REAL_INTR_VECTORS 256
+
+/*
+ * BIOS stack must be at absolute predefined memory address
+ * We reserve 64 bytes for BIOS stack
+ */
+#define MB_BIOS_SS 0xfff7
+#define MB_BIOS_SP 0x40
+
+/*
+ * When interfere with assembler code we need to be sure how
+ * arguments are passed in real mode.
+ */
+#define bioscall __attribute__((regparm(3)))
+
+#ifndef __ASSEMBLER__
+
+#include <linux/types.h>
+
+struct biosregs {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+ u32 esp;
+ u32 ebp;
+ u32 esi;
+ u32 edi;
+ u32 ds;
+ u32 es;
+ u32 fs;
+ u32 eip;
+ u32 eflags;
+};
+
+extern bioscall void int10_handler(struct biosregs *regs);
+extern bioscall void int15_handler(struct biosregs *regs);
+
+#endif
+
+#endif /* BIOS_H_ */
--- /dev/null
+/*
+ * Linux boot protocol specifics
+ */
+
+#ifndef BOOT_PROTOCOL_H_
+#define BOOT_PROTOCOL_H_
+
+/*
+ * The protected mode kernel part of a modern bzImage is loaded
+ * at 1 MB by default.
+ */
+#define BZ_DEFAULT_SETUP_SECTS 4
+#define BZ_KERNEL_START 0x100000UL
+#define INITRD_START 0x1000000UL
+
+#endif /* BOOT_PROTOCOL_H_ */
--- /dev/null
+#ifndef KVM__CPUFEATURE_H
+#define KVM__CPUFEATURE_H
+
+#define CPUID_VENDOR_INTEL_1 0x756e6547 /* "Genu" */
+#define CPUID_VENDOR_INTEL_2 0x49656e69 /* "ineI" */
+#define CPUID_VENDOR_INTEL_3 0x6c65746e /* "ntel" */
+
+#define CPUID_VENDOR_AMD_1 0x68747541 /* "Auth" */
+#define CPUID_VENDOR_AMD_2 0x69746e65 /* "enti" */
+#define CPUID_VENDOR_AMD_3 0x444d4163 /* "cAMD" */
+
+/*
+ * CPUID flags we need to deal with
+ */
+#define KVM__X86_FEATURE_VMX 5 /* Hardware virtualization */
+#define KVM__X86_FEATURE_SVM 2 /* Secure virtual machine */
+#define KVM__X86_FEATURE_XSAVE 26 /* XSAVE/XRSTOR/XSETBV/XGETBV */
+
+#define cpu_feature_disable(reg, feature) \
+ ((reg) & ~(1 << (feature)))
+#define cpu_feature_enable(reg, feature) \
+ ((reg) | (1 << (feature)))
+
+struct cpuid_regs {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+};
+
+static inline void host_cpuid(struct cpuid_regs *regs)
+{
+ asm volatile("cpuid"
+ : "=a" (regs->eax),
+ "=b" (regs->ebx),
+ "=c" (regs->ecx),
+ "=d" (regs->edx)
+ : "0" (regs->eax), "2" (regs->ecx));
+}
+
+#endif /* KVM__CPUFEATURE_H */
--- /dev/null
+#ifndef KVM__INTERRUPT_H
+#define KVM__INTERRUPT_H
+
+#include <linux/types.h>
+#include "kvm/bios.h"
+#include "kvm/bios-export.h"
+
+struct real_intr_desc {
+ u16 offset;
+ u16 segment;
+} __attribute__((packed));
+
+#define REAL_SEGMENT_SHIFT 4
+#define REAL_SEGMENT(addr) ((addr) >> REAL_SEGMENT_SHIFT)
+#define REAL_OFFSET(addr) ((addr) & ((1 << REAL_SEGMENT_SHIFT) - 1))
+#define REAL_INTR_SIZE (REAL_INTR_VECTORS * sizeof(struct real_intr_desc))
+
+struct interrupt_table {
+ struct real_intr_desc entries[REAL_INTR_VECTORS];
+};
+
+void interrupt_table__copy(struct interrupt_table *itable, void *dst, unsigned int size);
+void interrupt_table__setup(struct interrupt_table *itable, struct real_intr_desc *entry);
+void interrupt_table__set(struct interrupt_table *itable, struct real_intr_desc *entry, unsigned int num);
+
+#endif /* KVM__INTERRUPT_H */
--- /dev/null
+#ifndef KVM__KVM_ARCH_H
+#define KVM__KVM_ARCH_H
+
+#include "kvm/interrupt.h"
+#include "kvm/segment.h"
+
+#include <stdbool.h>
+#include <linux/types.h>
+#include <time.h>
+
+/*
+ * The hole includes VESA framebuffer and PCI memory.
+ */
+#define KVM_32BIT_MAX_MEM_SIZE (1ULL << 32)
+#define KVM_32BIT_GAP_SIZE (768 << 20)
+#define KVM_32BIT_GAP_START (KVM_32BIT_MAX_MEM_SIZE - KVM_32BIT_GAP_SIZE)
+
+#define KVM_MMIO_START KVM_32BIT_GAP_START
+
+/* This is the address that pci_get_io_space_block() starts allocating
+ * from. Note that this is a PCI bus address (though same on x86).
+ */
+#define KVM_PCI_MMIO_AREA (KVM_MMIO_START + 0x1000000)
+#define KVM_VIRTIO_MMIO_AREA (KVM_MMIO_START + 0x2000000)
+
+struct kvm_arch {
+ u16 boot_selector;
+ u16 boot_ip;
+ u16 boot_sp;
+
+ struct interrupt_table interrupt_table;
+};
+
+static inline void *guest_flat_to_host(struct kvm *kvm, unsigned long offset); /* In kvm.h */
+
+static inline void *guest_real_to_host(struct kvm *kvm, u16 selector, u16 offset)
+{
+ unsigned long flat = segment_to_flat(selector, offset);
+
+ return guest_flat_to_host(kvm, flat);
+}
+
+#endif /* KVM__KVM_ARCH_H */
--- /dev/null
+#ifndef KVM__KVM_CPU_ARCH_H
+#define KVM__KVM_CPU_ARCH_H
+
+/* Architecture-specific kvm_cpu definitions. */
+
+#include <linux/kvm.h> /* for struct kvm_regs */
+#include "kvm/kvm.h" /* for kvm__emulate_{mm}io() */
+#include <stdbool.h>
+#include <pthread.h>
+
+struct kvm;
+
+struct kvm_cpu {
+ pthread_t thread; /* VCPU thread */
+
+ unsigned long cpu_id;
+
+ struct kvm *kvm; /* parent KVM */
+ int vcpu_fd; /* For VCPU ioctls() */
+ struct kvm_run *kvm_run;
+
+ struct kvm_regs regs;
+ struct kvm_sregs sregs;
+ struct kvm_fpu fpu;
+
+ struct kvm_msrs *msrs; /* dynamically allocated */
+
+ u8 is_running;
+ u8 paused;
+ u8 needs_nmi;
+
+ struct kvm_coalesced_mmio_ring *ring;
+};
+
+/*
+ * As these are such simple wrappers, let's have them in the header so they'll
+ * be cheaper to call:
+ */
+static inline bool kvm_cpu__emulate_io(struct kvm *kvm, u16 port, void *data, int direction, int size, u32 count)
+{
+ return kvm__emulate_io(kvm, port, data, direction, size, count);
+}
+
+static inline bool kvm_cpu__emulate_mmio(struct kvm *kvm, u64 phys_addr, u8 *data, u32 len, u8 is_write)
+{
+ return kvm__emulate_mmio(kvm, phys_addr, data, len, is_write);
+}
+
+#endif /* KVM__KVM_CPU_ARCH_H */
--- /dev/null
+#ifndef KVM_MPTABLE_H_
+#define KVM_MPTABLE_H_
+
+struct kvm;
+
+int mptable__init(struct kvm *kvm);
+int mptable__exit(struct kvm *kvm);
+
+#endif /* KVM_MPTABLE_H_ */
--- /dev/null
+#include "kvm/interrupt.h"
+
+#include "kvm/util.h"
+
+#include <string.h>
+
+void interrupt_table__copy(struct interrupt_table *itable, void *dst, unsigned int size)
+{
+ if (size < sizeof(itable->entries))
+ die("An attempt to overwrite host memory");
+
+ memcpy(dst, itable->entries, sizeof(itable->entries));
+}
+
+void interrupt_table__setup(struct interrupt_table *itable, struct real_intr_desc *entry)
+{
+ unsigned int i;
+
+ for (i = 0; i < REAL_INTR_VECTORS; i++)
+ itable->entries[i] = *entry;
+}
+
+void interrupt_table__set(struct interrupt_table *itable,
+ struct real_intr_desc *entry, unsigned int num)
+{
+ if (num < REAL_INTR_VECTORS)
+ itable->entries[num] = *entry;
+}
--- /dev/null
+#include "kvm/ioport.h"
+
+#include <stdlib.h>
+#include <stdio.h>
+
+static bool debug_io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ return 0;
+}
+
+static struct ioport_operations debug_ops = {
+ .io_out = debug_io_out,
+};
+
+static bool seabios_debug_io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ char ch;
+
+ ch = ioport__read8(data);
+
+ putchar(ch);
+
+ return true;
+}
+
+static struct ioport_operations seabios_debug_ops = {
+ .io_out = seabios_debug_io_out,
+};
+
+static bool dummy_io_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ return true;
+}
+
+static bool dummy_io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size)
+{
+ return true;
+}
+
+static struct ioport_operations dummy_read_write_ioport_ops = {
+ .io_in = dummy_io_in,
+ .io_out = dummy_io_out,
+};
+
+static struct ioport_operations dummy_write_only_ioport_ops = {
+ .io_out = dummy_io_out,
+};
+
+void ioport__setup_arch(struct kvm *kvm)
+{
+ /* Legacy ioport setup */
+
+ /* 0x0020 - 0x003F - 8259A PIC 1 */
+ ioport__register(kvm, 0x0020, &dummy_read_write_ioport_ops, 2, NULL);
+
+ /* PORT 0040-005F - PIT - PROGRAMMABLE INTERVAL TIMER (8253, 8254) */
+ ioport__register(kvm, 0x0040, &dummy_read_write_ioport_ops, 4, NULL);
+
+ /* 0x00A0 - 0x00AF - 8259A PIC 2 */
+ ioport__register(kvm, 0x00A0, &dummy_read_write_ioport_ops, 2, NULL);
+
+ /* PORT 00E0-00EF are 'motherboard specific' so we use them for our
+ internal debugging purposes. */
+ ioport__register(kvm, IOPORT_DBG, &debug_ops, 1, NULL);
+
+ /* PORT 00ED - DUMMY PORT FOR DELAY??? */
+ ioport__register(kvm, 0x00ED, &dummy_write_only_ioport_ops, 1, NULL);
+
+ /* 0x00F0 - 0x00FF - Math co-processor */
+ ioport__register(kvm, 0x00F0, &dummy_write_only_ioport_ops, 2, NULL);
+
+ /* PORT 03D4-03D5 - COLOR VIDEO - CRT CONTROL REGISTERS */
+ ioport__register(kvm, 0x03D4, &dummy_read_write_ioport_ops, 1, NULL);
+ ioport__register(kvm, 0x03D5, &dummy_write_only_ioport_ops, 1, NULL);
+
+ ioport__register(kvm, 0x402, &seabios_debug_ops, 1, NULL);
+}
--- /dev/null
+#include "kvm/irq.h"
+#include "kvm/kvm.h"
+#include "kvm/util.h"
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/list.h>
+#include <linux/kvm.h>
+#include <sys/ioctl.h>
+
+#include <stddef.h>
+#include <stdlib.h>
+
+#define IRQ_MAX_GSI 64
+#define IRQCHIP_MASTER 0
+#define IRQCHIP_SLAVE 1
+#define IRQCHIP_IOAPIC 2
+
+static u8 next_line = 5;
+static u8 next_dev = 1;
+static struct rb_root pci_tree = RB_ROOT;
+
+/* First 24 GSIs are routed between IRQCHIPs and IOAPICs */
+static u32 gsi = 24;
+
+struct kvm_irq_routing *irq_routing;
+
+static int irq__add_routing(u32 gsi, u32 type, u32 irqchip, u32 pin)
+{
+ if (gsi >= IRQ_MAX_GSI)
+ return -ENOSPC;
+
+ irq_routing->entries[irq_routing->nr++] =
+ (struct kvm_irq_routing_entry) {
+ .gsi = gsi,
+ .type = type,
+ .u.irqchip.irqchip = irqchip,
+ .u.irqchip.pin = pin,
+ };
+
+ return 0;
+}
+
+static struct pci_dev *search(struct rb_root *root, u32 id)
+{
+ struct rb_node *node = root->rb_node;
+
+ while (node) {
+ struct pci_dev *data = rb_entry(node, struct pci_dev, node);
+ int result;
+
+ result = id - data->id;
+
+ if (result < 0)
+ node = node->rb_left;
+ else if (result > 0)
+ node = node->rb_right;
+ else
+ return data;
+ }
+ return NULL;
+}
+
+static int insert(struct rb_root *root, struct pci_dev *data)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /* Figure out where to put new node */
+ while (*new) {
+ struct pci_dev *this = container_of(*new, struct pci_dev, node);
+ int result = data->id - this->id;
+
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0)
+ new = &((*new)->rb_right);
+ else
+ return -EEXIST;
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+
+ return 0;
+}
+
+int irq__register_device(u32 dev, u8 *num, u8 *pin, u8 *line)
+{
+ struct pci_dev *node;
+ int r;
+
+ node = search(&pci_tree, dev);
+
+ if (!node) {
+ /* We haven't found a node - First device of it's kind */
+ node = malloc(sizeof(*node));
+ if (node == NULL)
+ return -ENOMEM;
+
+ *node = (struct pci_dev) {
+ .id = dev,
+ /*
+ * PCI supports only INTA#,B#,C#,D# per device.
+ * A#,B#,C#,D# are allowed for multifunctional
+ * devices so stick with A# for our single
+ * function devices.
+ */
+ .pin = 1,
+ };
+
+ INIT_LIST_HEAD(&node->lines);
+
+ r = insert(&pci_tree, node);
+ if (r) {
+ free(node);
+ return r;
+ }
+ }
+
+ if (node) {
+ /* This device already has a pin assigned, give out a new line and device id */
+ struct irq_line *new = malloc(sizeof(*new));
+ if (new == NULL)
+ return -ENOMEM;
+
+ new->line = next_line++;
+ *line = new->line;
+ *pin = node->pin;
+ *num = next_dev++;
+
+ list_add(&new->node, &node->lines);
+
+ return 0;
+ }
+
+ return -EFAULT;
+}
+
+int irq__init(struct kvm *kvm)
+{
+ int i, r;
+
+ irq_routing = calloc(sizeof(struct kvm_irq_routing) +
+ IRQ_MAX_GSI * sizeof(struct kvm_irq_routing_entry), 1);
+ if (irq_routing == NULL)
+ return -ENOMEM;
+
+ /* Hook first 8 GSIs to master IRQCHIP */
+ for (i = 0; i < 8; i++)
+ if (i != 2)
+ irq__add_routing(i, KVM_IRQ_ROUTING_IRQCHIP, IRQCHIP_MASTER, i);
+
+ /* Hook next 8 GSIs to slave IRQCHIP */
+ for (i = 8; i < 16; i++)
+ irq__add_routing(i, KVM_IRQ_ROUTING_IRQCHIP, IRQCHIP_SLAVE, i - 8);
+
+ /* Last but not least, IOAPIC */
+ for (i = 0; i < 24; i++) {
+ if (i == 0)
+ irq__add_routing(i, KVM_IRQ_ROUTING_IRQCHIP, IRQCHIP_IOAPIC, 2);
+ else if (i != 2)
+ irq__add_routing(i, KVM_IRQ_ROUTING_IRQCHIP, IRQCHIP_IOAPIC, i);
+ }
+
+ r = ioctl(kvm->vm_fd, KVM_SET_GSI_ROUTING, irq_routing);
+ if (r) {
+ free(irq_routing);
+ return errno;
+ }
+
+ return 0;
+}
+dev_base_init(irq__init);
+
+int irq__exit(struct kvm *kvm)
+{
+ struct rb_node *ent;
+
+ free(irq_routing);
+
+ while ((ent = rb_first(&pci_tree))) {
+ struct pci_dev *dev;
+ struct irq_line *line;
+
+ dev = rb_entry(ent, struct pci_dev, node);
+ while (!list_empty(&dev->lines)) {
+ line = list_first_entry(&dev->lines, struct irq_line, node);
+ list_del(&line->node);
+ free(line);
+ }
+ rb_erase(&dev->node, &pci_tree);
+ free(dev);
+ }
+
+ return 0;
+}
+dev_base_exit(irq__exit);
+
+int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg)
+{
+ int r;
+
+ irq_routing->entries[irq_routing->nr++] =
+ (struct kvm_irq_routing_entry) {
+ .gsi = gsi,
+ .type = KVM_IRQ_ROUTING_MSI,
+ .u.msi.address_hi = msg->address_hi,
+ .u.msi.address_lo = msg->address_lo,
+ .u.msi.data = msg->data,
+ };
+
+ r = ioctl(kvm->vm_fd, KVM_SET_GSI_ROUTING, irq_routing);
+ if (r)
+ return r;
+
+ return gsi++;
+}
+
+struct rb_node *irq__get_pci_tree(void)
+{
+ return rb_first(&pci_tree);
+}
--- /dev/null
+#include "kvm/kvm-cpu.h"
+
+#include "kvm/symbol.h"
+#include "kvm/util.h"
+#include "kvm/kvm.h"
+
+#include <asm/msr-index.h>
+#include <asm/apicdef.h>
+#include <linux/err.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <stdio.h>
+
+static int debug_fd;
+
+void kvm_cpu__set_debug_fd(int fd)
+{
+ debug_fd = fd;
+}
+
+int kvm_cpu__get_debug_fd(void)
+{
+ return debug_fd;
+}
+
+static inline bool is_in_protected_mode(struct kvm_cpu *vcpu)
+{
+ return vcpu->sregs.cr0 & 0x01;
+}
+
+static inline u64 ip_to_flat(struct kvm_cpu *vcpu, u64 ip)
+{
+ u64 cs;
+
+ /*
+ * NOTE! We should take code segment base address into account here.
+ * Luckily it's usually zero because Linux uses flat memory model.
+ */
+ if (is_in_protected_mode(vcpu))
+ return ip;
+
+ cs = vcpu->sregs.cs.selector;
+
+ return ip + (cs << 4);
+}
+
+static inline u32 selector_to_base(u16 selector)
+{
+ /*
+ * KVM on Intel requires 'base' to be 'selector * 16' in real mode.
+ */
+ return (u32)selector << 4;
+}
+
+static struct kvm_cpu *kvm_cpu__new(struct kvm *kvm)
+{
+ struct kvm_cpu *vcpu;
+
+ vcpu = calloc(1, sizeof(*vcpu));
+ if (!vcpu)
+ return NULL;
+
+ vcpu->kvm = kvm;
+
+ return vcpu;
+}
+
+void kvm_cpu__delete(struct kvm_cpu *vcpu)
+{
+ if (vcpu->msrs)
+ free(vcpu->msrs);
+
+ free(vcpu);
+}
+
+static int kvm_cpu__set_lint(struct kvm_cpu *vcpu)
+{
+ struct local_apic lapic;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_LAPIC, &lapic))
+ return -1;
+
+ lapic.lvt_lint0.delivery_mode = APIC_MODE_EXTINT;
+ lapic.lvt_lint1.delivery_mode = APIC_MODE_NMI;
+
+ return ioctl(vcpu->vcpu_fd, KVM_SET_LAPIC, &lapic);
+}
+
+struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned long cpu_id)
+{
+ struct kvm_cpu *vcpu;
+ int mmap_size;
+ int coalesced_offset;
+
+ vcpu = kvm_cpu__new(kvm);
+ if (!vcpu)
+ return NULL;
+
+ vcpu->cpu_id = cpu_id;
+
+ vcpu->vcpu_fd = ioctl(vcpu->kvm->vm_fd, KVM_CREATE_VCPU, cpu_id);
+ if (vcpu->vcpu_fd < 0)
+ die_perror("KVM_CREATE_VCPU ioctl");
+
+ mmap_size = ioctl(vcpu->kvm->sys_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
+ if (mmap_size < 0)
+ die_perror("KVM_GET_VCPU_MMAP_SIZE ioctl");
+
+ vcpu->kvm_run = mmap(NULL, mmap_size, PROT_RW, MAP_SHARED, vcpu->vcpu_fd, 0);
+ if (vcpu->kvm_run == MAP_FAILED)
+ die("unable to mmap vcpu fd");
+
+ coalesced_offset = ioctl(kvm->sys_fd, KVM_CHECK_EXTENSION, KVM_CAP_COALESCED_MMIO);
+ if (coalesced_offset)
+ vcpu->ring = (void *)vcpu->kvm_run + (coalesced_offset * PAGE_SIZE);
+
+ if (kvm_cpu__set_lint(vcpu))
+ die_perror("KVM_SET_LAPIC failed");
+
+ vcpu->is_running = true;
+
+ return vcpu;
+}
+
+static struct kvm_msrs *kvm_msrs__new(size_t nmsrs)
+{
+ struct kvm_msrs *vcpu = calloc(1, sizeof(*vcpu) + (sizeof(struct kvm_msr_entry) * nmsrs));
+
+ if (!vcpu)
+ die("out of memory");
+
+ return vcpu;
+}
+
+#define KVM_MSR_ENTRY(_index, _data) \
+ (struct kvm_msr_entry) { .index = _index, .data = _data }
+
+static void kvm_cpu__setup_msrs(struct kvm_cpu *vcpu)
+{
+ unsigned long ndx = 0;
+
+ vcpu->msrs = kvm_msrs__new(100);
+
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_IA32_SYSENTER_CS, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_IA32_SYSENTER_ESP, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_IA32_SYSENTER_EIP, 0x0);
+#ifdef CONFIG_X86_64
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_STAR, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_CSTAR, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_KERNEL_GS_BASE, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_SYSCALL_MASK, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_LSTAR, 0x0);
+#endif
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_IA32_TSC, 0x0);
+ vcpu->msrs->entries[ndx++] = KVM_MSR_ENTRY(MSR_IA32_MISC_ENABLE,
+ MSR_IA32_MISC_ENABLE_FAST_STRING);
+
+ vcpu->msrs->nmsrs = ndx;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_MSRS, vcpu->msrs) < 0)
+ die_perror("KVM_SET_MSRS failed");
+}
+
+static void kvm_cpu__setup_fpu(struct kvm_cpu *vcpu)
+{
+ vcpu->fpu = (struct kvm_fpu) {
+ .fcw = 0x37f,
+ .mxcsr = 0x1f80,
+ };
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_FPU, &vcpu->fpu) < 0)
+ die_perror("KVM_SET_FPU failed");
+}
+
+static void kvm_cpu__setup_regs(struct kvm_cpu *vcpu)
+{
+ vcpu->regs = (struct kvm_regs) {
+ /* We start the guest in 16-bit real mode */
+ .rflags = 0x0000000000000002ULL,
+
+ .rip = vcpu->kvm->arch.boot_ip,
+ .rsp = vcpu->kvm->arch.boot_sp,
+ .rbp = vcpu->kvm->arch.boot_sp,
+ };
+
+ if (vcpu->regs.rip > USHRT_MAX)
+ die("ip 0x%llx is too high for real mode", (u64)vcpu->regs.rip);
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_REGS, &vcpu->regs) < 0)
+ die_perror("KVM_SET_REGS failed");
+}
+
+static void kvm_cpu__setup_sregs(struct kvm_cpu *vcpu)
+{
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &vcpu->sregs) < 0)
+ die_perror("KVM_GET_SREGS failed");
+
+ vcpu->sregs.cs.selector = vcpu->kvm->arch.boot_selector;
+ vcpu->sregs.cs.base = selector_to_base(vcpu->kvm->arch.boot_selector);
+ vcpu->sregs.ss.selector = vcpu->kvm->arch.boot_selector;
+ vcpu->sregs.ss.base = selector_to_base(vcpu->kvm->arch.boot_selector);
+ vcpu->sregs.ds.selector = vcpu->kvm->arch.boot_selector;
+ vcpu->sregs.ds.base = selector_to_base(vcpu->kvm->arch.boot_selector);
+ vcpu->sregs.es.selector = vcpu->kvm->arch.boot_selector;
+ vcpu->sregs.es.base = selector_to_base(vcpu->kvm->arch.boot_selector);
+ vcpu->sregs.fs.selector = vcpu->kvm->arch.boot_selector;
+ vcpu->sregs.fs.base = selector_to_base(vcpu->kvm->arch.boot_selector);
+ vcpu->sregs.gs.selector = vcpu->kvm->arch.boot_selector;
+ vcpu->sregs.gs.base = selector_to_base(vcpu->kvm->arch.boot_selector);
+
+ if (ioctl(vcpu->vcpu_fd, KVM_SET_SREGS, &vcpu->sregs) < 0)
+ die_perror("KVM_SET_SREGS failed");
+}
+
+/**
+ * kvm_cpu__reset_vcpu - reset virtual CPU to a known state
+ */
+void kvm_cpu__reset_vcpu(struct kvm_cpu *vcpu)
+{
+ kvm_cpu__setup_cpuid(vcpu);
+ kvm_cpu__setup_sregs(vcpu);
+ kvm_cpu__setup_regs(vcpu);
+ kvm_cpu__setup_fpu(vcpu);
+ kvm_cpu__setup_msrs(vcpu);
+}
+
+bool kvm_cpu__handle_exit(struct kvm_cpu *vcpu)
+{
+ return false;
+}
+
+static void print_dtable(const char *name, struct kvm_dtable *dtable)
+{
+ dprintf(debug_fd, " %s %016llx %08hx\n",
+ name, (u64) dtable->base, (u16) dtable->limit);
+}
+
+static void print_segment(const char *name, struct kvm_segment *seg)
+{
+ dprintf(debug_fd, " %s %04hx %016llx %08x %02hhx %x %x %x %x %x %x %x\n",
+ name, (u16) seg->selector, (u64) seg->base, (u32) seg->limit,
+ (u8) seg->type, seg->present, seg->dpl, seg->db, seg->s, seg->l, seg->g, seg->avl);
+}
+
+void kvm_cpu__show_registers(struct kvm_cpu *vcpu)
+{
+ unsigned long cr0, cr2, cr3;
+ unsigned long cr4, cr8;
+ unsigned long rax, rbx, rcx;
+ unsigned long rdx, rsi, rdi;
+ unsigned long rbp, r8, r9;
+ unsigned long r10, r11, r12;
+ unsigned long r13, r14, r15;
+ unsigned long rip, rsp;
+ struct kvm_sregs sregs;
+ unsigned long rflags;
+ struct kvm_regs regs;
+ int i;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_REGS, ®s) < 0)
+ die("KVM_GET_REGS failed");
+
+ rflags = regs.rflags;
+
+ rip = regs.rip; rsp = regs.rsp;
+ rax = regs.rax; rbx = regs.rbx; rcx = regs.rcx;
+ rdx = regs.rdx; rsi = regs.rsi; rdi = regs.rdi;
+ rbp = regs.rbp; r8 = regs.r8; r9 = regs.r9;
+ r10 = regs.r10; r11 = regs.r11; r12 = regs.r12;
+ r13 = regs.r13; r14 = regs.r14; r15 = regs.r15;
+
+ dprintf(debug_fd, "\n Registers:\n");
+ dprintf(debug_fd, " ----------\n");
+ dprintf(debug_fd, " rip: %016lx rsp: %016lx flags: %016lx\n", rip, rsp, rflags);
+ dprintf(debug_fd, " rax: %016lx rbx: %016lx rcx: %016lx\n", rax, rbx, rcx);
+ dprintf(debug_fd, " rdx: %016lx rsi: %016lx rdi: %016lx\n", rdx, rsi, rdi);
+ dprintf(debug_fd, " rbp: %016lx r8: %016lx r9: %016lx\n", rbp, r8, r9);
+ dprintf(debug_fd, " r10: %016lx r11: %016lx r12: %016lx\n", r10, r11, r12);
+ dprintf(debug_fd, " r13: %016lx r14: %016lx r15: %016lx\n", r13, r14, r15);
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &sregs) < 0)
+ die("KVM_GET_REGS failed");
+
+ cr0 = sregs.cr0; cr2 = sregs.cr2; cr3 = sregs.cr3;
+ cr4 = sregs.cr4; cr8 = sregs.cr8;
+
+ dprintf(debug_fd, " cr0: %016lx cr2: %016lx cr3: %016lx\n", cr0, cr2, cr3);
+ dprintf(debug_fd, " cr4: %016lx cr8: %016lx\n", cr4, cr8);
+ dprintf(debug_fd, "\n Segment registers:\n");
+ dprintf(debug_fd, " ------------------\n");
+ dprintf(debug_fd, " register selector base limit type p dpl db s l g avl\n");
+ print_segment("cs ", &sregs.cs);
+ print_segment("ss ", &sregs.ss);
+ print_segment("ds ", &sregs.ds);
+ print_segment("es ", &sregs.es);
+ print_segment("fs ", &sregs.fs);
+ print_segment("gs ", &sregs.gs);
+ print_segment("tr ", &sregs.tr);
+ print_segment("ldt", &sregs.ldt);
+ print_dtable("gdt", &sregs.gdt);
+ print_dtable("idt", &sregs.idt);
+
+ dprintf(debug_fd, "\n APIC:\n");
+ dprintf(debug_fd, " -----\n");
+ dprintf(debug_fd, " efer: %016llx apic base: %016llx nmi: %s\n",
+ (u64) sregs.efer, (u64) sregs.apic_base,
+ (vcpu->kvm->nmi_disabled ? "disabled" : "enabled"));
+
+ dprintf(debug_fd, "\n Interrupt bitmap:\n");
+ dprintf(debug_fd, " -----------------\n");
+ for (i = 0; i < (KVM_NR_INTERRUPTS + 63) / 64; i++)
+ dprintf(debug_fd, " %016llx", (u64) sregs.interrupt_bitmap[i]);
+ dprintf(debug_fd, "\n");
+}
+
+#define MAX_SYM_LEN 128
+
+void kvm_cpu__show_code(struct kvm_cpu *vcpu)
+{
+ unsigned int code_bytes = 64;
+ unsigned int code_prologue = 43;
+ unsigned int code_len = code_bytes;
+ char sym[MAX_SYM_LEN] = SYMBOL_DEFAULT_UNKNOWN, *psym;
+ unsigned char c;
+ unsigned int i;
+ u8 *ip;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_REGS, &vcpu->regs) < 0)
+ die("KVM_GET_REGS failed");
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &vcpu->sregs) < 0)
+ die("KVM_GET_SREGS failed");
+
+ ip = guest_flat_to_host(vcpu->kvm, ip_to_flat(vcpu, vcpu->regs.rip) - code_prologue);
+
+ dprintf(debug_fd, "\n Code:\n");
+ dprintf(debug_fd, " -----\n");
+
+ psym = symbol_lookup(vcpu->kvm, vcpu->regs.rip, sym, MAX_SYM_LEN);
+ if (IS_ERR(psym))
+ dprintf(debug_fd,
+ "Warning: symbol_lookup() failed to find symbol "
+ "with error: %ld\n", PTR_ERR(psym));
+
+ dprintf(debug_fd, " rip: [<%016lx>] %s\n\n", (unsigned long) vcpu->regs.rip, sym);
+
+ for (i = 0; i < code_len; i++, ip++) {
+ if (!host_ptr_in_ram(vcpu->kvm, ip))
+ break;
+
+ c = *ip;
+
+ if (ip == guest_flat_to_host(vcpu->kvm, ip_to_flat(vcpu, vcpu->regs.rip)))
+ dprintf(debug_fd, " <%02x>", c);
+ else
+ dprintf(debug_fd, " %02x", c);
+ }
+
+ dprintf(debug_fd, "\n");
+
+ dprintf(debug_fd, "\n Stack:\n");
+ dprintf(debug_fd, " ------\n");
+ kvm__dump_mem(vcpu->kvm, vcpu->regs.rsp, 32);
+}
+
+void kvm_cpu__show_page_tables(struct kvm_cpu *vcpu)
+{
+ u64 *pte1;
+ u64 *pte2;
+ u64 *pte3;
+ u64 *pte4;
+
+ if (!is_in_protected_mode(vcpu))
+ return;
+
+ if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &vcpu->sregs) < 0)
+ die("KVM_GET_SREGS failed");
+
+ pte4 = guest_flat_to_host(vcpu->kvm, vcpu->sregs.cr3);
+ if (!host_ptr_in_ram(vcpu->kvm, pte4))
+ return;
+
+ pte3 = guest_flat_to_host(vcpu->kvm, (*pte4 & ~0xfff));
+ if (!host_ptr_in_ram(vcpu->kvm, pte3))
+ return;
+
+ pte2 = guest_flat_to_host(vcpu->kvm, (*pte3 & ~0xfff));
+ if (!host_ptr_in_ram(vcpu->kvm, pte2))
+ return;
+
+ pte1 = guest_flat_to_host(vcpu->kvm, (*pte2 & ~0xfff));
+ if (!host_ptr_in_ram(vcpu->kvm, pte1))
+ return;
+
+ dprintf(debug_fd, "Page Tables:\n");
+ if (*pte2 & (1 << 7))
+ dprintf(debug_fd, " pte4: %016llx pte3: %016llx"
+ " pte2: %016llx\n",
+ *pte4, *pte3, *pte2);
+ else
+ dprintf(debug_fd, " pte4: %016llx pte3: %016llx pte2: %016"
+ "llx pte1: %016llx\n",
+ *pte4, *pte3, *pte2, *pte1);
+}
+
+void kvm_cpu__arch_nmi(struct kvm_cpu *cpu)
+{
+ struct kvm_lapic_state klapic;
+ struct local_apic *lapic = (void *)&klapic;
+
+ if (ioctl(cpu->vcpu_fd, KVM_GET_LAPIC, &klapic) != 0)
+ return;
+
+ if (lapic->lvt_lint1.mask)
+ return;
+
+ if (lapic->lvt_lint1.delivery_mode != APIC_MODE_NMI)
+ return;
+
+ ioctl(cpu->vcpu_fd, KVM_NMI);
+}
--- /dev/null
+#include "kvm/kvm.h"
+#include "kvm/boot-protocol.h"
+#include "kvm/cpufeature.h"
+#include "kvm/interrupt.h"
+#include "kvm/mptable.h"
+#include "kvm/util.h"
+#include "kvm/8250-serial.h"
+#include "kvm/virtio-console.h"
+
+#include <asm/bootparam.h>
+#include <linux/kvm.h>
+
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+struct kvm_ext kvm_req_ext[] = {
+ { DEFINE_KVM_EXT(KVM_CAP_COALESCED_MMIO) },
+ { DEFINE_KVM_EXT(KVM_CAP_SET_TSS_ADDR) },
+ { DEFINE_KVM_EXT(KVM_CAP_PIT2) },
+ { DEFINE_KVM_EXT(KVM_CAP_USER_MEMORY) },
+ { DEFINE_KVM_EXT(KVM_CAP_IRQ_ROUTING) },
+ { DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
+ { DEFINE_KVM_EXT(KVM_CAP_HLT) },
+ { DEFINE_KVM_EXT(KVM_CAP_IRQ_INJECT_STATUS) },
+ { DEFINE_KVM_EXT(KVM_CAP_EXT_CPUID) },
+ { 0, 0 }
+};
+
+bool kvm__arch_cpu_supports_vm(void)
+{
+ struct cpuid_regs regs;
+ u32 eax_base;
+ int feature;
+
+ regs = (struct cpuid_regs) {
+ .eax = 0x00,
+ };
+ host_cpuid(®s);
+
+ switch (regs.ebx) {
+ case CPUID_VENDOR_INTEL_1:
+ eax_base = 0x00;
+ feature = KVM__X86_FEATURE_VMX;
+ break;
+
+ case CPUID_VENDOR_AMD_1:
+ eax_base = 0x80000000;
+ feature = KVM__X86_FEATURE_SVM;
+ break;
+
+ default:
+ return false;
+ }
+
+ regs = (struct cpuid_regs) {
+ .eax = eax_base,
+ };
+ host_cpuid(®s);
+
+ if (regs.eax < eax_base + 0x01)
+ return false;
+
+ regs = (struct cpuid_regs) {
+ .eax = eax_base + 0x01
+ };
+ host_cpuid(®s);
+
+ return regs.ecx & (1 << feature);
+}
+
+/*
+ * Allocating RAM size bigger than 4GB requires us to leave a gap
+ * in the RAM which is used for PCI MMIO, hotplug, and unconfigured
+ * devices (see documentation of e820_setup_gap() for details).
+ *
+ * If we're required to initialize RAM bigger than 4GB, we will create
+ * a gap between 0xe0000000 and 0x100000000 in the guest virtual mem space.
+ */
+
+void kvm__init_ram(struct kvm *kvm)
+{
+ u64 phys_start, phys_size;
+ void *host_mem;
+
+ if (kvm->ram_size < KVM_32BIT_GAP_START) {
+ /* Use a single block of RAM for 32bit RAM */
+
+ phys_start = 0;
+ phys_size = kvm->ram_size;
+ host_mem = kvm->ram_start;
+
+ kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+ } else {
+ /* First RAM range from zero to the PCI gap: */
+
+ phys_start = 0;
+ phys_size = KVM_32BIT_GAP_START;
+ host_mem = kvm->ram_start;
+
+ kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+
+ /* Second RAM range from 4GB to the end of RAM: */
+
+ phys_start = KVM_32BIT_MAX_MEM_SIZE;
+ phys_size = kvm->ram_size - phys_start;
+ host_mem = kvm->ram_start + phys_start;
+
+ kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+ }
+}
+
+/* Arch-specific commandline setup */
+void kvm__arch_set_cmdline(char *cmdline, bool video)
+{
+ strcpy(cmdline, "noapic noacpi pci=conf1 reboot=k panic=1 i8042.direct=1 "
+ "i8042.dumbkbd=1 i8042.nopnp=1");
+ if (video)
+ strcat(cmdline, " video=vesafb console=tty0");
+ else
+ strcat(cmdline, " console=ttyS0 earlyprintk=serial i8042.noaux=1");
+}
+
+/* Architecture-specific KVM init */
+void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size)
+{
+ struct kvm_pit_config pit_config = { .flags = 0, };
+ int ret;
+
+ ret = ioctl(kvm->vm_fd, KVM_SET_TSS_ADDR, 0xfffbd000);
+ if (ret < 0)
+ die_perror("KVM_SET_TSS_ADDR ioctl");
+
+ ret = ioctl(kvm->vm_fd, KVM_CREATE_PIT2, &pit_config);
+ if (ret < 0)
+ die_perror("KVM_CREATE_PIT2 ioctl");
+
+ if (ram_size < KVM_32BIT_GAP_START) {
+ kvm->ram_size = ram_size;
+ kvm->ram_start = mmap_anon_or_hugetlbfs(kvm, hugetlbfs_path, ram_size);
+ } else {
+ kvm->ram_start = mmap_anon_or_hugetlbfs(kvm, hugetlbfs_path, ram_size + KVM_32BIT_GAP_SIZE);
+ kvm->ram_size = ram_size + KVM_32BIT_GAP_SIZE;
+ if (kvm->ram_start != MAP_FAILED)
+ /*
+ * We mprotect the gap (see kvm__init_ram() for details) PROT_NONE so that
+ * if we accidently write to it, we will know.
+ */
+ mprotect(kvm->ram_start + KVM_32BIT_GAP_START, KVM_32BIT_GAP_SIZE, PROT_NONE);
+ }
+ if (kvm->ram_start == MAP_FAILED)
+ die("out of memory");
+
+ madvise(kvm->ram_start, kvm->ram_size, MADV_MERGEABLE);
+
+ ret = ioctl(kvm->vm_fd, KVM_CREATE_IRQCHIP);
+ if (ret < 0)
+ die_perror("KVM_CREATE_IRQCHIP ioctl");
+}
+
+void kvm__arch_delete_ram(struct kvm *kvm)
+{
+ munmap(kvm->ram_start, kvm->ram_size);
+}
+
+void kvm__irq_line(struct kvm *kvm, int irq, int level)
+{
+ struct kvm_irq_level irq_level;
+
+ irq_level = (struct kvm_irq_level) {
+ {
+ .irq = irq,
+ },
+ .level = level,
+ };
+
+ if (ioctl(kvm->vm_fd, KVM_IRQ_LINE, &irq_level) < 0)
+ die_perror("KVM_IRQ_LINE failed");
+}
+
+void kvm__irq_trigger(struct kvm *kvm, int irq)
+{
+ kvm__irq_line(kvm, irq, 1);
+ kvm__irq_line(kvm, irq, 0);
+}
+
+#define BOOT_LOADER_SELECTOR 0x1000
+#define BOOT_LOADER_IP 0x0000
+#define BOOT_LOADER_SP 0x8000
+#define BOOT_CMDLINE_OFFSET 0x20000
+
+#define BOOT_PROTOCOL_REQUIRED 0x206
+#define LOAD_HIGH 0x01
+
+int load_flat_binary(struct kvm *kvm, int fd_kernel, int fd_initrd, const char *kernel_cmdline)
+{
+ void *p;
+ int nr;
+
+ /*
+ * Some architectures may support loading an initrd alongside the flat kernel,
+ * but we do not.
+ */
+ if (fd_initrd != -1)
+ pr_warning("Loading initrd with flat binary not supported.");
+
+ if (lseek(fd_kernel, 0, SEEK_SET) < 0)
+ die_perror("lseek");
+
+ p = guest_real_to_host(kvm, BOOT_LOADER_SELECTOR, BOOT_LOADER_IP);
+
+ while ((nr = read(fd_kernel, p, 65536)) > 0)
+ p += nr;
+
+ kvm->arch.boot_selector = BOOT_LOADER_SELECTOR;
+ kvm->arch.boot_ip = BOOT_LOADER_IP;
+ kvm->arch.boot_sp = BOOT_LOADER_SP;
+
+ return true;
+}
+
+static const char *BZIMAGE_MAGIC = "HdrS";
+
+bool load_bzimage(struct kvm *kvm, int fd_kernel,
+ int fd_initrd, const char *kernel_cmdline, u16 vidmode)
+{
+ struct boot_params *kern_boot;
+ unsigned long setup_sects;
+ struct boot_params boot;
+ size_t cmdline_size;
+ ssize_t setup_size;
+ void *p;
+ int nr;
+
+ /*
+ * See Documentation/x86/boot.txt for details no bzImage on-disk and
+ * memory layout.
+ */
+
+ if (lseek(fd_kernel, 0, SEEK_SET) < 0)
+ die_perror("lseek");
+
+ if (read(fd_kernel, &boot, sizeof(boot)) != sizeof(boot))
+ return false;
+
+ if (memcmp(&boot.hdr.header, BZIMAGE_MAGIC, strlen(BZIMAGE_MAGIC)))
+ return false;
+
+ if (boot.hdr.version < BOOT_PROTOCOL_REQUIRED)
+ die("Too old kernel");
+
+ if (lseek(fd_kernel, 0, SEEK_SET) < 0)
+ die_perror("lseek");
+
+ if (!boot.hdr.setup_sects)
+ boot.hdr.setup_sects = BZ_DEFAULT_SETUP_SECTS;
+ setup_sects = boot.hdr.setup_sects + 1;
+
+ setup_size = setup_sects << 9;
+ p = guest_real_to_host(kvm, BOOT_LOADER_SELECTOR, BOOT_LOADER_IP);
+
+ /* copy setup.bin to mem*/
+ if (read(fd_kernel, p, setup_size) != setup_size)
+ die_perror("read");
+
+ /* copy vmlinux.bin to BZ_KERNEL_START*/
+ p = guest_flat_to_host(kvm, BZ_KERNEL_START);
+
+ while ((nr = read(fd_kernel, p, 65536)) > 0)
+ p += nr;
+
+ p = guest_flat_to_host(kvm, BOOT_CMDLINE_OFFSET);
+ if (kernel_cmdline) {
+ cmdline_size = strlen(kernel_cmdline) + 1;
+ if (cmdline_size > boot.hdr.cmdline_size)
+ cmdline_size = boot.hdr.cmdline_size;
+
+ memset(p, 0, boot.hdr.cmdline_size);
+ memcpy(p, kernel_cmdline, cmdline_size - 1);
+ }
+
+ kern_boot = guest_real_to_host(kvm, BOOT_LOADER_SELECTOR, 0x00);
+
+ kern_boot->hdr.cmd_line_ptr = BOOT_CMDLINE_OFFSET;
+ kern_boot->hdr.type_of_loader = 0xff;
+ kern_boot->hdr.heap_end_ptr = 0xfe00;
+ kern_boot->hdr.loadflags |= CAN_USE_HEAP;
+ kern_boot->hdr.vid_mode = vidmode;
+
+ /*
+ * Read initrd image into guest memory
+ */
+ if (fd_initrd >= 0) {
+ struct stat initrd_stat;
+ unsigned long addr;
+
+ if (fstat(fd_initrd, &initrd_stat))
+ die_perror("fstat");
+
+ addr = boot.hdr.initrd_addr_max & ~0xfffff;
+ for (;;) {
+ if (addr < BZ_KERNEL_START)
+ die("Not enough memory for initrd");
+ else if (addr < (kvm->ram_size - initrd_stat.st_size))
+ break;
+ addr -= 0x100000;
+ }
+
+ p = guest_flat_to_host(kvm, addr);
+ nr = read(fd_initrd, p, initrd_stat.st_size);
+ if (nr != initrd_stat.st_size)
+ die("Failed to read initrd");
+
+ kern_boot->hdr.ramdisk_image = addr;
+ kern_boot->hdr.ramdisk_size = initrd_stat.st_size;
+ }
+
+ kvm->arch.boot_selector = BOOT_LOADER_SELECTOR;
+ /*
+ * The real-mode setup code starts at offset 0x200 of a bzImage. See
+ * Documentation/x86/boot.txt for details.
+ */
+ kvm->arch.boot_ip = BOOT_LOADER_IP + 0x200;
+ kvm->arch.boot_sp = BOOT_LOADER_SP;
+
+ return true;
+}
+
+/**
+ * kvm__arch_setup_firmware - inject BIOS into guest system memory
+ * @kvm - guest system descriptor
+ *
+ * This function is a main routine where we poke guest memory
+ * and install BIOS there.
+ */
+int kvm__arch_setup_firmware(struct kvm *kvm)
+{
+ /* standart minimal configuration */
+ setup_bios(kvm);
+
+ /* FIXME: SMP, ACPI and friends here */
+
+ return 0;
+}
+
+int kvm__arch_free_firmware(struct kvm *kvm)
+{
+ return 0;
+}
+
+void kvm__arch_periodic_poll(struct kvm *kvm)
+{
+ serial8250__update_consoles(kvm);
+ virtio_console__inject_interrupt(kvm);
+}
--- /dev/null
+#include "kvm/kvm.h"
+#include "kvm/bios.h"
+#include "kvm/apic.h"
+#include "kvm/mptable.h"
+#include "kvm/util.h"
+#include "kvm/irq.h"
+
+#include <linux/kernel.h>
+#include <string.h>
+
+#include <asm/mpspec_def.h>
+#include <linux/types.h>
+
+/*
+ * FIXME: please make sure the addresses borrowed
+ * for apic/ioapic never overlaped! We need a global
+ * tracker of system resources (including io, mmio,
+ * and friends).
+ */
+
+static unsigned int mpf_checksum(unsigned char *mp, int len)
+{
+ unsigned int sum = 0;
+
+ while (len--)
+ sum += *mp++;
+
+ return sum & 0xFF;
+}
+
+static unsigned int gen_cpu_flag(unsigned int cpu, unsigned int ncpu)
+{
+ /* sets enabled/disabled | BSP/AP processor */
+ return ( (cpu < ncpu) ? CPU_ENABLED : 0) |
+ ((cpu == 0) ? CPU_BOOTPROCESSOR : 0x00);
+}
+
+#define MPTABLE_SIG_FLOATING "_MP_"
+#define MPTABLE_OEM "KVMCPU00"
+#define MPTABLE_PRODUCTID "0.1 "
+#define MPTABLE_PCIBUSTYPE "PCI "
+#define MPTABLE_ISABUSTYPE "ISA "
+
+#define MPTABLE_STRNCPY(d, s) memcpy(d, s, sizeof(d))
+
+/* It should be more than enough */
+#define MPTABLE_MAX_SIZE (32 << 20)
+
+/*
+ * Too many cpus will require x2apic mode
+ * and rather ACPI support so we limit it
+ * here for a while.
+ */
+#define MPTABLE_MAX_CPUS 255
+
+static void mptable_add_irq_src(struct mpc_intsrc *mpc_intsrc,
+ u16 srcbusid, u16 srcbusirq,
+ u16 dstapic, u16 dstirq)
+{
+ *mpc_intsrc = (struct mpc_intsrc) {
+ .type = MP_INTSRC,
+ .irqtype = mp_INT,
+ .irqflag = MP_IRQDIR_DEFAULT,
+ .srcbus = srcbusid,
+ .srcbusirq = srcbusirq,
+ .dstapic = dstapic,
+ .dstirq = dstirq
+ };
+}
+
+/**
+ * mptable_setup - create mptable and fill guest memory with it
+ */
+int mptable__init(struct kvm *kvm)
+{
+ unsigned long real_mpc_table, real_mpf_intel, size;
+ struct mpf_intel *mpf_intel;
+ struct mpc_table *mpc_table;
+ struct mpc_cpu *mpc_cpu;
+ struct mpc_bus *mpc_bus;
+ struct mpc_ioapic *mpc_ioapic;
+ struct mpc_intsrc *mpc_intsrc;
+ struct rb_node *pci_tree;
+
+ const int pcibusid = 0;
+ const int isabusid = 1;
+
+ unsigned int i, nentries = 0, ncpus = kvm->nrcpus;
+ unsigned int ioapicid;
+ void *last_addr;
+
+ /* That is where MP table will be in guest memory */
+ real_mpc_table = ALIGN(MB_BIOS_BEGIN + bios_rom_size, 16);
+
+ if (ncpus > MPTABLE_MAX_CPUS) {
+ pr_warning("Too many cpus: %d limited to %d",
+ ncpus, MPTABLE_MAX_CPUS);
+ ncpus = MPTABLE_MAX_CPUS;
+ }
+
+ mpc_table = calloc(1, MPTABLE_MAX_SIZE);
+ if (!mpc_table)
+ return -ENOMEM;
+
+ MPTABLE_STRNCPY(mpc_table->signature, MPC_SIGNATURE);
+ MPTABLE_STRNCPY(mpc_table->oem, MPTABLE_OEM);
+ MPTABLE_STRNCPY(mpc_table->productid, MPTABLE_PRODUCTID);
+
+ mpc_table->spec = 4;
+ mpc_table->lapic = APIC_ADDR(0);
+ mpc_table->oemcount = ncpus; /* will be updated again at end */
+
+ /*
+ * CPUs enumeration. Technically speaking we should
+ * ask either host or HV for apic version supported
+ * but for a while we simply put some random value
+ * here.
+ */
+ mpc_cpu = (void *)&mpc_table[1];
+ for (i = 0; i < ncpus; i++) {
+ mpc_cpu->type = MP_PROCESSOR;
+ mpc_cpu->apicid = i;
+ mpc_cpu->apicver = KVM_APIC_VERSION;
+ mpc_cpu->cpuflag = gen_cpu_flag(i, ncpus);
+ mpc_cpu->cpufeature = 0x600; /* some default value */
+ mpc_cpu->featureflag = 0x201; /* some default value */
+ mpc_cpu++;
+ }
+
+ last_addr = (void *)mpc_cpu;
+ nentries += ncpus;
+
+ /*
+ * PCI buses.
+ * FIXME: Some callback here to obtain real number
+ * of PCI buses present in system.
+ */
+ mpc_bus = last_addr;
+ mpc_bus->type = MP_BUS;
+ mpc_bus->busid = pcibusid;
+ MPTABLE_STRNCPY(mpc_bus->bustype, MPTABLE_PCIBUSTYPE);
+
+ last_addr = (void *)&mpc_bus[1];
+ nentries++;
+
+ /*
+ * ISA bus.
+ * FIXME: Same issue as for PCI bus.
+ */
+ mpc_bus = last_addr;
+ mpc_bus->type = MP_BUS;
+ mpc_bus->busid = isabusid;
+ MPTABLE_STRNCPY(mpc_bus->bustype, MPTABLE_ISABUSTYPE);
+
+ last_addr = (void *)&mpc_bus[1];
+ nentries++;
+
+ /*
+ * IO-APIC chip.
+ */
+ ioapicid = ncpus + 1;
+ mpc_ioapic = last_addr;
+ mpc_ioapic->type = MP_IOAPIC;
+ mpc_ioapic->apicid = ioapicid;
+ mpc_ioapic->apicver = KVM_APIC_VERSION;
+ mpc_ioapic->flags = MPC_APIC_USABLE;
+ mpc_ioapic->apicaddr = IOAPIC_ADDR(0);
+
+ last_addr = (void *)&mpc_ioapic[1];
+ nentries++;
+
+ /*
+ * IRQ sources.
+ *
+ * FIXME: Same issue as with buses. We definitely
+ * need kind of collector routine which enumerate
+ * resources used first and pass them here.
+ * At moment we know we have only virtio block device
+ * and virtio console but this is g00berfish.
+ *
+ * Also note we use PCI irqs here, no for ISA bus yet.
+ */
+
+ for (pci_tree = irq__get_pci_tree(); pci_tree; pci_tree = rb_next(pci_tree)) {
+ struct pci_dev *dev = rb_entry(pci_tree, struct pci_dev, node);
+ struct irq_line *irq_line;
+
+ list_for_each_entry(irq_line, &dev->lines, node) {
+ unsigned char srcbusirq;
+
+ srcbusirq = (dev->id << 2) | (dev->pin - 1);
+
+ mpc_intsrc = last_addr;
+
+ mptable_add_irq_src(mpc_intsrc, pcibusid, srcbusirq, ioapicid, irq_line->line);
+ last_addr = (void *)&mpc_intsrc[1];
+ nentries++;
+ }
+ }
+
+ /*
+ * Local IRQs assignment (LINT0, LINT1)
+ */
+ mpc_intsrc = last_addr;
+ mpc_intsrc->type = MP_LINTSRC;
+ mpc_intsrc->irqtype = mp_ExtINT;
+ mpc_intsrc->irqtype = mp_INT;
+ mpc_intsrc->irqflag = MP_IRQDIR_DEFAULT;
+ mpc_intsrc->srcbus = isabusid;
+ mpc_intsrc->srcbusirq = 0;
+ mpc_intsrc->dstapic = 0; /* FIXME: BSP apic */
+ mpc_intsrc->dstirq = 0; /* LINT0 */
+
+ last_addr = (void *)&mpc_intsrc[1];
+ nentries++;
+
+ mpc_intsrc = last_addr;
+ mpc_intsrc->type = MP_LINTSRC;
+ mpc_intsrc->irqtype = mp_NMI;
+ mpc_intsrc->irqflag = MP_IRQDIR_DEFAULT;
+ mpc_intsrc->srcbus = isabusid;
+ mpc_intsrc->srcbusirq = 0;
+ mpc_intsrc->dstapic = 0; /* FIXME: BSP apic */
+ mpc_intsrc->dstirq = 1; /* LINT1 */
+
+ last_addr = (void *)&mpc_intsrc[1];
+ nentries++;
+
+ /*
+ * Floating MP table finally.
+ */
+ real_mpf_intel = ALIGN((unsigned long)last_addr - (unsigned long)mpc_table, 16);
+ mpf_intel = (void *)((unsigned long)mpc_table + real_mpf_intel);
+
+ MPTABLE_STRNCPY(mpf_intel->signature, MPTABLE_SIG_FLOATING);
+ mpf_intel->length = 1;
+ mpf_intel->specification= 4;
+ mpf_intel->physptr = (unsigned int)real_mpc_table;
+ mpf_intel->checksum = -mpf_checksum((unsigned char *)mpf_intel, sizeof(*mpf_intel));
+
+ /*
+ * No last_addr inclrement here please, we need last
+ * active position here to compute table size.
+ */
+
+ /*
+ * Don't forget to update header in fixed table.
+ */
+ mpc_table->oemcount = nentries;
+ mpc_table->length = last_addr - (void *)mpc_table;
+ mpc_table->checksum = -mpf_checksum((unsigned char *)mpc_table, mpc_table->length);
+
+
+ /*
+ * We will copy the whole table, no need to separate
+ * floating structure and table itkvm.
+ */
+ size = (unsigned long)mpf_intel + sizeof(*mpf_intel) - (unsigned long)mpc_table;
+
+ /*
+ * The finial check -- never get out of system bios
+ * area. Lets also check for allocated memory overrun,
+ * in real it's late but still usefull.
+ */
+
+ if (size > (unsigned long)(MB_BIOS_END - bios_rom_size) ||
+ size > MPTABLE_MAX_SIZE) {
+ free(mpc_table);
+ pr_err("MP table is too big");
+
+ return -E2BIG;
+ }
+
+ /*
+ * OK, it is time to move it to guest memory.
+ */
+ memcpy(guest_flat_to_host(kvm, real_mpc_table), mpc_table, size);
+
+ free(mpc_table);
+
+ return 0;
+}
+firmware_init(mptable__init);
+
+int mptable__exit(struct kvm *kvm)
+{
+ return 0;
+}
+firmware_exit(mptable__exit);
--- /dev/null
+How to compile perf for Android
+=========================================
+
+I. Set the Android NDK environment
+------------------------------------------------
+
+(a). Use the Android NDK
+------------------------------------------------
+1. You need to download and install the Android Native Development Kit (NDK).
+Set the NDK variable to point to the path where you installed the NDK:
+ export NDK=/path/to/android-ndk
+
+2. Set cross-compiling environment variables for NDK toolchain and sysroot.
+For arm:
+ export NDK_TOOLCHAIN=${NDK}/toolchains/arm-linux-androideabi-4.6/prebuilt/linux-x86/bin/arm-linux-androideabi-
+ export NDK_SYSROOT=${NDK}/platforms/android-9/arch-arm
+For x86:
+ export NDK_TOOLCHAIN=${NDK}/toolchains/x86-4.6/prebuilt/linux-x86/bin/i686-linux-android-
+ export NDK_SYSROOT=${NDK}/platforms/android-9/arch-x86
+
+This method is not working for Android NDK versions up to Revision 8b.
+perf uses some bionic enhancements that are not included in these NDK versions.
+You can use method (b) described below instead.
+
+(b). Use the Android source tree
+-----------------------------------------------
+1. Download the master branch of the Android source tree.
+Set the environment for the target you want using:
+ source build/envsetup.sh
+ lunch
+
+2. Build your own NDK sysroot to contain latest bionic changes and set the
+NDK sysroot environment variable.
+ cd ${ANDROID_BUILD_TOP}/ndk
+For arm:
+ ./build/tools/build-ndk-sysroot.sh --abi=arm
+ export NDK_SYSROOT=${ANDROID_BUILD_TOP}/ndk/build/platforms/android-3/arch-arm
+For x86:
+ ./build/tools/build-ndk-sysroot.sh --abi=x86
+ export NDK_SYSROOT=${ANDROID_BUILD_TOP}/ndk/build/platforms/android-3/arch-x86
+
+3. Set the NDK toolchain environment variable.
+For arm:
+ export NDK_TOOLCHAIN=${ANDROID_TOOLCHAIN}/arm-linux-androideabi-
+For x86:
+ export NDK_TOOLCHAIN=${ANDROID_TOOLCHAIN}/i686-linux-android-
+
+II. Compile perf for Android
+------------------------------------------------
+You need to run make with the NDK toolchain and sysroot defined above:
+ make CROSS_COMPILE=${NDK_TOOLCHAIN} CFLAGS="--sysroot=${NDK_SYSROOT}"
+
+III. Install perf
+-----------------------------------------------
+You need to connect to your Android device/emulator using adb.
+Install perf using:
+ adb push perf /data/perf
+
+If you also want to use perf-archive you need busybox tools for Android.
+For installing perf-archive, you first need to replace #!/bin/bash with #!/system/bin/sh:
+ sed 's/#!\/bin\/bash/#!\/system\/bin\/sh/g' perf-archive >> /tmp/perf-archive
+ chmod +x /tmp/perf-archive
+ adb push /tmp/perf-archive /data/perf-archive
+
+IV. Environment settings for running perf
+------------------------------------------------
+Some perf features need environment variables to run properly.
+You need to set these before running perf on the target:
+ adb shell
+ # PERF_PAGER=cat
+
+IV. Run perf
+------------------------------------------------
+Run perf on your device/emulator to which you previously connected using adb:
+ # ./data/perf
--symfs=<directory>::
Look for files with symbols relative to this directory.
+-b::
+--baseline-only::
+ Show only items with match in baseline.
+
+-c::
+--compute::
+ Differential computation selection - delta,ratio,wdiff (default is delta).
+ If '+' is specified as a first character, the output is sorted based
+ on the computation results.
+ See COMPARISON METHODS section for more info.
+
+-p::
+--period::
+ Show period values for both compared hist entries.
+
+-F::
+--formula::
+ Show formula for given computation.
+
+COMPARISON METHODS
+------------------
+delta
+~~~~~
+If specified the 'Delta' column is displayed with value 'd' computed as:
+
+ d = A->period_percent - B->period_percent
+
+with:
+ - A/B being matching hist entry from first/second file specified
+ (or perf.data/perf.data.old) respectively.
+
+ - period_percent being the % of the hist entry period value within
+ single data file
+
+ratio
+~~~~~
+If specified the 'Ratio' column is displayed with value 'r' computed as:
+
+ r = A->period / B->period
+
+with:
+ - A/B being matching hist entry from first/second file specified
+ (or perf.data/perf.data.old) respectively.
+
+ - period being the hist entry period value
+
+wdiff
+~~~~~
+If specified the 'Weighted diff' column is displayed with value 'd' computed as:
+
+ d = B->period * WEIGHT-A - A->period * WEIGHT-B
+
+ - A/B being matching hist entry from first/second file specified
+ (or perf.data/perf.data.old) respectively.
+
+ - period being the hist entry period value
+
+ - WEIGHT-A/WEIGHT-B being user suplied weights in the the '-c' option
+ behind ':' separator like '-c wdiff:1,2'.
+
SEE ALSO
--------
linkperf:perf-record[1]
-include config/feature-tests.mak
-ifeq ($(call try-cc,$(SOURCE_HELLO),-Werror -fstack-protector-all),y)
+ifeq ($(call try-cc,$(SOURCE_HELLO),$(CFLAGS) -Werror -fstack-protector-all),y)
CFLAGS := $(CFLAGS) -fstack-protector-all
endif
-ifeq ($(call try-cc,$(SOURCE_HELLO),-Werror -Wstack-protector),y)
+ifeq ($(call try-cc,$(SOURCE_HELLO),$(CFLAGS) -Werror -Wstack-protector),y)
CFLAGS := $(CFLAGS) -Wstack-protector
endif
-ifeq ($(call try-cc,$(SOURCE_HELLO),-Werror -Wvolatile-register-var),y)
+ifeq ($(call try-cc,$(SOURCE_HELLO),$(CFLAGS) -Werror -Wvolatile-register-var),y)
CFLAGS := $(CFLAGS) -Wvolatile-register-var
endif
BASIC_CFLAGS = -Iutil/include -Iarch/$(ARCH)/include -I$(OUTPUT)util -I$(TRACE_EVENT_DIR) -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE
BASIC_LDFLAGS =
+ifeq ($(call try-cc,$(SOURCE_BIONIC),$(CFLAGS)),y)
+ BIONIC := 1
+ EXTLIBS := $(filter-out -lrt,$(EXTLIBS))
+ EXTLIBS := $(filter-out -lpthread,$(EXTLIBS))
+ BASIC_CFLAGS += -I.
+endif
+
# Guard against environment variables
BUILTIN_OBJS =
LIB_H =
LIB_H += util/exec_cmd.h
LIB_H += util/types.h
LIB_H += util/levenshtein.h
+LIB_H += util/machine.h
LIB_H += util/map.h
LIB_H += util/parse-options.h
LIB_H += util/parse-events.h
LIB_OBJS += $(OUTPUT)util/callchain.o
LIB_OBJS += $(OUTPUT)util/values.o
LIB_OBJS += $(OUTPUT)util/debug.o
+LIB_OBJS += $(OUTPUT)util/machine.o
LIB_OBJS += $(OUTPUT)util/map.o
LIB_OBJS += $(OUTPUT)util/pstack.o
LIB_OBJS += $(OUTPUT)util/session.o
FLAGS_LIBELF=$(ALL_CFLAGS) $(ALL_LDFLAGS) $(EXTLIBS)
ifneq ($(call try-cc,$(SOURCE_LIBELF),$(FLAGS_LIBELF)),y)
FLAGS_GLIBC=$(ALL_CFLAGS) $(ALL_LDFLAGS)
- ifneq ($(call try-cc,$(SOURCE_GLIBC),$(FLAGS_GLIBC)),y)
- msg := $(error No gnu/libc-version.h found, please install glibc-dev[el]/glibc-static);
- else
+ ifeq ($(call try-cc,$(SOURCE_GLIBC),$(FLAGS_GLIBC)),y)
+ LIBC_SUPPORT := 1
+ endif
+ ifeq ($(BIONIC),1)
+ LIBC_SUPPORT := 1
+ endif
+ ifeq ($(LIBC_SUPPORT),1)
NO_LIBELF := 1
NO_DWARF := 1
NO_DEMANGLE := 1
+ else
+ msg := $(error No gnu/libc-version.h found, please install glibc-dev[el]/glibc-static);
endif
else
FLAGS_DWARF=$(ALL_CFLAGS) -ldw -lelf $(ALL_LDFLAGS) $(EXTLIBS)
endif
endif
+ifndef NO_ON_EXIT
+ ifeq ($(call try-cc,$(SOURCE_ON_EXIT),),y)
+ BASIC_CFLAGS += -DHAVE_ON_EXIT
+ endif
+endif
+
ifndef NO_BACKTRACE
ifeq ($(call try-cc,$(SOURCE_BACKTRACE),),y)
BASIC_CFLAGS += -DBACKTRACE_SUPPORT
.sample = process_sample_event,
.mmap = perf_event__process_mmap,
.comm = perf_event__process_comm,
- .fork = perf_event__process_task,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
.ordered_samples = true,
.ordering_requires_timestamps = true,
},
static char diff__default_sort_order[] = "dso,symbol";
static bool force;
static bool show_displacement;
+static bool show_period;
+static bool show_formula;
+static bool show_baseline_only;
+static bool sort_compute;
+
+static s64 compute_wdiff_w1;
+static s64 compute_wdiff_w2;
+
+enum {
+ COMPUTE_DELTA,
+ COMPUTE_RATIO,
+ COMPUTE_WEIGHTED_DIFF,
+ COMPUTE_MAX,
+};
+
+const char *compute_names[COMPUTE_MAX] = {
+ [COMPUTE_DELTA] = "delta",
+ [COMPUTE_RATIO] = "ratio",
+ [COMPUTE_WEIGHTED_DIFF] = "wdiff",
+};
+
+static int compute;
+
+static int setup_compute_opt_wdiff(char *opt)
+{
+ char *w1_str = opt;
+ char *w2_str;
+
+ int ret = -EINVAL;
+
+ if (!opt)
+ goto out;
+
+ w2_str = strchr(opt, ',');
+ if (!w2_str)
+ goto out;
+
+ *w2_str++ = 0x0;
+ if (!*w2_str)
+ goto out;
+
+ compute_wdiff_w1 = strtol(w1_str, NULL, 10);
+ compute_wdiff_w2 = strtol(w2_str, NULL, 10);
+
+ if (!compute_wdiff_w1 || !compute_wdiff_w2)
+ goto out;
+
+ pr_debug("compute wdiff w1(%" PRId64 ") w2(%" PRId64 ")\n",
+ compute_wdiff_w1, compute_wdiff_w2);
+
+ ret = 0;
+
+ out:
+ if (ret)
+ pr_err("Failed: wrong weight data, use 'wdiff:w1,w2'\n");
+
+ return ret;
+}
+
+static int setup_compute_opt(char *opt)
+{
+ if (compute == COMPUTE_WEIGHTED_DIFF)
+ return setup_compute_opt_wdiff(opt);
+
+ if (opt) {
+ pr_err("Failed: extra option specified '%s'", opt);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int setup_compute(const struct option *opt, const char *str,
+ int unset __maybe_unused)
+{
+ int *cp = (int *) opt->value;
+ char *cstr = (char *) str;
+ char buf[50];
+ unsigned i;
+ char *option;
+
+ if (!str) {
+ *cp = COMPUTE_DELTA;
+ return 0;
+ }
+
+ if (*str == '+') {
+ sort_compute = true;
+ cstr = (char *) ++str;
+ if (!*str)
+ return 0;
+ }
+
+ option = strchr(str, ':');
+ if (option) {
+ unsigned len = option++ - str;
+
+ /*
+ * The str data are not writeable, so we need
+ * to use another buffer.
+ */
+
+ /* No option value is longer. */
+ if (len >= sizeof(buf))
+ return -EINVAL;
+
+ strncpy(buf, str, len);
+ buf[len] = 0x0;
+ cstr = buf;
+ }
+
+ for (i = 0; i < COMPUTE_MAX; i++)
+ if (!strcmp(cstr, compute_names[i])) {
+ *cp = i;
+ return setup_compute_opt(option);
+ }
+
+ pr_err("Failed: '%s' is not computation method "
+ "(use 'delta','ratio' or 'wdiff')\n", str);
+ return -EINVAL;
+}
+
+static double get_period_percent(struct hist_entry *he, u64 period)
+{
+ u64 total = he->hists->stats.total_period;
+ return (period * 100.0) / total;
+}
+
+double perf_diff__compute_delta(struct hist_entry *he)
+{
+ struct hist_entry *pair = he->pair;
+ double new_percent = get_period_percent(he, he->stat.period);
+ double old_percent = pair ? get_period_percent(pair, pair->stat.period) : 0.0;
+
+ he->diff.period_ratio_delta = new_percent - old_percent;
+ he->diff.computed = true;
+ return he->diff.period_ratio_delta;
+}
+
+double perf_diff__compute_ratio(struct hist_entry *he)
+{
+ struct hist_entry *pair = he->pair;
+ double new_period = he->stat.period;
+ double old_period = pair ? pair->stat.period : 0;
+
+ he->diff.computed = true;
+ he->diff.period_ratio = pair ? (new_period / old_period) : 0;
+ return he->diff.period_ratio;
+}
+
+s64 perf_diff__compute_wdiff(struct hist_entry *he)
+{
+ struct hist_entry *pair = he->pair;
+ u64 new_period = he->stat.period;
+ u64 old_period = pair ? pair->stat.period : 0;
+
+ he->diff.computed = true;
+
+ if (!pair)
+ he->diff.wdiff = 0;
+ else
+ he->diff.wdiff = new_period * compute_wdiff_w2 -
+ old_period * compute_wdiff_w1;
+
+ return he->diff.wdiff;
+}
+
+static int formula_delta(struct hist_entry *he, char *buf, size_t size)
+{
+ struct hist_entry *pair = he->pair;
+
+ if (!pair)
+ return -1;
+
+ return scnprintf(buf, size,
+ "(%" PRIu64 " * 100 / %" PRIu64 ") - "
+ "(%" PRIu64 " * 100 / %" PRIu64 ")",
+ he->stat.period, he->hists->stats.total_period,
+ pair->stat.period, pair->hists->stats.total_period);
+}
+
+static int formula_ratio(struct hist_entry *he, char *buf, size_t size)
+{
+ struct hist_entry *pair = he->pair;
+ double new_period = he->stat.period;
+ double old_period = pair ? pair->stat.period : 0;
+
+ if (!pair)
+ return -1;
+
+ return scnprintf(buf, size, "%.0F / %.0F", new_period, old_period);
+}
+
+static int formula_wdiff(struct hist_entry *he, char *buf, size_t size)
+{
+ struct hist_entry *pair = he->pair;
+ u64 new_period = he->stat.period;
+ u64 old_period = pair ? pair->stat.period : 0;
+
+ if (!pair)
+ return -1;
+
+ return scnprintf(buf, size,
+ "(%" PRIu64 " * " "%" PRId64 ") - (%" PRIu64 " * " "%" PRId64 ")",
+ new_period, compute_wdiff_w2, old_period, compute_wdiff_w1);
+}
+
+int perf_diff__formula(char *buf, size_t size, struct hist_entry *he)
+{
+ switch (compute) {
+ case COMPUTE_DELTA:
+ return formula_delta(he, buf, size);
+ case COMPUTE_RATIO:
+ return formula_ratio(he, buf, size);
+ case COMPUTE_WEIGHTED_DIFF:
+ return formula_wdiff(he, buf, size);
+ default:
+ BUG_ON(1);
+ }
+
+ return -1;
+}
static int hists__add_entry(struct hists *self,
struct addr_location *al, u64 period)
return -1;
}
- if (al.filtered || al.sym == NULL)
+ if (al.filtered)
return 0;
if (hists__add_entry(&evsel->hists, &al, sample->period)) {
.sample = diff__process_sample_event,
.mmap = perf_event__process_mmap,
.comm = perf_event__process_comm,
- .exit = perf_event__process_task,
- .fork = perf_event__process_task,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
.lost = perf_event__process_lost,
.ordered_samples = true,
.ordering_requires_timestamps = true,
}
}
+static void hists__baseline_only(struct hists *hists)
+{
+ struct rb_node *next = rb_first(&hists->entries);
+
+ while (next != NULL) {
+ struct hist_entry *he = rb_entry(next, struct hist_entry, rb_node);
+
+ next = rb_next(&he->rb_node);
+ if (!he->pair) {
+ rb_erase(&he->rb_node, &hists->entries);
+ hist_entry__free(he);
+ }
+ }
+}
+
+static void hists__precompute(struct hists *hists)
+{
+ struct rb_node *next = rb_first(&hists->entries);
+
+ while (next != NULL) {
+ struct hist_entry *he = rb_entry(next, struct hist_entry, rb_node);
+
+ next = rb_next(&he->rb_node);
+
+ switch (compute) {
+ case COMPUTE_DELTA:
+ perf_diff__compute_delta(he);
+ break;
+ case COMPUTE_RATIO:
+ perf_diff__compute_ratio(he);
+ break;
+ case COMPUTE_WEIGHTED_DIFF:
+ perf_diff__compute_wdiff(he);
+ break;
+ default:
+ BUG_ON(1);
+ }
+ }
+}
+
+static int64_t cmp_doubles(double l, double r)
+{
+ if (l > r)
+ return -1;
+ else if (l < r)
+ return 1;
+ else
+ return 0;
+}
+
+static int64_t
+hist_entry__cmp_compute(struct hist_entry *left, struct hist_entry *right,
+ int c)
+{
+ switch (c) {
+ case COMPUTE_DELTA:
+ {
+ double l = left->diff.period_ratio_delta;
+ double r = right->diff.period_ratio_delta;
+
+ return cmp_doubles(l, r);
+ }
+ case COMPUTE_RATIO:
+ {
+ double l = left->diff.period_ratio;
+ double r = right->diff.period_ratio;
+
+ return cmp_doubles(l, r);
+ }
+ case COMPUTE_WEIGHTED_DIFF:
+ {
+ s64 l = left->diff.wdiff;
+ s64 r = right->diff.wdiff;
+
+ return r - l;
+ }
+ default:
+ BUG_ON(1);
+ }
+
+ return 0;
+}
+
+static void insert_hist_entry_by_compute(struct rb_root *root,
+ struct hist_entry *he,
+ int c)
+{
+ struct rb_node **p = &root->rb_node;
+ struct rb_node *parent = NULL;
+ struct hist_entry *iter;
+
+ while (*p != NULL) {
+ parent = *p;
+ iter = rb_entry(parent, struct hist_entry, rb_node);
+ if (hist_entry__cmp_compute(he, iter, c) < 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&he->rb_node, parent, p);
+ rb_insert_color(&he->rb_node, root);
+}
+
+static void hists__compute_resort(struct hists *hists)
+{
+ struct rb_root tmp = RB_ROOT;
+ struct rb_node *next = rb_first(&hists->entries);
+
+ while (next != NULL) {
+ struct hist_entry *he = rb_entry(next, struct hist_entry, rb_node);
+
+ next = rb_next(&he->rb_node);
+
+ rb_erase(&he->rb_node, &hists->entries);
+ insert_hist_entry_by_compute(&tmp, he, compute);
+ }
+
+ hists->entries = tmp;
+}
+
+static void hists__process(struct hists *old, struct hists *new)
+{
+ hists__match(old, new);
+
+ if (show_baseline_only)
+ hists__baseline_only(new);
+
+ if (sort_compute) {
+ hists__precompute(new);
+ hists__compute_resort(new);
+ }
+
+ hists__fprintf(new, true, 0, 0, stdout);
+}
+
static int __cmd_diff(void)
{
int ret, i;
first = false;
- hists__match(&evsel_old->hists, &evsel->hists);
- hists__fprintf(&evsel->hists, true, 0, 0, stdout);
+ hists__process(&evsel_old->hists, &evsel->hists);
}
out_delete:
"be more verbose (show symbol address, etc)"),
OPT_BOOLEAN('M', "displacement", &show_displacement,
"Show position displacement relative to baseline"),
+ OPT_BOOLEAN('b', "baseline-only", &show_baseline_only,
+ "Show only items with match in baseline"),
+ OPT_CALLBACK('c', "compute", &compute,
+ "delta,ratio,wdiff:w1,w2 (default delta)",
+ "Entries differential computation selection",
+ setup_compute),
+ OPT_BOOLEAN('p', "period", &show_period,
+ "Show period values."),
+ OPT_BOOLEAN('F', "formula", &show_formula,
+ "Show formula."),
OPT_BOOLEAN('D', "dump-raw-trace", &dump_trace,
"dump raw trace in ASCII"),
OPT_BOOLEAN('f', "force", &force, "don't complain, do it"),
/* No overhead column. */
perf_hpp__column_enable(PERF_HPP__OVERHEAD, false);
- /* Display baseline/delta/displacement columns. */
+ /*
+ * Display baseline/delta/ratio/displacement/
+ * formula/periods columns.
+ */
perf_hpp__column_enable(PERF_HPP__BASELINE, true);
- perf_hpp__column_enable(PERF_HPP__DELTA, true);
+
+ switch (compute) {
+ case COMPUTE_DELTA:
+ perf_hpp__column_enable(PERF_HPP__DELTA, true);
+ break;
+ case COMPUTE_RATIO:
+ perf_hpp__column_enable(PERF_HPP__RATIO, true);
+ break;
+ case COMPUTE_WEIGHTED_DIFF:
+ perf_hpp__column_enable(PERF_HPP__WEIGHTED_DIFF, true);
+ break;
+ default:
+ BUG_ON(1);
+ };
if (show_displacement)
perf_hpp__column_enable(PERF_HPP__DISPL, true);
+
+ if (show_formula)
+ perf_hpp__column_enable(PERF_HPP__FORMULA, true);
+
+ if (show_period) {
+ perf_hpp__column_enable(PERF_HPP__PERIOD, true);
+ perf_hpp__column_enable(PERF_HPP__PERIOD_BASELINE, true);
+ }
}
int cmd_diff(int argc, const char **argv, const char *prefix __maybe_unused)
return err;
}
-static int perf_event__repipe_task(struct perf_tool *tool,
+static int perf_event__repipe_fork(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine)
{
int err;
- err = perf_event__process_task(tool, event, sample, machine);
+ err = perf_event__process_fork(tool, event, sample, machine);
perf_event__repipe(tool, event, sample, machine);
return err;
if (inject->build_ids) {
inject->tool.sample = perf_event__inject_buildid;
inject->tool.mmap = perf_event__repipe_mmap;
- inject->tool.fork = perf_event__repipe_task;
+ inject->tool.fork = perf_event__repipe_fork;
inject->tool.tracing_data = perf_event__repipe_tracing_data;
}
static void init_kvm_event_record(struct perf_kvm *kvm)
{
- int i;
+ unsigned int i;
- for (i = 0; i < (int)EVENTS_CACHE_SIZE; i++)
+ for (i = 0; i < EVENTS_CACHE_SIZE; i++)
INIT_LIST_HEAD(&kvm->kvm_events_cache[i]);
}
BUG_ON(key->key == INVALID_KEY);
head = &kvm->kvm_events_cache[kvm_events_hash_fn(key->key)];
- list_for_each_entry(event, head, hash_entry)
+ list_for_each_entry(event, head, hash_entry) {
if (event->key.key == key->key && event->key.info == key->info)
return event;
+ }
event = kvm_alloc_init_event(key);
if (!event)
static bool update_kvm_event(struct kvm_event *event, int vcpu_id,
u64 time_diff)
{
- kvm_update_event_stats(&event->total, time_diff);
+ if (vcpu_id == -1) {
+ kvm_update_event_stats(&event->total, time_diff);
+ return true;
+ }
if (!kvm_event_expand(event, vcpu_id))
return false;
{
struct kvm_event *event;
u64 time_begin, time_diff;
+ int vcpu;
+
+ if (kvm->trace_vcpu == -1)
+ vcpu = -1;
+ else
+ vcpu = vcpu_record->vcpu_id;
event = vcpu_record->last_event;
time_begin = vcpu_record->start_time;
BUG_ON(timestamp < time_begin);
time_diff = timestamp - time_begin;
- return update_kvm_event(event, vcpu_record->vcpu_id, time_diff);
+ return update_kvm_event(event, vcpu, time_diff);
}
static
if (!vcpu_record)
return true;
+ /* only process events for vcpus user cares about */
+ if ((kvm->trace_vcpu != -1) &&
+ (kvm->trace_vcpu != vcpu_record->vcpu_id))
+ return true;
+
if (kvm->events_ops->is_begin_event(evsel, sample, &key))
return handle_begin_event(kvm, vcpu_record, &key, sample->time);
int vcpu = kvm->trace_vcpu;
struct kvm_event *event;
- for (i = 0; i < EVENTS_CACHE_SIZE; i++)
- list_for_each_entry(event, &kvm->kvm_events_cache[i], hash_entry)
+ for (i = 0; i < EVENTS_CACHE_SIZE; i++) {
+ list_for_each_entry(event, &kvm->kvm_events_cache[i], hash_entry) {
if (event_is_valid(event, vcpu)) {
update_total_count(kvm, event);
insert_to_result(&kvm->result, event,
kvm->compare, vcpu);
}
+ }
+ }
}
/* returns left most element of result, and erase it */
pr_info("\n");
}
- pr_info("\nTotal Samples:%lld, Total events handled time:%.2fus.\n\n",
- (unsigned long long)kvm->total_count, kvm->total_time / 1e3);
+ pr_info("\nTotal Samples:%" PRIu64 ", Total events handled time:%.2fus.\n\n",
+ kvm->total_count, kvm->total_time / 1e3);
}
static int process_sample_event(struct perf_tool *tool,
#include <sched.h>
#include <sys/mman.h>
+#ifndef HAVE_ON_EXIT
+#ifndef ATEXIT_MAX
+#define ATEXIT_MAX 32
+#endif
+static int __on_exit_count = 0;
+typedef void (*on_exit_func_t) (int, void *);
+static on_exit_func_t __on_exit_funcs[ATEXIT_MAX];
+static void *__on_exit_args[ATEXIT_MAX];
+static int __exitcode = 0;
+static void __handle_on_exit_funcs(void);
+static int on_exit(on_exit_func_t function, void *arg);
+#define exit(x) (exit)(__exitcode = (x))
+
+static int on_exit(on_exit_func_t function, void *arg)
+{
+ if (__on_exit_count == ATEXIT_MAX)
+ return -ENOMEM;
+ else if (__on_exit_count == 0)
+ atexit(__handle_on_exit_funcs);
+ __on_exit_funcs[__on_exit_count] = function;
+ __on_exit_args[__on_exit_count++] = arg;
+ return 0;
+}
+
+static void __handle_on_exit_funcs(void)
+{
+ int i;
+ for (i = 0; i < __on_exit_count; i++)
+ __on_exit_funcs[i] (__exitcode, __on_exit_args[i]);
+}
+#endif
+
enum write_mode_t {
WRITE_FORCE,
WRITE_APPEND
.sample = process_sample_event,
.mmap = perf_event__process_mmap,
.comm = perf_event__process_comm,
- .exit = perf_event__process_task,
- .fork = perf_event__process_task,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
.lost = perf_event__process_lost,
.read = process_read_event,
.attr = perf_event__process_attr,
.sample = perf_sched__process_tracepoint_sample,
.comm = perf_event__process_comm,
.lost = perf_event__process_lost,
- .fork = perf_event__process_task,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
.ordered_samples = true,
},
.cmp_pid = LIST_HEAD_INIT(sched.cmp_pid),
.sample = process_sample_event,
.mmap = perf_event__process_mmap,
.comm = perf_event__process_comm,
- .exit = perf_event__process_task,
- .fork = perf_event__process_task,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
.attr = perf_event__process_attr,
.event_type = perf_event__process_event_type,
.tracing_data = perf_event__process_tracing_data,
struct map *kallsyms_map, *vmlinux_map;
struct machine kallsyms, vmlinux;
enum map_type type = MAP__FUNCTION;
- long page_size = sysconf(_SC_PAGE_SIZE);
struct ref_reloc_sym ref_reloc_sym = { .name = "_stext", };
/*
static int __test__rdpmc(void)
{
- long page_size = sysconf(_SC_PAGE_SIZE);
volatile int tmp = 0;
u64 i, loops = 1000;
int n;
#include "util/color.h"
#include "util/evlist.h"
#include "util/evsel.h"
+#include "util/machine.h"
#include "util/session.h"
#include "util/symbol.h"
#include "util/thread.h"
&sample, machine);
} else if (event->header.type < PERF_RECORD_MAX) {
hists__inc_nr_events(&evsel->hists, event->header.type);
- perf_event__process(&top->tool, event, &sample, machine);
+ machine__process_event(machine, event);
} else
++session->hists.stats.nr_unknown_events;
}
struct perf_record_opts opts;
};
+static bool done = false;
+
+static void sig_handler(int sig __maybe_unused)
+{
+ done = true;
+}
+
static int trace__read_syscall_info(struct trace *trace, int id)
{
char tp_name[128];
return 0;
}
-static int trace__run(struct trace *trace)
+static int trace__run(struct trace *trace, int argc, const char **argv)
{
struct perf_evlist *evlist = perf_evlist__new(NULL, NULL);
struct perf_evsel *evsel;
int err = -1, i, nr_events = 0, before;
+ const bool forks = argc > 0;
if (evlist == NULL) {
printf("Not enough memory to run!\n");
perf_evlist__config_attrs(evlist, &trace->opts);
+ signal(SIGCHLD, sig_handler);
+ signal(SIGINT, sig_handler);
+
+ if (forks) {
+ err = perf_evlist__prepare_workload(evlist, &trace->opts, argv);
+ if (err < 0) {
+ printf("Couldn't run the workload!\n");
+ goto out_delete_evlist;
+ }
+ }
+
err = perf_evlist__open(evlist);
if (err < 0) {
printf("Couldn't create the events: %s\n", strerror(errno));
}
perf_evlist__enable(evlist);
+
+ if (forks)
+ perf_evlist__start_workload(evlist);
+
again:
before = nr_events;
}
}
- if (nr_events == before)
+ if (nr_events == before) {
+ if (done)
+ goto out_delete_evlist;
+
poll(evlist->pollfd, evlist->nr_fds, -1);
+ }
+
+ if (done)
+ perf_evlist__disable(evlist);
goto again;
int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
{
const char * const trace_usage[] = {
- "perf trace [<options>]",
+ "perf trace [<options>] [<command>]",
+ "perf trace [<options>] -- <command> [<options>]",
NULL
};
struct trace trace = {
OPT_END()
};
int err;
+ char bf[BUFSIZ];
argc = parse_options(argc, argv, trace_options, trace_usage, 0);
- if (argc)
- usage_with_options(trace_usage, trace_options);
+
+ err = perf_target__validate(&trace.opts.target);
+ if (err) {
+ perf_target__strerror(&trace.opts.target, err, bf, sizeof(bf));
+ printf("%s", bf);
+ return err;
+ }
err = perf_target__parse_uid(&trace.opts.target);
if (err) {
- char bf[BUFSIZ];
perf_target__strerror(&trace.opts.target, err, bf, sizeof(bf));
printf("%s", bf);
return err;
}
- return trace__run(&trace);
+ if (!argc && perf_target__none(&trace.opts.target))
+ trace.opts.target.system_wide = true;
+
+ return trace__run(&trace, argc, argv);
}
}
endef
+define SOURCE_BIONIC
+#include <android/api-level.h>
+
+int main(void)
+{
+ return __ANDROID_API__;
+}
+endef
+
define SOURCE_ELF_MMAP
#include <libelf.h>
int main(void)
return audit_open();
}
endef
-endif
\ No newline at end of file
+endif
+
+define SOURCE_ON_EXIT
+#include <stdio.h>
+
+int main(void)
+{
+ return on_exit(NULL, NULL);
+}
+endef
{
const char *cmd;
+ page_size = sysconf(_SC_PAGE_SIZE);
+
cmd = perf_extract_argv0_path(argv[0]);
if (!cmd)
cmd = "perf-help";
{
double percent = baseline_percent(he);
- return percent_color_snprintf(hpp->buf, hpp->size, " %6.2f%%", percent);
+ if (he->pair)
+ return percent_color_snprintf(hpp->buf, hpp->size, " %6.2f%%", percent);
+ else
+ return scnprintf(hpp->buf, hpp->size, " ");
}
static int hpp__entry_baseline(struct perf_hpp *hpp, struct hist_entry *he)
double percent = baseline_percent(he);
const char *fmt = symbol_conf.field_sep ? "%.2f" : " %6.2f%%";
- return scnprintf(hpp->buf, hpp->size, fmt, percent);
+ if (he->pair || symbol_conf.field_sep)
+ return scnprintf(hpp->buf, hpp->size, fmt, percent);
+ else
+ return scnprintf(hpp->buf, hpp->size, " ");
}
static int hpp__header_samples(struct perf_hpp *hpp)
return scnprintf(hpp->buf, hpp->size, fmt, he->stat.period);
}
+static int hpp__header_period_baseline(struct perf_hpp *hpp)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%12s";
+
+ return scnprintf(hpp->buf, hpp->size, fmt, "Period Base");
+}
+
+static int hpp__width_period_baseline(struct perf_hpp *hpp __maybe_unused)
+{
+ return 12;
+}
+
+static int hpp__entry_period_baseline(struct perf_hpp *hpp, struct hist_entry *he)
+{
+ struct hist_entry *pair = he->pair;
+ u64 period = pair ? pair->stat.period : 0;
+ const char *fmt = symbol_conf.field_sep ? "%" PRIu64 : "%12" PRIu64;
+
+ return scnprintf(hpp->buf, hpp->size, fmt, period);
+}
static int hpp__header_delta(struct perf_hpp *hpp)
{
const char *fmt = symbol_conf.field_sep ? "%s" : "%7s";
static int hpp__entry_delta(struct perf_hpp *hpp, struct hist_entry *he)
{
- struct hist_entry *pair = he->pair;
- struct hists *pair_hists = pair ? pair->hists : NULL;
- struct hists *hists = he->hists;
- u64 old_total, new_total;
- double old_percent = 0, new_percent = 0;
- double diff;
const char *fmt = symbol_conf.field_sep ? "%s" : "%7.7s";
char buf[32] = " ";
+ double diff;
- old_total = pair_hists ? pair_hists->stats.total_period : 0;
- if (old_total > 0 && pair)
- old_percent = 100.0 * pair->stat.period / old_total;
-
- new_total = hists->stats.total_period;
- if (new_total > 0)
- new_percent = 100.0 * he->stat.period / new_total;
+ if (he->diff.computed)
+ diff = he->diff.period_ratio_delta;
+ else
+ diff = perf_diff__compute_delta(he);
- diff = new_percent - old_percent;
if (fabs(diff) >= 0.01)
scnprintf(buf, sizeof(buf), "%+4.2F%%", diff);
return scnprintf(hpp->buf, hpp->size, fmt, buf);
}
+static int hpp__header_ratio(struct perf_hpp *hpp)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%14s";
+
+ return scnprintf(hpp->buf, hpp->size, fmt, "Ratio");
+}
+
+static int hpp__width_ratio(struct perf_hpp *hpp __maybe_unused)
+{
+ return 14;
+}
+
+static int hpp__entry_ratio(struct perf_hpp *hpp, struct hist_entry *he)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%14s";
+ char buf[32] = " ";
+ double ratio;
+
+ if (he->diff.computed)
+ ratio = he->diff.period_ratio;
+ else
+ ratio = perf_diff__compute_ratio(he);
+
+ if (ratio > 0.0)
+ scnprintf(buf, sizeof(buf), "%+14.6F", ratio);
+
+ return scnprintf(hpp->buf, hpp->size, fmt, buf);
+}
+
+static int hpp__header_wdiff(struct perf_hpp *hpp)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%14s";
+
+ return scnprintf(hpp->buf, hpp->size, fmt, "Weighted diff");
+}
+
+static int hpp__width_wdiff(struct perf_hpp *hpp __maybe_unused)
+{
+ return 14;
+}
+
+static int hpp__entry_wdiff(struct perf_hpp *hpp, struct hist_entry *he)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%14s";
+ char buf[32] = " ";
+ s64 wdiff;
+
+ if (he->diff.computed)
+ wdiff = he->diff.wdiff;
+ else
+ wdiff = perf_diff__compute_wdiff(he);
+
+ if (wdiff != 0)
+ scnprintf(buf, sizeof(buf), "%14ld", wdiff);
+
+ return scnprintf(hpp->buf, hpp->size, fmt, buf);
+}
+
static int hpp__header_displ(struct perf_hpp *hpp)
{
return scnprintf(hpp->buf, hpp->size, "Displ.");
return scnprintf(hpp->buf, hpp->size, fmt, buf);
}
+static int hpp__header_formula(struct perf_hpp *hpp)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%70s";
+
+ return scnprintf(hpp->buf, hpp->size, fmt, "Formula");
+}
+
+static int hpp__width_formula(struct perf_hpp *hpp __maybe_unused)
+{
+ return 70;
+}
+
+static int hpp__entry_formula(struct perf_hpp *hpp, struct hist_entry *he)
+{
+ const char *fmt = symbol_conf.field_sep ? "%s" : "%-70s";
+ char buf[96] = " ";
+
+ perf_diff__formula(buf, sizeof(buf), he);
+ return scnprintf(hpp->buf, hpp->size, fmt, buf);
+}
+
#define HPP__COLOR_PRINT_FNS(_name) \
.header = hpp__header_ ## _name, \
.width = hpp__width_ ## _name, \
{ .cond = false, HPP__COLOR_PRINT_FNS(overhead_guest_us) },
{ .cond = false, HPP__PRINT_FNS(samples) },
{ .cond = false, HPP__PRINT_FNS(period) },
+ { .cond = false, HPP__PRINT_FNS(period_baseline) },
{ .cond = false, HPP__PRINT_FNS(delta) },
- { .cond = false, HPP__PRINT_FNS(displ) }
+ { .cond = false, HPP__PRINT_FNS(ratio) },
+ { .cond = false, HPP__PRINT_FNS(wdiff) },
+ { .cond = false, HPP__PRINT_FNS(displ) },
+ { .cond = false, HPP__PRINT_FNS(formula) }
};
#undef HPP__COLOR_PRINT_FNS
const char *sep = symbol_conf.field_sep;
const char *col_width = symbol_conf.col_width_list_str;
int idx, nr_rows = 0;
- char bf[64];
+ char bf[96];
struct perf_hpp dummy_hpp = {
.buf = bf,
.size = sizeof(bf),
struct perf_tool build_id__mark_dso_hit_ops = {
.sample = build_id__mark_dso_hit,
.mmap = perf_event__process_mmap,
- .fork = perf_event__process_task,
+ .fork = perf_event__process_fork,
.exit = perf_event__exit_del_thread,
.attr = perf_event__process_attr,
.build_id = perf_event__process_build_id,
#include <linux/types.h>
#include "event.h"
#include "debug.h"
+#include "machine.h"
#include "sort.h"
#include "string.h"
#include "strlist.h"
struct perf_sample *sample __maybe_unused,
struct machine *machine)
{
- struct thread *thread = machine__findnew_thread(machine, event->comm.tid);
-
- if (dump_trace)
- perf_event__fprintf_comm(event, stdout);
-
- if (thread == NULL || thread__set_comm(thread, event->comm.comm)) {
- dump_printf("problem processing PERF_RECORD_COMM, skipping event.\n");
- return -1;
- }
-
- return 0;
+ return machine__process_comm_event(machine, event);
}
int perf_event__process_lost(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
- struct machine *machine __maybe_unused)
-{
- dump_printf(": id:%" PRIu64 ": lost:%" PRIu64 "\n",
- event->lost.id, event->lost.lost);
- return 0;
-}
-
-static void perf_event__set_kernel_mmap_len(union perf_event *event,
- struct map **maps)
+ struct machine *machine)
{
- maps[MAP__FUNCTION]->start = event->mmap.start;
- maps[MAP__FUNCTION]->end = event->mmap.start + event->mmap.len;
- /*
- * Be a bit paranoid here, some perf.data file came with
- * a zero sized synthesized MMAP event for the kernel.
- */
- if (maps[MAP__FUNCTION]->end == 0)
- maps[MAP__FUNCTION]->end = ~0ULL;
-}
-
-static int perf_event__process_kernel_mmap(struct perf_tool *tool
- __maybe_unused,
- union perf_event *event,
- struct machine *machine)
-{
- struct map *map;
- char kmmap_prefix[PATH_MAX];
- enum dso_kernel_type kernel_type;
- bool is_kernel_mmap;
-
- machine__mmap_name(machine, kmmap_prefix, sizeof(kmmap_prefix));
- if (machine__is_host(machine))
- kernel_type = DSO_TYPE_KERNEL;
- else
- kernel_type = DSO_TYPE_GUEST_KERNEL;
-
- is_kernel_mmap = memcmp(event->mmap.filename,
- kmmap_prefix,
- strlen(kmmap_prefix) - 1) == 0;
- if (event->mmap.filename[0] == '/' ||
- (!is_kernel_mmap && event->mmap.filename[0] == '[')) {
-
- char short_module_name[1024];
- char *name, *dot;
-
- if (event->mmap.filename[0] == '/') {
- name = strrchr(event->mmap.filename, '/');
- if (name == NULL)
- goto out_problem;
-
- ++name; /* skip / */
- dot = strrchr(name, '.');
- if (dot == NULL)
- goto out_problem;
- snprintf(short_module_name, sizeof(short_module_name),
- "[%.*s]", (int)(dot - name), name);
- strxfrchar(short_module_name, '-', '_');
- } else
- strcpy(short_module_name, event->mmap.filename);
-
- map = machine__new_module(machine, event->mmap.start,
- event->mmap.filename);
- if (map == NULL)
- goto out_problem;
-
- name = strdup(short_module_name);
- if (name == NULL)
- goto out_problem;
-
- map->dso->short_name = name;
- map->dso->sname_alloc = 1;
- map->end = map->start + event->mmap.len;
- } else if (is_kernel_mmap) {
- const char *symbol_name = (event->mmap.filename +
- strlen(kmmap_prefix));
- /*
- * Should be there already, from the build-id table in
- * the header.
- */
- struct dso *kernel = __dsos__findnew(&machine->kernel_dsos,
- kmmap_prefix);
- if (kernel == NULL)
- goto out_problem;
-
- kernel->kernel = kernel_type;
- if (__machine__create_kernel_maps(machine, kernel) < 0)
- goto out_problem;
-
- perf_event__set_kernel_mmap_len(event, machine->vmlinux_maps);
-
- /*
- * Avoid using a zero address (kptr_restrict) for the ref reloc
- * symbol. Effectively having zero here means that at record
- * time /proc/sys/kernel/kptr_restrict was non zero.
- */
- if (event->mmap.pgoff != 0) {
- maps__set_kallsyms_ref_reloc_sym(machine->vmlinux_maps,
- symbol_name,
- event->mmap.pgoff);
- }
-
- if (machine__is_default_guest(machine)) {
- /*
- * preload dso of guest kernel and modules
- */
- dso__load(kernel, machine->vmlinux_maps[MAP__FUNCTION],
- NULL);
- }
- }
- return 0;
-out_problem:
- return -1;
+ return machine__process_lost_event(machine, event);
}
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
event->mmap.len, event->mmap.pgoff, event->mmap.filename);
}
-int perf_event__process_mmap(struct perf_tool *tool,
+int perf_event__process_mmap(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
struct machine *machine)
{
- struct thread *thread;
- struct map *map;
- u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
- int ret = 0;
-
- if (dump_trace)
- perf_event__fprintf_mmap(event, stdout);
-
- if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL ||
- cpumode == PERF_RECORD_MISC_KERNEL) {
- ret = perf_event__process_kernel_mmap(tool, event, machine);
- if (ret < 0)
- goto out_problem;
- return 0;
- }
-
- thread = machine__findnew_thread(machine, event->mmap.pid);
- if (thread == NULL)
- goto out_problem;
- map = map__new(&machine->user_dsos, event->mmap.start,
- event->mmap.len, event->mmap.pgoff,
- event->mmap.pid, event->mmap.filename,
- MAP__FUNCTION);
- if (map == NULL)
- goto out_problem;
-
- thread__insert_map(thread, map);
- return 0;
-
-out_problem:
- dump_printf("problem processing PERF_RECORD_MMAP, skipping event.\n");
- return 0;
+ return machine__process_mmap_event(machine, event);
}
size_t perf_event__fprintf_task(union perf_event *event, FILE *fp)
event->fork.ppid, event->fork.ptid);
}
-int perf_event__process_task(struct perf_tool *tool __maybe_unused,
+int perf_event__process_fork(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
- struct machine *machine)
+ struct machine *machine)
{
- struct thread *thread = machine__findnew_thread(machine, event->fork.tid);
- struct thread *parent = machine__findnew_thread(machine, event->fork.ptid);
-
- if (dump_trace)
- perf_event__fprintf_task(event, stdout);
-
- if (event->header.type == PERF_RECORD_EXIT) {
- machine__remove_thread(machine, thread);
- return 0;
- }
-
- if (thread == NULL || parent == NULL ||
- thread__fork(thread, parent) < 0) {
- dump_printf("problem processing PERF_RECORD_FORK, skipping event.\n");
- return -1;
- }
+ return machine__process_fork_event(machine, event);
+}
- return 0;
+int perf_event__process_exit(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
+{
+ return machine__process_exit_event(machine, event);
}
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
return ret;
}
-int perf_event__process(struct perf_tool *tool, union perf_event *event,
- struct perf_sample *sample, struct machine *machine)
+int perf_event__process(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
{
- switch (event->header.type) {
- case PERF_RECORD_COMM:
- perf_event__process_comm(tool, event, sample, machine);
- break;
- case PERF_RECORD_MMAP:
- perf_event__process_mmap(tool, event, sample, machine);
- break;
- case PERF_RECORD_FORK:
- case PERF_RECORD_EXIT:
- perf_event__process_task(tool, event, sample, machine);
- break;
- case PERF_RECORD_LOST:
- perf_event__process_lost(tool, event, sample, machine);
- default:
- break;
- }
-
- return 0;
+ return machine__process_event(machine, event);
}
void thread__find_addr_map(struct thread *self,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
-int perf_event__process_task(struct perf_tool *tool,
+int perf_event__process_fork(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
+int perf_event__process_exit(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
union perf_event *perf_evlist__mmap_read(struct perf_evlist *evlist, int idx)
{
- /* XXX Move this to perf.c, making it generally available */
- unsigned int page_size = sysconf(_SC_PAGE_SIZE);
struct perf_mmap *md = &evlist->mmap[idx];
unsigned int head = perf_mmap__read_head(md);
unsigned int old = md->prev;
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages,
bool overwrite)
{
- unsigned int page_size = sysconf(_SC_PAGE_SIZE);
struct perf_evsel *evsel;
const struct cpu_map *cpus = evlist->cpus;
const struct thread_map *threads = evlist->threads;
PERF_HPP__OVERHEAD_GUEST_US,
PERF_HPP__SAMPLES,
PERF_HPP__PERIOD,
+ PERF_HPP__PERIOD_BASELINE,
PERF_HPP__DELTA,
+ PERF_HPP__RATIO,
+ PERF_HPP__WEIGHTED_DIFF,
PERF_HPP__DISPL,
+ PERF_HPP__FORMULA,
PERF_HPP__MAX_INDEX
};
unsigned int hists__sort_list_width(struct hists *self);
+double perf_diff__compute_delta(struct hist_entry *he);
+double perf_diff__compute_ratio(struct hist_entry *he);
+s64 perf_diff__compute_wdiff(struct hist_entry *he);
+int perf_diff__formula(char *buf, size_t size, struct hist_entry *he);
#endif /* __PERF_HIST_H */
--- /dev/null
+#include "debug.h"
+#include "event.h"
+#include "machine.h"
+#include "map.h"
+#include "thread.h"
+#include <stdbool.h>
+
+static struct thread *__machine__findnew_thread(struct machine *machine, pid_t pid,
+ bool create)
+{
+ struct rb_node **p = &machine->threads.rb_node;
+ struct rb_node *parent = NULL;
+ struct thread *th;
+
+ /*
+ * Font-end cache - PID lookups come in blocks,
+ * so most of the time we dont have to look up
+ * the full rbtree:
+ */
+ if (machine->last_match && machine->last_match->pid == pid)
+ return machine->last_match;
+
+ while (*p != NULL) {
+ parent = *p;
+ th = rb_entry(parent, struct thread, rb_node);
+
+ if (th->pid == pid) {
+ machine->last_match = th;
+ return th;
+ }
+
+ if (pid < th->pid)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ if (!create)
+ return NULL;
+
+ th = thread__new(pid);
+ if (th != NULL) {
+ rb_link_node(&th->rb_node, parent, p);
+ rb_insert_color(&th->rb_node, &machine->threads);
+ machine->last_match = th;
+ }
+
+ return th;
+}
+
+struct thread *machine__findnew_thread(struct machine *machine, pid_t pid)
+{
+ return __machine__findnew_thread(machine, pid, true);
+}
+
+struct thread *machine__find_thread(struct machine *machine, pid_t pid)
+{
+ return __machine__findnew_thread(machine, pid, false);
+}
+
+int machine__process_comm_event(struct machine *machine, union perf_event *event)
+{
+ struct thread *thread = machine__findnew_thread(machine, event->comm.tid);
+
+ if (dump_trace)
+ perf_event__fprintf_comm(event, stdout);
+
+ if (thread == NULL || thread__set_comm(thread, event->comm.comm)) {
+ dump_printf("problem processing PERF_RECORD_COMM, skipping event.\n");
+ return -1;
+ }
+
+ return 0;
+}
+
+int machine__process_lost_event(struct machine *machine __maybe_unused,
+ union perf_event *event)
+{
+ dump_printf(": id:%" PRIu64 ": lost:%" PRIu64 "\n",
+ event->lost.id, event->lost.lost);
+ return 0;
+}
+
+static void machine__set_kernel_mmap_len(struct machine *machine,
+ union perf_event *event)
+{
+ machine->vmlinux_maps[MAP__FUNCTION]->start = event->mmap.start;
+ machine->vmlinux_maps[MAP__FUNCTION]->end = (event->mmap.start +
+ event->mmap.len);
+ /*
+ * Be a bit paranoid here, some perf.data file came with
+ * a zero sized synthesized MMAP event for the kernel.
+ */
+ if (machine->vmlinux_maps[MAP__FUNCTION]->end == 0)
+ machine->vmlinux_maps[MAP__FUNCTION]->end = ~0ULL;
+}
+
+static int machine__process_kernel_mmap_event(struct machine *machine,
+ union perf_event *event)
+{
+ struct map *map;
+ char kmmap_prefix[PATH_MAX];
+ enum dso_kernel_type kernel_type;
+ bool is_kernel_mmap;
+
+ machine__mmap_name(machine, kmmap_prefix, sizeof(kmmap_prefix));
+ if (machine__is_host(machine))
+ kernel_type = DSO_TYPE_KERNEL;
+ else
+ kernel_type = DSO_TYPE_GUEST_KERNEL;
+
+ is_kernel_mmap = memcmp(event->mmap.filename,
+ kmmap_prefix,
+ strlen(kmmap_prefix) - 1) == 0;
+ if (event->mmap.filename[0] == '/' ||
+ (!is_kernel_mmap && event->mmap.filename[0] == '[')) {
+
+ char short_module_name[1024];
+ char *name, *dot;
+
+ if (event->mmap.filename[0] == '/') {
+ name = strrchr(event->mmap.filename, '/');
+ if (name == NULL)
+ goto out_problem;
+
+ ++name; /* skip / */
+ dot = strrchr(name, '.');
+ if (dot == NULL)
+ goto out_problem;
+ snprintf(short_module_name, sizeof(short_module_name),
+ "[%.*s]", (int)(dot - name), name);
+ strxfrchar(short_module_name, '-', '_');
+ } else
+ strcpy(short_module_name, event->mmap.filename);
+
+ map = machine__new_module(machine, event->mmap.start,
+ event->mmap.filename);
+ if (map == NULL)
+ goto out_problem;
+
+ name = strdup(short_module_name);
+ if (name == NULL)
+ goto out_problem;
+
+ map->dso->short_name = name;
+ map->dso->sname_alloc = 1;
+ map->end = map->start + event->mmap.len;
+ } else if (is_kernel_mmap) {
+ const char *symbol_name = (event->mmap.filename +
+ strlen(kmmap_prefix));
+ /*
+ * Should be there already, from the build-id table in
+ * the header.
+ */
+ struct dso *kernel = __dsos__findnew(&machine->kernel_dsos,
+ kmmap_prefix);
+ if (kernel == NULL)
+ goto out_problem;
+
+ kernel->kernel = kernel_type;
+ if (__machine__create_kernel_maps(machine, kernel) < 0)
+ goto out_problem;
+
+ machine__set_kernel_mmap_len(machine, event);
+
+ /*
+ * Avoid using a zero address (kptr_restrict) for the ref reloc
+ * symbol. Effectively having zero here means that at record
+ * time /proc/sys/kernel/kptr_restrict was non zero.
+ */
+ if (event->mmap.pgoff != 0) {
+ maps__set_kallsyms_ref_reloc_sym(machine->vmlinux_maps,
+ symbol_name,
+ event->mmap.pgoff);
+ }
+
+ if (machine__is_default_guest(machine)) {
+ /*
+ * preload dso of guest kernel and modules
+ */
+ dso__load(kernel, machine->vmlinux_maps[MAP__FUNCTION],
+ NULL);
+ }
+ }
+ return 0;
+out_problem:
+ return -1;
+}
+
+int machine__process_mmap_event(struct machine *machine, union perf_event *event)
+{
+ u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ struct thread *thread;
+ struct map *map;
+ int ret = 0;
+
+ if (dump_trace)
+ perf_event__fprintf_mmap(event, stdout);
+
+ if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL ||
+ cpumode == PERF_RECORD_MISC_KERNEL) {
+ ret = machine__process_kernel_mmap_event(machine, event);
+ if (ret < 0)
+ goto out_problem;
+ return 0;
+ }
+
+ thread = machine__findnew_thread(machine, event->mmap.pid);
+ if (thread == NULL)
+ goto out_problem;
+ map = map__new(&machine->user_dsos, event->mmap.start,
+ event->mmap.len, event->mmap.pgoff,
+ event->mmap.pid, event->mmap.filename,
+ MAP__FUNCTION);
+ if (map == NULL)
+ goto out_problem;
+
+ thread__insert_map(thread, map);
+ return 0;
+
+out_problem:
+ dump_printf("problem processing PERF_RECORD_MMAP, skipping event.\n");
+ return 0;
+}
+
+int machine__process_fork_event(struct machine *machine, union perf_event *event)
+{
+ struct thread *thread = machine__findnew_thread(machine, event->fork.tid);
+ struct thread *parent = machine__findnew_thread(machine, event->fork.ptid);
+
+ if (dump_trace)
+ perf_event__fprintf_task(event, stdout);
+
+ if (thread == NULL || parent == NULL ||
+ thread__fork(thread, parent) < 0) {
+ dump_printf("problem processing PERF_RECORD_FORK, skipping event.\n");
+ return -1;
+ }
+
+ return 0;
+}
+
+int machine__process_exit_event(struct machine *machine, union perf_event *event)
+{
+ struct thread *thread = machine__find_thread(machine, event->fork.tid);
+
+ if (dump_trace)
+ perf_event__fprintf_task(event, stdout);
+
+ if (thread != NULL)
+ machine__remove_thread(machine, thread);
+
+ return 0;
+}
+
+int machine__process_event(struct machine *machine, union perf_event *event)
+{
+ int ret;
+
+ switch (event->header.type) {
+ case PERF_RECORD_COMM:
+ ret = machine__process_comm_event(machine, event); break;
+ case PERF_RECORD_MMAP:
+ ret = machine__process_mmap_event(machine, event); break;
+ case PERF_RECORD_FORK:
+ ret = machine__process_fork_event(machine, event); break;
+ case PERF_RECORD_EXIT:
+ ret = machine__process_exit_event(machine, event); break;
+ case PERF_RECORD_LOST:
+ ret = machine__process_lost_event(machine, event); break;
+ default:
+ ret = -1;
+ break;
+ }
+
+ return ret;
+}
--- /dev/null
+#ifndef __PERF_MACHINE_H
+#define __PERF_MACHINE_H
+
+#include <sys/types.h>
+
+struct thread;
+struct machine;
+union perf_event;
+
+struct thread *machine__find_thread(struct machine *machine, pid_t pid);
+
+int machine__process_comm_event(struct machine *machine, union perf_event *event);
+int machine__process_exit_event(struct machine *machine, union perf_event *event);
+int machine__process_fork_event(struct machine *machine, union perf_event *event);
+int machine__process_lost_event(struct machine *machine, union perf_event *event);
+int machine__process_mmap_event(struct machine *machine, union perf_event *event);
+int machine__process_event(struct machine *machine, union perf_event *event);
+
+#endif /* __PERF_MACHINE_H */
#include "../event.h"
#include "../thread.h"
#include "../trace-event.h"
-#include "../evsel.h"
PyMODINIT_FUNC initperf_trace_context(void);
{
u64 head, page_offset, file_offset, file_pos, progress_next;
int err, mmap_prot, mmap_flags, map_idx = 0;
- size_t page_size, mmap_size;
+ size_t mmap_size;
char *buf, *mmaps[8];
union perf_event *event;
uint32_t size;
perf_tool__fill_defaults(tool);
- page_size = sysconf(_SC_PAGESIZE);
-
page_offset = page_size * (data_offset / page_size);
file_offset = page_offset;
head = data_offset - page_offset;
u32 nr_events;
};
+struct hist_entry_diff {
+ bool computed;
+
+ /* PERF_HPP__DISPL */
+ int displacement;
+
+ /* PERF_HPP__DELTA */
+ double period_ratio_delta;
+
+ /* PERF_HPP__RATIO */
+ double period_ratio;
+
+ /* HISTC_WEIGHTED_DIFF */
+ s64 wdiff;
+};
+
/**
* struct hist_entry - histogram entry
*
u64 ip;
s32 cpu;
+ struct hist_entry_diff diff;
+
/* XXX These two should move to some tree widget lib */
u16 row_offset;
u16 nr_rows;
#include "util.h"
#include "debug.h"
-static struct thread *thread__new(pid_t pid)
+struct thread *thread__new(pid_t pid)
{
struct thread *self = zalloc(sizeof(*self));
map_groups__fprintf(&self->mg, verbose, fp);
}
-struct thread *machine__findnew_thread(struct machine *self, pid_t pid)
-{
- struct rb_node **p = &self->threads.rb_node;
- struct rb_node *parent = NULL;
- struct thread *th;
-
- /*
- * Font-end cache - PID lookups come in blocks,
- * so most of the time we dont have to look up
- * the full rbtree:
- */
- if (self->last_match && self->last_match->pid == pid)
- return self->last_match;
-
- while (*p != NULL) {
- parent = *p;
- th = rb_entry(parent, struct thread, rb_node);
-
- if (th->pid == pid) {
- self->last_match = th;
- return th;
- }
-
- if (pid < th->pid)
- p = &(*p)->rb_left;
- else
- p = &(*p)->rb_right;
- }
-
- th = thread__new(pid);
- if (th != NULL) {
- rb_link_node(&th->rb_node, parent, p);
- rb_insert_color(&th->rb_node, &self->threads);
- self->last_match = th;
- }
-
- return th;
-}
-
void thread__insert_map(struct thread *self, struct map *map)
{
map_groups__fixup_overlappings(&self->mg, map, verbose, stderr);
#include <linux/rbtree.h>
#include <unistd.h>
+#include <sys/types.h>
#include "symbol.h"
struct thread {
struct machine;
+struct thread *thread__new(pid_t pid);
void thread__delete(struct thread *self);
int thread__set_comm(struct thread *self, const char *comm);
int host_bigendian;
static int long_size;
-static unsigned long page_size;
-
static ssize_t calc_data_size;
static bool repipe;
/*
* XXX We need to find a better place for these things...
*/
+unsigned int page_size;
+
bool perf_host = true;
bool perf_guest = false;
void dump_stack(void);
+extern unsigned int page_size;
+
#endif