Verification todo ~~~~~~~~~~~~~~~~~ check that illegal insns on all targets don't cause the _toIR.c's to assert. [DONE: amd64 x86 ppc32 ppc64 arm s390] check also with --vex-guest-chase-cond=yes check that all targets can run their insn set tests with --vex-guest-max-insns=1. all targets: run some tests using --profile-flags=... to exercise function patchProfInc_ [DONE: amd64 x86 ppc32 ppc64 arm s390] figure out if there is a way to write a test program that checks that event checks are actually getting triggered Cleanups ~~~~~~~~ host_arm_isel.c and host_arm_defs.c: get rid of global var arm_hwcaps. host_x86_defs.c, host_amd64_defs.c: return proper VexInvalRange records from the patchers, instead of {0,0}, so that transparent self hosting works properly. host_ppc_defs.h: is RdWrLR still needed? If not delete. ditto ARM, Ld8S Comments that used to be in m_scheduler.c: tchaining tests: - extensive spinrounds - with sched quantum = 1 -- check that handle_noredir_jump doesn't return with INNER_COUNTERZERO other: - out of date comment w.r.t. bit 0 set in libvex_trc_values.h - can VG_TRC_BORING still happen? if not, rm - memory leaks in m_transtab (InEdgeArr/OutEdgeArr leaking?) - move do_cacheflush out of m_transtab - more economical unchaining when nuking an entire sector - ditto w.r.t. cache flushes - verify case of 2 paths from A to B - check -- is IP_AT_SYSCALL still right? Optimisations ~~~~~~~~~~~~~ ppc: chain_XDirect: generate short form jumps when possible ppc64: immediate generation is terrible .. should be able to do better arm codegen: Generate ORRS for CmpwNEZ32(Or32(x,y)) all targets: when nuking an entire sector, don't bother to undo the patching for any translations within the sector (nor with their invalidations). (somewhat implausible) for jumps to disp_cp_indir, have multiple copies of disp_cp_indir, one for each of the possible registers that could have held the target guest address before jumping to the stub. Then disp_cp_indir wouldn't have to reload it from memory each time. Might also have the effect of spreading out the indirect mispredict burden somewhat (across the multiple copies.) Implementation notes ~~~~~~~~~~~~~~~~~~~~ T-chaining changes -- summary * The code generators (host_blah_isel.c, host_blah_defs.[ch]) interact more closely with Valgrind than before. In particular the instruction selectors must use one of 3 different kinds of control-transfer instructions: XDirect, XIndir and XAssisted. All archs must use these the same; no more ad-hoc control transfer instructions. (more detail below) * With T-chaining, translations can jump between each other without going through the dispatcher loop every time. This means that the event check (counter dec, and exit if negative) the dispatcher loop previously did now needs to be compiled into each translation. * The assembly dispatcher code (dispatch-arch-os.S) is still present. It still provides table lookup services for indirect branches, but it also provides a new feature: dispatch points, to which the generated code jumps. There are 5: VG_(disp_cp_chain_me_to_slowEP): VG_(disp_cp_chain_me_to_fastEP): These are chain-me requests, used for Boring conditional and unconditional jumps to destinations known at JIT time. The generated code calls these (doesn't jump to them) and the stub recovers the return address. These calls never return; instead the call is done so that the stub knows where the calling point is. It needs to know this so it can patch the calling point to the requested destination. VG_(disp_cp_xindir): Old-style table lookup and go; used for indirect jumps VG_(disp_cp_xassisted): Most general and slowest kind. Can transfer to anywhere, but first returns to scheduler to do some other event (eg a syscall) before continuing. VG_(disp_cp_evcheck_fail): Code jumps here when the event check fails. * new instructions in backends: XDirect, XIndir and XAssisted. XDirect is used for chainable jumps. It is compiled into a call to VG_(disp_cp_chain_me_to_slowEP) or VG_(disp_cp_chain_me_to_fastEP). XIndir is used for indirect jumps. It is compiled into a jump to VG_(disp_cp_xindir) XAssisted is used for "assisted" (do something first, then jump) transfers. It is compiled into a jump to VG_(disp_cp_xassisted) All 3 of these may be conditional. More complexity: in some circumstances (no-redir translations) all transfers must be done with XAssisted. In such cases the instruction selector will be told this. * Patching: XDirect is compiled basically into %r11 = &VG_(disp_cp_chain_me_to_{slow,fast}EP) call *%r11 Backends must provide a function (eg) chainXDirect_AMD64 which converts it into a jump to a specified destination jmp $delta-of-PCs or %r11 = 64-bit immediate jmpq *%r11 depending on branch distance. Backends must provide a function (eg) unchainXDirect_AMD64 which restores the original call-to-the-stub version. * Event checks. Each translation now has two entry points, the slow one (slowEP) and fast one (fastEP). Like this: slowEP: counter-- if (counter < 0) goto VG_(disp_cp_evcheck_fail) fastEP: (rest of the translation) slowEP is used for control flow transfers that are or might be a back edge in the control flow graph. Insn selectors are given the address of the highest guest byte in the block so they can determine which edges are definitely not back edges. The counter is placed in the first 8 bytes of the guest state, and the address of VG_(disp_cp_evcheck_fail) is placed in the next 8 bytes. This allows very compact checks on all targets, since no immediates need to be synthesised, eg: decq 0(%baseblock-pointer) jns fastEP jmpq *8(baseblock-pointer) fastEP: On amd64 a non-failing check is therefore 2 insns; all 3 occupy just 8 bytes. On amd64 the event check is created by a special single pseudo-instruction AMD64_EvCheck. * BB profiling (for --profile-flags=). The dispatch assembly dispatch-arch-os.S no longer deals with this and so is much simplified. Instead the profile inc is compiled into each translation, as the insn immediately following the event check. Again, on amd64 a pseudo-insn AMD64_ProfInc is used. Counters are now 64 bit even on 32 bit hosts, to avoid overflow. One complexity is that at JIT time it is not known where the address of the counter is. To solve this, VexTranslateResult now returns the offset of the profile inc in the generated code. When the counter address is known, VEX can be called again to patch it in. Backends must supply eg patchProfInc_AMD64 to make this happen. * Front end changes (guest_blah_toIR.c) The way the guest program counter is handled has changed significantly. Previously, the guest PC was updated (in IR) at the start of each instruction, except for the first insn in an IRSB. This is inconsistent and doesn't work with the new framework. Now, each instruction must update the guest PC as its last IR statement -- not its first. And no special exemption for the first insn in the block. As before most of these are optimised out by ir_opt, so no concerns about efficiency. As a logical side effect of this, exits (IRStmt_Exit) and the block-end transfer are both considered to write to the guest state (the guest PC) and so need to be told the offset of it. IR generators (eg disInstr_AMD64) are no longer allowed to set the IRSB::next, to specify the block-end transfer address. Instead they now indicate, to the generic steering logic that drives them (iow, guest_generic_bb_to_IR.c), that the block has ended. This then generates effectively "goto GET(PC)" (which, again, is optimised away). What this does mean is that if the IR generator function ends the IR of the last instruction in the block with an incorrect assignment to the guest PC, execution will transfer to an incorrect destination -- making the error obvious quickly.