This introduces speculative constants, allowing FIFO writes to be
optimized in more places.
It also clarifies the guarantees of the FIFO optimization, changing
the location of some of the checks and potentially avoiding redundant
checks.
Now it should be easier to merge more than 2-instruction-long sequences.
Also correct some minor inconsistencies in behavior between instruction
merging cases.
Optimistically assume used GQRs are 0 in blocks that only use one GQR, and
bail at the start of the block and recompile if that assumption fails.
Many games use almost entirely unquantized stores (e.g. Rebel Strike, Sonic
Colors), so this will likely be a big performance improvement across the board
for games with heavy use of paired singles.
If the inputs are both float singles, and the top half is known to be identical
to the bottom half, we can use packed arithmetic instead of scalar to skip
the movddup.
This is slower on a few rather old CPUs, plus the Atom+Silvermont, so detect
Atom and disable it in that case.
Also avoid PPC_FP on stores if we know that the output came from a float op.
This is a higher level, more concise wrapper for bitsets which supports
efficiently counting and iterating over set bits. It's similar to
std::bitset, but the latter does not support efficient iteration (and at
least in libc++, the count algorithm is subpar, not that it really
matters). The converted uses include both bitsets and, notably,
considerably less efficient regular arrays (for in/out registers in
PPCAnalyst).
Unfortunately, this may slightly pessimize unoptimized builds.
Should be at least a bit better than the previous LRU approach. Currently
has two basic components: whether a register is dirty (dirty registers need
to be stored, so clobbering them hurts more) and how many other registers will
be used between now and the next time a register gets used.
Also don't pre-load values that don't need to be in registers.
This should dramatically reduce code size in the case of blocks with
lots of branches, and certainly doesn't hurt elsewhere either.
This can probably be improved a good bit through smarter tracking of register
usage, e.g. discarding registers that are going to be overwritten, but this
is a good start and should help reduce code size and register pressure.
Unlike that sort of change, this is a "safe" patch; it only flushes registers,
which can't affect correctness, unlike actually discarding data.
As part of this, refactor PPCAnalyst to support distinguishing between
float and integer registers (to properly handle instructions that access
both, like floating-point loads and stores).
Also update every instruction in the interpreter flags table I could find
that didn't have all the correct flags.
Tries as hard as possible to push carry-using operations (like addc and adde)
next to each other. Refactor the instruction reordering to be more flexible
and allow multiple passes.
353 -> 192 x86 instructions on a carry-heavy code block in Pokemon Puzzle.
12% faster overall in Pokemon Puzzle; probably less in typical games (Virtual
Console games seem to be carry-heavy for some reason; maybe a different
compiler?)
This is the bare minimum required to run a few games on AArch64.
Was able to run starfield and Animal Crossing to the Nintendo logo.
QEmu emulation is literally the slowest thing in the world, it maxes out at around 12mhz on my Core i7-4930MX.
Doesn't support all the FPSCR flags, just the FPRF ones.
Add PPCAnalyzer support to remove unnecessary FPRF calculations.
POV-ray benchmark with enableFPRF forced on for an extreme comparison:
Before: 1500s
After, fmul/fmadd only: 728s
After, all float: 753s
In real games that use FPRF, like F-Zero GX, FPRF previously cost a few percent
of total runtime.
Since FPRF is so much faster now, if enableFPRF is set, just do it for every
float instruction, not just fmul/fmadd like before. I don't know if this will
fix any games, but there's little good reason not to.
The issue was that on memory exception we wouldn't call in to PPCAnalyst and our code_block would retain the previous blocks information.
This would cause us to compile the previous blocks instructions in prior to the exception exit.