This blog post is an introduction to dynamic recompilers (dynarecs), and hopes to provide some insight on how they work and why pcsx2 uses them to speed up emulation.
It is probably easier to read on our forums, because some of the code didn't wrap nicely on our main blog page....
(
Click here to view blog post in forum)
To first understand why dynarecs are useful, you must first be familiar with a basic interpreter emulator.
Assume we are emulating a very simple processor. Processors have instruction sets which are a set of different instructions they can compute.
Lets assume the processor we are emulating is a made-up chip I'll call SL3 (super lame 3), and has only these 3 instructions (and each instruction has fixed width of 4 bytes):
MOV dest_reg, src1_reg // Move source register to destination register
ADD dest_reg, src1_reg, src2_reg // Add source1 and source2 registers, and store the result in destination register
BR relative_address // Branch (jump) to relative address (PC += relative_address * 4)
Processors generally have what we call registers which can hold data, and the processor's instructions perform the operations on these registers.
For our example, we will assume that our SL3 processor has 8 registers, and the size of these registers is 32 bits (so each register holds 32 bits of data).
Now to program for this processor, we can have the following code:
Code:
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5
What this code does is:
1) It moves register 0 to register 1 (so now register 1 holds a copy of register 0's data).
2) It adds register 2 and register 3 together, and stores the result in register 4.
3) It branches 5 instructions further away (so now it jumps to some code that is further down (not shown in above example))
So that is how we can program for the SL3 processor in assembly code. But how do we emulate it?
To actually emulate this processor we can use an
interpreter. An interpreter simply fetches each
instruction opcode and executes them accordingly (e.g. by calling emulated methods for each different instruction). The rest of the emulator (emulating other processors/peripherals of our system) can then get updated sometime in between instructions or after a group of cpu instructions are run. Interpreters are a simple and complete way to emulate a system.
(
Click here to see a C++ code example of a simple interpreter)
Using interpreters we constantly have to be fetching and executing instructions one-by-one. There is a lot of overhead in this, and minimal room for optimization since most special case optimizations will have the overhead of checking for them (so it will for example add extra if-statements and conditionals... reducing the gain from the optimization). But there's a faster way to do processor emulation which doesn't have these draw-backs... using dynamic recompilation!
The basic idea of dynamic recompilation is to translate emulated instructions once, cache the emitted translated instructions, and then run the emitted native instructions as much times as needed.
Since the instructions you read are not changing (lets leave out self-modifying code for this example), you can translate the emulated instructions into native cpu instructions (in our case x86-32 asm instructions) and then cache the translated instructions into 'blocks' of code, then just execute these blocks of native code instead of having to do the translation over and over again whenever the code needs to be read again.
So for instance remember our above SL3 program:
Code:
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5
Lets assume this code is part of some function and gets called 100's of times a second (this could sound crazy, but games/applications commonly call the same code hundreds or thousands of times a second).
Now our runCPU() interpreter function above will have to translate every instruction before it can actually compute the result.
That is, it needs to fetch the opcode of every instruction, call the emulated function based on the opcode number, then actually compute the instruction result.
But dynarecs allow us to skip the "fetch opcode" and the "call the emulated function" part, by only doing this once, and then caching the translated code into native asm blocks.
To make a good dynarec, we first need a code emitter.
An emitter is a series of functions we call that write native asm to some memory block we give it.
So we use an x86-32 emitter to write native x86-32 asm code to blocks of memory, and then later we can execute these blocks as if they were normal c++ generated functions!
PCSX2 has a very cool emitter that looks very similar to x86-32 assembly, except the instructions have an 'x' before them.
So for example:
mov eax, ecx;
is
xMOV(eax, ecx);
with the pcsx2 emitter.
Now the idea behind the dynarec we are going to write now, is that we will end blocks whenever a branch instruction is detected (which will jump to other blocks).
The code for actually recompiling these blocks looks something like this:
Code:
// This is our emulated MOV instruction
void MOV() {
u8 dest = fetch(); // Get destination register number
u8 reg1 = fetch(); // Get source 1 register number
xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register
fetch(); // This fetch is needed because every instruction in our SL3 processor is 4 bytes
}
// This is our emulated ADD instruction
void ADD() {
u8 dest = fetch(); // Get destination register number
u8 reg1 = fetch(); // Get source 1 register number
u8 reg2 = fetch(); // Get source 2 register number
xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
xADD(eax, ptr[&cpuRegs[reg2]]); // Add eax with reg2's data
xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register
}
// This is our emulated BR (jump) instruction
void BR() {
s8 addr = fetch(); // Gets a number by which we will increment (or decrement if negative) PC by
PC = (PC - 2) + (addr * 4);
// Get a pointer to a block of x86-32 compiled code
// that was recompiled by the recompileSL3() function
u8* block_pointer = getBlock(PC);
xJMP(block_pointer); // Jump to the pointer returned by getBlock()
}
// This goes through instructions and recompiles them
// It recompiles instructions until it reaches a BR() instruction.
u8* recompileSL3(u32 startPC) {
u8* startPtr = xGetPtr(); // Gets pointer to where the emitter is currently pointing to (the start pointer of the block)
PC = startPC; // Set PC to the start PC of this block
bool do_recompile = true;
while (do_recompile) {
u8 opcode = fetch();
switch (opcode) {
case 0: MOV(); break;
case 1: ADD(); break;
case 2: // We stop recompiling on branches
BR();
do_recompile = false;
break;
}
}
return startPtr; // Returns the pointer to where our block of x86 generated code starts at
}
// This holds all the pointers to our blocks that were recompiled based on
// starting PC address. We will assume that the instruction memory for
// this processor is 16kb, which means that it can hold 1024*16 instructions
// and therefor we we have at-most 1024*16 block pointers.
static u8* blockArray[1024*16];
// This returns a pointer to our recompiled block
// If it hasn't been compiled, it'll recompile the block and then return that pointer.
// We use __fastcall because it lets us pass the startPC parameter in the ecx register
// instead of having to use the x86 stack...
u8* __fastcall getBlock(u32 startPC) {
if (blockArray[startPC] == null) {
blockArray[startPC] = recompileSL3(startPC);
}
return blockArray[startPC];
}
// Basic cpu emulator using dynamic recompilation
void runCPU() {
// This sets our emitter to start emitting instructions to rec_cache
// which is a large block of memory where we can write lots of
// x86 asm instructions to...
x86setPtr(rec_cache);
__asm {
pushad; // Save all our registers
mov ecx, PC; // Move PC parameter into ecx register (for __fastcall)
call getBlock; // Call the getBlock function
jmp eax; // The block to jump to is returned in eax
}
}
Note the above code doesn't have any logic to successfully exit once it starts executing recompiled blocks... I left this stuff out in order to not complicate things... so assume that somehow execution ends and we can get back to running the other parts of the emulator...
Also note we need to restore registers when we exit execution, and we also need to set rec_cache to the new x86 emitter address (so it can continue where it left off instead of writing over already recompiled memory blocks).
(
Click here for a more complete sl3 recompiler code with cycle counting and exit logic)
Now how does this work?
Well we basically call runCPU() which calls the getBlock() function with the original PC value.
getBlock() then checks if we have already recompiled a block with that start PC value, which we havn't yet, so it will go on to call recompileSL3() and give it the startPC value.
recompileSL3() will loop through the opcodes, fetching them and then calling the appropriate emulated functions which will write to memory x86-32 asm instructions computing the correct results.
recompileSL3() will continue looping until it reaches a BR() instruction (which will recursively call getBlock() until all the addresses have been recompiled and no more recompilation needs to happen).
Once everything has been recompiled we jump to the starting block's execution address, and that's how we actually run the execution.
The code that ends up being executed after we recompiled is only the emitted code, which were the functions prefixed with 'x' (xMOV, xADD, etc...).
Notice that that's a lot of code omitted as we don't have to fetch anything anymore, but instead just run a series of fast generated asm functions...
This means that although we have a lot of extra overhead on the actual recompilation, we end up generating some really fast code, that could be called 1000's of times a second, therefor making the initial overhead well worth it!
We can also perform many optimizations while recompiling such as regalloc, constant propagation, optimized asm generation for special opcode cases, etc... (we didn't do this stuff in our example to avoid complexities).
Hopefully this article has helped you gain a bit of understanding on how dynamic recompilation works, and why we use it in pcsx2 to gain a lot more speed over interpreters!
Also note that dynarecs can be used for more than processor emulation, we also used dynamic recompilation in pcsx2 to cache asm optimized unpack routines that the ps2's vif does.
I should also add that currently pcsx2 has the following dynarecs:
EE recompiler (for MIPS R5900 ee-core processor)
IOP recompiler (for MIPS R3000A i/o processor)
microVU recompiler (for VU0/VU1, and COP2 instructions)
Super VU recompiler (can be used instead of microVU for VU0/VU1)
newVIF unpack recompiler (recompiles vif unpack routines)
r3000air (not yet finished, but should one day supersede the IOP recompiler)
One problem that has plagued PCSX2 since as long as most anyone can remember is it's general inflexibility and instability when flipping between windowed and fullscreen modes. This was something we really sought to address as we re-write the user interface in 0.9.7.
In 0.9.6 the situation was grim: The flip never really worked right in DX9 -- there's an "escape hack" that completely shuts down the emulator just to flip out of fullscreen mode. Hitting alt-enter is usually less forgiving (the plugin handles it, and doesn't allow PCSX2 to completely shut everything down and restart). In DX10 things were better: alt-enter typically works a couple times, but do it too often, or get an unlucky flip, and it'll still result in lost or corrupted video and possibly a crash depending on video drivers and (lack of) luck.
In 0.9.7 we've completely re-done our approach to fullscreen. Instead of using what Microsoft DirectX calls
Exclusive Mode (you can read about the programmer-centric details of it
here), we're taking a more modern approach and using a special type of maximized desktop window instead. Like anything there's some advantages and disadvantages to this new approach:
Advantages of the new fullscreen method:
- Works perfectly in DX9 and DX10 alike.
- No more risk of visual corruption or crashes, and no need to shutdown the emulator to avoid them.
- Much faster and seamless flips.
- Works with any GS plugin, regardless of how the GS plugin is implemented.
- Always uses your LCD display's optimal resolution (assuming you have it set in the desktop as such, and you should).
- Integrates better with your desktop -- Alt-Tab, TaskBar, popup errors, etc. are much less prone to being... annoying. (pulling up a strategy guide in a brwoser window, for example!)
- Super-easy to implement, from a programming perspective!
Disadvantages:
- A slight bit of extra window management overhead.
- Always uses your desktop resolution on CRT displays (this is an advantage for anyone with an LCD display, but can be a disadvantage for people with CRT displays, depending on your setup .. however few of you are left)
The performance benefit of exclusive fullscreen is mostly related to Aero under Vista/Windows7. in which case the performance is sometimes
better in a window over exclusive mode (this depending on video card/drivers/etc).
The thing that really sold us on it though is the fact that overhead of the non-exclusive fullscreen mode is minimal on modern GPUs. The only real advantage of using exclusive fullscreen is that it bypasses Aero and increases the "multimedia priority" of the app. If you already have Aero disabled, then neither of those apply anyway. And looking toward the future: the next generation of GPUs will reduce that overhead of Aero even further to the point where even that will likely not make a significant impact on performance. So the idea of a seamlessly working fullscreen flip for all DirectX incarnations on all incarnations of Windows over a rather iffy, unstable, and fragile fullscreen flip that
might be 2-3% faster on legacy hardware ended up being a no-brainer.
We're still leaving the door open for adding optional support for exclusive mode fullscreen, since there could still be some use to it for special scenarios like CRT displays and TV projection; though there's no timetable for the implementation of the option -- and it would depend on the GS plugin to support it properly otherwise it'll still be the corruption/crash bomb that it's always been up to now.
One thing is for sure: The new 0.9.7 betas will use a
lot more threads than the current 0.9.6 releases. Now this doesn't necessarily mean the emulator will take advantage of quad core CPUs better than 0.9.6, least not in a gameplay sense. As I explained in my previous blog, threading is as much a function of improving responsiveness and recoverability as it is about sharing a workload across multi-core cpus, and so far most of the threading implemented into 0.9.7 is the scalable/responsive sort.
One of the major changes in 0.9.7 will be the removal of what I call an
"aggressive spinwait" in the
EmotionEngine (EEcore) emulation unit. A spinwait is a simple loop that waits on a variable to change, like so:
Code:
volatile bool IsRunning = true;
StartThreadedAction();
while( IsRunning );
// When the above while() exits, the ThreadedAction is done.
This is a very simple threading design, but it's mostly drawbacks and not many advantages. We've continued to use it up to now in PCSX2's EEcore because there wasn't much reason to do away with it; and with the EEcore being the main orchestrator of everything in PCSX2 (gui included!), having the high-resolution responsiveness of a spinwait made some sense.
On the current design in 0.9.6, the EEcore and the GUI share time on the same thread, and when the GS thread is busy, the EEcore will split time between waiting on the GS (via the spinwait) and processing GUI messages. This transition from EEcore emulation to GUI message processing was typically costly, but was necessary to handle input from the user. In the new 0.9.7 design, the EEcore has its own thread separate from the GUI. This allows us to remove the overhead of having to switch to/from GUI processing code, but it came with a somewhat unexpected drawback: the EEcore's aggressive spinwait is now suddenly
very aggressive when a game becomes GS-limited. In 0.9.6 the spinwait breaks from time to time to splice in some gui messages and make time for the GS to do its thing too. This kept everything pretty happy. But with the splicing gone, the EEcore is allowed to run free, soaking up tons of resources simply re-testing the same variable over and over.
The full impact became obvious when we realized that setting 2 software threads in GSdx caused PCSX2 to slow to a crawl (sub 1fps!) on dual core systems. That's what happens when you have three threads using spinwaits on a dual core system -- they completely starved out everything else, and to some extent each other as well. (yes, GSdx software uses spinwaits also!)
The primary solution is to get rid of the spinwait in the EEcore. In its wake we'll put the EEcore to sleep and have it wake up only once the MTGS ringbuffer has emptied to a satisfactory percentage. With the EEcore asleep, the GSdx thread(s) will have full reign over all the resources of the cpu; which will allow it to play "catch up" more efficiently than it can even in 0.9.6. This model will be an obvious win for both software rendering, and possibly DX11's multithreaded pipeline in the future.
It's the year 2009, and it's almost over at that; and as anyone reading this blog well knows, multithreaded applications are the here-and-now and future of desktop computing. It's the only way we can take advantage of multicore CPUs. But multithreaded programming offers more than just improved multicore performance. Using threaded programming is actually very important to developing software that
behaves nicely. By that I mean software that refreshes its window contents quickly, responds to your mouse clicks, and lets you cancel stuff.
For that the best approach is usually threading, with the alternative being something called "Cooperative multitasking" where by a program is written such that it splits all tasks into neat little chunks. For example, the two possible ways to implement loading an image (let's say a png image):
* Load the image one scanline at a time, and then after each scanline manually check for keyboard, mouse, or other input, and refresh the screen.
* Load the image using a thread, and let the usual "global" windows message handler dispatch keyboard, mouse, and refresh messages as usual.
The second approach has several advantages. For one, it needs fewer temporary heap allocations (which are typically slow and fragment memory). It is more responsive: windows messages will be handled in parallel to the image loading, so you don't even need to "wait" until the end of a scanline for user input to have its effect. It's also more scalable: while the first system is able to load one image at a time only in co-operative fashion (extending it to support multiple is possible, but very difficult), the threaded approach can be scaled to load dozens of images at once with no additional complications.
The drawback is that thread synchronization and
especially structured error handling across threads tends to be much more complicated than that of the linear cooperative model. If you don't have errors to handle, or don't really
care about handling errors, then threaded tasking isn't so bad.
Enter PCSX2, where everything ends up being damn complicated. Being a perfectionist, I figured I'd design the new GUI completely on the threaded model, doing away with cooperative design almost completely. Such a design should help avoid any deadlocking scenarios and allow the emu to recover from almost any error gracefully. Problem: The emulator has a lot of inter-dependent parts and pieces that need to be interlocked and synchronized, and all of them can throw out a variety of errors -- which too I'd like to handle smartly; requesting extra user input when appropriate (and not just throwing out annoying or vague message boxes).
Interlocking dependencies can be a nightmare. For example, if you start a thread that loads an image, and then
block on that thread until it completes, you're worse off than if you wrote yourself a cooperative image loader because now the whole program stalls waiting for the thread to complete anyway. And like everything else, there are two ways to handle this:
(1) Use a "friendly" blocking mechanism that periodically polls the user input and updates display. This is no better than cooperative single-thread designs though, as it has slow response times and doesn't scale well to multiple threads.
(2) Build your entire GUI around "messages" and "callbacks" (sometimes also called "signals"). This is the most flexible and user-friendly option but can add a lot of "framework" to any codebase.
I tried to use the first approach initially, because I was in a hurry to get things working. But it's been problematic since day 1, so now I'm redoing most things to use the second method instead.
The second one is in fact the recommended design by Microsoft, and one they've been using for almost everything in Windows ever since Win95. It's one of the reasons the Win32 API feels "heavy" to a lot of programmers, but as it turns out, it's not without good reason.