- Created: 23 June 2009
- Written by Jake Stine
There are a lot of buzzwords in emulator recompilation. Popular ones include:
- Intermediate Representation
- Intermediate Language
- Register Mapping
- Register Allocation
- Constant Propagation
- Constant Folding
- Register coloring
- SSA (static single assignment optimization)
What do they mean? And perhaps more importantly, what do they mean to a user of the emulator? Truth is, not much usually. Most of these things are technologies and strategies borrowed from high level language compilers like those for C/C++, C#, Java, etc. Some of them are useful to emulator recompilers, some not so much.
The first thing to consider when working on a recompiler is that we are working with what is most likely optimized code. The machine code that your favorite games are running on your PS2 is already parsed and optimized by a compiler (or in the case of older consoles, hand-optimized). Secondly, recompilers typically have a single-block scope limitation (which is hard to explain, but basically means that compilation stops when branch conditionals are encountered). This all but eliminates the usefulness of SSA and register coloring techniques, since their main benefits are in applying optimizations over a series of conditional code blocks, and elimination of dead code. Furthermore, even when higher level optimizations can be applied, the emulated CPU/Register states must still be guaranteed at frequent intervals. Unlike high level languages, an emulator must manually track things like the Program Counter, instruction pipeline stalls, and other hardware complications. So typically the benefits of such cross-block optimizations get watered down anyway.
Constant Folding and Constant Propagation are, for all intents and purposes, the same thing. They always work together, and most people use the terms interchangeably. You'll never really find a practical situation where one is used without the other. Constant folding refers to the evaluation of constant expressions like 100 * 5. Constant propagation refers to the substitution of variables with known constants, like:
x = 100 * 5;
y = x * 10;
... in which case, the value of 'y' is known at compilation time, and can be further substituted (propagated) anywhere 'y' is used (this should remind you of your 5th grade algebraic homework!). As mentioned before, many people use the terms interchangeably, so when you see another emulator talk of Constant Folding, they mean the same thing I do when I talk of Constant Propagation. The term 'propagation' is technically more correct, but folding is easier to type and looks nicer when naming functions in your recompiler.
Intermediate Language (IL) and Intermediate Representation (IR) are more or less the same thing as well. IL simply implies an IR that has a (mostly) human-readable form. That is, an IL is in fact a programming language, usually bearing some similarity to assembly code, but simpler and more sensible. An IR is just a raw collection of data sufficient to represent all information needed to optimize code and generate the final recompiled product.
In either case, the dual purposes of an IR/IL are:
- To simplify the instruction set as much as possible so that optimizations can be analyzed without requiring a lot of special-case code.
- To provide a platform-independent "stage" to the compilation process, so that ports of the recompiler to new target platforms need not be rewritten from the ground up.
Because of the complicated nature of an emulator, an IL itself is virtually useless. There's typically too much per-instruction cpu state information that has to be tracked for a human-readable language output to be viable. But for the same reason an IR can be remarkably helpful in reducing the overall complexity of a recompiler implementation, and is almost a foregone necessity when implementing Register Mapping.
Register Mapping and Register Allocation are once again a set of fairly interchangeable terms. I prefer the term mapping, but other folks like to call it allocation. Register mapping/allocation is typically one of the final stages of recompilation since it's dependent on the target platform (in our case x86), and is also one of the most complex. It's also typically not very beneficial for performance when the target platform is an x86 machine (which in our case it is), unless register mapping algorithms are very clever.
As I mentioned in my previous blog entry, the most significant factor in a recompiler's speed is simply the fact that it's acting as a decoded-instruction cache, and that it executes instructions in bursts (a block at a time instead of one-by-one). All the rest of these techniques tend to be more for the factors of code maintainability and academic challenge than for end-user performance gains.
- Created: 02 June 2009
- Written by Jake Stine
One of the less obvious things that has plagued Pcsx2's compatibility over the years is its event handling system. The system in place as of 0.9.6 is adequate for interpreter-based emulation but is not well-equipped to handle the methods that a recompiler uses to manage cpu cycle updates. This is something we aim to fix in the coming weeks.
Cycle-based Timing Explained
All cpus have a cycle rate, which is typically the Mhz/Ghz values you're most familiar with when talking about any cpu. An i7 clocked at 2.83ghz has a 2.83ghz cycle rate. Now, the actual throughput of instructions can vary greatly since each cycle of the cpu consists of several stages and multiple piplines, each of which can have dependency stalls and has varying rules for when such stalls occur. The cycle rate, however, is always 2.83ghz. Because cycle rates are a known constant, they make a good barometer for synchronizing the activities of a multi-processor design like the Playstation 2.
Why do Recompilers Complicate Event Testing?
Recompilers work as a significant speedup over interpreters by doing two things:
- Recompile the machine code of a emulated CPU (in our case MIPS instructions) into code native to the host machine (ix86 instructions).
- Prefetch and pre-decode emulated instructions, and inline them into blocks.
The thing recompilers are most well-known for -- recompiling to native machine code -- is actually the less effective of the two things recompilers do for speeding up emulation. The primary speedup typically comes from the prefetching and inlining of instructions, which in addition to eliminating the instruction fetch/decode stage (by far the slowest part of any interpreter), also allows for cross-instruction optimizations such as constant propagation and register caching/mapping. In other words, a recompiler is effectively executing emulated instructions in pre-compiled bursts. This is so important to performance that a recompiler without block-level execution would hardly be any faster than an interpreter.
As part of the design of block-level execution, the recompiled code only updates cpu cycle counts and tests for scheduled events at block boundaries. Blocks typically span 5 to 35 cycles, but in some cases can span a hundred cycles or more. When the subsequent Event Test is performed, several scheduled events may be pending execution. This is where problems can occur: The current event system implemented into Pcsx2 executes all pending events in no particular order, leading to events being executed out-of-order when multiple events time-out during a single block. Typically most events don't have dependencies on each other, or games don't use them in a way that execution order matters. But sometimes they do, and in those cases behavior can be unpredictable, or can cause the game to fail outright. To make matters worse, the pending events typically don't know how late they are, and will re-schedule subsequent events in increasingly belated fashion. The current implementation of EE and IOP counters have tons of complicated code meant to compensate for this limitation (both slow and were nearly impossible to get right).
The fix for this is to use an event system I'll call decremental delta time. It has three advantages:
- Makes it easy to execute events in scheduled order regardless of the amount of time which has passed since the last Event Test.
- Maintains relative cycle scheduling at a high level so that none of the events being re-scheduled "lose time" due to belated block-boundary event testing.
- Simplifies event handling on all levels, and provides significant speedups for event testing and event dispatching.
It's hard to know beforehand just how beneficial in-order execution of events will be. I'm anticipating that it might actually fix a few emulation problems on the IOP recompiler in particular, since it has a slow cycle rate and also has a handful of events which can have potential inter-dependencies. For that reason I'll be implementing the system first into the IOP, and then when all the chinks in its armor are worked free we'll port the EE side of the emulator over to it.
- Created: 13 February 2009
- Written by Jake Stine
Zerofrog documented the concepts of Virtual Memory a few years back. So now I figure it's VTLB's turn, since it's the new exclusive memory model used in current AVNs and any future releases.
So what is VTLB? VTLB stands for Virtual Translation Look-aside Buffer, which for most of us is a lot of common everyday words that, when put together like that, don't mean much at all.
Firstly, the memory model names VM and VTLB refer to the systems in place inside PCSX2 for emulating the Playstation2's memory, and don't actually refer to what's being emulated. In fact, neither VM or VTLB builds emulate the PS2's own TLB memory model, which can be misleading since VTLB contains the letters "TLB." The VM build was incapable of emulating the PS2's TLB without significant speed penalties and complications. VTLB on the other hand has the potential to emulate the PS2 TLB, but we haven't added functionality for it since it also depends on some other not-yet-complete areas of emulation (namely MIPS TLBMiss exceptions). As Zerofrog explained in an earlier blog, very few games utilize the TLB features of the PS2 anyway, so it's pretty much at the bottom of our wishlist at this time.
Conceptually, VTLB is surprisingly simple. It works by building a look-up of the PS2's physical memory on a per-page basis, and then defining actions or "handlers" for some pages, and defining other pages as "direct access" (fast mode). A page of memory is 4096 bytes long, so the PS2's 32Meg physical address space translates into 8192 total pages of memory, which ends up being a pretty small and efficient size as far as lookup tables are concerned. By comparison the current EErec uses a lookup table with 8 million entries!
So when a PS2 instruction performs a memory operation (usually referred to as a memOp), the VTLB grabs the lookup address. If the address has the "special handler" bit set, it routes to that handler. If the handler bit is not set, the address is treated like a normal pointer to physical memory. As an optimization, VTLB uses the sign bit of the 32 bit address for the purpose of differentiating handled memory pages from direct memory pages.
The pseudo-code looks like this, as performed for a write memOp:
uint page = ps2_addr/4096;
uptr pc_addr = vtlb_lookup[page];
if( pc_addr & 0x80000000 ) // sign bit check
*pc_addr = data;
handler[page]( ps2_addr, data );
By default, Pcsx2 utilizes the VTLB's handlers for several areas of Ps2 memory that hold hardware registers. Hardware registers are memory addresses that control the whats, whens, and hows of the Ps2 -- write to a specific memory address and the PS2 starts a DMA transfer, or changes the video mode, or plays a new audio sample. These writes have to be intercepted and handled by the emulator. The VTLB allows us to do that very efficiently.
The reason why VTLB is able to emulate the PS2's own TLB is thanks to the handlers, which can remap memory anywhere at any time, with any set of permissions. If a game decided to remap some pages of memory to a different virtual address, those pages would have the "handler bit" set TRUE, and then the handler for those pages would be instructed to remap the memOp to the appropriate physical address.
Thus, the flow would be as such:
Ps2_Virtual_Address -> VTLB_Lookup -> Handler -> Ps2_Physical_Address
The benefits of this model are three-fold: efficiency, extensibility, and ease of debugging. If a game doesn't use the Ps2's TLB, then the VTLB will simply use direct memOps (fast!). If a game does happen to use the TLB, then VTLB can remap the memory as needed, allowing that game to emulate correctly also without having to needlessly burden other games with the overhead of virtual memory remapping logic. And to top it all off, handler mappings can be traced and dumped quickly and easily at any stage of emulation.
- Created: 03 February 2009
- Written by Jake Stine
Without a doubt, creating a public release of Pcsx2 is an exhausting affair. We just got finished posting the latest and greatest in 0.9.6 (available in our Downloads Section!), and while it's nice having everything done and over with for now, it sure feels like there should have been a better way.
This time around we tried to make use of our GoogleCode Svn in "smart" fashion, and created a branch for the Release Candidate. The jury's still out on if this proved to be successful or not. Several ground-breaking fixes were submitted shortly after the RC branch was made, so we had to merge all of that stuff in. Furthermore, I got carried away and experimented with partial merges, without fully understanding the advantages of reverse merges, so I had to undo several of my own merging errors. And just to add salt to the wound, TortoiseSvn had a bug that would frequently "forget" line breaks; merging all code changes into one super-long line. -_-
So in the end, the merges required a lot of brain power, a lot of time, and may have led to some small mistakes. These were all things we were hoping the RC branch would help reduce, so it was a bit of a fail on that account.
The other stress tester when doing an official release is the updating of the compatibility list, which is both a lot of work for our dedicated testers and has the nasty side-effect of making us devs completely and totally aware of just how many games actually emulate worse now, instead of better. So each day was a mad dash to do regression testing on each new set of titles that came in as no longer being playable. This was made even more challenging by the fact that most of the regressions ended up being pretty old, dating back to the pre-Playground days (meaning they were attributed to 0.9.5 Svn revisions). We only managed to get a few of the riddles solved.
So yeah, it's true -- the overall "playable" number of games is lower in 0.9.6 compared to 0.9.4, due to many semi-obscure titles which are unable make it past the intro in 0.9.6. But on the other hand, games that are playable tend be much more accurately emulated now, and are certainly much faster. And 0.9.6 also runs a couple dozen games that 0.9.4 could not (most of which are big titles many folks have looked forward to for some time). In the meantime, though, you might want to keep that old 0.9.4 copy around for some of those titles that need it.
- Created: 06 February 2008
- Written by refraction
The new version of PCSX2 is out, development continues, ideas flow through the team and new ideas for fun spring to mind, just to forget about the work at hand for five minutes! This time, in the form of a video containing "misheard lyrics". We all listen to songs and sometimes completely miss the correct lyrics in a song, but i decided to go one step further with this following number and see what i could make in english out of a song which is sung in sweedish (afaik). The idea came around after Falcon4ever was thinking of ideas for a 404 page on a subdomain of ours to wind up one of the other team members, with the use of an annoying song.
Well, needless to say, it worked, but the song was so crap i figured I'd make it a bit more fun, sat down with Cool Edit open for a while and jotted down what i "thought" i heard!
So in its full glory here is the song with the misheard lyrics written below. You will need Flash Player installed, big thanks to Falcon4ever for mocking this up! I detest flash...
Original Song: Caramell - Caramelldansen (Speedycake Remix)
* update *
A small preloader bug prevented the video from working in IE7, it should be fixed now.