- Created: 30 January 2007
- Written by ZeroFrog
Many people have visited the forums giving ideas on how and where Pcsx2 should be optimized. While most ideas sound solid on the outside, they usually will not work in practice for various reasons. This blog will answer some of those burning questions on what Pcsx2 optimizations are important and where development work should be put in to make things run faster. We will touch on why the GPU is the bottleneck on some games and why the CPU is on others. We will also go into the distribution of workload of the various components of Pcsx2 as it is computing away. And most important, we will cover plugin design so that system resources are distributed nicely.
First a note to the people that have played around with optimization or will play around with it. Be careful when measuring performance with frames per second! If anyone told me their optimization gained 5 fps for a certain game, I would not understand what that means! Why? Well if a game went from 5 fps to 10 fps, that means each frame took 200ms and now it takes 100ms. The optimization saved 100ms of CPU time per frame and now the game is 2x faster (this doesn't happen anymore)! If a game went from 60 fps to 65 fps, each frame took 16.6 ms and now it takes 15.4 ms. This is only 1.2 ms of saved time per frame, and the game is only 1.06 times faster. Which optimization do you think is better? Also a 1-2% speed difference is not statistically big enough to say that the optimization is useful. In fact, the fps counter in the title bar fluctuates between 1-2% all the time. So you'll just be picking up noise.
- Created: 11 November 2006
- Written by Falcon4ever
It's been a while since the last site improvements, however this time we have some nice and maybe unique features on our site.
Our compatibility page has been upgraded. The new feature is the AJAX powered toggle boxes which allow you to see games with a particular status.
The status will be remembered when switching between pages.
- Created: 10 November 2006
- Written by CKemu
One of the greatest questions mankind has ever asked is; "How long does Saqib take to give testers a beta?". It's a common question raised by many testers over the years.
One could argue that such questions are not to be answered by mankind, for such knowledge would surely destroy us, well I for one believe that mankind must know, for it could be the key to unlocking one of the greatest advances in Quantum Theory, since zerofrog learned how to collapse entire galaxies with his zeroGS KOSMOS.
To calculate the real world time it will take for saqib to deliver a beta to any given tester, you have to take into account the following variables and factors.
Nagging Factor (N), does the tester have the strength to 'push' Saqib into hurrying along, for most testers this strength is given as merely a fraction of 1 as Saqib is remarkably stubborn!
Thus often N has little impact on the rate of beta delivery, However a tester can use the offer of pornography (P) to entice Saqib to speed up, but this is a double edged sword, a value greater than 3 (3 videos, 3 photosets etc), will create a - ahem 'W' effect, causing the divider in this calculation to be reset to 1.
Coffee Power ( C ) truly an awesome cosmic force in the universe, it's magical beans can break through temporal barriers allowing for the user to work faster! In fact it's powers are so truly incredible, a single mug of coffee acts as a massive multiplier, with each mug being worth ^2.
Laz0ritus is a common side effect caused by lack of daylight and social interaction, creating an almost coma like state, this is a severe factor and can extend waiting duration by significant amounts of time, such is the effect that the multiplier for this variable can be set as high as 24 hours (1440 minutes).
KOSMOS Temporal Pull (K) When working with KOSMOS all testers are effected by the dragging effect caused by this massive Energy Black Hole (EBH), which consumes entire galaxy clusters constantly, however developers are exposed to higher levels of KTP, which causes their time to move extremely slowly, for every minute that passes in their time, 10,080 minutes pass in our time. The longer a developer has been exposed to this effect the greater the effect.
Temporal Reality Flux Syndrome (F) Early years of PCSX2 development was a risky business, those who ventured into this unknown, often came back 'different', medically the issue is little understood, but is believed to be caused by watching motion at sub 1 FPS, this causes the developer to perceive time outside of PCSX2 to be incredibly fast.
This causes the sufferer to slow down to the more comfortable PCSX2 speeds they became accustomed to in those early days, doctors and scientists have learned in recent studies that the brain and motor functions slow down by 60x normal values.
In some extreme cases it's known to produce such a slowdown that the developer is apparently petrified (Frozen in time). Others consider this is merely a visual side effect as such low levels of motion cannot be perceived by humans in normal space.
Thus the following calculation can be made:
R=Real World Time (Minutes)
S=Saqib Time (Minutes)
K=10080, L=1440, F=6, P=2, C=10, N=2, S=1.
So a single minute in Saqib Time is equal to 7.29 Days in our time, this work is theoretical at the moment and needs a great deal of refinement, however one can see via this simple equation that we'll be long dead by the time PCSX2 has Saqib's code.
One hopes that a scientist or time traveler gets chance to see this and can offer help and advice for Saqib and his somewhat unique dilemma.
- Created: 29 October 2006
- Written by ZeroFrog
Many 64 bit architectures have been proposed; however, the x86-64 (aka AMD64) architecture has picked up a lot of speed since its initial proposal a couple of years ago. Most 64bit CPUs today support it, so it looks like a good candidate for 64bit recompilation. The x86-64 architecture offers many more registers and can potentially speed up games by a significant amount. Up to now, Pcsx2 has largely been ignoring the 64 bit arena because there have been massive compatibility issues, the developers weren't sure if it was really worth it, and adding a new bug-free and fast recompiler to the existing code base is a very painful process. Anyone seriously suggesting this to a dev would have been laughed out of the chat room. However, the upcoming 0.9.2 release is looking very stable and after doing some research, we have decided to add support for x86-64 recompilation, both for 64bit versions of Linux and Windows (yes, Linux support is returning).
Before going into technical details, I want to cover the current Pcsx2 recompilation model.
Every different instruction set requires either an interpreter or a recompiler to execute it on the PC. Both are important in emulation. Interpreters are implemented with regular high-level languages and are platform independent. They are easy to program, easy to debug, but slow. They are extremely important for testing and debugging purposes. For example, interpreting a simple 32bit EE MIPS instruction (code) might look like:
case 0x02: // J - jump to
pc = (code & 0x03ffffff)*4; // change the program counter
case 0x23: // LW - load word, sign extend
gpr[Rt] = (long long)*(long*)(memory+gpr[Rs]+(short)code);
Recompilers, on the other hand, try to cut as many corners as possible. For example, we know the instruction at address 0x1000 will never change, so there is no reason why the CPU needs to execute the switch statement and decode the instruction every single time it executes it. So recompilers generate the minimal amount of assembly the CPU needs to execute to emulate that instruction. Because we're working with assembly, recompilation is a very platform dependent process.
Simple recompilers look at one instruction at a time and keep all target platform (in this case, the PS2) registers in memory. For every new instruction, the used registers are read from memory and stored in real CPU registers, then some instructions are executed, and finally the register with the result is stored back in memory. Before 0.9, Pcsx2 used to employ this type of recompilation.
More complex recompilers divide the code into simple blocks (no jumps/branches) and try to preserve target platform registers across instructions in the real CPU registers. There are many different types of register allocation algorithms using graph coloring. Such compilers might also do constant propagation elimination. A common pattern in the MIPS Emotion Engine is something like:
lui s0, 0x1000
lw s0, 0x2000(s0)
If we propagated the constants at the lw, we know that the read address is 0x10002000.
A little more complex recompiler will know that 0x10002000 corresponds to the IPU, so the assembly will call the IPU straight away (without worrying about memory location translation).
There are many such local optimizations, however they aren't enough. At the end of every block, all the registers will have to be pushed to memory because the next simple block that needs to be executed can't be predicted at recompilation time (ie: branch if x >= 0 depends on the value of x at runtime).
An even more complex recompiler can work on the global scale by finding out which simple blocks are connected to which. Once it knows, it can get rid of the register flushing at the end of every simple block by simply telling the next blocks to allocate the same real CPU register to the same target platform register. This is called global register allocation and sometimes uses Markov blankets for block synchronization. For those people that know Bayes nets, this is very similar, except it applies to the global simple block graph. Just think about the nodes necessary for making a specific node independent with respect to the whole graph. This will include the node's parents, children, and the children's parents. For those that just got lost... don't worry.
The Pcsx2 recompilers also use MMX and SSE(1/2/3) interchangeably. So an EE register can be in an MMX, SSE, or regular x86 register at any point in time depending on the current types of instructions (this is a nightmare to manage).
Console emulators rarely need to go through such complex recompilers because up until a couple of years ago, consoles weren't that powerful. But starting with the PS2, consoles got powerful and the Pcsx2 recompilers for the EmotionEngine and Vectors Units got complex really fast. Pcsx2 0.9.1 supports all the above mentioned optimizations plus many more unmentioned ones. The VU recompiler (code named SuperVU) is by far the most complex and fastest. Anyone who wants to keep their sanity should stay away from it.
For those that remember what it was like in the 0.8.1 days can appreciate how powerful the 0.9.1 Pcsx2 optimizations are.
So why isn't x86-32 enough? Well, for starters the Playstation 2 EE has 32 128bit regular registers, 32 32bit floating point registers, and some COP0 registers. Most instructions work on 64 bits, the MMI instructions work on the full 128bits. On the other hand, the x86 CPU has 8 32bit general purpose registers (one is for stack), 8 64bit registers (MMX), and 8 128bit registers(SSE). And you can't combine the three that easily (ie: you can't add an x86 register with a SSE register before first transferring the x86 to SSE or vice versa). So there's a very big difference in registers sizes. Because of the small number of x86 registers, the recompiler does a lot of register thrashing (registers are spilled to memory very frequently). Each memory read/write is pretty slow, so the more thrashing, the slower the recompiler becomes. Also, x86-32 is inherently 32bit, so a 64bit add would require 2 32bit instructions and 4 regular x86 registers for the source and result (2 if reading from memory). The EE recompiler tries to alleviate the register pressure by using the 64bit arithmetic capabilities of MMX, but MMX has a pretty limited ISA and intra-register set transfers kill performance.
The registers on the x86-64 architecture are: 16 64bit general purpose registers, 8 64bit MMX registers, and 16 128bit SSE registers. This amounts to twice the number of register memory! This means much less register thrashing. On top of that, 64bit adds/shifts/etc can all be done in one instruction.
However, the story isn't as simple as it sounds. The recompiler has to interface with regular C++ code constantly (ie: calling plugin functions), so the calling conventions on the recompiler boundaries must be followed exactly. The x86-64 specification can be found here and is pretty straightforward. However, Microsoft decided that it wanted its own specification (for reasons not quite known to anyone else).. so now there are two different calling conventions with a different set of registers specifying arguments to functions and another different set acting as non-volatile data! (Thanks Microsoft, it wasn't difficult enough)
Because the size of the registers changed, all pointers are now 64 bits, which adds many difficulties to reading and writing from memory, incrementing the stack, etc.
Virtual memory is yet another obstacle to overcome with 64bit OSs. The AWE mapping trick (described in an early blog) has to be refined. But now that the address range is much bigger, there are less limitations. VM builds for Linux also need a completely new implementation.
Finally, if anyone has seen Pcsx2 code, they would know that inline assembly is pretty frequent in the recompilers. The reasons we use inline assembly rather than C++ code are many. Actually, some things like dynamic dispatching become impossible to do with C++ code. So, inline is necessary... and it looks like Microsoft has disabled all functionality for inline assembly in 64bit editions of Visual C++!!!! (Thanks again Microsoft, you just know where to strike hardest)
With all the mentioned challenges, it will take a couple of months to get things working reasonably stable. By that time, more people would have switched to 64bit OSs. If we're even half right in our estimates, Pcsx2 will run much faster on a 64bit OS than on a 32bit OS on the same computer once x86-64 recompilation is done.
Moral of the blog Most recompiler theory discussed here actually comes straight from compiler theory. Compilers will always be necessary as long as engineers keep coming with new instruction set architectures (ISAs). Learn how a compiler works. I recommend Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
- Created: 29 September 2006
- Written by Falcon4ever
Since the launch of the new site last year, several improvements have been made to the site. Some of you may have noticed that the site is looking a bit different since yesterday.
The site now contains several navigation panels to look up old news.
Another (maybe) less noticeable improvement has been made to the page caching engine. Over the past year, PCSX2.net has been using a custom written cache engine. Whilst this had been functioning well enough for sometime now, it still had a few nasty bugs which where hard to trace, leading to glitches such as Page 1 of 0.
Also due to a demand for a cleaner (and easier to maintain) code, we have been looking into several template engines. Thus the engine used to cache pages has be switched.
For the current version of the portal, we're using the Smarty template engine. More information on how smarty works will follow in a later blog article.
The pcsx2.net community is pretty large (including Windows and Linux users) it's no surprise that users are using different kinds of browsers. To be compatible with most recent browsers, most pages are XHTML 1.1 compatible (the compat page is the only exception at the moment), because of this standard, PCSX2.net should be viewable in Firefox 1.5.x, IE6, IE7 beta, Opera 8.x and Opera 9.x.
An interesting result of this high browser compatibility, is that PCSX2.net can be browsed on SONY's PSP unit! To give you an impression on how this looks, here are a few shots:
The upcoming weeks a new function will be added to the compat page. To give you a small hint: