- Created on 10 November 2006
- Written by CKemu
One of the greatest questions mankind has ever asked is; "How long does Saqib take to give testers a beta?". It's a common question raised by many testers over the years.
One could argue that such questions are not to be answered by mankind, for such knowledge would surely destroy us, well I for one believe that mankind must know, for it could be the key to unlocking one of the greatest advances in Quantum Theory, since zerofrog learned how to collapse entire galaxies with his zeroGS KOSMOS.
To calculate the real world time it will take for saqib to deliver a beta to any given tester, you have to take into account the following variables and factors.
Nagging Factor (N), does the tester have the strength to 'push' Saqib into hurrying along, for most testers this strength is given as merely a fraction of 1 as Saqib is remarkably stubborn!
Thus often N has little impact on the rate of beta delivery, However a tester can use the offer of pornography (P) to entice Saqib to speed up, but this is a double edged sword, a value greater than 3 (3 videos, 3 photosets etc), will create a - ahem 'W' effect, causing the divider in this calculation to be reset to 1.
Coffee Power ( C ) truly an awesome cosmic force in the universe, it's magical beans can break through temporal barriers allowing for the user to work faster! In fact it's powers are so truly incredible, a single mug of coffee acts as a massive multiplier, with each mug being worth ^2.
Laz0ritus is a common side effect caused by lack of daylight and social interaction, creating an almost coma like state, this is a severe factor and can extend waiting duration by significant amounts of time, such is the effect that the multiplier for this variable can be set as high as 24 hours (1440 minutes).
KOSMOS Temporal Pull (K) When working with KOSMOS all testers are effected by the dragging effect caused by this massive Energy Black Hole (EBH), which consumes entire galaxy clusters constantly, however developers are exposed to higher levels of KTP, which causes their time to move extremely slowly, for every minute that passes in their time, 10,080 minutes pass in our time. The longer a developer has been exposed to this effect the greater the effect.
Temporal Reality Flux Syndrome (F) Early years of PCSX2 development was a risky business, those who ventured into this unknown, often came back 'different', medically the issue is little understood, but is believed to be caused by watching motion at sub 1 FPS, this causes the developer to perceive time outside of PCSX2 to be incredibly fast.
This causes the sufferer to slow down to the more comfortable PCSX2 speeds they became accustomed to in those early days, doctors and scientists have learned in recent studies that the brain and motor functions slow down by 60x normal values.
In some extreme cases it's known to produce such a slowdown that the developer is apparently petrified (Frozen in time). Others consider this is merely a visual side effect as such low levels of motion cannot be perceived by humans in normal space.
Thus the following calculation can be made:
R=Real World Time (Minutes)
S=Saqib Time (Minutes)
K=10080, L=1440, F=6, P=2, C=10, N=2, S=1.
So a single minute in Saqib Time is equal to 7.29 Days in our time, this work is theoretical at the moment and needs a great deal of refinement, however one can see via this simple equation that we'll be long dead by the time PCSX2 has Saqib's code.
One hopes that a scientist or time traveler gets chance to see this and can offer help and advice for Saqib and his somewhat unique dilemma.
- Created on 29 October 2006
- Written by ZeroFrog
Many 64 bit architectures have been proposed; however, the x86-64 (aka AMD64) architecture has picked up a lot of speed since its initial proposal a couple of years ago. Most 64bit CPUs today support it, so it looks like a good candidate for 64bit recompilation. The x86-64 architecture offers many more registers and can potentially speed up games by a significant amount. Up to now, Pcsx2 has largely been ignoring the 64 bit arena because there have been massive compatibility issues, the developers weren't sure if it was really worth it, and adding a new bug-free and fast recompiler to the existing code base is a very painful process. Anyone seriously suggesting this to a dev would have been laughed out of the chat room. However, the upcoming 0.9.2 release is looking very stable and after doing some research, we have decided to add support for x86-64 recompilation, both for 64bit versions of Linux and Windows (yes, Linux support is returning).
Before going into technical details, I want to cover the current Pcsx2 recompilation model.
Every different instruction set requires either an interpreter or a recompiler to execute it on the PC. Both are important in emulation. Interpreters are implemented with regular high-level languages and are platform independent. They are easy to program, easy to debug, but slow. They are extremely important for testing and debugging purposes. For example, interpreting a simple 32bit EE MIPS instruction (code) might look like:
case 0x02: // J - jump to
pc = (code & 0x03ffffff)*4; // change the program counter
case 0x23: // LW - load word, sign extend
gpr[Rt] = (long long)*(long*)(memory+gpr[Rs]+(short)code);
Recompilers, on the other hand, try to cut as many corners as possible. For example, we know the instruction at address 0x1000 will never change, so there is no reason why the CPU needs to execute the switch statement and decode the instruction every single time it executes it. So recompilers generate the minimal amount of assembly the CPU needs to execute to emulate that instruction. Because we're working with assembly, recompilation is a very platform dependent process.
Simple recompilers look at one instruction at a time and keep all target platform (in this case, the PS2) registers in memory. For every new instruction, the used registers are read from memory and stored in real CPU registers, then some instructions are executed, and finally the register with the result is stored back in memory. Before 0.9, Pcsx2 used to employ this type of recompilation.
More complex recompilers divide the code into simple blocks (no jumps/branches) and try to preserve target platform registers across instructions in the real CPU registers. There are many different types of register allocation algorithms using graph coloring. Such compilers might also do constant propagation elimination. A common pattern in the MIPS Emotion Engine is something like:
lui s0, 0x1000
lw s0, 0x2000(s0)
If we propagated the constants at the lw, we know that the read address is 0x10002000.
A little more complex recompiler will know that 0x10002000 corresponds to the IPU, so the assembly will call the IPU straight away (without worrying about memory location translation).
There are many such local optimizations, however they aren't enough. At the end of every block, all the registers will have to be pushed to memory because the next simple block that needs to be executed can't be predicted at recompilation time (ie: branch if x >= 0 depends on the value of x at runtime).
An even more complex recompiler can work on the global scale by finding out which simple blocks are connected to which. Once it knows, it can get rid of the register flushing at the end of every simple block by simply telling the next blocks to allocate the same real CPU register to the same target platform register. This is called global register allocation and sometimes uses Markov blankets for block synchronization. For those people that know Bayes nets, this is very similar, except it applies to the global simple block graph. Just think about the nodes necessary for making a specific node independent with respect to the whole graph. This will include the node's parents, children, and the children's parents. For those that just got lost... don't worry.
The Pcsx2 recompilers also use MMX and SSE(1/2/3) interchangeably. So an EE register can be in an MMX, SSE, or regular x86 register at any point in time depending on the current types of instructions (this is a nightmare to manage).
Console emulators rarely need to go through such complex recompilers because up until a couple of years ago, consoles weren't that powerful. But starting with the PS2, consoles got powerful and the Pcsx2 recompilers for the EmotionEngine and Vectors Units got complex really fast. Pcsx2 0.9.1 supports all the above mentioned optimizations plus many more unmentioned ones. The VU recompiler (code named SuperVU) is by far the most complex and fastest. Anyone who wants to keep their sanity should stay away from it.
For those that remember what it was like in the 0.8.1 days can appreciate how powerful the 0.9.1 Pcsx2 optimizations are.
So why isn't x86-32 enough? Well, for starters the Playstation 2 EE has 32 128bit regular registers, 32 32bit floating point registers, and some COP0 registers. Most instructions work on 64 bits, the MMI instructions work on the full 128bits. On the other hand, the x86 CPU has 8 32bit general purpose registers (one is for stack), 8 64bit registers (MMX), and 8 128bit registers(SSE). And you can't combine the three that easily (ie: you can't add an x86 register with a SSE register before first transferring the x86 to SSE or vice versa). So there's a very big difference in registers sizes. Because of the small number of x86 registers, the recompiler does a lot of register thrashing (registers are spilled to memory very frequently). Each memory read/write is pretty slow, so the more thrashing, the slower the recompiler becomes. Also, x86-32 is inherently 32bit, so a 64bit add would require 2 32bit instructions and 4 regular x86 registers for the source and result (2 if reading from memory). The EE recompiler tries to alleviate the register pressure by using the 64bit arithmetic capabilities of MMX, but MMX has a pretty limited ISA and intra-register set transfers kill performance.
The registers on the x86-64 architecture are: 16 64bit general purpose registers, 8 64bit MMX registers, and 16 128bit SSE registers. This amounts to twice the number of register memory! This means much less register thrashing. On top of that, 64bit adds/shifts/etc can all be done in one instruction.
However, the story isn't as simple as it sounds. The recompiler has to interface with regular C++ code constantly (ie: calling plugin functions), so the calling conventions on the recompiler boundaries must be followed exactly. The x86-64 specification can be found here and is pretty straightforward. However, Microsoft decided that it wanted its own specification (for reasons not quite known to anyone else).. so now there are two different calling conventions with a different set of registers specifying arguments to functions and another different set acting as non-volatile data! (Thanks Microsoft, it wasn't difficult enough)
Because the size of the registers changed, all pointers are now 64 bits, which adds many difficulties to reading and writing from memory, incrementing the stack, etc.
Virtual memory is yet another obstacle to overcome with 64bit OSs. The AWE mapping trick (described in an early blog) has to be refined. But now that the address range is much bigger, there are less limitations. VM builds for Linux also need a completely new implementation.
Finally, if anyone has seen Pcsx2 code, they would know that inline assembly is pretty frequent in the recompilers. The reasons we use inline assembly rather than C++ code are many. Actually, some things like dynamic dispatching become impossible to do with C++ code. So, inline is necessary... and it looks like Microsoft has disabled all functionality for inline assembly in 64bit editions of Visual C++!!!! (Thanks again Microsoft, you just know where to strike hardest)
With all the mentioned challenges, it will take a couple of months to get things working reasonably stable. By that time, more people would have switched to 64bit OSs. If we're even half right in our estimates, Pcsx2 will run much faster on a 64bit OS than on a 32bit OS on the same computer once x86-64 recompilation is done.
Moral of the blog Most recompiler theory discussed here actually comes straight from compiler theory. Compilers will always be necessary as long as engineers keep coming with new instruction set architectures (ISAs). Learn how a compiler works. I recommend Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
- Created on 29 September 2006
- Written by Falcon4ever
Since the launch of the new site last year, several improvements have been made to the site. Some of you may have noticed that the site is looking a bit different since yesterday.
The site now contains several navigation panels to look up old news.
Another (maybe) less noticeable improvement has been made to the page caching engine. Over the past year, PCSX2.net has been using a custom written cache engine. Whilst this had been functioning well enough for sometime now, it still had a few nasty bugs which where hard to trace, leading to glitches such as Page 1 of 0.
Also due to a demand for a cleaner (and easier to maintain) code, we have been looking into several template engines. Thus the engine used to cache pages has be switched.
For the current version of the portal, we're using the Smarty template engine. More information on how smarty works will follow in a later blog article.
The pcsx2.net community is pretty large (including Windows and Linux users) it's no surprise that users are using different kinds of browsers. To be compatible with most recent browsers, most pages are XHTML 1.1 compatible (the compat page is the only exception at the moment), because of this standard, PCSX2.net should be viewable in Firefox 1.5.x, IE6, IE7 beta, Opera 8.x and Opera 9.x.
An interesting result of this high browser compatibility, is that PCSX2.net can be browsed on SONY's PSP unit! To give you an impression on how this looks, here are a few shots:
The upcoming weeks a new function will be added to the compat page. To give you a small hint:
- Created on 30 July 2006
- Written by ZeroFrog
The Playstation 2 uses co-processor 0 to implement virtual paging. Even without COP0, the Playstation 2 memory map is pretty complex and the mapping can change depending on which processor you use to read the memory from. A simple version of how the default mapping looks from the Emotion Engine side is:
The 32Mb of main memory occupying 0000_0000 - 01ff_ffff
Hardware registers occupying 1000_0000 - 1000_ffff
VU/BIOS/SPU2 addresses in 1100_0000-1fff_ffff
Special kernel modes etc in 8000_0000-bfff_ffff
A scratch pad in some other address
...And of course can't forget the hidden addresses (thanks SONY)
To make matters worse, these mappings can change depending on the setting of COP0. (Note that at the time of writing, Pcsx2 doesn't emulate even half of COP0 correctly.) The simplest and most straightforward way to emulate this is to have another memory layer through a software Translation-Lookaside-Buffer (TLB). You pass it the PS2 address, and out comes the real physical address or some special code signifying a hardware register, etc. The problem is that every read/write has to be preceded by a TLB lookup. Considering that reads/writes are as common as addition, that's a lot of wasted cycles.
Well, the OS also uses virtual memory. In fact, every process has its own special virtual memory driven by a real hardware TLB. If we could get away by mapping the 4Gb PS2 memory map onto the process's virtual memory, we could eliminate the need for the software translation (Figure 1). Looking at the virtual manipulation functions Windows XP offers, there are two major problems with this:
1 WindowsXP reserves more than half the address space for OS specific stuff. A good amount is also reserved for all of Pcsx2's working memory, executable code, and plugins (especially ZeroGS). It looks like we are left with less than 1.5 Gb of address range to implement the 4Gb PS2 memory map. Note that this problem doesn't exist on 64bit operating systems where the address range is practically... infinite (don't quote me on this 20 years down the road).
2 Playstation 2 allows more than one virtual page to point to the same physical page, Windows XP doesn't (I don't know about Linux). Assume that PS2 address 0x1000 points to the same physical page as address 0x0000, each page is 4Kb. Now a write occurs at 0x1000. The game can retrieve that same value just by reading from 0x0000. In Windows XP, this has to be two different pages; so unless some clever solution/technology is discovered, we could kiss our VM dreams goodbye.
The first problem was solved somehow by introducing special address transformations before a read/write occurs.
And thankfully a clever technology presented itself for the second problem: Address Windowing Extensions. This lets Pcsx2 handle the actual physical page instead of a virtual page. We still can't map two virtual pages to the same physical page; however, what we can do instead is switch the mapping of the physical page as many times as needed! To achieve this, Pcsx2 hacks into the root exception handler and intercepts every exception the program generates. Whenever an illegal virtual page is accessed (ie, no physical page mapped to it), Pcsx2 gets a EXCEPTION_ACCESS_VIOLATION then it remaps the correct physical page to that empty virtual page and returns. Although I haven't calculated precisely, I'm pretty sure that switching physical pages around is pretty expensive, computationally speaking. So all this works fine under the assumption that game developers won't be crazy and access two virtual pages mapping to the same physical page back-and-forth frequently... [pause].
Alas, we were wrong... again (see floating-point article). It turns out that there are uncached and cached address ranges; so it is optimal to do such a bi-mapping trick: write in one virtual range and read from another. Pcsx2 tries to detect such cases and work around, but there's no clean solution.
And I'm going to stop here before this becomes a book.
So the ultimate question is: why doesn't VM work on some computers with 1Gb of RAM and the newest updates, while works on others? Turns out that real-time monitoring applications like to take up some of the 1.5 Gb of left over addresses on certain processes. (this might be OS specific programs too). I have also observed that performance/debugging monitors like NvPerfHud do similar tricks. There probably might be other reasons for VM builds of Pcsx2 not working because virtual memory is a pretty complicated issue.
Moral of the blog Read an OS book. I recommend Operating System Concepts (the dinosaur book) by Abraham Silberschatz, Peter Baer Galvin, Greg Gagne.
- Created on 24 July 2006
- Written by ZeroFrog
It is very hard to emulate the floating-point calculations of the R5900 FPU and the Vector Units on an x86 CPU because the Playstation 2 does not follow the IEEE standard. Multiplying two numbers on the FPU, VU, and an x86 processor can give you 3 different results all differing by a couple of bits! Operations like square root and division are even more imprecise.
Originally, we thought that a couple of bits shouldn't matter, that game developers would be crazy to rely on such precise calculation. Floating points are mostly used for world transformations or interpolation calculations, so no one would care if their Holy Sword of Armageddon was 0.00001 meters off from the main player's hand. To put it shortly, we were wrong and game developers are crazier than we thought. Games started breaking just by changing the floating point rounding mode!
While rounding mode is a problem, the bigger nightmare is the floating-point infinities. The IEEE standard states that when a number overflows (meaning that it is larger than 3.4028234663852886E+38), the result will be infinity. Any number multiplied by infinity is infinity (even 0 * infinity = infinity). That sounds great until you figure out that the VUs don't support infinities. Instead they clamp all large numbers to the max floating point possible. This discrepancy breaks a lot of games!
For example, let's say a game developer tries to normalize a zero vector by dividing by its length, which is 0. On the VU, the end result will be (0,0,0). On x86/IEEE, the result will be (infinity, infinity, infinity). Now if the game developer uses this vector to perturb some faces for artificial hair or some type of animation, all final positions on the PS2 will remain the same. All final positions on x86 will go to infinity... and there goes the game's graphics, now figure out where the problem occurred.
The simplest solution is to clamp the written vector of the current instruction. This requires 2 SSE operations and is SLOW; and it doesn't work sometimes. To top it off, you can never dismiss the fact that game developers can be loading bad floating-point data to the VUs to begin with! Some games zero out vectors by multiplying them with a zero, so the VU doesn't care at all what kind of garbage the original vector's data has, x86 does care.
These two problems make floating-point emulation very hard to do fast and accurate. The range of bugs are from screen flickering when a fade occurs, to disappearing characters, to spiky polygon syndrome (the most common problem and widely known as SPS).
In the end Pcsx2 does all its floating-point operations with SSE since it is easier to cache the registers. Two different rounding modes are used for the FPU and VUs. Whenever a divide or rsqrt occur on the FPU, overflow is checked. Overflow is checked much more frequently with the VUs. The fact that VUs handle both integer and floating-point data in the same SSE register makes the checking a little longer. In the future, Pcsx2 will read the rounding mode and overflow settings from the patch files. This is so that all games can be accommodated with the best/fastest settings.
Moral of the blog When comparing two floating point numbers a and b, never use a == b. Instead use something along the lines of
fabs(a-b) < epsilon
where epsilon is some very small number.