- Created: 25 November 2010
- Written by refraction
Some forum members had shown quite an interest in the history of the emulator, so i thought, why not? I'll write a history of the emulator to the best of my knowledge for everybody to look at! Hopefully those who have been here longer (like bositman) can fill you in a bit more on what happened. My apologies for any inaccuracies, I didn't join the team until version 0.8.0 (January 2005)! So, here goes.....
- Created: 07 October 2010
- Written by Jake Stine
As most people probably know, PCSX2 is primarily a dual-thread application. The two main threads are described as such:
- EE/Core thread emulates the PS2's EmotionEngine (including VIF, SIF, GIF, and VUs) and the IOP (including SPU2, CDVD, and PAD)
- GS thread emulates the PS2's Graphic Synthesizer (includes texture swizzling, texture filtering, upscaling, and frame rendering)
Each thread relies on the other thread in some way -- the GS thread cannot swizzle texture data until the EE thread has uploaded said data, for example. Meanwhile, the EE thread cannot upload texture data to the GS thread if the GS thread is currently bogged down rendering last week's frame to video. During these periods, either thread will sleep, only to be woken up once the other thread has caught up in its workload.
In theory the act of sleeping the EE/GS threads should make benchmarking the CPU load registered by each thread pretty easy: all modern operating systems have built-in APIs for reading the busy/idle time of any thread on the system -- this is the same API used by your tried and true task/process manager, for example:
(Air shows off his personal favorite, ProcessExplorer, part of the SysInternals Suite)
This readout is simple, efficient, and seemingly reliable. It also avoids a lot of the annoying pitfalls one runs into trying to use common alternatives such as rdtsc and QueryPerformanceCounter.
... and this is precisely the method I decided to use for PCSX2 0.9.7.r3113 (and still in use as of r3878). Simple theory really: if the GS thread is sleeping a lot (low load) then the game is bottlenecked by EE/Core thread activity. If the EE thread is sleeping a lot and the GS thread reports 90+%, then the GS thread is the bottleneck (a problem often correctable through using lower internal resolutions, for example).
But as I've recently found out, it doesn't work as expected. -_-
It's filled with... threads!
The immediate problem faced by this simple method of load detection is that the latest wave of Windows Vista/7 GPU drivers themselves are multithreaded. It should have come as little surprise that one of the primary goals of the new DWM/Aero/DX11 systems implemented into Vista/7 is scalable parallel processing that takes better advantage of modern multi-core CPUs. Why this causes the OS built-in thread load detection to fail might be less obvious; I'll explain with an example:
When the GPU driver receives a directive to render the current scene (aka 'Present' in DirectX lingo), it sends the job to a thread dedicated to the task. That thread has a Present Queue, typically 1 or 2 frames deep, that automatically handles triple buffered vsync'd page updates. If the queue is full when the PCSX2 GS thread issues its next Present request, the GPU driver will put the GS thread to sleep until a slot in the Present Queue becomes available. End result: The GS thread reports idle time to the operating system (and to PCSX2's GS window), but the GPU is still quite overloaded and bottlenecked via work supplied to it by a different thread altogether.
In essence, it is nearly the same sort of inter-thread dependence that the EE/Core and GS threads have between each other, only now the EE/Core thread's dependency chain extends to include GS and GPU driver threads (of which there could be one or many).
The solution to this problem is to use a more traditional method of manual load checking: timing various sections of code executed in-thread via either the aforementioned rdtsc (timestamp) or QueryPerformanceCounter, read at key points in the GS thread's execution/program flow. This wasn't such a great idea a few years ago, due to K8/Athlon and P4 generation CPUs lacking a stable internal clock counter. Fortunately, all modern CPUs have a consistent counter suitable for benchmarking, so the pitfalls that have been long associated with using Intel/AMD timestamps are finally obsolete enough to not be a concern for us here.
- Created: 14 September 2010
- Written by Jake Stine
Over the past two years I have become dearly intimate with Microsoft's Visual C++ 2008 compiler, and the methods it uses for optimizing code. Now generally speaking MSVC 2008 does well -- very well -- especially for everyday "not-so-clever" code. Its global optimization feature (aka Linktime Code Generation, or LTCG) is also a tremendous advantage over GCC -- though GCC is in the process of (finally!) adding LTCG to their own C/C++ compiler. MSVC does have a few very annoying failings as an optimizer, though. The most glaring of which has to do with templated code and inlined functions.
Disclaimer: This analysis is for Visual C++ 2008 only. I have not yet analyzed MSVC 2010's code generation. Some of these glitches may be improved or different (or worse even) in 2010. I'll post an update if/when I compile information it in the future.
Edit/Update: This bug only appears to manifest itself when the input parameters are 1st or 2nd generation propagated constants (which is hard to explain if you don't know what that means). So chances of hitting this bug are not actually all that common, but still plausible in many coding scenarios.
- Created: 31 August 2010
- Written by Jake Stine
For those who don't know, DMA stands for Direct Memory Access, and it refers to logic circuits in a computer that allow for the automated transfer of system memory to and from peripherals. DMAs are beneficial because they are simple circuits that do work in parallel to the CPU -- while a DMA transfers data, the CPU is free to do other work.that requires more complex computations and logic. The end result is better utilization of the computer's maximum memory transfer bandwidth and computational/logical ability.
- Created: 19 August 2010
- Written by Jake Stine
Yes, there is a way to simulate Microsoft's VirtualAlloc behavior on Linux. Much searching of the internet did not reveal a satisfactory answer; only hints that when combined with some applied tests of my own yielded the following result:
// to RESERVE memory in Linux, use mmap with a private, anonymous, non-accessible mapping.
// The following line reserves 1gb of ram starting at 0x10000000.
void* result = mmap((void*)0x10000000, 0x40000000, PROT_NONE, MAP_PRIVATE | MAP_ANON, -1, 0);
// to COMMIT memory in Linux, use mprotect on the range of memory you'd like to commit, and
// grant the memory READ and/or WRITE access.
// The following line commits 1mb of the buffer. It will return -1 on out of memory errors.
int result3 = mprotect((void*)0x10000000, 0x100000, PROT_READ | PROT_WRITE);
When using mmap, you can create a simple uncommitted reservation of memory simply by specifying PROT_NONE on any anonymous mapping (in the world of mmap, anonymous means it has no associated file/pipe -- it's just a memory block). This is sufficient for reserving a large contiguous address range from being fragmented up by the likes of malloc. Granting the memory read and/or write privileges tells Linux to commit the memory (equivalent to VirtualAlloc with MEM_COMMIT). If there is not enough system memory to complete the call, it returns -1.
Oddly enough, though, Linux makes it so that it isn't even necessary to bother with the above solution, via a strange little hacky technique called...
This 'feature' is enabled by default in most modern Linux kernels (anything 2.6 or newer). Basically all this means is that Linux will let programs commit a lot more RAM than is actually available to the operating system! Instead of performing a "strict contract" on commit that says "oh yes we absolutely have this much ram available", Linux looks at the ram and looks at the request, and makes some arbitrary judgement call on if the program will actually use that much ram or not. In other words, just because your call to malloc returned a valid non-NULL pointer doesn't mean there's actually anywhere near that much memory available to your app. It just means that Linux doesn't think you're going to use that much.
Instead, as a program references its allocated memory, Linux commits the memory on-demand. Most of the time, programs that malloc huge amounts of ram only use a wee bit of it, so that's fine. By using overcommitted memory management, Linux avoids the dreaded "Low on virtual memory!" error that can sometimes plague Windows. This is actually highly ideal for apps like PCSX2 and the Java virtual machine, for example. Kudos!
.. oh but things do get fun if apps over-step their bounds!
Thanks to over-committing, Linux programs that run out of memory do not get error codes or NULL pointers. Instead they will typically be KILLED INSTANTLY by the kernel. They do not get out of memory errors, and they don't even get SIGSEGV or anything else that can be handled or logged. They just DIE -- because doing anything else would risk system stability. So in the long run, its still a good idea to use the Reserve/Commit management strategy even on Linux (mmap / mprotect as described above); because your app will be more likely to get proper out-of-memory errors instead of just causing itself (and possibly other processes on the system) to die suddenly and without warning or error.
Another positive for the the above mmap / mprotect example is that it will also work well on Linux systems that have over-commit disabled, since it basically does what over-commit does but without the hacky "programs die instantly without error" part if the system runs out of physical memory.