Font Size




 Well if you've kept up with pcsx2's SVN, you'll notice we recently added a MTVU (Multi-Threaded microVU1) option, which runs VU1 on its own thread.

Getting pcsx2 to use more cores is something many people have asked for, and they wondered why we weren’t doing it. Some users would go as far as to flame us pcsx2 coders saying we didn’t have the skills to do it or would say some other nonsense.


Some forum members had shown quite an interest in the history of the emulator, so i thought, why not? I'll write a history of the emulator to the best of my knowledge for everybody to look at! Hopefully those who have been here longer (like bositman) can fill you in a bit more on what happened. My apologies for any inaccuracies, I didn't join the team until version 0.8.0 (January 2005)! So, here goes.....

Jake Stine_avatar

As most people probably know, PCSX2 is primarily a dual-thread application.  The two main threads are described as such:

  • EE/Core thread emulates the PS2's EmotionEngine (including VIF, SIF, GIF, and VUs) and the IOP (including SPU2, CDVD, and PAD)
  • GS thread emulates the PS2's Graphic Synthesizer (includes texture swizzling, texture filtering, upscaling, and frame rendering)

Each thread relies on the other thread in some way -- the GS thread cannot swizzle texture data until the EE thread has uploaded said data, for example.  Meanwhile, the EE thread cannot upload texture data to the GS thread if the GS thread is currently bogged down rendering last week's frame to video.  During these periods, either thread will sleep, only to be woken up once the other thread has caught up in its workload.

In theory the act of sleeping the EE/GS threads should make benchmarking the CPU load registered by each thread pretty easy: all modern operating systems have built-in APIs for reading the busy/idle time of any thread on the system -- this is the same API used by your tried and true task/process manager, for example:


(Air shows off his personal favorite, ProcessExplorer, part of the SysInternals Suite)

This readout is simple, efficient, and seemingly reliable.  It also avoids a lot of the annoying pitfalls one runs into trying to use common alternatives such as rdtsc and QueryPerformanceCounter.

... and this is precisely the method I decided to use for PCSX2 0.9.7.r3113 (and still in use as of r3878).  Simple theory really: if the GS thread is sleeping a lot (low load) then the game is bottlenecked by EE/Core thread activity.  If the EE thread is sleeping a lot and the GS thread reports 90+%, then the GS thread is the bottleneck (a problem often correctable through using lower internal resolutions, for example).

But as I've recently found out, it doesn't work as expected. -_-

It's filled with... threads!

The immediate problem faced by this simple method of load detection is that the latest wave of Windows Vista/7 GPU drivers themselves are multithreaded.  It should have come as little surprise that one of the primary goals of the new DWM/Aero/DX11 systems implemented into Vista/7 is scalable parallel processing that takes better advantage of modern multi-core CPUs.  Why this causes the OS built-in thread load detection to fail might be less obvious; I'll explain with an example:

When the GPU driver receives a directive to render the current scene (aka 'Present' in DirectX lingo), it sends the job to a thread dedicated to the task.  That thread has a Present Queue, typically 1 or 2 frames deep, that automatically handles triple buffered vsync'd page updates.  If the queue is full when the PCSX2 GS thread issues its next Present request, the GPU driver will put the GS thread to sleep until a slot in the Present Queue becomes available.  End result: The GS thread reports idle time to the operating system (and to PCSX2's GS window), but the GPU is still quite overloaded and bottlenecked via work supplied to it by a different thread altogether.

In essence, it is nearly the same sort of inter-thread dependence that the EE/Core and GS threads have between each other, only now the EE/Core thread's dependency chain extends to include GS and GPU driver threads (of which there could be one or many).

The solution to this problem is to use a more traditional method of manual load checking: timing various sections of code executed in-thread via either the aforementioned rdtsc (timestamp) or QueryPerformanceCounter, read at key points in the GS thread's execution/program flow.  This wasn't such a great idea a few years ago, due to K8/Athlon and P4 generation CPUs lacking a stable internal clock counter.  Fortunately, all modern CPUs have a consistent counter suitable for benchmarking, so the pitfalls that have been long associated with using Intel/AMD timestamps are finally obsolete enough to not be a concern for us here.

Post a Comment!

Jake Stine_avatar

Over the past two years I have become dearly intimate with Microsoft's Visual C++ 2008 compiler, and the methods it uses for optimizing code. Now generally speaking MSVC 2008 does well -- very well -- especially for everyday "not-so-clever" code. Its global optimization feature (aka Linktime Code Generation, or LTCG) is also a tremendous advantage over GCC -- though GCC is in the process of (finally!) adding LTCG to their own C/C++ compiler. MSVC does have a few very annoying failings as an optimizer, though. The most glaring of which has to do with templated code and inlined functions.

Disclaimer: This analysis is for Visual C++ 2008 only. I have not yet analyzed MSVC 2010's code generation. Some of these glitches may be improved or different (or worse even) in 2010. I'll post an update if/when I compile information it in the future.

Edit/Update: This bug only appears to manifest itself when the input parameters are 1st or 2nd generation propagated constants (which is hard to explain if you don't know what that means). So chances of hitting this bug are not actually all that common, but still plausible in many coding scenarios.

Jake Stine_avatar

For those who don't know, DMA stands for Direct Memory Access, and it refers to logic circuits in a computer that allow for the automated transfer of system memory to and from peripherals. DMAs are beneficial because they are simple circuits that do work in parallel to the CPU -- while a DMA transfers data, the CPU is free to do other work.that requires more complex computations and logic. The end result is better utilization of the computer's maximum memory transfer bandwidth and computational/logical ability.

You are here: Home Developer Blog