Page 1 of 7: « First < 1 2 3 4 5 > Last »

6th September 2010

Air

PCSX2 coder

[ 31 08 2010 @ 05:08 ]

PS2's Programmable DMA

For those who don't know, DMA stands for Direct Memory Access, and it refers to logic circuits in a computer that allow for the automated transfer of system memory to and from peripherals. DMAs are beneficial because they are simple circuits that do work in parallel to the CPU -- while a DMA transfers data, the CPU is free to do other work.that requires more complex computations and logic. The end result is better utilization of the computer's maximum memory transfer bandwidth and computational/logical ability.

Traditionally DMAs are pretty simple. The Playstation 2's EmotionEngine, however, has an 'intelligent' programmable DMA controller (DMAC). Neatly translated, it means that the DMAC can do a lot more than just move raw data from place to place. It supports several modes of operation and has a number of special features to take advantage of the unique multi-core design of the EE. Furthermore, the EE's DMAC is much more tightly integrated with its memory bus than traditional DMAs, allowing it to transfer data with exceptional efficiency. These two features combined make the EE's DMAC a key component to PS2 games developers -- in quite a few games, the DMAC actually does more raw work than the EE Core CPU (R5900).

How The Real Thing Works

While emulating the actual hardware of the DMAC isn't usually needed, it can still be helpful to understand exactly how the PS2's real DMAC works at a hardware level. The EE DMAC operates at 147mhz (1/2th the EE's core clock speed), and transfers 128 bits (16 bytes) of memory per cycle; meaning that the theoretical maximum transfer rate of the DMAC is 2.4 GB/s (147mhz * 16 bytes). It's a nice number, but is technically unattainable even in ideal conditions. Further explanation will make it clear why.

The DMAC connects the PS2's 32 MB of Main Memory (RAM) to various peripheral interfaces, such as VIF (VPU), SIF (IOP), GIF (GS), and IPU (mpeg decoder). VIF, GIF, and IPU are all part of the Emotion Engine and operate at 147mhz, same as the DMAC itself. Thus each of those interfaces can send/receive data at roughly 2.4GB/s. SIF is limited by the IOP's own DMA controller and memry bus, which operates at 1/8th the speed of the EE's DMAC, or about 154MB/s.

Peripheral FIFOs

Each peripheral (VIF, GIF, SIF, IPU, etc) has a 128 or 256 byte FIFO. The FIFO helps mitigate occasional latency differences between Main Memory/SPRAM and the peripheral (some peripherals, in particular the GIF, can incur cycle stalls depending on data sent to them). Thanks to the FIFOs, data can be burst to/from memory in 128-byte blocks, which helps maximize data transfer rates since the EE's memory bus was built to operate most efficiently in those conditions. However, the maximum bandwidth of Main Memory (32MB) in ideal conditions is only ~1.2GB/s (half of the DMAC), and has additional memory bank related latencies, reducing its effective transfer rates even further. If DMA transfers are only done to/from Main Memory, the DMAC will only be able to come within about 40% of its theoretical maximum throughput.

Enter the Scratchpad!

The Scratchpad (SPRAM) is 16KB of memory integrated directly into the EmotionEngine. Because it is directly integrated on-die, it has no read/write latencies and can always be accessed at the maximum transfer rate of 2.4gb/s. The integrated nature of the SPRAM means it has to be small in order to fit -- and its lack of size is what limits its usefulness.

So in order to utilize the bandwidth potential of the EE DMAC, a PS2 programmer must find ways to use a combination of Main Memory and Scratchpad transfers in parallel: When main memory stalls due to inherent latencies, the DMAC will automatically busy itself with a pending SPRAM transfer. Likewise, while the DMAC is transferring to/from SPRAM, the EE's Main Memory becomes available to the CPU, which further improves the system's CPU throughput.

The Scratchpad's MemoryFIFO (MFIFO)

The MemoryFIFO function of the EE DMAC performs and managed two simultaneous DMA transfers, as follows:
  • Scratchpad -> Main Memory (RAM)
  • Main Memory (RAM) -> Peripheral (VIF1 or GIF)

As the buffer in memory is filled by Scratchpad, it is simultaneously drained by the attached peripheral, either VIF1 or GIF. On the surface, the MFIFO can appear to be somewhat silly, since the DMAC already has the ability to transfer direcly from SPRAM -> Peripheral. Adding a stop in Main Memory might seem like a waste of the DMAC's bandwidth capacity, but in some situations the 'extra work' can result in a general improvement in overall transfer speeds.

The PS2 engineers introduced the MFIFO for two reasons:

1. The scratchpad is too small. MFIFO can be used by the EE core as a place to "empty" the scratchpad after its completed a set of data processing. While the data in the MFIFO awaits the DMAC to transfer it, the EE is free to load new raw data into Scratchpad for processing.

2. The GIF has additional bandwidth constraints since it has direct connections to three PATHs: the the VU1 co-processor (GIF PATH1), VIF1 FIFO (GIF PATH2), and the DMAC's GIF channel (GIF PATH3). When transfers are active on any one of the paths, the other two paths must idle/stall until the current path's transfer completes; meaning that DMAC transfers to both GIF and VIF1 channels can have unexpectedly long stalls.

So by using MFIFO, the EE core can mitigate the unpredictable GIF/VIF1 stalls while it works on entirely new sets of data in parallel. If a GIF transfer via DMA is stalled because of other PATH1 or PATH2 transfers, the DMAC can busy itself with other transfers in meantime, such as SPRAM->memory or memory->SPRAM. These transfers are nearly 'free' in a sense, since the DMAC would have been idle regardless -- but thanks to the MFIFO concept, the SPRAM itself will be free for use by the EE Core to continue processing data. Thus while the DMAC's overall productivity isn't affected, the EE's overall computational ability improves.

I'll talk a bit more on actual emulation details of the PS2's programmable DMA controller in future blogs, so this is To Be Continued...

Post a Comment!

Air

PCSX2 coder

[ 19 08 2010 @ 08:47 ]

VirtualAlloc on Linux

Yes, there is a way to simulate Microsoft's VirtualAlloc behavior on Linux. Much searching of the internet did not reveal a satisfactory answer; only hints that when combined with some applied tests of my own yielded the following result:

Code:
// to RESERVE memory in Linux, use mmap with a private, anonymous, non-accessible mapping.
// The following line reserves 1gb of ram starting at 0x10000000.

void* result = mmap((void*)0x10000000, 0x40000000, PROT_NONE, MAP_PRIVATE | MAP_ANON, -1, 0);

// to COMMIT memory in Linux, use mprotect on the range of memory you'd like to commit, and
// grant the memory READ and/or WRITE access.
// The following line commits 1mb of the buffer.  It will return -1 on out of memory errors.

int result3 = mprotect((void*)0x10000000, 0x100000, PROT_READ | PROT_WRITE);

When using mmap, you can create a simple uncommitted reservation of memory simply by specifying PROT_NONE on any anonymous mapping (in the world of mmap, anonymous means it has no associated file/pipe -- it's just a memory block). This is sufficient for reserving a large contiguous address range from being fragmented up by the likes of malloc. Granting the memory read and/or write privileges tells Linux to commit the memory (equivalent to VirtualAlloc with MEM_COMMIT). If there is not enough system memory to complete the call, it returns -1.

Oddly enough, though, Linux makes it so that it isn't even necessary to bother with the above solution, via a strange little hacky technique called...

Over-committing Memory

This 'feature' is enabled by default in most moder Linux kernels (anything 2.6 or newer). Basically all this means is that Linux will let programs commit a lot more RAM than is actually available to the operating system! Instead of performing a "strict contract" on commit that says "oh yes we absolutely have this much ram available", Linux looks at the ram and looks at the request, and makes some arbitrary judgement call on if the program will actually use that much ram or not. In other words, just because your call to malloc returned a valid non-NULL pointer doesn't mean there's actually anywhere near that much memory available to your app. It just means that Linux doesn't think you're going to use that much.

Instead, as a program references its allocated memory, Linux commits the memory on-demand. Most of the time, programs that malloc huge amounts of ram only use a wee bit of it, so that's fine. By using overcommitted memory management, Linux avoids the dreaded "Low on virtual memory!" error that can sometimes plague Windows. This is actually highly ideal for apps like PCSX2 and the Java virtual machine, for example. Kudos!

.. oh but things do get fun if apps over-step their bounds!

Thanks to over-committing, Linux programs that run out of memory do not get error codes or NULL pointers. Instead they will typically be KILLED INSTANTLY by the kernel. They do not get out of memory errors, and they don't even get SIGSEGV or anything else that can be handled or logged. They just DIE -- because doing anything else would risk system stability. So in the long run, its still a good idea to use the Reserve/Commit management strategy even on Linux (mmap / mprotect as described above); because your app will be more likely to get proper out-of-memory errors instead of just causing itself (and possibly other processes on the system) to die suddenly and without warning or error.

Another positive for the the above mmap / mprotect example is that it will also work well on Linux systems that have over-commit disabled, since it basically does what over-commit does but without the hacky "programs die instantly without error" part if the system runs out of physical memory.

Post a Comment!

Air

PCSX2 coder

[ 19 08 2010 @ 08:11 ]

Advanced memory management

Being an emulator of a fairly robust system (the PS2), PCSX2 typically consumes a lot of system RAM. It needs multitudes of caches and buffers for various things. Just to give an idea, I'll list some of the larger stuff and their current defaults:
  • PS2 main memory [32mb]
  • IOP memory [2mb]
  • EE/IOP BIOS roms [6mb]
  • Scratchpad, Hardware registers, VU memory, DMA buffers, etc [4mb]
  • VTLB indexes, lookups, and protection tables [8mb]
  • EE/R5900 recompiler cache [16mb]
  • EE/R5900 recompiler block/pc translation table [48mb]
  • R5900 memory protection mirror [32mb]
  • IOP/R3000A recompiler cache and translation table [10mb]
  • microVU recompiler code caches [16mb * 2]
  • superVU recompiler code caches [8mb]

If all of these things are reserved when PCSX2 starts, we have a base memory footprint of over 200 megs before even a single instruction of PS2 code is executed! The worst part is that we could really stand to allocate even more ram: some games need over 120 mb of recompiler caches to run properly. Currently those games are dealt with by issuing periodic recompiler resets (sluggish).

Fortunately modern operating systems have a lot of built-in features that help us out. Both Windows and Linux OSes use virtual memory mapping features of our Intel/AMD cpus to perform "virtual" allocations of large memory reserves. What this means is that initially the allocated memory has no actual physical equivalent. It is only given a physical presence once the memory is accessed (read or written). Explained as a process:

1. App requests 1gb of RAM via malloc.
2. Operating system "reserves" the 1gb of RAM, which marks the virtual addresses for use by this memory only. In this case the memory might be reserved from 0x10000000 (0.4gb) -> 0x50000000 (1.4gb).
3. Operating system "commits" the 1gb of RAM, ensuring there is enough physical and swapfile RAM to accommodate it. No actual memory or swapfile changes are made; only the tracked amount of ram/swap in reserve is altered.
3. App receives a pointer to the reserved ram.
4. App reads or writes data -- 128mb worth, let's say.
5. OS receives a page fault exception for that memory, and allocates a chunk of physical RAM for it. Other processes may be swapped out to disk at this time to make room for the memory in-use.

Only at Point 5 does any actual physical ram get used by the program. Prior to Point 5, the app has used exactly 0 byte of RAM, in spite of allocating 1gb via malloc. This feature is implicit to both Windows and Linux and already helps work wonders on PCSX2's overall memory footprint. This is also why you might get "Low on virtual memory!" errors even though it appears as though you have lots of free ram in the System Monitor / Process Explorer, because some apps commit lots of memory but only actually access a small fraction of it.

There are ways, however, to fine tune memory access and get even better memory management than the implicit Windows / Linux provisions via malloc. The first rule is a simple one, but one many programmers probably have no idea about it:

Do not use calloc, and do not clear allocated memory by default unless you absolutely have to!

Calling calloc instead of malloc causes the entire allocation (1gb in our above example) to be committed to physical memory because of its being cleared to zero. Likewise, manually clearing buffers to zero (or some other value) has the same effect. Even if only a small portion of the array ends up being used later on, its too late: the whole thing is sucking up resources for no good reason except to express a patterned fill value. Sometimes clearing buffers cannot be avoided, but most of the time buffers need not be cleared at all, and programmers simply use calloc or manual clears out of habit.

Using Reserve and Commit to manage recompiled code buffers.

There are two phases to allocating memory on a virtual memory system, as noted in the small ordered list above. By default, malloc will reserve and commit ram together. This is done so that the system can ensure that there is enough ram and swap to free to give the program the entire allocation -- if it happens to ever need it. If the commit phase fails due to there not being enough physical ram, malloc returns NULL. If you manage the reserve and commit phases separately, then you can reserve extra large swatches of memory addresses without affecting the rest of the operating system in any way; and then later on commit portions of the reserve only as needed. There aren't a whole lot of reasons why you'd need to micro-manage the virtual memory system in this way, and for most purposes simply using malloc and letting the OS do its own internal management suffices nicely. Lucky for us, PCSX2 has one!

One of the troubles with recompiled code is that it can't be allowed to move. Typically use of malloc and realloc results in allocated memory moving around as it grows or shrinks. This is fine for most purposes, but is disastrous to executable code since it invalidates all block pointers and long jumps (which use absolute addressing). In order to grow a recompiled code cache using traditional malloc, you have to clear the cache and start over -- a recompiler reset. This usually causes a lengthy hiccup in emulation speed when it happens.

Virtual memory techniques can be used to get around that. When we reserve the recompiled code cache, we reserve the upper limit of what we deem a sane cache size. In this case, the R5900 cache should be a maximum of 48mb. The 48mb is reserved from 0x30000000->0x33000000 when PCSX2 starts, the first 4mb are committed when PCSX2 starts executing R5900 code. When the cache fills, PCSX2 automatically commits more memory in 128k increments, up to 48mb -- at which point the emulator will reset the cache and start over. Thanks to the virtual memory strategy described above, only a fraction of the 48meg allocation actually exists in physical ram unless more of the allocation is actually needed. Furthermore, computers with limited RAM resources or disabled swapfiles will still be able to run PCSX2 nicely.

Committing blocks of memory from the 48meg reserve never alters the base address of the memory, so no pointers become invalid, and no memory needs to be copied or shuffled in order to make room for the larger caches. The end result is near instantaneous increases in cache size, on-the-fly! ... and all-the-while maintaining compact and efficient memory footprint for games that don't need more than the basic caches.

On Windows this technique is implemented using VirtualAlloc, which is fairly well documented via the linked MSDN page. On Linux, however, things get a bit strange. The technique can be implemented using a combination of mmap and mprotect, but unfortunately the Linux man pages lack any actual explanation of how to perform independent reserve and commit actions (but rest assured, it can be done). Furthermore, Linux has an implicit system enabled by default called Over-committing, which basically skips phase (3) described above -- and always returns a valid pointer on calls to malloc, even if the system hasn't enough ram to accomodate the request.

Over-committing is so surpassingly hacky and evil that it deserves a blog post all to itself, so stay tuned. Wink

To be continued...

Post a Comment!

Air

PCSX2 coder

[ 07 07 2010 @ 08:30 ]

Why MSI sucks...

I just spent five weeks waiting for MSI to mail me a check back for an RMA'd HD 5770. I mailed my card in on May 28th, after getting an RMA number from MSI tech support. The check showed up today on July 7th. The amount refunded naturally doesn't cover shipping costs, neither for the initial purchase or for mailing the piece of crap back to MSI. And when I consider time spent troubleshooting the GPU, dealing with tech support, packing and mailing the card back, and now shopping for a replacement (I hate shopping for GPUs, and not surprisingly, whatever I buy this time around will not be an MSI brand), I'm really feeling steamrolled.

(More gory details of the ordeal are posted here: [url=http://forums.pcsx2.net/Thread-blog-Why-MSI-sucks?pid=126869#pid126869])

In other news, we recently freed PCSX2 of the shackles of MMX an XMM register freezes. This is a godsend for Linux and Mac users, as it means PCSX2 can finally be compiled with gcc optimizations enabled and be more stable at the same time. Thusly Linux and Mac users can most likely expect some really good things in the near future.

I may do a more detailed blog on the nightmares that plagued the Linux/Mac builds for so many years, and how we went about fixing them. But for now I'm too busy trying to pick out a new HD 5770, because MSI was too cheap to send me a replacement. Glare

Post a Comment!

Air

PCSX2 coder

[ 11 06 2010 @ 11:52 ]

The return of the Commandline!

After its absence for many moons, the Commandline functionality will finally be restored to PCSX2. Third-party frontend and config-manager authors rejoice! ... and hopefully stop hating my guts, too.

To paraphrase Darth Vader: "Witness the power of this fully armed and operational Command line-driven battlestation."

(we all know a command line-driven death star would have been way cooler than some click-and-drag crap.)

The new PCSX2 command line should be functional in our next beta release, which should be out pretty soon, and it will work as follows:

Syntax: pcsx2 [IsoFile] --toggle --option=value ... etc
  • IsoFile - optional ISO image to load and run on startup; uses the PCSX2 internal ISO loader.

General Options :
  • --cfg=[file] {specify a custom configuration file to use instead of PCSX2.ini (does not affect plugins)}
  • --cfgpath=[dir] {specifies the config folder; applies to pcsx2 + plugins}
  • --help {display this help text}
  • --forcewiz {forces running of the First-time Wizard (selection of docs folders and what-not)}

Auto-Run Options :
  • --elf=[file] {executes an ELF image}
  • --nogui {disables display of the gui on exit (program auto-exits)}
  • --nodisc {boots with an empty dvd tray; use this to boot into the PS2 system menu}
  • --usecd {uses the configured CDVD plugin instead of IsoFile}

Compatibility Options:
  • --nohacks {disables all speedhacks}
  • --gamefixes=[fix,fix] {Enable specific gamefixes for this session. Valid fixes in 0.9.7 are: VuAddSub, VuClipFlag, FpuCompare, FpuNegDiv, XGKick, IpuWait, EETiming, SkipMpeg }
  • --fullboot {disables the quick boot feature, forcing you to sit through the PS2 startup splash screens}

Plugin Overrides (specified dlls will be used in place of configured dlls):
  • --cdvd=[dllpath] {override for the CDVD plugin}
  • --gs=[dllpath] {override for the GS plugin}
  • --spu=[dllpath] {override for the SPU2 plugin}
  • --pad=[dllpath] {override for the PAD plugin only}
  • --dev9=[dllpath] {override for the DEV9 plugin}
  • --usb=[dllpath] {override for the USB plugin only}


Post a Comment!

Air

PCSX2 coder

[ 22 03 2010 @ 18:38 ]

SPU2 is more than just sound!

The SPU2 is the Sound Processing Unit for the Playstation 2, and works a lot like the sound card in your own PC; albeit still quite unique in its approach to mixing sounds/voices and the programmable interface it provides for that. But the SPU2 is more than just sound. It's one of the more reliable timing mechanisms on the PS2 and games tend to use it as such. Without at least basic SPU2 emulation, no games will boot at all. This isn't too surprising if you understand how console hardware typically works, but what might be surprising is realizing how many games won't boot even with what appears to be fairly competent SPU2 emulation.

Until SPU2-X 1.4, no SPU2 plugin had gone the distance on implementing IRQs (Interrupt Requests). IRQs are scheduled via specific SPU2 memory addresses. When a marked memory address is accessed anywhere in SPU2 memory (either read or write), the IRQ is signaled to the IOP. The most important IRQs on DMAs and audible voice playback have been supported for eons; without these no games would boot, period! Meanwhile, many of the lacking IRQ checks were known, but glossed over because of overhead required for the checks (a couple other checks were simply overlooked). The three main culprits for causing emulation errors were as follows:

1) the "free run" feature of SPU2 voices.
2) the write-back areas for each core's mixed output.
3) Reverb Processing, which uses a series of overlapping buffers to generate feedback.


Free Running Voices

The SPU2 has 48 total voices (24 voices for each core), plus two dedicated streaming audio input sources. Each voice can play a sound effect or stream audio, and can either be stopped, looping, or 'free running.' Free running voices typically zero out their volume rather than stopping or looping, and continue to 'play' forever (albeit silently). These free running voices access inaudible areas of SPU2 memory and thus trigger IRQs unexpectedly -- except, of course, some games are cleverly designed to expect these unexpected IRQs!

Because of the overhead required to free-run otherwise silent voices, all other SPU2 plugins (until now!) have opted to ignore processing them. This is the feature that fixes Fatal Frame 2 (Project Zero 2) and a dozen more games.

Output Write-back Areas

The SPU2 defines a handful of special areas of memory where it writes back sound data at various stages of the mixing process. It's perfectly legal for a game to set an IRQ address within these buffers, and then expect it to trigger when the SPU2 does its write-back to that address. The write-back areas are mapped as follows:

Code:
0x0400 - 0x05FF  :  Core 0, Voice 1
0x0600 - 0x07FF  :  Core 0, Voice 3
0x0800 - 0x09FF  :  Core 0 Output (Left) [includes Wet/Dry/ADMA sources]
0x0A00 - 0x0BFF  :  Core 0 Output (Right) [includes Wet/Dry/ADMA sources]
0x0C00 - 0x0DFF  :  Core 1, Voice 1
0x0E00 - 0x0FFF  :  Core 1, Voice 3

// Following are results of mixing all 24 voices for the given Core.

0x1000 - 0x11FF  :  Core 0, Dry Mix (Left)
0x1200 - 0x13FF  :  Core 0, Dry Mix (Right)
0x1400 - 0x15FF  :  Core 0, Wet Mix (Left)
0x1600 - 0x17FF  :  Core 0, Wet Mix (Right)
0x1800 - 0x19FF  :  Core 1, Dry Mix (Left)
0x1A00 - 0x1BFF  :  Core 1, Dry Mix (Right)
0x1C00 - 0x1DFF  :  Core 1, Wet Mix (Left)
0x1E00 - 0x1FFF  :  Core 1, Wet Mix (Right)

In specific, some games set an IRQA for Core0's write-back area. The IRQ can either be used as a timing mechanism, or as a synchronization point for post-processing audio effects. Most SPU2 plugins properly handled the write-backs, but overlooked the necessity of doing IRQ checks for them.

Reverb Processing

The SPU2 employs a clever reverberation algorithm that utilizes multiple overlapping read and writeback buffers within SPU2 memory to generate feedback. Each step of the reverb process accesses memory and must test against the IRQ address; for a grand total of 24 IRQ tests per Core. Fortunately, all reverb activity occurs within a specified area of SPU2 memory, so for most games a single simple test can be used to exclude the IRQ test.


And It All Applies to SPU2null!

This is the boring part that I'm going to look to implementing soon: In order for SPU2null to be fully emulation-compliant, it must properly simulate all of these things, which basically means it needs to have a complete sound mixer implemented; including reverb buffering/addressing logic. It probably seems silly, but SPU2null would still be without any platform dependent code or sound drivers, making it an ideal base for emulation analysis and as a base for future plugins.

Post a Comment!