The goal here is to develop applications, mostly games, to be run on the Nintendo DS, from the hobbyist point of view, as opposed to a software/game production company. This kind of development is called homebrew.
We do not support piracy or the illegal use of the copyrighted Nintendo SDK by non-licensed developers. We use here unofficial development kits, such as devkitPro. These coding efforts are to remain free and legal.
This document is a guide rather than a tutorial or a reference book: somewhat in between these two kinds of writing, informations are thematically organized (instead of being introduced incrementally like in a tutorial) without aiming at the exhaustiveness of a reference source (such sources already exist, and links to them are provided). Only DS-specific knowledge is gathered here, the reader is expected to be already familiar with at least some form of programming.
Portable development will be kept in mind, as having one's game working both on a DS and on a PC is quite enjoyable. In this guide the C/C++ language will be favored. The development platform will be a standard PC, preferably using GNU/Linux.
Most of the informations here came from several homebrew websites (many thanks to Cearn for TONC, to Dovoto, Chris Double and Mollusk for their respective tutorials) and from IRC discussions (notably, thanks to Wintermute [Dave Murphy] and sgstair [Stephen Stair]). More generally, many thanks to the fearless homebrewers !
This guide has been created by Olivier Boudeville (a.k.a. sye).
First of course is the Nintendo DS itself (see a video presentation). It should be a DS Lite, which is quite superior to the previous "fat" version: better screens, thiner, lighter, better autonomy, more beautiful, etc. The color does not matter, but we found the black version quite sober, which opens it the possibility of somewhat faking a classical PDA. This is one of the intended uses in my case, besides game programming.
The main competitor for the DS would be Sony's PlayStation Portable, PSP, although they are quite different in price, capabilities and market shares. The lack of input device other than the gamepad, the quite low autonomy, the price and Sony policies make a lot of people favour DS over PSP, at least for the usage intended here (PDA and homebrew).
As most console manufacturers enforce a unique pricing for all vendors, choose the one with the better conditions (guarantee, return policy, etc.), preferably not online: dealing with real shops is convenient, especially to have one's DS replaced whenever there are dead pixels. This happens quite often apparently, and not all vendors accept easily to replace it.
As for me, I bought my DS Lite in France, in a FNAC shop (I could cancel my buying during a 15-day period), in may 2007, for 150 euros, and had no dead pixel.
Here is the content of the Nintendo box:

This box includes only:
You can make a better use of your DS potential thanks to them.

From the most useful to the least:
I bought a Bigben Interactive pack of 5 accessories (6 euros) and a (black) Subsonic Premium Pack (mostly for the leather case; 15 euros), both of course for DS Lite. Hence one can have all these recommended accessories for quite a low price.
See also the Wikipedia article about DS accessories.
Having a basic console is not enough to be able to program it: Nintendo, probably due to the fear of piracy, preferred to make the life of homebrewers more difficult than technically needed, notably thanks to hardened firmwares preventing modifications or execution of non-acknowledged code. For example, Wifi demos are apparently protected by a RSA check (at least parts of them are encrypted with a private key from Nintendo), and the DS uses the embedded public key to decipher these demos before running them.
Hence, to have access to a DS that can be programmed, one has to circumvent these protections. One of the simplest ways of doing so is to use special game-like cards, at least once. This is a vast subject and many cases have to be taken into account, see Booting Tools on Wikipedia for details.
As the DS is not sold with a rewritable storage medium or anything like it (no standard card reader, no USB port, etc.), some specific device is required so that the DS can load homebrew applications. Though one could instead use WMB (Wireless Multiboot), which is a method to send programs to be run on the DS using a PC with specific Wi-Fi adaptors, this method would not be convenient enough: transferred content would be lost on the DS as soon as it is powered off, a Ralink-compatible Wifi adaptor would be required, the sent applications could not occupy more than 4 megabytes, a hacked firmware would be needed, etc.
As for me, I intended to have Linux working on my DS, and for that the native 4 megabytes of RAM where quite small, so I wanted to have both 32 megabytes of additional RAM (the maximum size that can be addressed by the related DS bus) and access to mass storage on removable media.
The best and simplest bet for me was to use, in the slot-1 (the DS-specific one, small and at the top of the console, a.k.a the DS card), a SuperKey (a kind of NoPass) as a fake game used to enable the execution of code from the slot-2 (the big one at the bottom, used for Game Boy Advance compatibility, a.k.a. the GBA cartridge). This slot-2 would host a SuperCard Lite (note: avoid the rumble series, it does not provide more RAM) which, besides the 32 megabytes of RAM, offers a microSD port. It is micro indeed (not mini), and SD stands for Secure Digital.
Homebrewers not interested in running Linux on their DS and searching a cheap and well-supported solution may choose the R4DS, a slot-1 linker, using also microSD cards.
I bought both the SuperKey and the SuperCard Lite (black version) from FL-games for 65 euros (shipping included) and was quite happy of the cards (no compatibility issue between the two cards to be afraid of) and of the store. I would not say the same of another french store I tried.
On this subject, although we are not affiliated in any way with them and (for the moment) we never bought from them (we heard of them too late), we would recommend nevertheless electrobee, a small Canadian store run by Natrium42, as their prices and reputation are good, and also because Natrium42 contributed a lot to the DS homebrew community, especially regarding hardware add-ons.
Here are the three boxes, for the Nintendo DS Lite, for the SuperKey and for the SuperCard. Note the two-euro coin: they are quite small indeed.

Here is a view of a bare DS running the default firmware (hence the mandatory health and safety warning) and, below, the quite tiny SuperKey and SuperCarde Lite themselves:

Last but not least, one should have one micro-SD card at least, I bought a 2 gigabytes Kingston one from PC-look for 28 euros, shipping included. My laptop had a SD reader/writer (well, Ubuntu could not use it properly but it is another story), so there was no real need to buy anything else. Separate USB reader/writer are quite inexpensive anyway, and it is hard to buy a R4DS with no such card reader/writer included. Check that your micro-SD card is sold with a SD adapter, so that the actual tiny micro-card can fit in a basic SD reader.
To use your micro-SD, you should first format it in FAT (i.e. FAT16, not FAT32), as factory settings are not reliable in all cases, or are not the expected ones. Formatting can be done both from Windows or from Linux (see our FAT-dedicated section).
FAT is for the moment favored over other filesystems by most homebrewers, mainly because it is adequate for small size storage media, and for the sake of simplicity: there already exists a library for it on the DS, libfat.
Finally, in my case, the full equipment cost only 265 euros.
An alternative setup to the SuperKey/SuperCard is to use the R4DS linker:

This package costs about 35 euros (cheaper than the SuperKey+SuperCard pair) and is very useful when having problems with the SD support on Linux, thanks to the USB interface. A drawback is that the R4DS, being slot-1, will not provide additional RAM to applications such as DSLinux, whereas the SuperCard does. However DSLinux works well on the R4 DS.
Such removable media (as microSD) are not the magic bullet for the frantic homebrewer: to anticipate a bit on further subjects, when generating programs for the DS from your PC, to test them you may either execute them "in place", directly on your PC thanks to DS emulators (but they do not behave always as a DS would behave) or you have to send them to the DS, one way or another.
This can be done thanks to the aforementioned microSD card, but it soon becomes tedious to swap again and again the card and its adapter. Furthermore some linkers, like the R4 DS, use a spring-based microSD slot. It makes the swap still harder, as in order to remove the microSD, one has to push it first, which may push as well the R4; in that case one may pull the full linker instead of just the microSD. Annoying when performing trial and error debugging.
An alternate way would be to use a Wifi access point or a Wifi router, preferably not a mere Wifi adapter, which would have to embedded a specific chip etc. in order to work with the DS. Such devices could be for example the famous Linksys WRT54G routers (55 euros approximately) or La Fonera, both of which are inexpensive and, on a side node, run Linux, whose firmware and software can be programmed, etc., in a very similar way to the one we are to program applications for the DS. Note this may involve running specific software on the DS (ex: DSLinux) to transfer ROM wirelessly, which may or may not be more convenient than the basic card swap.
Other still less usual methods would be to rely on:
Finally, using a USB reader/writer like the one provided with the R4DS is probably one of the best solutions.
The snapshot below shows the official main screen:

What happens when the cards are inserted ? Both ought to be used: if only the slot-1 one (SuperKey) is inserted, you will have two white screens, and if only the slot-2 one (SuperCard) is active, then your DS will run in backward compatibility mode for the GBA.
When both cards are inserted, you can see the SuperCard splash screen, but here one could not go further, as the microSD card was not inserted. See it in its case, on the right, the SD adapter being on the left. The SuperCard-like black object below the DS is just the default plastic filler to protect the console from dust.

Inserting the microSD card and rebooting is not really interesting, as this card is still empty.
We see nevertheless that our customized console works well and is almost ready to be used. Last step may be to upgrade the Supercard firmware, to benefit from improvements whose interest should be evaluated beforehand (beware to regressions though, check the assessment of other users first). To upgrade the firmware, download its targeted version from the official site (ex: microsd_eng_181.rar, for the 1.81 version), extract it with unrar (installed with apt-get install unrar) thanks to unrar e microsd_eng_181.rar. Put the result (my_micro_eng_181.bin) at the root of, here, your microSD card, before booting the DS with it. Then the firmware should be upgraded by selecting this file in the Supercard menu, and pressing the A key.
Buying things was quite easy. Now, let's try to use them ! First of all, let's learn what is available for our developments.
Even though Nintendo designed a low-cost handheld, the DS offers quite a lot of features:
two all-purpose processors (CPU):
Both can be running code at the same time.
The ARM (Arcos Risc Machine) family is quite widespread on embedded devices, and is known to be small, cheap and power-savvy. The Nintendo 64 offered roughly 100 MIPS, PlayStation 2 and PSP (PlayStation Portable) roughly 560 MIPS, GameCube 980 MIPS. According to some sources, the DS processors are under-clocked to save some power. Some people overclocked them, which involves quite some electronical skills and risks
The overall architecture (including the ARM, the rendering engine, the keys, etc.) took as a starting point the one of the GameBoy Advance. Both have an ARM7 (but the DS one is running at 33 MHz instead of 17 MHz), explaining partly the very good retrocompatibility of DS for GBA. More generally the DS hardware is quite close to the one of PocketPC devices. The performances are expected to be somewhat near the Nintendo 64 ones.
More low-level informations about the DS hardware:
There are two different ports in which cards can be inserted: Slot-1 and Slot-2. In both cases, cards are mostly ROM, but some of them have a small amount of non-volatile memory, usually EEPROM (notably Flash memory), to save some data (highscores, game saved, etc.).
Usual DS game cards are 33.0 × 35.0 × 3.8 mm, and weigh around 3,5 g. They can contain up to 128 megabytes, but in this case their transfer rates are smaller than the ones of 64-megabyte cards.
SuperKey and the other Slot-1 NoPass counterparts, apart from allowing the use of the Slot-2 devices, do not add any memory or feature. They just set the program counter so that it points to a memory location in the GBA slot, allowing programs to be run from storage there. They consume some power that reduces the console autonomy. A flashed firmware can be used instead, for the most daring.
There are several slot-2 add-ons (called linkers), which all provide their own set of features. From the ARM9 point of view, they can offer up to 32 megabytes of additional ROM (which can be actually RAM) and up to 64 kilobytes of SRAM, intended for game saves, preferences, etc.. Any other storage these cards may include cannot be seen directly by the ARMs.
Therefore these slot-2 linkers use their built-in RAM as a mock flash cartridge, i.e. as a cache fed from their hidden mass storage. There are various systems that upload in the ARM-visible 32-megabyte address space any desired chunk from any hidden mass storage. Some DS homebrew (ex: DSLinux) also use this additional RAM, though bus and speed issues make it less useful than the DS built-in memory.
More precisely, the GBA slot bus only supports 16-bit writes, which leads to issues when needing to perform 8-bit only write operations. As for speed, some people complained about Supercard's built-in RAM or some versions of microSD cards being too slow for some GBA games.
The SuperCard Lite is one of the most interesting linkers, as it fits well in the GBA port (it does not protrude when inserted in a DS Lite) and provides both the optimal 32 megabytes of RAM and an access to removable storage, here a microSD card.
Most of these removable cards are formatted in FAT (i.e. FAT16, not FAT32), which implies that no more than 2 gigabytes can be managed. Some linkers, like the R4DS / M3 Simply, may use FAT32 and take advantage of 4-gigabyte cards. Some other linkers are discussed in this PAlib tutorial
Note that a supposed work-around for a DSLinux bug would require making partitions strictly smaller than 2 gigabytes. This can be done from a PC running GNU/Linux thanks to, for example, fdisk /dev/mmcblk0, then deleting previous partition (d), creating two new primary partitions (1 and 2), first with only 1020 megabytes, the second with the rest of the space, then setting them to FAT16 (t option, with value 6 for filesystem type). You should then have partitions like:
|
mkdosfs -c -F 16 -n DS-homebrew -v /dev/mmcblk0p1 and mkdosfs -c -F 16 -n DS-storage -v /dev/mmcblk0p2. Using pmount and udev allows to mount these filesystems without needing a root access.
Software distribution on the DS is mostly based on ROM images, which can be downloaded on a PC as unique files. These files are read by the DS system, which see them as a kind of archive (a filesystem actually) containing possibly numerous files and directories.
Note that there are often more than one filesystem involved here:
There are several file extensions for Nintendo ROM. Each extension has a specific meaning, is associated to a specific executable format, and must be chosen adequately.
The sole purpose of ROM is to contain binaries (i.e. executables) for the two ARMs and, most often, application-specific data (icons, bitmaps, sounds, etc.), stored either as a unique flat file or as full filesystems, in one or more files, each filesystem containing in turn files and directories. Usually there is at most one filesystem. To be run from the GBA cart (slot-2), these ROM should include a loader that transfers them to the RAM.
The most frequent extensions are:
.nds (ex: myGame.nds):
this binary for the DS (slot-1) is used by official game cards and most emulators. It embeds a small header referring to a logo, a short description of the ROM content in several languages, followed eventually by the two executable binaries needed (a region dedicated to the ARM7, then a region dedicated to the ARM9), and optionally some appended data, for example a filesystem. The logo and description texts are used when booting a game card from the firmware, or when starting to download one over wireless multiboot: GBA Movie Player and the FlashMe + WMB method support .nds files.
This file format was primarily designed to be run from RAM filled from the slot-1 (ex: M3 and DS-X), hence it did not need any specific loader. Therefore old generation .nds do not have a loader at all, and an external one must be used when run from the GBA port. This loader is generally provided by the menu loader of the flash cart device.
Some newer .nds to be run from the GBA cart put nevertheless a loader (often ndsloader.bin, see our dedicated section), which they store into some unused space of the header. Sadly, many GBA flashers mess up the NDS header, thinking it is a corrupted GBA header. An external loader is then needed again
myGame.ds.gba): this binary for the DS is designed to run from GBA ROM. It is usually made of a small loader (to transfer thanks to the ARM7 the ROM content from the GBA card to the main RAM) contained in a 512-byte (GBA) header, then a .nds file, and zero or more appended data files. The code from the .nds accesses the appended data files thanks to reads from GBA ROM space. GBA flash carts, M3, and SuperCard support this. Some linkers may require the ROM to be renamed to .ds.ndsmyGame.ds.gba could be renamed preferably to myGame.sc.nds (see below)Some quite uncommon formats and extensions are:
.nds programs can be converted into ds.gba programs, i.e. ROM designed to run from slot-1 may run from slot-2. It involves prepending a loader, which may load the first appended .nds file, or allow selection of one of the .nds files.
The loader can be ndsloader.bin, which could be prepended that way:
|
The other way round (from ds.gba to .nds, i.e. from slot-2 to slot-1) usually cannot be performed if not already done by the author, because most often the application relies on specific data meant to be embedded, which is not supported with a .nds file.
There are some tools, such as the Supercard Magic Homebrew Patcher, that turns your .nds / .ds.gba files into a fixed supercard .sc.nds file.
Finally, most of the recent FAT-based homebrew applications need to be patched according to the relevant DLDI script, before being run. See next section.
Before rushing for the numerous homebrew softwares available on the Internet, one must understand the problem encountered with the various linkers, be them slot-1 or slot-2. This section therefore applies mostly to the ROM using the .ds.gba format, including .gba.nds and .sc.nds.
As each linker offers its own memory interface (the way it writes to its Flash memory), as soon as an application has to access to a memory provided by the linker, the homebrew has to be aware of the specific linker it will run with. Several solutions exist: the developer may build as many versions of his software as there are supported linkers (it becomes soon tedious and messy), or the end-user may use Chishm's DLDI, for FAT-based applications.
DLDI stands for Dynamically Linked Device Interface for libfat. The DLDI-aware software can be storage-agnostic. It has just to be built once, and patched according to the target linker it is to run with, before being transferred to the DS.
The point is that it is quite easy for the user to patch the software, and it frees the developer from having to take into account each and every linker variation on the market. Even linkers released years after one's software will be supported thanks to DLDI, with no additional efforts from the original author.
DLDI is needed whenever libfat versions more recent than Christmas 2006 are being used.
In practice, the user will need:
Our LOANI script can take in charge everything needed to use directly the DLDI tool on a Linux platform, see below. Otherwise Linux users just have to put the dlditool archive (here dlditool-belowlinux-x86.zip) and the device-specific patch file (here Sclt.dldi, it could be R4tf.dldi) in a new directory and prepare them. For example:
|
One just has then to use the DLDI tool to patch the software (here myGame.nds) according to its linker-specific patching specification: ./dlditool Sclt.dldi myGame.nds for example. Check that the tools returned that the operation was successfully performed. The patched ROM can then be used at once.
Sadly things are sometimes more complicated, for example some linkers perform auto-patching, and it may depend on their firmware version. For example starting from the 1.10 firmware version, R4 DS auto-patch their ROM.
One should also avoid with this R4 with prior firmware versions (ex: 1.09) to transform its .nds into a .ds.gba and then to DLDI-patch it: the resulting ROM will not even load on the R4, which is a slot 1, and thus is unrelated to the .ds.gba format.
Another option for Windows users is to install DLDIrc, which is configured once for all with a target linker and adds to the contextual menus for each selected DS ROM (the menu obtained with the right-click) the possibility to patch it.
More information about DLDI: read this section on Dev-Scene.
Finally, some homebrew make use of the SRAM (actually, a non-volatile RAM) that may be provided by a flash cartridge.
Even though most DS applications are to be run standalone (i.e. by themselves, not using any other specific operating system), one can nonetheless have its DS run a dedicated operating system (OS). Beyond the fun experience, it may allow for multitasking, or may just provide an environment familiar to the user, with useful tools and interfaces.
Should an OS be used, it will have to be executed as any other DS application. This OS will then be able itself to run the applications it is hosting. They may even run simultaneously, if the OS supports multitasking and if it does not monopolize for its own purpose too much the DS resources, not letting enough CPU or RAM for the hosted applications.
Among all other operating systems, homebrewers often favor Linux.
On the DS, this OS is provided thanks to DSLinux, based on uClinux.
It can be installed quite easily:
dslinux.nds with the adequate DLDI device patch (refer to our DLDI nano-guide)dslinux.nds and the linux directory extracted from the archive, including its own content) at the root of your SD cardSee DSLinux running, with the virtual stylus-based keyboard on the bottom screen:

The terminal output is on the top screen:

See also:
Programming the DS is quite fun because it is rather down to the metal: being that close to the hardware, with no operating system, no drivers, no hardware incompatibilities (well, until we look at the flash carts) to take into account is a good pretext indeed to struggle against low level details. It is probably a very good way to practise embedded developments as well. Not to mention that having one's own programs working on a console is quite satisfactory.
ARM CPU are RISC (Reduced Instruction Set Computer) processors, as opposed to CISC processors (Complex Instruction Set Computer). Most ARM instructions are expected to be executed in one clock cycle, thanks to a simpler and constant structure.
For newcomers, one of the most disturbing specificities of the DS is its having multiple processors. They are expected to share the load: even if both ARM are rather general-purpose, the ARM9 should take care of most computation-intensive tasks, including graphic-related operations (it is the only one that can interact with the 3D rendering engine), whereas the ARM7 should concentrate on input management (touchscreen, most keys), wireless communications, audio output and any GameBoy emulation. As for the 2D/3D engines, they are dedicated as expected to visual rendering.
Roughly speaking, the ARM9 can be somewhere between 1.5 and about 4 times faster than the ARM7, depending on what task is involved.
ARM7 and ARM9 are often configured to use Thumb mode (16 bit mode), instead of the ARM mode (32 bit mode). The reason for this, in both cases, is memory space: ARM7 has limited program space, whereas ARM9 has limited instruction cache space. As for the 16 bit-wide bus accesses (there are several buses), ARM (32-bit) instructions would saturate the bandwidth and seriously decrease overall performance.
Some applications manage to rely on very tight 32-bit loops that fit in the ARM9 fast memory dedicated to instructions (ITCM, 32-kilobyte) or in the ARM7 dedicated fast RAM (IWRAM), so that they can benefit from a more "powerful" instruction set, for critical parts. Apart from the 32-bit memory regions fed by 32-bit buses, one should prefer Thumb (16-bit) instructions.
One can use the ITCM by either naming the file *.itcm.c or *.itcm.cpp or using the ITCM_CODE macro in the libnds headers. For example:
|
The latest devkitARM linkscripts and the default ARM7 core reserve the switchable IWRAM for ARM7 exclusive use. Nintendo official code also does this.
Neither CPU of the DS has a Floating Point Unit (FPU), so all floating-point operations must be emulated (not hardware support, they have to be done in software), which is very slow (so avoid by all means float and double variables). Most computations use therefore:
cos, sin, etc.). See this TONC section for more detailsAs a consequence of having two CPU, a DS ROM has to include at least two executables, one for each of the ARM (not mentioning application-specific data). At runtime the executables will have to work simultaneously, which involves often having to be synchronized and to share data.
Monitoring the ARM loads is not easy, even when using emulators. Profiling can be done via timers: one can start some hardware timers when a task starts, and then stop and read them when the task is done. That information is rather precise, and you can use it to provide a lot of information about how much of the CPU power is being used, even though setting it up is a bit tedious.
Some guidelines should be applied to write effective code for the DS. They include favoring when possible bit shifts (<< and >>), avoiding branching (if and switch replaced by arithmetic operations) and using & instead of % whenever possible. See also TONC section about ARM assembly.
Both ARM use the Little-Endian convention.
Hardware memory-mapped registers are a way for the CPUs to interact with other components of the DS, such as the sound or video hardware.
These registers behave like special memory addresses that can be read and/or written by the CPU, but the associated bits, instead of being actual memory, are in fact I/O ports of the chips the CPU is interacting with. These registers can be used for example so that the CPU sets the playback frequency of the sound hardware, or so that it reads which keys are pressed.
Most useful registers are defined by the low level library named libnds, to manipule abstract plain names (ex: DISPLAY_CR) instead of raw addresses (ex: 0x4000000).
Such addresses must be declared volatile, as the compiler cannot rely their value remaining constant between two accesses: registers are directly controlled by other chips that can modify them at any time, hence their value should not be cached as are usual bytes in memory.
Finally, in this case, using the preprocessor to set these adresses as immediate values (#define DISPLAY_CR XXXX .... DISPLAY_CR = aValue ;) instead of using constants (const unsigned int DisplayCR = XXXX ; *DisplayCR = aValue ;) might involve less operations (stored directly in the instruction opcode instead of being retrieved from memory), depending on the compiler optimization.
The main DS built-in memory, consisting of one big 4-megabyte block (sometimes named EWRAM, for External Working RAM) can be accessed by both ARM's, but one at a time. When both CPUs are trying to read main memory, one will have priority over the other (by default, the ARM7 has priority over the ARM9, with is a reasonable and safer setting), leading the other to wait until the first has finished its operation.
Regarding the ARM9, in the main memory there is generally its corresponding executable as well as most game data. Everything except its stack (placed in DTCM, see below) and the data declared to be read-only (placed in ROM space) is stored by default in these 4 megabytes of main memory, including application-specific code, non-const variables, global data, C++ constructors and destructors, etc. This memory is rather slow, at least compared to the ARM9 caches.
The ARM7 executable can be as well in main memory, but for performance reasons its code, together with its data, are often placed into the IWRAM (for Internal Working RAM, 64 kilobytes of fast RAM, 32-bit wide, that only the ARM7 can access). This is the case with devkitARM. Code has to be small enough to fit in this IWRAM.
From the point of view of the ARMs, the GBA cartridges are expected to be a ROM area of up to 32 megabytes. Linkers making use of removable media fake a 32-megabytes only memory by loading dynamically the relevant memory chunks from their mass storage media.
There are two memory banks of Tightly Coupled Memory (TCM) is the ARM9. They are high-speed memory, directly contained in the ARM CPU core.
The DTCM, for Data Tightly Coupled Memory, is a special 16-kilobyte memory area in the ARM9 which can be mapped to reside at various actual physical addresses. It is a lot faster than the main RAM, therefore the standard ARM9 linkscript places its stack in DTCM. Due to the small size of this memory region, local variables should be used carefully. One should avoid for example declaring too many of them (ex: local arrays) or having too deep recursions.
As for the ITCM, for Instruction Tightly Coupled Memory, it is a special 32-kilobyte memory area in the ARM9 which can be mapped to reside at various actual physical addresses. It is a lot faster than the main RAM, so it should be used for small (preferably 32-bit) functions that are computation-intensive and/or frequently called. For example, libnds uses that region to store the interrupt dispatcher.
Although these two tightly coupled memories (TCM) are faster memories than RAM, are internal to the ARM9, and are used for storing high performance code/data, they are actually completely separate areas of memory than the instruction and data (L1) caches of the ARM9.
So from the ARM9 point of view, the memory hierarchy is, from closest/fastest/smallest to farest/slowest/biggest: ARM9 caches (L1) > ARM9 TCM caches ("L2") > RAM and other memories.
Both ARM9 TCM seem to rely on 32-byte cache lines. As a cache line cannot be partially read or written apparently, special care must be taken when invalidating or flushing them.
Finally, see the memory layout diagram from Dev-Scene website to better understand the various buses between the various memories.
An interrupt (or IRQ, for Interrupt Request) is a way for a CPU to stop immediately the current execution path in order to run another function, called an interrupt handler, instead. Hence when a hardware or software interrupt occurs, the processor saves first some information so that it knows where to go back and in which state, then the handler associated to this interrupt for that CPU is called at once and, when that function returns, this CPU continues executing the piece of code it was executing before it was interrupted, as if nothing had happened.
Therefore interrupts allow to perform tasks that should not wait, either because otherwise they might be missed (ex: a keypress described in a hardware register) or because the application must react directly to them (ex: a Vertical Blank Interrupt that would trigger rendering, see below).
Interrupts allow also to use the CPU sleep mode, since they provide a way of waking it up as soon as it becomes needed again, thanks to a BIOS routine. In this low power mode, the ARM9 stops processing instructions and powers down some memory banks to save battery charge. There is an ARM instruction named SWI (for SoftWare Interrupt) with one numerical parameter that means: enter the SWI handler and pass that value in order to know which interrupt or combination of interrupts the CPU should be waiting for, from now on.
Helper libraries, including libnds, offer higher-level interrupt management. If most of the program is in interrupt handlers, then the CPU may sleep most of the time, preserving the charge of the DS batteries. Note that interrupt handlers are meant to be executed in a short time though, as otherwise they might be interrupted themselves. Too many cascading long interrupts might prevent the DS from ever returning to the main interrupted code.
By default when an interrupt is triggered, the CPU jumps to the standard BIOS interrupt routine, a function which is stored in a special memory address that we can write to. By storing here a pointer to one of our own functions, we can cause the interrupt to be processed by an user-specified function, i.e. we can define our custom interrupt (IRQ) handler.
More information on interrupts in general can be found in this TONC section.
The DS supports 23 different hardware interrupts, named here according to the libnds convention (unless specified otherwise, these interrupts are available both for the ARM7 and the ARM9):
DISPSTAT, compared to REG_VCOUNT) were displayed (IRQ_HBLANK / IRQ_VBLANK / IRQ_VCOUNT)IRQ_TIMER0 / IRQ_TIMER1 / IRQ_TIMER2 / IRQ_TIMER3)IRQ_NETWORK / IRQ_WIFI ) [ARM7-specific]IRQ_DMA0 / IRQ_DMA1 / IRQ_DMA2 / IRQ_DMA3)REG_KEYCNT (IRQ_KEYS)IRQ_CART / IRQ_CARD / IRQ_CARD_LINE)IRQ_IPC_SYNC / IRQ_FIFO_EMPTY / IRQ_FIFO_NOT_EMPTY)IRQ_GEOMETRY_FIFO) [ARM9 only]IRQ_SPI) [ARM7 only]IRQ_LID) [ARM7 only]
Once the IRQ subsystem has been initialized (irqInit), handlers can be associated to interrupts (irqSet, irqClear). libnds provides a default overall interrupt dispatcher that can be overriden (irqInitHandler).
Interrupts can be enabled/disabled separately (using REG_IE, or irqEnable, irqDisable), and temporarily disabled as a whole (REG_IME).
The ARM7 stores the addresses of its interrupt handlers in a hardcoded memory location, whereas the ARM9 defines these addresses relativey to the DTCM.
Both CPU can trigger interrupts to each other (if the ARM9 allows it). It is convenient to send a notification to the other CPU when, for example, there is data for it waiting to be read on a shared area in RAM, once the calling CPU has finished filling it.
The DS screens are updated neither as a whole nor permanently: the graphics hardware draws pixels one by one, from the top left to the bottom right of each screen, line by line. It then waits for a while (a fixed duration) before starting to draw again.
If the framebuffer (the place in memory where the pixels are stored) is modified during the redrawing process, the user may see visual artefacts in the form of partly-updated images, the top-left part being rendered according to the previous state of the framebuffer, bottom-right with the current one.
During a short duration after a line is drawn, during a longer one once a full screen is rendered, the hardware remains idle. These moments can be used to perform safe rendering: no partial redraw is to be feared then.
Two special interrupts are regularly fired, so that the programs can use these two favorable periods: one, the Horizontal Blank Interrupt, occurs whenever a line has been rendered. Your program can use this first idle duration to perform rendering operations. The other one, more famous, is the Vertical Blank Interrupt (VBI), that is fired once a full screen has been redrawn. Your program should use this longer duration to perform at least framebuffer-related operations, while the hardware moves from the last line back up to the first line. This idle stage is called Vblank, as opposed to Vdraw, the screen refresh time.
At both screens are refreshed at 60 Hz, the period between two VBI is 16,7 ms long. See more information about the GBA, which still applies here. The VBI is called also vsync, since it allows for vertical synchronisation.
Note that all other operations (input reading, sound output, application logic, etc.) can be performed regardless of these two interrupts. But the VBI, beyond its use to avoid visual artefacts, provides too a hard real-time 60Hz time-base. This time base can be used to schedule operations on a regular basis. Timers are useful for that task too.
Keys can be read thanks to several methods, including the interrupt-based one. In this case a specific IRQ handler is registered. This handler will be triggered indeed when a key is pressed, but it will not be called when the key is released, which reduces quite a lot the interest of this method for key handling.
Beyond screens and sometimes keys, FIFO, IPC and timers make heavy use of interrupts.
These SWI (for SoftWare Interrupt) are triggered by the program itself, thanks to the ARM instruction named swi, and result in a DS BIOS function being called.
One would prefer to use pre-made encapsulations for these BIOS calls, for example the ones provided by libnds (ex: swiSoftReset).
The DS supports 25 different software interrupts, they are described in the libnds BIOS section.
The full name of this processor is ARM7TDMI, meaning it is an ARM 7 core (a.k.a. ARM v4), which can read Thumb (16-bit) code, has a Debug mode and a fast Multiplier. On the DS it has neither an instruction cache nor a data cache, but it is a bit compensated by the fast memory it owns, the 64-kilobyte IWRAM linked with a 32-bit wide bus.
There are also two 16-kilobyte WRAM banks that can be assigned independently to the ARM7 or ARM9, with a 32-bit wide bus in both cases. The two ARMs cannot access these banks at the same time. Commonly, both banks will be mapped to the ARM7 (devkitARM defaults): as they form then a continuous block with the ARM7 IWRAM, this processor is effectively given 96 kilobytes of fast memory.
The ARM7 is the only CPU that can be used for controlling the touchscreen. Most applications use boilerplate code that sets up an interrupt handler for the already mentioned Vertical Blank Interrupt (VBI). Not for rendering purpose here, but for synchronization, so that the interrupt handler dedicated to ARM7 input reporting can be scheduled regularly. The ARM7 boilerplate code gets the value of the touchscreen parameters and stores them in a data structure the ARM9 can access to.
The ARM7 is also the only CPU that can make use of the microphone, the sound playback, the wireless communications and the real-time clock.
Depending on a data being const or not, the linker will place that variable respectively either on an average memory rather unconstrained or in IWRAM, which is the fastest but one of the smallest. Therefore each time one forgets to specify the const qualifier for an actual constant, it may use unnecessarily the most researched IWRAM instead of low-end memory.
The free toolchain devkitPro includes a default ARM7 program to handle basic tasks like managing interrupts, reading the touchscreen, the microphone and the realtime clock, performing very simple sound playback, etc. It is the arm7.bin file discussed in our building section. Usually there is no need to write custom ARM7 code. The standard Makefiles include this default ARM7 program.
Due to its superior power compared to the ARM7, the ARM9 is the main processor and as such will take in charge most of the work. Most of application-specific code is expected to run on it.
The ARM9 uses two additional built-in caches (beyond the usual CPU L1 caches): one for the instructions (ITCM, 32 kilobytes), the other for the data (DTCM, 16 kilobytes). Each is accessed thanks to a dedicated 32-bit wide bus. Both are caching accesses to the main memory and increase the ARM9 performances a lot, at the expense of a small additional level of complexity: as neither the ARM7 nor the DMA circuits are aware of these two caches, care must be taken not to create inconsistencies with their view and the one of the ARM9.
To make a better use of these caches, various primitives are provided to ensure they stay in sync with the main memory. This includes a mirror of main memory that is not cacheable (02400000-027FFFFF), and a way of flushing the data cache (DC_FlushAll).
A single ARM9 main() function defines usually:
In GBA mode the ARM9 is not powered, only the ARM7 can be used.
As each ARM has specific abilities (ex: the ARM7 is the only one that can access the hardware for sound and wireless), they have to communicate one way or another to send to the other ARM commands to be executed on the sender behalf.
The ARM CPU can communicate thanks to IPC (Inter-Process Communications) based on a set of registers managed thanks to a (possibily bidirectional) FIFO (First In, First Out) data structure. It corresponds actually to message queues with an asynchronous communication protocol.
Communication between ARMs is tricky: beyond the classical issues of synchronization of the concurrent accesses (parallelism between the ARMs), one has to keep in mind the ARM7 is not aware of the ARM9 data cache (DTCM), which may lead to inconsistencies if using the main memory to share data.
One solution is:
0x02400000-0x027fffff, add 0x400000 to the normal main RAM address to get to the uncached mirror), even if thus the ARM9 access to the data will be slower. An alternative solution to the mirror would be to flush/invalidate the cache manually, although this approach does not seem 100% reliable
Note that as soon as IRQ and/or inter-ARM accesses to shared variables in main RAM are involved, one has to rely on variables declared volatile, so that the compiler does not suppose it can avoid sometimes avoid to read from main RAM. As code for the DS is usually compiled with deepest optimisations activated (ex: -03), these issues must be managed.
A pointer to a volatile variable of type T should be declared as volatile T * (ex: data in main RAM). If the pointer is itself volatile (ex: set from an IRQ), then declare it volatile as well: volatile T * volatile. Despite these safety measures, the data cache will still cause problems though.
The shared memory could also be located in the so-called IPC region, which starts at the address 0x027ff000. Its purpose is to provide a safe memory area dedicated to shared variables. The usual allocators (ex: malloc, new, etc.) do not readily provide ways of specifying the target address (for in-place allocation in this area), so a given structure has to be mapped directly to the relevant part of this IPC region.
If using libnds, then this library will reserve the beginning of this IPC for its inner workings. This is done thanks to their TransferRegion structure. Thus user data should start no before than (uint32)(IPC) + sizeof(TransferRegion), to avoid memory corruption.
Note that the link scripts for both ARMs leave only a 4 kilobyte-space for this IPC area: the size of main RAM is restricted 4MB - 4KB, then the IPC struct is defined as address 0x027ff000, i.e. 4 kilobytes before the end of the non-cached mirror of main RAM. The libnds IPC struct itself is pretty small (less than 500 bytes), but user variables (and, possibly, all other libraries used besides libnds) may run over it, if larger than the remaining space (about 3 500 bytes).
Finally, as already explained, memory areas in the main RAM (not in the IPC area) can be used instead of this so-called IPC region, provided the usual precautions are respected (allocated from the ARM9, accessed from the non-cacheable mirror or with a DTCM management) and provided a means of notifying the ARM7 is used (ex: the hardware FIFO).
Commands can be described by the ARM9 in a data structure for the ARM7, for example a C union or a non-abstract C++ class inheriting from an abstract command class. These commands should specify the requested action (ex: play sound) and its associated data (ex: a pointer to the samples stored in main memory). A set of commands can then be kept in shared memory, accessible by the ARM7 and ARM9.
The command set can be implemented thanks to a circular C array or a C++ (FIFO) std::queue that would be instanciated into the shared memory area, after the libnds IPC region, i.e. after the libnds-defined IPC starting address at the IPC symbol incremented of an offset equal to the length of TransferRegion.
So the ARM9, after having initalized the queue, is expected to place a command in it whenever needed, whereas the ARM7 is expected to poll regularly the queue (ex: thanks to the VBI) in order to gather and execute commands.
This IPC method has three drawbacks. First it is only one-way: the ARM7 cannot send commands or results to the ARM9, as no protection against concurrent accesses is available here. Second it requires the ARM7 to perform polling, at the expense of uselessly burnt CPU cycles. Third it has to fit in the tiny IPC region.
See Chris Double tutorial #6 for more details about custom-made IPC.
The DS provides a built-in interrupt-based FIFO queue. Therefore the ARM7 can receive immediately interrupt notifications, instead of having to poll, which would be rather inefficient. The FIFO handles concurrent accesses in hardware, which allows fast reliable bidirectional communications. This queue can only hold 16 items (32-bit int, that can be taken as integers or as pointers to user-defined data) on each side, so the other processor should be receiving these items in a timely manner.
Each CPU has a queue which it can put data on. The other CPU can receive this data by reading from a register, or thanks to an interrupt. It will get the oldest item that the other CPU put on the queue. A kind of small application-specific protocol could be specified to manage the queue exchanges.
See Chris Double tutorial #7 for more details about hardware-based IPC, and also our Ceylan-based generic high-level IPC system.
Programming the DS involves performing numerous data transfers, for graphics, sounds, application data, etc., from various regions in address space to various other regions (memory banks, slot-1, slot-2, IWRAM, etc.).
These transfers can be achieved thanks to various means, each with its own forces and weaknesses. Starting from the most often favoured transfer method to the least in case of a significant transfer:
although "only" up to four of them can run simultaneously, DMA transfers (open to both ARM, ex: dmaCopy) are interesting because they offset this load from a CPU. They run in background and trigger an interrupt when having finished, letting the CPU perform other tasks in the mean time. There are not necessarily the fastest of all transfer methods, but this is more than compensated by their parallel execution feature.
A drawback is that, as long as the DMA transfer is running, both CPU will be locked off the bus to the main RAM, to prevent the CPU and the DMA controller from trying to access the bus at the same time, causing a collision.
Thus the ARM9 should execute from its "second-level" instruction cache (ITCM) reading/writing data from/to its "second-level" data cache (DTCM), otherwise it will be frozen, waiting for the bus. dmaCopy cannot access the DTCM region of the ARM9, which is where the stack is placed. Thus if the source of a DMA transfer is the main RAM from the ARM9, the DTCM must be flushed beforehand. Some interference between DMA transfers and interrupt handling have been reported, when in doubt swap to memcpy to see whether it improves stability.
As for the ARM7, apparently, even when executing from its IWRAM, it will be frozen
memset for setting instead of copying)swiCopy, but has higher interrupt latencyswiFastCopy, suffers from a bug apparentlyFirst, the DS can run in different modes, including the GBA Mode (ARM9 not used), sleep (for the ARM7), stop, halt, etc. The DS can be woken up from various IRQ (Interrupt Requests):
Second, most subsystems of the DS can be powered independently: sound speakers, Wifi, LCD screens, the two 2D engines, the 3D rendering and geometry engines.
Selective activation is useful notably to save energy.
The two screens can be managed separatly, or considered as two halves of a taller single screen. They can be used also with the DS being rotated of 90 degrees, on its side (portrait mode), like an open book.
Most of the informations related to the graphical rendering are stored in the VRAM, for Video RAM. Its size is 656 kilobytes, and it only accepts 16 or 32-bit writes (no 8-bit writes allowed).
2D/3D rendering requires images, geometries, textures, etc., see our data storage section explaining how to access the resources needed by your application.
If the rendering takes place while the screens are redrawn, then the user will see on its screens images partly updated, leading to unwanted visual artefacts. The solution is either to modify the screen content only between two redraws or to use page flipping.
The first approach can be implemented by waiting for the aforementioned VBI (Vertical Blank Interrupt) and performing the rendering only in its handler, i.e. in hard real time. One has just to ensure that rendering does not last more than the Vblank, which is rather short. This is not always possible, except for the most simple renderings. Otherwise page flipping should be preferred.
As there are many technical choices that lead to poor trade-offs, one may benefit from some thoughts we gathered about organization and toolchain for graphical assets on the DS.
When rendering is not trivial, this task might not be short enough to fit in the Vblank. Page flipping (not to be mixed up with the more expensive double-buffering, which involves an extra screen copy) is a method that consists on rendering in a screen buffer while the hardware, simultaneously, displays another buffer. At each VBI, buffers are exchanged so that both tasks can continue. Page flipping is thus a way of eliminating nasty artefacts like tearing in animations. Palettes may have to be flipped as well as bitmaps.
Page flipping allows to have a lot more time to render than when rendering only during Vblank: here 16 ms (at 60 Hz) are available.
In software you manipulate a main screen abstraction (main rendering core), that can be mapped either to the top actual screen or to the bottom one, and a sub screen (sub rendering core), which will correspond to the other hardware screen. They can be swapped at any time thanks to lcdSwap().
Compared to the sub core, the main core offers following additional features:
Both screens can be seen from software like 2D Cartesian coordinate systems, the abscissa (X axis) ranging from 0 (left) to 255 (right), the ordinate (Y axis) ranging from 0 (top) to 191 (bottom).
Both cores can be set in various video modes, each with different capabilities and features. The sub core has only five different video modes, whereas the main core is more powerful: it is able to reallocate more VRAM, and has the ability to render basic vertex 3D (with a 2048 polygon limit) and several vertex shaders, such as Toon Shading. The main core also has an extra video mode called Mode 6 for large bitmaps. In addition, the display capture device is capable to take a capture screen only from this core. See LiraNuna's 2D tutorial for more details.
In a given screen mode, pixels are described according to a specific pixel format, that dictates their layout in memory:
xBBBBBGGGGGRRRRRDirect color modes are simple, but use a lot of memory space when stored and a lot of bus bandwidth when blitted, compared to palette-based ones with 4-bit or 8-bit index, which are therefore often favoured.
There are libnds-provided macros (ex: RGB15) that help defining pixels according to specific formats. See Chris Double tutorial #2 for more details.
Depending on the video mode, each rendering core will retrieve its video-related informations (ex: bitmaps, sprites, tiles, textures, maps) from hardcoded regions defined in overall memory address space.
These are address regions (i.e. a lower and higher addresses), but by default at these adresses there is no actual memory: the engines have almost no memory on their own (except for sprite attributes and base palettes).
You therefore need to map predefined chunks of actual video RAM into these specified address regions for the cores to use them. Each of these chunks is called a bank, there are nine of them, labelled VRAM_A through to VRAM_I. Each of these has a particular size and is best suited for specific purposes: choosing the appropriate layout and settings is all the more important that memory space is tiny.
The banks are:
| Bank name | Bank size (in Kilobytes) |
|---|---|
| VRAM_A | 128 |
| VRAM_B | 128 |
| VRAM_C | 128 |
| VRAM_D | 128 |
| VRAM_E | 64 |
| VRAM_F | 16 |
| VRAM_G | 16 |
| VRAM_H | 32 |
| VRAM_I | 16 |
Total is thus 656 kilobytes. See also the VRAM bank mapping from Dovoto's tutorial.
The system has two 2D engines, one per screen. Each screen can be put into a variety of different modes, with different feature sets. These engines are each quite similar to the Game Boy Advance's single 2D engine, though more powerful.
The VRAM banks previously mentioned are to be mapped according to the expected layout for 2D memory, which is mostly made of:
As for sprite attributes, as said previously each 2D core has its own built-in dedicated memory, which contains 128 entries (one entry per possible sprite). Each entry is made of four 16-bit attributes, storing the size, shape and location of the associated sprite. Up to 32 out of the 128 entries can correspond to affine transformations (named rotsets), whose additional attributes specify rotation and scale. Hence up to 32 rotsets can be defined, but more than one sprite can be associated to a given rotset.
Base palettes have their own per-engine memory too. Each 2D engine has two base palettes, one for the background, one for the sprites. Each base palette contains 256 16-bit color entries, in x555 BGR format.
See also the memory layout as seen from the 2D engines, taken from Dovoto's tutorial.
Each video mode has a number of backgrounds (BG) that can be used. A background is a surface that, when drawn upon, displays on the hardware like a layer. Modes that have multiple backgrounds allow these backgrounds to be overlayed, so that they are drawn on top of each other. This, and the support of per-pixel transparency, enables for example a background with text to be overlayed on top of a background displaying an image.
Backgrounds are opposed to sprites: the formers take care of the decoration, the environment often static that may take the entire screen, whereas the laters represent the living entities (small moving objects, like animated characters).
Some backgrounds behave like framebuffers (per-pixel software rendering), others are tiled. Such backgrounds, in a very similar way to palettized surfaces (that contain, instead of pixels, a palette and color indices referring to that palette) are composed of a set of tiles (the "palette") and a map, which for each cell of its grid tells which tile should be placed here, as if the screens were divided into a series of 8x8 squares. Expressed in tile units, each screen is thus 32x24.
More precisely, instead of being a kind of array of pixels, a tile-based background is a kind of array of references to tiles, each tile being a fixed-size array of pixels (a rectangle 8 pixels wide and 8 pixels tall, i.e. 64 pixels). The 2D engine will thus sweep the background as if it was a grid of tiles: for each cell of the grid, it will read in the map array the corresponding tile entry, and will blit the tile being mentioned in that entry to the cell. Hence tiles allow to perform a full pavage of the background.
The DS excels at tile-based rendering, which is the way to go for most games, as it is hardware-accelerated, whereas framebuffer access implies slow software rendering. A few basic tiles can create complex displays quite nicely.
There are three types of backgrounds: Rotoscale (a.k.a. Rotation), Extended Rotoscale (a.k.a. Extended) and Text. See the table of graphic modes from Dovoto's tutorial.
Rotoscale backgrounds are tile-based backgrounds (or surfaces, in SDL-like language) that can be manipulated (affine transformations: 2-axis scrolled, scaled in or out, rotated, sheared, thanks a transformation matrix) and displayed directly on screen, all in hardware, i.e. with little overhead for the CPUs, no additional VRAM copy and no specific developing effort (it only involves setting some registers). They can be smaller or bigger than a screen, in which case only the relevant parts will be shown (clipping).
Such regular rotoscale backgrounds use 8-bit tile entries. They can therefore reference only up to 256 different tiles. Each tile entry designates simply the number of its tile into character memory.
Extended Rotation Backgrounds (ERB) are more powerful versions of Rotoscale backgrounds: they support larger tilesets (1024 instead of 256, which induces larger maps), they support more palettes (256-color ones), and they can operate in bitmap mode (linear frame buffer) as well as in tiled mode.
As ERB backgrounds can be accessed (read and write) like the framebuffer, as they provide additional hardware capabilities, they tend to make the framebuffer modes useless.
To demonstrate scrolling around an image, one may use for example a 256x256 background size (in pixels), with 16-bit color tiles (with one alpha bit). That background would be larger than the DS screen (256x192), but it would use only 128 kilobytes to store the image. VRAM_A could be chosen for that.
Finally, hardware scrolling can be of great help to implement parallax scrolling (a fake sense of depth), when having multiple backgrounds sliding at different paces.
In tile mode, ERB use the same 16-bit tile entries as the text backgrounds (see below).
They are general-purpose tiled backgrounds. A text background is made of:
2^10=1024), and the 6 remaining bits (the control bits) record the tile attributes (see below)The DS tiling engine is quite feature-rich: beyond displaying tiles as they are, it can flip them (horizontally and/or vertically), and use different palettes to render the same tile, for example for palette-based animations. These per-tile informations are stored in the aforementioned 6-bit attributes of tile entries (the control bits): one bit (on/off) for each flip (horizontal/vertical), and 4 bits to designate one of the 16 sub-palettes.
So with text backgrounds you have nothing like an overall screen buffer that could be written at the pixel level: you can just manipulate tiles and maps, and it is the DS hardware that will read it to recompose the screen.
The pixels of a tile are described as color indices taken from a palette. As palettes here are a collection of either 256 colors or 16 sub-palettes of 16 colors each, there are two color modes for tiles in text backgrounds:
In the 512-kilobyte VRAM (Video RAM), there are locations dedicated to the storage of map data, in map bases (2 kilobytes each), or tile data, in tile bases (16 kilobytes each, a.k.a. charblock). Map and tile bases share the same region in VRAM, whose layout must be carefully planned. Usually one counts in map base units, for example the size of a tile base is 8 map bases.
As maps can be placed anywhere in the first 64 kilobytes of background memory, and tiles anywhere in the first 256 kilobytes, a good practise may be to reserve the first 16 kilobytes of background memory to your maps (hence 8 map bases, which therefore correspond to the first tile base), and, starting at tile base #1 (instead of #0), to use the remaining 496 kilobytes (31*16) to store your tiles.
As each tile base occupies 16 kilobytes, in 256-color mode there can be no more than 256 tiles per tile base (16*1024/(64*1)), whereas in 16-color mode there can be 512 tiles per tile base (16*1024/(64*0.5)).
The tileset of a given map can refer up to 1024 tiles: more than a tile base can be referred to by a map base. Therefore in 256-color mode, there can be up to 4 tile bases per map base (1024/256), whereas in 16-color mode there can be up to 2 tile bases per map base (1024/512).
Map bases always have 32×32 tile entries. By default, the DS will be using a 32x32 tile map, a size of 256x256 pixels. It would be exactly as wide as a DS screen, and taller: screen height is 192 pixels, therefore there would be 64 extra rows.
When wanting to perform (hardware) scrolling, one has to have a logical screen bigger than the rendered screen, in order to load in the background the neighboring parts before showing them. To create bigger backgrounds, you need to use several map bases, and to choose one out of the three other sizes: 32x64, 64x32 and 64x64. As explained in TONC, maps using several map bases cannot be accessed as a unique array, per-map base accesses still has to be performed. Computing the address of a tile entry requires therefore a little more efforts.
More information on regular tiled backgrounds can be found in TONC and in Dovoto's tutorial.
The framebuffer mode, or bitmap mode, is the easiest to directly draw to, but the less powerful of all modes. A framebuffer is a mode where the screen is mapped directly to a portion of main memory. Pixel informations written to this memory area in 15-bit direct color format will immediately result in their appearing on the screen. Therefore to plot a pixel, whose color is defined in a 16-bit integer c, at screen location (x,y), one just would have to write c at FramebufferBaseAddress + y * ScreenPitch + x, ScreenPitch being equal here to the screen width (no extra per-scanline bytes), 256.
If it remains quite simple, it does not offer the various useful features supported by other modes, such as sprites, scrollings, etc., not to mention 3D primitives: with the framebuffer only software rendering is performed (raster graphics). And only the main screen can use this framebuffer mode.
Hence a good practise is to prefer sprites and tiled backgrounds over bitmap modes. Very few 2D games do not rely on the hardware acceleration.
See Dovoto's tutorial about software-based line rendering using the Bresenham algorithm.
Finally, the graphics to render are most of the time available first in general-purpose formats (png, jpeg, etc.). They have therefore to be converted into data that the DS may use natively, i.e. split into the appropriate palettes, tilesets, etc. This can be either precomputed (done at build time, once for all) or done at runtime, which may allow to benefit from the features of the general-purpose format, for example compression.
Tools include grit (GBA Raster Image Transmogrifier, new name for git) and gfx2gba (Linux port available) which, when given a file in 16 or 256-color PCX, TGA, SPR, BMP or TIM formats, can create a binary tileset (*.raw) with 8x8 tiles), a binary background map (*.map) and a 256-color palette file (*.bin):
|
One may use The Gimp to preconvert PNG and JPEG files to formats recognized by gfx2gba. There are also Python-Fu scripts for that tool, to convert images to formats appropriate for tiles and sprites. See this guide for further hints to use The Gimp with regard to the DS.
Another tool useful to convert sprites, backgrounds and textures is PAGfx, which has been ported to GNU/Linux (download here). Its user just has to update PAGfx.ini and then to run PAGfx.exe that will read it and perform its image manipulations. Prefer for example the magenta colorkey (transparent color). This will result in various .h and .c files to be used when building, so that the corresponding resources are embedded. This implies linking these data directly in the executable, which is rarely the best way to manage resources.
More information about PAGfx can be found in this section of PAlib tutorial.
Other tools include GbaGraphics, by Brunni, which works when run from Wine.
These converted graphics will then have to be stored so that the DS application will find it.
Setting up a full toolchain for graphical assets is a difficult task, we described our approach here.
Tools include gba-jpeg.
When developing a decoder, to read easily headers (notably fixed-size ones), one may use in-place structure reading, known as (packed) overlay: it maps directly a sequence of bytes being read from file into a C structure, provided the compiler is instructed about how data is padded.
See also: Dovoto's tutorial about BMP decoding, LiraNuna's 2D tutorial, and Chris Double tutorial #10 for more details.
Sprites are small (8x8 to 64x64 pixels) graphical objects that can be transformed independently from each other and can be used in conjunction with either bitmap or tiles background types (which are mutually exclusive). Like backgrounds, sprites are built out of tiles, but backgrounds and sprites cannot share tiles. The tiles available for sprites are stored in Object VRAM, or OVRAM for short. They can be shared between different sprites, thus changing them will change them for all sprites (ex: when performing animations). The same is true for palettes.
OAM means Object Attribute Memory. This is where you control the sprites, which are managed by the DS hardware. At most 128 sprites are supported at a time, per rendering engine. Each of them has three so-called attributes (hence OAM), including its onscreen position (top-left corner of the sprite) expressed in pixels (hence sprites, even if they are made out of tiles, can be blitted to any screen location, not only on tile intersection).
This position ranges between 0 and 511 for the abcissa, and between 0 and 191 for the ordinate. This means the sprite can be out of the screen bounds (256x192). Coordinates wrap around: a 512 abcissa is managed as if it was 0, same thing for a 192 ordinate. Other per-sprite informations in OAM are shape, size, requested flipping, and if a rotset is used. If yes, its number, in 0..31, is specified. In each rotset, a transformation matrix is specified, with zoom factors and rotation angle.
Generally speaking, the OAM is not directly read or written to, as during the VDraw this memory area is locked and cannot be accessed. This is less a limitation than a protection, as visual artefacts could be seen if the OAM was changed during screen update. Therefore programmers usually relies on an OAM double buffering: they have their own shadow copy in RAM of the OAM attributes, they can modify it at any time (including Vdraw), and during Vblank they update the real OAM from their shadow copy.
There are three available color modes for sprites:
A special color can be elected in each of the color modes, the colorkey, so that the pixels that exactly match that color are not drawned at all: they will be fully transparent, which enables to have sprites which do not color all their bitmap area, i.e. who can have a non-rectangle shape. For example a disc sprite should not overwrite the background inside its clipping rectangle but outside of the disc: in this area we need to view the background behind it. A good and quite usual colorkey choice is the "pure" magenta (full red, full blue, no green), as it is seldom used for real colors.
As for the size of sprites, each dimension can be either 8, 16, 32 or 64 pixels (i.e. 1,2,4 or 8 tiles), but if a dimension is 64, then the other must be either 32 or 4 (not 8, not 16). See the corresponding table from PAlib documentation.
Each sprite must be given a number in 0..127 that identifies it, and that defines as well its priority in its background. The smaller a sprite number is, the highest its priority will be: in a given background, a given sprite will be on top of the sprites with higher numbers, as if the engine rendered the sprites incrementally, from 127 to 0. In addition to this per-sprite priority, sprites can be associated with one background in 0..3. Similarly, a background with a smaller number will be on top of other backgrounds, and its associated sprites as well. Thus, as for priority, the sprites are first sorted by backgrounds, and in a given background they are sorted by their own priority. For example sprite #0 of background #0 will be always on top of all other sprites.
Each sprite behaves a bit like a big tile: it has an identifier and some overall informations such as flipping flags and palette-info for 16-color sprites. It is nevertheless made from several real tiles, which can be stored in memory according to two conventions: 1D mapping or 2D mapping. Specific tools can be used to turn any bitmap into a suitable tileset for DS sprites.
To follow TONC naming conventions, each sprite is either regular (standard, basic) or affine (being able to be rotated, scaled, sheared).
A regular sprite has several attributes that define its size, actual shape (depending both on the shape and size attributes), position (x,y of the top-left sprite corner in screen space), background priority (higher priorities are drawn first, hence are below lower-priority ones), flipping (horizontal and/or vertical), base tile, bit depth (palette range), activation (shown/hidden), alpha-blending status, mosaïc (blocky) effect status, windowing, and more.
The size of sprites can be 64x64 (the default), 64x32, 32x64.
Affine sprites are extended regular sprites. They have all regular attributes, and refer to additional affine-specific informations, the previously mentioned rotsets. Each affine sprite is transformable: it references an affine 2x2 transformation matrix, which defines how it should be rotated, scaled or sheared when being rendered. This matrix transforms points expressed in screen space into points in texture space, not the other way round.
Because of round-off errors due to fixed point matrix multiplications, one should avoid to compute multiple transforms in a row by reusing the previous affine matrix to compute the newer one: the error would soon grow too much. One should keep track of current transformation settings (angles, zoom factor, etc.) and recompute from them an affine matrix instead.
More information about these affine operations can be read from TONC, especially the affine sprite section about clipping, discretization and wrapping artefacts.
Sprites are discussed as well in Dovoto's tutorial and in this section of PAlib tutorial.
To perform animations, one should better have its sprites refer to different tiles, rather than considering that a sprite owns the tiles it uses, and therefore updating these tiles. Changing the tile index in the sprite attributes is quite fast and easy, but it requires that the various tiles for the animation frames can fit in the tile block. For example, even with the biggest banks (128 kilobytes), no more than 32 64x64 sprites, at 256 bit color, can be stored.
Another method would be to load dynamically the tiles for the animations, for example thanks to DMA copies, either directly to VRAM, or in a kind of back-buffer located in main RAM that can be swapped with the current sprite buffer when the animation needs its data.
More information on bitmaps, backgrounds and sprites, coming from the GBA but mostly relevant for the DS as well.
The system's 3D hardware can perform a lot of operations:
Character rendering might have a blocky appearance due to Point (nearest neighbor) texture filtering. 2D sub-pixel glyph rendering could be used though, as it is done with the TextViewer homebrew.
The DS can render at most about 6144 vertexes per frame (which translates to roughly 2048 triangles per scene or 1536 quads), at 60 frames per second. Rendering can occur only to a single screen at a time, so having 3D on both screens decreases performance significantly. The DS is more limited by its polygon budget than by its pixel fill rate.
There are 512 kilobytes of texture memory per screen, and the maximum texture size is 1024x1024 pixels.
The 3D core behaves a lot like an OpenGL state machine, allowing for wrappers and the reuse of rendering code and data. Actually OpenGL ES would be the closest encapsulation, as there is no support on the DS for floating point computations. Fixed-point is used instead.
See also the TONC section about basic linear algebra and its 3D section for theoritical elements about 3D rendering.
Sound tends to consume a lot of CPU and RAM resources, hence the best is often to target first the hardware-supported features.
Only the ARM7 can access the sound hardware, so we somehow have to have the ARM9 notify the ARM7 when it wants a sound to be played.
Sample data can be assigned independently to any of the 16 hardware channels. For a given channel, the playback frequency of the sample (ex: 22 050 Hz), the sample size (ex: 8bit) and the sample data length can be set, among other informations (channel enabled or not, looping mode or one shot, volume, etc.). Stereo sound can be output, with panning from left to right for example.
Sound output requires sound to play, see our data storage section explaining how to access the audio resources needed by your application.
Sample data also need to be passed to the hardware in raw format, i.e. as a series of uncompressed samples without any header. This is not the case of the most usual formats (wav, mp3, OggVorbis, etc.), therefore they have to be converted one way or another to the expected raw format (beforehand or by the program itself, at start-up or in-game). This can be done thanks to the sox tool, or thanks to the wav2gba tool. More information about the conversion can be found here.
Relying on the sound mixing hardware is the best bet: it can manage the 16 independent audio channels with about 5% memory bandwidth usage. Therefore using MOD-like soundtracks is favored, even by commercial games (moreover, according to some sources, using the official SDK implies being unable to program the ARM7, so they endure worse limitations than homebrew). Supported formats are 8 bit, 16 bit, or ADPCM. Any sound played through the DS sound hardware at a frequency other than 32768 Hz will be resampled to 32768 Hz with nearest-neighbor algorithm.
With advanced decoding like OggVorbis or MP3, one may eat around 50% of the ARM9 CPU power, or almost 100% of the ARM7, still with a pretty low bitrate. However with careful design and appopriate settings, as we will see for example with Helix-OSDL, one can have highly-compressed high quality music on the DS with quite a low resource consumption (CPU, RAM, storage-wise).
One could use the Tremor implementation to benefit from a FPU-less OggVorbis decoder (check then that memory leak issues have been solved), even if OggVorbis decoding on the ARM7 is not proven to be feasible, CPU-wise.
There is an alternate solution to OggVorbis, in order to use highly encoded music on the ARM7: use a mp3 decoder, which is expected to need generally less CPU resources than OggVorbis ones.
This may be done thanks to:
When using the ARM7 for that purpose, tight 32 bit (ARM, not Thumb) loops should be favoured. If games are generally interested in shifting the load to the ARM7, other applications, less CPU-bound, may just let the decoding take place on the ARM9.
When developping a game, after the audio content has been secured (i.e. obtained with an appropriate licence), in general it has to be transformed. On GNU/Linux, powerful sound editors are Sox (a command-line tool) and Audacity (a graphical tool). Quite often the audio files do not conform totally to their expected format and/or are somewhat corrupted, causing various post-processing tools or playback to fail. Audacity can be used to fix them, by first importing and then exporting them directly, as are.
After the audio content has been manipulated (cut, cleaned, and possibly run through various effects, including volume adjustment), it has to be converted, format-wise. Quality has often to be lessen, as the handheld resources are scarce.
When targeting an embedded MP3 decoder on the DS, lightening and conversion can be performed in a single step. For that we use the LAME encoder, whose reputation is quite good.
For example if you have a YourLongMusic.wav, with following format: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz (use the UNIX file command to figure it out), and if its size is about 145 megabytes (13'50 of high-quality music), you may use LAME to obtain a YourLongMusic.mp3 of only 8 megabytes or less, i.e. a compression ratio at least equal to 1800%.
To reach the previous ratio, a quite low (here constant) bitrate has to be used (80 kbps). The bitrate determines the bandwidth budget you chose, and in this budget you have to select which features are to be available. You can play mainly onto two variables: the number of output channels (ex: mono or stereo) and the output sampling frequency (ex: 22.05, 44.1 kHz, etc.).
Of course, for the same bandwidth budget, choosing stereo at 44.1 kHz will cause the encoding to loose a lot of audio details. So with a tiny budget of 80 kbps, to preserve the sound quality, one should probably prefer mono at 22.05 kHz. The following command can be used for such a conversion:
|
lame YourLongMusic.wav --verbose -m m -b 96 -q 0 --resample 22.05 YourLongMusic.mp3
|