Optimizing with SDL

Important note : this file that gathers some topic-specific knowledge from the SDL mailing list has not been integrated in our SDL documentation corner yet, as OpenGL or audio topics already are. If you want to help us doing so, feel free to tell us. Thanks !
Overview

This section is dedicated to the gathering of clues and hints in order to use SDL efficiently. Those tricks might provide a major speed up of one's FPS.
See also :
openGL and SDL
The video pipe-line
Blitting
DGA


Using OpenGL for per-pixel drawing :

You have to use OpenGL calls to read and write pixels. You should 
avoid those, though, since they're generally slow, and may interfere 
with accelerated rendering. (Must hard-sync the CPU with the 
accelerator, so you can't have the CPU and GPU working in parallel 
==> potential major frame rate reduction.)
counting FPS  :

you should take a sample (ie 10 frames).
Something like this.

sample = 10; //sample every 10 frames

//This will need to be called once at the start as well.
void reset()
{
    count = 0;
    time = getTime();
}

void draw()
{
    drawframe();
    count++;
    if (count == sample)
    {
        fps = count/(getTime()-time);
        reset();
    }
}

If you don't take a sample, your results will:
1) be counting the fps for the entire program, so eventually, it'll get
stuck on some number and generally you don't want to take inital program
delays into account.
2) will eventually overflow the number count.


double buffering _IS_ how you overcome flickering 

Are you getting flickering (assuming your double buffering
works like it should) or are you getting tearing?

If you're getting tearing you might want to look at your vsync
settings.  Ideally you want your frame rate to be in sync
with your monitor's refresh rate...




Understanding the video pipe-line to use it efficiently
>Why would double buffering reduce your frame rate???


Because it allows SDL to use retrace sync'ed page flips when supported 
by the driver. Without double buffering, SDL won't even try to sync.

If the driver doesn't support retrace sync, or it's been disabled, 
double buffering (provided it's still h/w page flipping!) should 
indeed have insignificant impact on performance.



>> You're drawing the same amound of stuff, you're just
>> doing it on the back buffer...


...but if you're running in windowed mode, or on one of the targets 
that cannot do h/w page flipping, SDL_Flip() has to do an extra 
back->front blit.



Often, poor performances come from a lack of knowledge about the way rendering is achieved by its actors, from the CPU and its central memory to the GPU and the video card memory (VRAM), through the graphical bus, such as the AGP one. Choosing an efficient scheme is a bit tricky, since one could think modern PCs are not designed for software rendering : video cards are able to compute scenes on their own, with very low resource overall costs, but using them uniformly, on all platforms, without special rights or complex configuration changes is difficult. 



The problem is that CPU access to VRAM is very slow on modern video cards. As if that isn't bad enough, memory reads are insanely slow, many times slower than writes. And some operations, such as alpha blending, happens to be read-modify-write operations...



What you are really supposed to do, for example if you want to deal with transparency, is upload your graphics to the video card, and then use the 3D accerelator for rendering. That is,
OpenGL or Direct3D, which is what modern hardware is designed for. The major advantages are that this runs extremly fast on any reasonably modern hardware, and that you get filtered scaling,
rotation, color modulation, blending and stuff with no significant performance impact.



However, using 3D acceleration means your game will not run without 3D drivers and libraries, and without a 3D accelerator, you may actually see even worse frame rates than you get now. So, this is not an option, unless it is otherwise motivated to add 3D accelerator with OpenGL
and/or Direct3D drivers to the minimun system requirements.



There is a simple trick, though : render into a software shadow
surface, copy (plain blit) to the display surface and then flip.
Repeat for each frame. Note that if you get a software display surface
from SDL, you should not set up a shadow surface of your own. That
would just be a waste of memory and cycles.


That way, you get fast software rendering with alpha-blending and other stuff, but you
still perform only writes to VRAM. The primary advantages are that
this works everywhere, and that you can implement your own low level
software blitting routines for pixel level effects. You can do that
with OpenGL and Direct3D, but you cannot do things the same way, and
it still has major performance issues on most systems.


>Why would double buffering reduce your frame rate???


Because it allows SDL to use retrace sync'ed page flips when supported 
by the driver. Without double buffering, SDL won't even try to sync.

If the driver doesn't support retrace sync, or it's been disabled, 
double buffering (provided it's still h/w page flipping!) should 
indeed have insignificant impact on performance.



>> You're drawing the same amound of stuff, you're just
>> doing it on the back buffer...


...but if you're running in windowed mode, or on one of the targets 
that cannot do h/w page flipping, SDL_Flip() has to do an extra 
back->front blit.




[Back to the table of contents]





Blitting section

Managing surfaces format


  To prevent SDL_BlitSurface from making a time consuming conversion (bits per pixel matching) every time it blits a specified software surface onto another (the target being in general the screen display, whose color depth is given and not easily changed), use, on the source surface which will be blitted, SDL_DisplayFormat, or SDL_DisplayFormatAlpha if you want transparency to be managed. One should indeed convert his surface to the screen's depth before blitting it : it is better to do it once than having SDL performing it each frame for each sprite on your behalf, which would make the frame rate drop tremendously. Notably, conversions from 8 to 16, or 8 to 32 bits per pixel are very expensive. Frequently, the conversion ought to be done upon loading the corresponding image. Do not forget to free the old surface after calling SDL_DisplayFormat, otherwise you have a memory leak !



If the source surface is changed every frame, one should initialize the video mode so that it fits the source format, in order to avoid an on-the-fly conversion. That however requires the user to start the X server with the corresponding bits per pixel setting. The other way of dealing with format is alternatively to change the surface color depth to have it matching the display's one.




RLE optimization

Surfaces compressed with Run-Length Encoding are smaller and faster blitted, but cannot be accessed directly as an array of pixels. To prevent user programs to manipulate them whereas they are still compressed, the yourRLESurface->pixels field gets set to 0 (NULL) and the RLE surface data is 
stored privately, as long as the surface is not locked.


To access the raw pixel data again:

      

        
          
		              
SDL_LockSurface( yourRLESurface ) ;
// yourRLESurface->pixels valid inside here, it triggers the surface decompression.
SDL_UnlockSurface( yourRLESurface ) ;
// yourRLESurface->pixels no longer valid (set to NULL to expose bugs cleanly)
// the surface is compressed internally
           
			  
          
        
      
	  


Modifying an existing RLE surface costs you a decode (when locking) and a 
re-encode (when unlocking). The benefit of RLE is faster rendering, thanks to the compression. If the RLE surface in question is being pre-rendered (instead of being modified frequently), this trade-off is a good one.


And if you're using surfaces with alpha transparency, there can be a decent 
performance increase from turning on a surface's RLEACCEL flag:
	... create and load an SDL surface ...
	SDL_SetAlpha(surface,SDL_RLEACCEL|SDL_SRCALPHA,255);
Just be sure to read the caveats to SDL's alpha transparency on the 
SDL_SetAlpha() page in the SDL docs (http://sdldoc.csn.ul.ie/sdlsetalpha.php)

 [See also hardware surfaces]  


Obtaining best performance with video ram


If one wants to benefit from better performance when reading/writing to/from
video RAM, one ought to consider this :

	do as many batch transactions as you can. Give a chance for the back to
back writes to linear addresses to be batched by the chipset
	avoid read/modify/write operations on small data sizes
	avoid unaligned accesses (4 byte boundaries are the smallest you want
to go)
	assemble your frames in main memory, and avoid using video card
backbuffer surfaces. As you have found, main memory is always going to be
faster than video card memory




[Back to the table of contents]






DGA section


DGA, which stands for Direct Graphics Architecture, is proposed by the XFree86 project. DGA is meant to be used for fullscreen only.


Mouse cursor, acceleration

When you are drawing directly to video memory, without your
knowledge SDL has no way of drawing the mouse cursor without corrupting your image. As no really good way around has been found for now, if you get SDL_HWSURFACE, you will need to draw the cursor yourself. 




The X server tends to provide mouse acceleration, which is not present when you are using the DGA driver. In DGA mode, the X server sends raw mouse events without doing any post-processing (to modify speed, etc.)



We could query the X server for the mouse acceleration options and use
those to modify the DGA mouse events....

  

- If you're using alpha blending, make sure that your display surface is
software. Alternatively use a software buffer surface.

- After loading images, always use SDL_DisplayFormatAlpha() on the surfaces.

- Also call SDL_SetAlpha() on the surfaces with the SDL_RLEACCEL flag.

  
A benchmark software for SDL with OpenGL.

run a profiler tool ('man gprof' if
your on *nix or "Profiling/Profiler" in MSVC) and find out what is
taking time.

The questions you should ask yourself is "What is the bottleneck?":
1) graphics rendering (could be further subdivided..)
2) logic/physic of the game
3) something else? (O/S stealing your app's time, event handling, ...)
 
Optimize for the glSDL backend

Running stock applications with the glSDL back-end might be extremely slow. 
Application tuning for performance in general and for glSDL in 
particular is described here.

  
a) Turn off the alpha channel ( eg with GIMP - unify all channels ), if
you don't need them.

b) Convert surfaces to display format

c) Enable RLE ColorKey for surfaces with a lot of transparent pixels (
if you don't need to change them )

d) Enable RLE Alpha for surfaces with lot of semi-transparent pixels (
if you don't need to change them )

e) Create a buffer for background and surfaces that are rarely modified

f) Try hardware surface ( this depends on the hardware support - check
first, before setting video-mode, if hardware has accelerated alpha
surface blitting support, otherwise software surface maybe faster )

e) Or forget all that, and render images with OpenGL API

having access to the frame buffer (VRAM) only occurs when your program
is run full-screen. You'll never(someone correct me if I'm wrong, but
really, I think it is this way) get a windowed hardware surface.
  
Profiling :  
In GCC put the option -pg in both compiling and linking.  Then you run the program to use gprof and
see the profiling information

How does one build an optimised SDL library? 
Do

export CFLAGS="-funroll-loops -fexpensive-optimizations -march=i686 -mmmx"

before configure (or maybe just before make?).

But I've found that its the amount of data you push to display memory,
rather than what you are working with in system memory, that is the
bottleneck. Blitting a 32bpp system memory surface on to a 16bpp display
surface is actualy faster than blitting 32bpp to 32bpp!



 you can't rely on the GPU doing anything much at all 
when using the SDL 2D API. (Unless you use glSDL, which uses the 
OpenGL API, which is more likely to be accelerated on most 
platforms.)

Now, if you use alpha blending, you can pretty much forget about 
acceleration (only glSDL and DirectFB support that AFAIK) - and 
what's worse, if you do alpha blending with the CPU in VRAM, you'll 
effectively downgrade the CPU to something like a 286, due to the 
slow VRAM reads.

In short, if you want the best performance on more than one platform, 
there's no simple answer. Sometimes you should use h/w surfaces for 
everything. Sometimes you should definitely *not* use h/w surfaces 
for anything but the display surface. In some cases, a some kind of 
hybrid may be faster. (For example, if everything but alpha is 
accelerated, your best bet may be to use that, and do s/w alpha in 
small areas only.)

Either way, pretty much *anything* (including streaming from the hard 
drive in many cases!) is faster than reading from VRAM with the 
CPU... Avoid it at (almost) any cost. *Only* blit from the display 
surface if you know that that particular operation (for some 
mysterious reason) actually is accelerated or at least relatively 
fast. (CPU access to "VRAM" *can* be fast on consoles and machines 
with integrated graphics, but don't count on it. On graphics cards 
with PCI, AGP and similar, bus master DMA is your only option.)
I would recommend against copying dirty rectangles, unless rendering 
the background is extremely expensive and/or you know that blitting 
from the screen is accelerated. (If it isn't, you'll be doing CPU 
reads from VRAM, which - for the 13,546,572nd time - is insanely slow 
on pretty much any computer you can find today.)

Instead, just rerender the dirty areas from the map to remove sprites.


Either way, note that you need to do some extra work to make dirty 
rects work properly on page flipping double buffered displays. With 
plain dirty rects, moving sprites will leave flickering trails.





  Optimizing one's use of SDL



The first thing to do is to convert all your surfaces to the screen  
format using SDL_DisplayFormat. Otherwise SDL does the conversion on the 
fly and this kills ther performance. Once this is done, you must 
optimize your layer algorithm. For example if the first layer is almost 
covered by the others then you should try to blit only the visible part 
of it. I did this to make an efficient parallax effect with a tiled map. 
I can send you some code if you want.

Another point is alpha. If you use alpha, you must know that if your 
surfaces are in video memory  blitting will be dog slow, because the 
blitting is done by the CPU. If you don't need an alpha channel, and if 
your surfaces are in video memory the blitting is done by the GPU and is 
very fast.


On 2.6, the image loading thread causes MAJOR slowdown.
Make sure DMA access is enabled:

hdparm -d /dev/hda

/dev/hda:
 using_dma    =  1 (on)
 
And what does reveal this for the PIO modes, DMA modes and UDMA modes ?
$ hdparm -i /dev/hda 

hdparm -t -T /dev/hda
This will run READING-ONLY speed tests and show the results.
Please react !

If you have information more detailed or more recent than those presented in this document, if you noticed errors, neglects or points insufficiently discussed, drop us a line!
[Top]

Last update : 2006
Optimizing with SDL

Overview

Table of contents

Understanding the video pipe-line to use it efficiently

Blitting section

Managing surfaces format

RLE optimization

Obtaining best performance with video ram

DGA section

Mouse cursor, acceleration

Optimize for the glSDL backend

Optimizing one's use of SDL

Please react !