Sam MacPherson

Flash, Haxe, Game Dev and more…

To Batch or not to Batch

Once again I’ve revisited my rendering system. This time with the intention of switching over to a batched setup instead of a one-draw-call-per-sprite setup. Overall the switch was relatively painless, but I did learn some lessons along the way. I wanted to share some of the pitfalls that I experienced.

First, I will briefly explain what a batched sprite rendering system is. The basic idea behind batched sprite rendering is that everytime you call drawTriangles() you incur some overhead. If you are using a naive setup (Like I was) you are probably calling drawTriangles() once for every sprite. This is okay for a few sprites, but once you get a couple thousand it becomes extremely inefficient. Basically the name of the game is to minimize the number of drawTriangles() calls. To do this, we don’t immediately render the sprite when render is called. Instead we batch the sprite’s verticies onto a global vertex buffer with the intent of rendering everything at the end with one drawTriangles() call. Simple. Well not really. Using this method has some implications:

1. We can no longer use the GPU to apply sprite-specific transforms.
2. A batch of sprites must share the same texture.

The implications of (1) mean that we must do the coordinate transforms on the CPU. This not the best, but it is unavoidable. If you have a reasonable amount of sprites (Say a couple thousand) then this should be fine.

Because of (2) we must make seperate drawTriangles() calls everytime we need to switch textures. But wait hang on. If we need to make another call everytime we switch textures then doesn’t that leave us where we started seeing as different sprites will likely have different images? The answer is yes — and no. First and foremost you can group sprites of the same image into the same draw call. However, you can even go one step further.Instead of allocating a texture for every image we allocate a global store of massive textures (2048×2048 pixels). We then stamp in all the smaller textures and give the render jobs appropriate U,V coordinates. Very nice! If your game has a lot of small sprites this will be lightning fast. Probably under 3 draw calls.

Ok so we go ahead and do this and oops we have another problem. Because we are grouping the sprites by texture they will no longer necessarily be sorted by depth. Ok, so this is a set-back, but perhaps we could just batch the sprites until we encounter a texture change then flush the buffer and start again. This will of course work, but it is only efficient if sprites of similar depth also share the same texture which may not be true. For example when testing this with my game I went from 3 draw calls to about 300. This is unacceptable.

Well we are rendering on a 3D graphics card so why not use the depth buffer to do the sorting for us! Every frame update we assign a global depth value to each sprite accordingly and enable the depth buffer. This may appear to be the best solution possible, but it does have one major flaw. You can’t use translucent textures. The reason behind this is the depth buffer does not understand alpha compositing. All it understands is geometry. Either a triangle is blocking something behind it or it is not. This is where we have to make a decision. There are four equally valid solutions that I have come up with.

General Purpose Solutions

1. Fall back to the render on texture switch method and do some optimizations to try and group depth-locality with texture-locality. (Could be optimal depending on setup)

2. Render all opaque images first entirely on the GPU. Then do transparent images after using method (1). (Works very well if there isn’t many transparent images)

Specific Solutions

3. Only use opaque textures. (Optimal)

4. Only use textures with quantized alpha values of either 0 or 1. (Optimal, but requires an extra instruction in the shader)

All of these methods have there strengths and weaknesses. Personally I decided to go with method (4) for my game. Method (4) is very similar to method (3). They only differ by one instruction in the shader. For (4) you include a KIL opcode (kill() in hxsl) which when the alpha channel is less than one will abort the pixel and depth buffer writes.

There are definitely other solutions out there, but these were the best ones I could come up with after working on this for several hours. Hope this helps.


8 responses to “To Batch or not to Batch

  1. Iain July 16, 2011 at 5:03 am

    Thanks so much for writing this! A lot of things make a lot more sense now – it explains why certain 2d molehill demos are slower than blitting or displayObjects. How do 3D engines handle transparency?

  2. Sam July 16, 2011 at 1:58 pm

    No problem. I’m glad you enjoyed reading it. I am not a 3D developer, but if I had to guess I would say that something similar would have to be done.

  3. Pingback: This Week on Twitter and Google+ | Flash Video Traning Source

  4. TheUnSpoken July 17, 2011 at 7:28 pm

    As you probably know by now, I am part of your staff on Pawn. First of all, I have no knowledge of coding at all. Although, I do plan on learning in the near future. You gave me a different aspect on how hard this actually is. I read this whole blog, and the last couple of posts as well. I understand more things that are happening with Zed. I enjoy reading these (even though I don’t understand half of it). Keep it up! Good job Sam.

  5. Qwerb October 22, 2011 at 5:08 am

    Calling a shader.draw isn’t really that slow; you can get some pretty good results by giving sprites their own vertex buffers.

  6. Sam October 22, 2011 at 1:29 pm

    Maybe not on its own, but start calling draw 5000 times and you will see a significant performance drop. Not to mention after 64 draw calls the GPU no longer runs asynchronously.

    Personally for my current project switching from a non-batched to a batched system was the difference between about ~20ms or so under heavy load.

    FYI This is for flash/molehill. Not graphics in general.

  7. Kevin Newman February 2, 2012 at 11:40 am

    Let’s see if I got this right. 🙂 This means you actually do all Matrix transforms on the CPU using either Matrix3D or doing the math manually in a Vector. vertex data? It looks like that’s what Starling does in it’s quadbatch. I’m attempting to convert some 2D molehill examples I’ve seen to doing it that way – just about every (small) example I’ve seen uses a quad normal (I think I got that term right), and shader constants to adjust the positioning (Context3D.setProgramConstantsFromMatrix), and they are only set to draw the one quad per drawTriangles (though I’d guess you could expand it out to batch as many transform matrices could fit in the available contants – replacing setProgramConstantsFromMatrix with setProgramConstantsFromVector and concatenating multiple Matrix3Ds into that). I’m going to go play with all this stuff (finally found a few minutes).

  8. Sam February 4, 2012 at 4:37 pm

    @Keven Newman – Yes that’s another way of doing it. Depending on the number of quads you need to render you can get away with passing in an “array” of matricies in the vertex shader constants. As I recall there should be 128 float4 constants available. So that gives 32 4×4 matrices. Mind you that you will need to pass in an array index into the vertex buffer for each quad.

    It sounds like you know what you’re doing though. Good luck. 😀

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: