Rosario 3D

My developer blog

Category Archives: programming

Triangle render order optimization

If our render queue is fully optimized and do the minimum state changes on the driver we still can do some optimization?
Think about it, the most frequent task for a 3d engine is to render meshes. So let’s optimize meshes rendering. There is nothing you can do on your code, once you called glDrawElement you are done. But you can optimize the mesh in a way that will be processed faster by our graphic processor.
First, almost every graphic card have a vertex cache, second the triangles are fetched from the buffer in the order you send them.
So if you have a mesh with the following triangle indexes:
0 1 2
1 3 0
4 2 5
the first vertex processed will be the 0, then the 1, then 2 and then again the 1.
Cause the internal cache if the vertex is still there the graphic card will not call the vertex shader again but will reuse the output stored in the cache.
So trying to maximize the cache hit is a good strategy to speed up the program if your application is vertex bounded.
There are different paper about this topic:

Optimization of mesh locality for transparent vertex caching by Hoppe use different aproach the first grow a triangle strip in counter-clockwise order, and for each triangle added check the cache impact and decide if to continue with the strip or to restart with a more cache friendly triangle. If you grow the triangle strip to infinite you always have a vertex out of the cache. The second algorithm more complex and more slow, but produce best cache hit.

Fast Triangle Reordering for Vertex Locality and Reduce Overdraw by Pedro V. Sander, Diego Nehab, Joshua Barczak. Use the Hoppe aproach but subdivide the polygon in “island” with the same normal and then sort the island with the similar normal front to back. In this way the impact on the cache is similar to the Hoppe algorithm but also the overdraw is reduced.

I will use the algorithm described by Tom Forsyth, Linear-Speed Vertex Cache Optimisation, because is fast, is easy to implement and to understand and it’s cache size independent.

You can find the code here:
I cleaned the code and removed the boost dependency. Ok, let’s start.
First of all we need to assign the vertexes and the triangle a score, each vertex have higher score if is already in the cache and if have few triangle attached. The triangle score is determinated by the sum of the vertexes score.

The only input is an array of integer that represent list of vertex indexes. We will fill two arrays, one with triangles and another for vertexes information.
The triangle structure will contain the vertex index used, the triangle score and a flag that say if the triangle is already been visited. The vertex structure will contain the vertex score, the list of triangle connected and the current position in the cache.

unsigned int polyCount = indexes_.size() / 3;
unsigned int bestTri = -1;
vector<TriangleCacheData> triangleCache(polyCount);
vector<VertexCacheData> vertexInfo(getVertexNumber());
for(size_t triIndex = 0; triIndex < polyCount; ++triIndex){
   size_t base = triIndex * 3;
   triangleCache[triIndex].indexes_[0] = unpackedIndexes_[base];
   triangleCache[triIndex].indexes_[1] = unpackedIndexes_[base+1];
   triangleCache[triIndex].indexes_[2] = unpackedIndexes_[base+2];
   triangleCache[triIndex].added_ = false;

   vertexInfo[indexes_._[base  ]].activeTriangle_.push_back(triIndex);

Then we compute the vertex score and reset the cache position, then we compute the triangle score and select the triangle with the higher score (that will be the one with less neighborhood).

for(vector<VertexCacheData>::iterator vIt = vertexInfo.begin(); vIt != vertexInfo.end(); ++vIt)
      vIt->score_ = findVertexScore(*vIt);
short max_score = -1;
for(unsigned int triIndex = 0; triIndex < polyCount; ++triIndex){
   int score = triangleCache[triIndex].updateScore(vertexInfo);
   if(score > max_score ){
      max_score = score;
      bestTri = triIndex;

Then until we had visited all triangles we put the selected triangle back to the original array, but in the optimized order and we mark the triangle as visited. Then we removed the triangle from the vertexes (these vertexes will have an higher score in the future) and then update the cache. I implemented the cache with a simple list, I preferred simplicity against performance, the original implementation used a boost circular buffer, use whatever you want, remember that there are a lot of erase in the middle and a lot of push back and push front. Doing performance test the push back didn’t take a lot of time.

unsigned int trisLeft = polyCount;
unsigned int triIndex = 0;
std::list<IndexType> vCache;

   TriangleCacheData& tri = triangleCache[bestTri];
   indexes_[triIndex  ] = tri.indexes_[0];
   indexes_[triIndex+1] = tri.indexes_[1];
   indexes_[triIndex+2] = tri.indexes_[2];
   triIndex += 3;
   tri.added_ = true;

   /// reduce valence of used vertexes
   for(IndexType const * idx = tri.indexes_; idx != tri.indexes_ + 3; ++idx ){
      VertexCacheData & vert = vertexInfo[*idx];
      vert.activeTriangle_.erase( std::find(vert.activeTriangle_.begin(), vert.activeTriangle_.end(), bestTri ) );
      std::list<IndexType>::iterator inCache = std::find(vCache.begin(), vCache.end(), *idx);
      if(inCache != vCache.end())// remove from cache, reput them at the beginning
   while( vCache.size() > vertexCacheSize)

Then we need to update the vertex position to update the vertex score.

   unsigned int cachePos = vCache.size() - 1;
   for(std::list<IndexType>::iterator cIt = vCache.begin(); cIt != vCache.end(); ++cIt){
      vertexInfo[*cIt].cachePos_ = cachePos--;
      vertexInfo[*cIt].score_ = findVertexScore(vertexInfo[*cIt]);

Now for each vertex in the cache, recompute the triangle score.

int  max_score = -1;
for(std::list<IndexType>::iterator cIt = vCache.begin(); cIt != vCache.end(); ++cIt){
   VertexCacheData & vert = vertexInfo[*cIt];
   for( std::vector<unsigned int>::iterator itri = vert.activeTriangle_.begin(),
     tris_end = vert.activeTriangle_.end();
     itri != tris_end; ++itri){
      int score = triangleCache[*itri].updateScore(vertexInfo);
      if(score > max_score ){
         max_score = score;
         bestTri = *itri;

Now this is linear, we compute some vertex score based if the vertex is a good candidate for cache, but there are cases where the algorithm reach a dead end i.e. it’s traversing the mesh that represent an horse and at the end of the food it can’t find any suitable triangle for the cache. In this case the algorithm must restart from another place, so we need to compute the another best triangle from scratch considering only the non visited triangle. Like that

if( max_score == -1 ){
   for(unsigned int triIndex = 0; triIndex < polyCount; ++triIndex){
      if(!triangleCache[triIndex].added_ && triangleCache[triIndex].score_ > max_score){
         max_score = triangleCache[triIndex].score_;
         bestTri = triIndex;

My performance test show that this is the most costly part of the code, probably cause I have a lot of restart in my mesh and every time I have to check all the triangles.
Note that you don’t need to run the algorithm every time you load the mesh, but you can precompute the optimized order in your exporter.
Now you only need the score function, in the original article there are a lot of float computation and float comparison, this is bad for the CPU that have different ALU for float and int, like almost every CPU (playstation is really sensible in this topic), the code I found use to table for the score.

int VertexArranger::findVertexScore(const VertexCacheData &vertexData){
   static int cacheScoreTable[vertexCacheSize] = {
      750, 750, 750, 1000, 949, 898, 849, 800, 753, 706, 661, 616, 573, 530, 489, 449, 410, 372, 335, 300, 266, 234, 202, 173, 145, 119, 94, 72, 51, 33, 18, 6 };
   static int valenceBoostScoreTable[16] = {
      2000, 1414, 1155, 1000, 894, 816, 756, 707, 667, 632, 603, 577, 555, 535, 516, 500};
   if(vertexData.activeTriangle_.size() == 0 )
      return -1;
   int score = 0.0f;
   int cachePosition = vertexData.cachePos_;
   if(cachePosition >= 0){
      assert(cachePosition < vertexCacheSize);
      score = cacheScoreTable[cachePosition];
   score += valenceBoostScoreTable[std::min(vertexData.activeTriangle_.size() -1,
   return score;

First note the cacheScoreTable, the first three score are the same, the number is the same cause we want the same result if you insert 0 1 2 or 2 0 1. Also the first three value are lower then the other one to discourage the immediate reuse of the same edge. valenceBoostScoreTable is used to give a large bonus to vertex with few connection, otherwise the algorithm can leave single triangle surrounded by used triangle. These triangles will be selected only at the end, cause they can appear random will have an huge impact on the cache, the boost score greatly alleviate this problem.
The only thing left is the function that compute the triangle score, that it’s trivial, simply sum the triangle vertex score.

inline int VertexArranger::TriangleCacheData::updateScore(const vector<VertexCacheData>& vcd)
   return score_ = vcd[indexes_[0]].score_ + vcd[indexes_[1]].score_ + vcd[indexes_[2]].score_;

Final note
The algorithm is O(numTriangle * cacheSize) the cache size don’t have to be the exact size of the cache, with higher cacheSize you will only increase the computation time but the algorithm will always select the best triangle. The article say that 32 is a good cache size.

Xlib + openGL programming and multithreading

When programming using Xlib or Win32 the usual program is something like that:


    event = GetEvent()

Where in the processMessage() will manage the event like key pressed or mouse moved and applicationLogic() update the application state (like physics simulation). This pattern have two disadvantages

  • both the processMessage() and applicationLogic() function can be slow and freeze the rendering (for example when a press return I want to load a picture).
  • every frame check in polling if there are some messages, nobody like polling.

Developing a framework the problem is quite common cause you don’t know what the user will do with your framework, the only thing you know is that you don’t want to freeze the rendering.
To make the long story short, we have to move the renderScene() function in another thread. It look incredible but I didn’t found any tutorial about this, so here it is.
I want to create an application that create n openGl windows, each window will have it’s own thread.
Something like this:

for(i = 0; i < numWindow; ++i){
  win = setupWindow()
   event = waitForEvent() // this thread sleep if there aren't event

I tried exactly what I wrote above but I got a lot of multi thread issue. If you use multiple context openGl it’s ok with multithreading, is you use the magic function XInitThread also xlib is thread safe, the problem in GLX. It look like that have a bug that freeze the swapbuffer calls when another thread is waiting for events, when they both try to get the display they freeze. I tried different solution (using xcb works on some platform), but the most reliable is to use multiple X connection.
Ok, enought talk, let’s start with the code. I suppose that you already know some basic Xlib programming, so I wont explain basic function call.
First of all we need a structure to contain the shared memory between the main thread and the render thread.

struct RenderData {
  Display d_;   // Display connection used for this thread
  Atom delMgs_; // Explain later
  GLXDrawable surface_;
  GLXContext  ctx_;
  // Other data
  Color bg_;     // background color
  Size winSize_; // Windows dimension

then we need to initialize the main x connection.

int main(){
// We must call this to make X "thread safe"
// opening the connection with X
  Display *display = XOpenDisplay();
  if(display == NULL) return -1; // Add meaning full message error here
  int xScreen = XDefaultScreen(display);
// Let's choose a visual (pretty standard function, link to the full code below)
  GLXFBConfig *fbConfig = chooseFBConfig(display, screen);
  XVisualInfo* visInfo = glXGetVisualInfoFromFBConfig(display, fbConfig[0]);
  Window root = XRootWindow(display, screen);

Now we must create an Atom to get informed about window destruction (when the user press the X button).

Atom delMsg = XInternAtom(display, "WM_DELETE_MESSAGE", True);

Now we can create the windows:

RenderData param[numWindow];
for(int i = 0; i < numWindow; ++i){
// create a X connection for each window
   param[i].d_ = XOpenDisplay(NULL);
   param[i].delMsg_ = delMsg;
   XSetWindowAttributes winAttrib;
// Need to create a color map for each connection
   winAttrib.colormap = XCreateColormap(param[i].d_, root, visInfo->Visual, AllocNone);
// Create the window and the openGl context in the standard way
   param[i].surface_ = XCreateWindow(param[i].d_, root, 0, 0, 300, 200, 0, visInfo->depth, InputOutput, visInfo->visual, CWColormap, &winAttrib);
   param[i].ctx_ = glXCreateContext(param[i].d_, visInfo, NULL, True);
// Now the trick, tell X that we want to retrieve event for the window in the main connection
   XSelectInput(display, param[i].surface_, StructureNotifyMask);
// Register the Atom for this window
   XSetWMProtocolos(param[i].d_, param[i].surface_, &delMsg, 1);
// Create a new thread (c++0x way)
   param[i].thread_ = new thread(render, param+i);
// Keep track of the association window <-> data
   wMap.insert(std::make_pair(param[i].surface_, r));

The thread function will create a thread that will execute the function render using param+i as parameter.
The render function will simply cycle endlessly until someone press the X button.

void render(RenderData *data){
   XEvent event;
// show the window
   XMapWindow(data->d_, data->surface_);
// Each thread must have it's current context
   glXMakeCurrent(data->d_, data->surface_, data->ctx_);
   glClearColor(data->bg.red_, data->bg.green_, data->bg.blue_, 1.f);
         XNextEvent(data->d_, &event);
// Got death message
         if(event.type == ClientMessage &&[0] == data->delMsg)
      glViewport(0, 0, data->winSize_.w, data->winSize_.h_);
      // do some rendering here
// Show our beautiful rendering
      glXSwapBuffers(data->d_, data->ctx_);
// Work finished, destroy the context, the window
   glXDestroyContext(data->d_, data->ctx_);
   XDestroyWindow(data->d_, data->surface_);
// and close the connection

The render function render the scene and pool for the close even. In this way we are sure that our render will not freeze.
We need the last part of the code, the event handling. After the for that create the windows…

int windowLeft = numWindow;
while(windowLeft > 0){
  RenderData *data;
  XEvent event;
  XNextEvent(display, &event); // blocking function
  case ConfigureNotify: // window changed size or position
    // get the data associate to the window
     data = selectData(wMap, event.xconfigurerequest.window);
     data->winSize.w = event.xconfigurerequest.width;
     data->winSize.h = event.xconfigurerequest.height;
  case DestroyNotify: // the window has been destroyed
     data = selectDatta(wMap, event.xdestroywindow.window);
// wait the thread to end
// delete the thread
// remove the thread from the data

The StructureNotify we have request to the XSelectInput give us these message: ConfigureNotify, when the window change size or position; DestroyNotify, when the window is destroyed; and other message like Map/Unmap that we are not using.
In this code there is a possible data corruption issue caused by writing in a non atomic way the window size.

Here what you are waiting for, the whole source of the example.

#include <X11/X.h>
#include <X11/Xlib.h>
#include <GL/glx.h>
#include <GL/gl.h>
#include <thread>
#include <map>

using namespace std;
GLXFBConfig *chooseFbConfig(Display *d, int screen);

static const int totalThread = 3;

struct Size{
   Size() {}
   Size(unsigned int x, unsigned int y):w(x), h(y){}
   unsigned int w, h;

struct Color{
   Color(unsigned int c):
   { }
   Color(float r, float g, float b): red(r), green(g), blue(b) {}
   float red, green, blue;

struct RenderData{
   RenderData() : exit(false), thread_(NULL) {}
   Display *d_;
   Atom delMsg;
   GLXContext ctx;
   GLXDrawable win;
   Color bg;
   Size winSize;
   bool exit;
   thread *thread_;

typedef std::map<Window, RenderData*> WinDataMap;

RenderData* selectData(WinDataMap& map, const Window& w){
   WinDataMap::iterator it = map.find(w);
   if(it == map.end())
      throw runtime_error("invalid window");
   return (it->second);

void render(const RenderData *data)
   XMapWindow(data->d_, data->win);
   glXMakeCurrent(data->d_, data->win, data->ctx);
   glClearColor(data->, data->, data->, 1.0);
      XEvent event;
         XNextEvent(data->d_, &event);
         if(event.type == ClientMessage && static_cast<unsigned int>([0]) == data->delMsg){
            puts("got dead event.. ._.");
      glViewport(0, 0, data->winSize.w, data->winSize.h);
         glVertex2d(0.0f, 0.0f);
         glVertex2d(1.0f, 0.0f);
         glVertex2d(1.0f, 1.0f);
         glVertex2d(0.0f, 1.0f);
      if(glGetError() != GL_NO_ERROR)
         puts("gl error");
      glXSwapBuffers(data->d_, data->win);
   glXDestroyContext(data->d_, data->ctx);
   XDestroyWindow(data->d_, data->win);
const unsigned int colors[totalThread] = {

int main(int argc, char *argv[])
   Display *d;
   WinDataMap wMap;

   RenderData param[totalThread];   // window parameter
   d = XOpenDisplay(NULL);
   if(d == NULL){
      puts("Unable to connect to X");
      return 1;
   int screen = XDefaultScreen(d);

   Atom delMsg = XInternAtom(d, "WM_DELETE_WINDOW", True);
   XSetWindowAttributes windowAttr;
   GLXFBConfig *fbConfigs = chooseFbConfig(d, screen);
   XVisualInfo *visInfo = glXGetVisualFromFBConfig(d, fbConfigs[0]);
   Window root = XRootWindow(d, screen);
   for(int i = 0; i < totalThread; ++i)
      param[i].d_ = XOpenDisplay(NULL);
      param[i].delMsg = delMsg;
      param[i].bg = Color(colors[i]);
      param[i].winSize = Size(300, 200);

      windowAttr.colormap = XCreateColormap(param[i].d_, root, visInfo->visual, AllocNone);
      param[i].win = XCreateWindow(param[i].d_, root, 200,200, 
                   300,200, 0, visInfo->depth, InputOutput, visInfo->visual,
      param[i].ctx = glXCreateContext(param[i].d_, visInfo,  NULL, True);
      XSelectInput(d, param[i].win, StructureNotifyMask);
      XSetWMProtocols(d, param[i].win, &(delMsg), 1);
      printf("win %lu\n", param[i].win);
      wMap.insert(std::make_pair(param[i].win, param+i));
      param[i].thread_ = new thread(render, param+i);
   int windowLeft = totalThread;
   while(windowLeft > 0)
      RenderData *pData;
      XEvent event;
      XNextEvent(d, &event);
   // Structure notify
      case ConfigureNotify:   // size or position change
         pData = selectData(wMap, event.xconfigurerequest.window);
         printf("configure win %lu: pos [%d, %d], size [%d, %d]\n", event.xconfigurerequest.window,
         // Warning this is not thread safe
         pData->winSize.w = event.xconfigurerequest.width;
         pData->winSize.h = event.xconfigurerequest.height;
      case CirculateNotify:
         puts("CirculateNotify"); break;
      case DestroyNotify:
         pData = selectData(wMap, event.xdestroywindow.window);
         delete (pData->thread_);
      case GravityNotify:
         puts("GravityNotify"); break;
      case ReparentNotify:
         puts("ReparentNotify"); break;
      case MapNotify:
         printf("Map win %lu\n", event.xmap.window);
      case UnmapNotify:
         printf("UnMap win %lu\n", event.xunmap.window);
   // Unhandled message
         printf("Type %d\n", event.type);
   return 0;

GLXFBConfig* chooseFbConfig(Display *d, int screen) {
   int glAttrib[] = {
      GLX_RED_SIZE      , 1,
      GLX_GREEN_SIZE    , 1,
      GLX_BLUE_SIZE     , 1,
      GLX_ALPHA_SIZE    , 8,
      GLX_DEPTH_SIZE    , 24,
   int num; 
   GLXFBConfig *fbConfigs = glXChooseFBConfig(d, screen, glAttrib, &num);
   if(fbConfigs == NULL) throw runtime_error("Unable to find a frame buffer config!");

   return fbConfigs;

To compile use:
gcc glThread.cpp -std=c++0x -pthread -l stdc++ -l X11 -l GL -o glthread


CppUnit + Hudson = Epic win

Finally I came with a good test driven pipeline for my code. It’s easy to write a test, I don’t have to write a lot of code for the tests and everything is tested automatically.


CppUnit is a test framework inspired by JUnit, writing test in C++ is more difficult then Java cause C++ don’t have language support to navigate trough classes, but again, a good design can solve a lot of programming issue.

With CppUnit all you have to write to run your test is

int main(int argc, char *argv[])
  CppUnit::TextUi::TestRunner runner;
  CppUnit::TestResultCollector  collector;
  CppUnit::TestResult result;

  CppUnit::TestFactoryRegistry &registry = CppUnit::TestFactoryRegistry::getRegistry();
 // writing result on a XML file
  std::ofstream xmlFileOut("testresults.xml");
  CppUnit::XmlOutputter xmlOut(&collector, xmlFileOut);
  return 0;

This code create all the structure needed to get all the tests case, collect the results and write everything on a XML file. The file is then analyzed by Hudson.
To create a test case you have to create a class derived by TestFixture and write a methods for each test case. A fixture is a set of test, you can put all your test into one fixture, but is better do subdivide the fixture in different classes for compilation speed and clarity sake. In the test methods you can use some macros to make assertions, if an assertion fail the test fail.
For example here the dot product test to my vector class:

void testDotProduct(){
  CPPUNIT_ASSERT(dot(Vec3f::xAxis, Vec3f::yAxis) == 0.0f);
  CPPUNIT_ASSERT(dot(Vec3f::xAxis, Vec3f::zAxis) == 0.0f);
  CPPUNIT_ASSERT(dot(Vec3f::yAxis, Vec3f::zAxis) == 0.0f);
        op1.length()*op1.length(), tollerance);
  CPPUNIT_ASSERT(dot(op1, op2) == dot(op2, op1));  // commutative
  CPPUNIT_ASSERT_DOUBLES_EQUAL(dot(7.0f*op1, 5.0f*op2),
        (7.0f*5.0f)*dot(op1, op2), tollerance);  // distributive
  CPPUNIT_ASSERT(dot(op1, op2+op3) == dot(op1, op2) + dot(op1, op3));

To help the test runner to collect (and run) all the test the class should declare a static method called suite, it’s a very boring and repetitive method where you have to list all the test. To make it less boring CppUnit have some macros.

   CPPUNIT_TEST_EXCEPTION(testDivideByZero, MathError);

In the main, you should add all the test suite to the runner, another boring work, also you have to include all the file test in the main. This create a lot of dependence, and you have to recompile the main every time you change a test. To avoid these problems you can register your suite into a test factory using the macro


, and get the test with the factory.
You can find the whole code in the repository. You can find a guide to learn how to use CppUnit here.


Hudson is a wonderful tools to automatically compile your code and make a report in case of errors, and also analyze the CppUnit test report (if you install the appropriate plug-in), that’s why I write a XML file in the test case. It’s designed to work with java but I can run any script to compile (so you can invoke make or any other tool chain you use). It can pool for changes in your repository and check if something is changed. But you can also trigger the build process on commit with a simple hook script in your svn repository. Hudson for every commit will check if something is changed in the project and compile if it’s necessary and will also send you a mail if something is wrong. 🙂

I still have to find a good (easy and free) code coverage analyzer, any suggestion?

Let's do optimization

A lot of programmer start with the idea of optimizing everything, they want to save the last clock cycle to make their program faster even if this cost some readability of the code. Making optimized code make you feel smart and sometime you can say… “WOW!! These are the smartest five line of code I ever wrote”

Read more of this post

Vector class

Difficulty: medium (a lot of template)

I want to write an article to speak  about quaternion and how beautiful they are to implement orientation and  keyframe interpolation. But before speaking about quaternion is better to start from the base and create a class for vector.

Vector class

When I speak about vector I refer to maths vector, not an array.  We will develop general version of a vector class that can be used to record our geometry data or other multidimensional data.
This implementation is inspired by the nVidia vector class that you can find with the nVidia SDK, is one of the best Vector class implementation I found. Let’s try to make it better.


  1. Speed: Vector class should be very fast, in a 3d program we will use a lot of vector maths and the speed of this class is crucial for the whole program.
  2. Small size: I also want my class to be compact, I will use the vector class to store the vertex attribute, I don’t want any other data in the middle.
  3. Readability: I want to access to the y coordinate with the .y notation

For the first two reason it’s not wise to use virtual function in our vector class. Virtual function are (slightly) slower, but the most important is that using virtual function will insert an hidden pointer that will increase the size considerably.

Let’s start with the first part of the class, first of all I’ll not create a class but a struct, cause I want all my data to be public, yep I can make the data private and then use getX, getY, setX, setY… but why? This will make only code less readable. A vector is a very simple data that don’t have anything to hide.

template<typename T>
struct Vec4Container
   static unsigned const len = 4;
      struct { T x, y, z, w; };
      struct { T r, g, b, a; };
      struct { T s, t, u, v; };
      T data_[len];

I let you immagine how the Vec3Container and Vec2Container is be made.
I have used a template so my class is type independent, now I can create a vector of float, int and so on. I have used union, so I can access to my value with the .x .y .z notation or if my vec represent a color I can use .r .g .b .a or stuv for texture coordinate. OpenGL use s, t, r, q but we already used r in rgb.

Here the real vector class, the one that will perform the computation

template<class container, class T>
struct Vec : public container
  static size_t size() { return container::len; }
  Vec() { }
  explicit Vec(const T *v) { for(size_t i = 0; i < size(); i++) container::data_[i]=v[i]; }
  explicit Vec(const T &v) { for(size_t i = 0; i < size(); i++) container::data_[i]=v; }
  Vec(const Vec &vv) { for(size_t i = 0; i < size(); i++) container::data_[i]=vv.data_[i]; }

This struct derive from container, so we can access the same data. I also added some constructor,  the first one is the default constructor, I don’t want to initialize my vector if I don’t explicitly needed, then I have a constructor from an array, a constructor from a value. Notice that I have added the explicit keyword, cause I don’t want that a single number can be interpreted as a Vec, this can lead to strange behaviors and confusion. I have added a size operator that return the length of the vector, useful for cycles. Note that this struct don’t add any overhead. The only problem is that cause we are using template C++ don’t know where data_ came from, and we must specify it every time.

We need one more constructor, is very useful to specify a vector like this b(4, 3) or w(1, 3, 5), we need some other classes

template<class T>
struct Vec3 : public Vec<Vec3Container<T> >
   Vec3() {};
   explicit Vec3(const T &v) : Vec<Vec3Container<T>, T>(v){ }
   explicit Vec3(const T *v) : Vec<Vec3Container<T>, T>(v){ }
   Vec3(const T& v0, const T& v1, const T& v2){
      Vec<Vec3Container<T>, T>::x = v0;
      Vec<Vec3Container<T>, T>::y = v1;
      Vec<Vec3Container<T>, T>::z = v2;

this class will declare only the constructors. The Vec2 and Vec4 are very similar.

Now we can define some basic operation for Vec:

   T& operator[] (int i){
      return container::data_[i];
   const T& operator[] (int i) const {
      return container::data_[i];
   Vec& operator +=(const Vec &rop){
      for(size_t i = 0; i < size(); i++)
         container::data_[i]+= rop.data_[i];
      return *this;
   Vec& operator -=(const Vec &rop){
      for(size_t i = 0; i < size(); i++)
         container::data_[i]-= rop.data_[i];
      return *this;

There is no problem with sum or subtraction, but what we have to do with multiplication?
We can define different type of multiplication for vector, the dot product, the cross product, pairwise multiplication or scalar multiplication, every operation deserve a * operator. For clarity sake I’ll use a named method for dot and cross product and I’ll reserve the * operator for pairwise and scalar product.

   Vec& operator *=(const T &rop){
      for(size_t i = 0; i < size(); i++)
         container::data_[i]*= rop;
      return *this;
   Vec& operator *=(const Vec &rop){
      for(size_t i = 0; i < size(); i++)
         container::data_[i]*= rop.data_[i];
      return *this;

Dot product is a binary operation, so it’s better to make it extern to the class, same thing for cross product.

template<class container, typename T>
T dot(const Vec<container, T>& rOp, const Vec<container, T>& lOp)
  T ret(0);
  for(size_t i = 0; i < container::len; i++)
     ret += rOp[i] * lOp[i];
  return ret;

We can specialize the cross funtion to make is more efficient

template<typename T>
Vec3<T> cross(const Vec3<T>& rOp, const Vec3<T&>& lOp)
    Vec3 res;
    res.x = rOp.y * lOp.z - rOp.z * lOp.y;
    res.y = rOp.x * lOp.z - rOp.z * lOp.x;
    res.z = rOp.x * lOp.y - rOp.y * lOp.x;
    return res;

Other function like normalize, length and binary operator are trivial, you can implement them as exercise. 😛
Final touch:

typedef Vec3 Vec3f;
typedef Vec3 Vec3i;
typedef Vec3 Vec3ui;
typedef Vec2 Vec2f;
typedef Vec2 Vec2i;

Edit: I notice that a lot of compiler don’t like template of template, so I modify the code a little bit.

Glus for linux

Recently Norbert Nopper have released GLUS, a framework similar to (free)Glut. The first version was only for windows, but recently Pablo Alonso-Villaverde Roza wrote a version that work with X11.

I have tested this evening on Ubuntu 9.10 with a G8800 and it works perfectly.

How to build the code

GLUS use Glew to load all the openGL extension, the Glus packet that you download from the site already include Glew. But if you have Glew already installed it can generate some conflict. The Glew library that came with Ubuntu is not the most recently updated and support openGL up to 3.0. So you can remove Glew and use the one that came with Glus or download the latest Glew source and build it by yourself (it take 5 min.)

svn co glew
cd glew
#Probably you will need these package
sudo apt-get install libxmu-dev xorg-dev
make extensions
sudo make install

Now you can go in you Glus directory and type

make all

Now you can run the examples, make sure you have the latest driver installed.

About Glus

I’m not a big fan of Glut, it imposes a programming style that leads to a tons of global variables (I’ll write an article about that)  but is very simple and for a rough test is very good. Glus use the same program style but removing all the deprecated stuff and add some utility function to:

  • create openGL 3.2 context
  • read and create shaders from files
  • operate on matrix and vector
  • load TGA texture
  • create some basic shapes (cubes, sphere, plane)


  • Very easy to learn, if you already had programmed with glut you don’t have to learn anything new.
  • First framework that support openGL 3.2 natively.


  • Very young, no documentation yet.
  • It’s not object oriented
  • No support for multi-threading
  • No license… MIT? ZLIB? Need to ask.

I still don’t know why people in 2010 don’t program with object, but I think it’s a matter of taste. The lack of documentation is fullfilled by some example. I’m a bit worried about the license I hope it will adopt LGPL3.


IMHO Glus is very good to create some basic example, with the utility function you don’t have to bother about texture loading or shader creation, making prototyping faster. I don’t know if it’s good to create more complex program. In this case you usually create your own code for matrix, image and shader manipulation, also it lack of multi-thread support that really limit the possibilities.