Rosario 3D

My developer blog

Triangle render order optimization

If our render queue is fully optimized and do the minimum state changes on the driver we still can do some optimization?
Think about it, the most frequent task for a 3d engine is to render meshes. So let’s optimize meshes rendering. There is nothing you can do on your code, once you called glDrawElement you are done. But you can optimize the mesh in a way that will be processed faster by our graphic processor.
First, almost every graphic card have a vertex cache, second the triangles are fetched from the buffer in the order you send them.
So if you have a mesh with the following triangle indexes:
0 1 2
1 3 0
4 2 5
the first vertex processed will be the 0, then the 1, then 2 and then again the 1.
Cause the internal cache if the vertex is still there the graphic card will not call the vertex shader again but will reuse the output stored in the cache.
So trying to maximize the cache hit is a good strategy to speed up the program if your application is vertex bounded.
There are different paper about this topic:

Optimization of mesh locality for transparent vertex caching by Hoppe use different aproach the first grow a triangle strip in counter-clockwise order, and for each triangle added check the cache impact and decide if to continue with the strip or to restart with a more cache friendly triangle. If you grow the triangle strip to infinite you always have a vertex out of the cache. The second algorithm more complex and more slow, but produce best cache hit.

Fast Triangle Reordering for Vertex Locality and Reduce Overdraw by Pedro V. Sander, Diego Nehab, Joshua Barczak. Use the Hoppe aproach but subdivide the polygon in “island” with the same normal and then sort the island with the similar normal front to back. In this way the impact on the cache is similar to the Hoppe algorithm but also the overdraw is reduced.

I will use the algorithm described by Tom Forsyth, Linear-Speed Vertex Cache Optimisation, because is fast, is easy to implement and to understand and it’s cache size independent.

You can find the code here: http://www.rsdn.ru/forum/src/3076506.1.aspx
I cleaned the code and removed the boost dependency. Ok, let’s start.
First of all we need to assign the vertexes and the triangle a score, each vertex have higher score if is already in the cache and if have few triangle attached. The triangle score is determinated by the sum of the vertexes score.

The only input is an array of integer that represent list of vertex indexes. We will fill two arrays, one with triangles and another for vertexes information.
The triangle structure will contain the vertex index used, the triangle score and a flag that say if the triangle is already been visited. The vertex structure will contain the vertex score, the list of triangle connected and the current position in the cache.

unsigned int polyCount = indexes_.size() / 3;
unsigned int bestTri = -1;
vector<TriangleCacheData> triangleCache(polyCount);
vector<VertexCacheData> vertexInfo(getVertexNumber());
for(size_t triIndex = 0; triIndex < polyCount; ++triIndex){
   size_t base = triIndex * 3;
   triangleCache[triIndex].indexes_[0] = unpackedIndexes_[base];
   triangleCache[triIndex].indexes_[1] = unpackedIndexes_[base+1];
   triangleCache[triIndex].indexes_[2] = unpackedIndexes_[base+2];
   triangleCache[triIndex].added_ = false;

   vertexInfo[indexes_._[base  ]].activeTriangle_.push_back(triIndex);
   vertexInfo[indexes_._[base+1]].activeTriangle_.push_back(triIndex);
   vertexInfo[indexes_._[base+2]].activeTriangle_.push_back(triIndex);
}

Then we compute the vertex score and reset the cache position, then we compute the triangle score and select the triangle with the higher score (that will be the one with less neighborhood).

for(vector<VertexCacheData>::iterator vIt = vertexInfo.begin(); vIt != vertexInfo.end(); ++vIt)
      vIt->score_ = findVertexScore(*vIt);
short max_score = -1;
for(unsigned int triIndex = 0; triIndex < polyCount; ++triIndex){
   int score = triangleCache[triIndex].updateScore(vertexInfo);
   if(score > max_score ){
      max_score = score;
      bestTri = triIndex;
   }
}

Then until we had visited all triangles we put the selected triangle back to the original array, but in the optimized order and we mark the triangle as visited. Then we removed the triangle from the vertexes (these vertexes will have an higher score in the future) and then update the cache. I implemented the cache with a simple list, I preferred simplicity against performance, the original implementation used a boost circular buffer, use whatever you want, remember that there are a lot of erase in the middle and a lot of push back and push front. Doing performance test the push back didn’t take a lot of time.

unsigned int trisLeft = polyCount;
unsigned int triIndex = 0;
std::list<IndexType> vCache;

while(trisLeft--){
   TriangleCacheData& tri = triangleCache[bestTri];
   indexes_[triIndex  ] = tri.indexes_[0];
   indexes_[triIndex+1] = tri.indexes_[1];
   indexes_[triIndex+2] = tri.indexes_[2];
   triIndex += 3;
   tri.added_ = true;

   /// reduce valence of used vertexes
   for(IndexType const * idx = tri.indexes_; idx != tri.indexes_ + 3; ++idx ){
      VertexCacheData & vert = vertexInfo[*idx];
      vert.activeTriangle_.erase( std::find(vert.activeTriangle_.begin(), vert.activeTriangle_.end(), bestTri ) );
      std::list<IndexType>::iterator inCache = std::find(vCache.begin(), vCache.end(), *idx);
      if(inCache != vCache.end())// remove from cache, reput them at the beginning
         vCache.erase(inCache);
   }
   vCache.push_back(tri.indexes_[0]);
   vCache.push_back(tri.indexes_[1]);
   vCache.push_back(tri.indexes_[2]);
   while( vCache.size() > vertexCacheSize)
      vCache.pop_front();

Then we need to update the vertex position to update the vertex score.

   unsigned int cachePos = vCache.size() - 1;
   for(std::list<IndexType>::iterator cIt = vCache.begin(); cIt != vCache.end(); ++cIt){
      vertexInfo[*cIt].cachePos_ = cachePos--;
      vertexInfo[*cIt].score_ = findVertexScore(vertexInfo[*cIt]);
   }

Now for each vertex in the cache, recompute the triangle score.

int  max_score = -1;
for(std::list<IndexType>::iterator cIt = vCache.begin(); cIt != vCache.end(); ++cIt){
   VertexCacheData & vert = vertexInfo[*cIt];
   for( std::vector<unsigned int>::iterator itri = vert.activeTriangle_.begin(),
     tris_end = vert.activeTriangle_.end();
     itri != tris_end; ++itri){
      assert(!triangleCache[*itri].added_);
      int score = triangleCache[*itri].updateScore(vertexInfo);
      if(score > max_score ){
         max_score = score;
         bestTri = *itri;
      }
   }
}

Now this is linear, we compute some vertex score based if the vertex is a good candidate for cache, but there are cases where the algorithm reach a dead end i.e. it’s traversing the mesh that represent an horse and at the end of the food it can’t find any suitable triangle for the cache. In this case the algorithm must restart from another place, so we need to compute the another best triangle from scratch considering only the non visited triangle. Like that

if( max_score == -1 ){
   for(unsigned int triIndex = 0; triIndex < polyCount; ++triIndex){
      if(!triangleCache[triIndex].added_ && triangleCache[triIndex].score_ > max_score){
         max_score = triangleCache[triIndex].score_;
         bestTri = triIndex;
      }
   }
}

My performance test show that this is the most costly part of the code, probably cause I have a lot of restart in my mesh and every time I have to check all the triangles.
Note that you don’t need to run the algorithm every time you load the mesh, but you can precompute the optimized order in your exporter.
Now you only need the score function, in the original article there are a lot of float computation and float comparison, this is bad for the CPU that have different ALU for float and int, like almost every CPU (playstation is really sensible in this topic), the code I found use to table for the score.

int VertexArranger::findVertexScore(const VertexCacheData &vertexData){
   static int cacheScoreTable[vertexCacheSize] = {
      750, 750, 750, 1000, 949, 898, 849, 800, 753, 706, 661, 616, 573, 530, 489, 449, 410, 372, 335, 300, 266, 234, 202, 173, 145, 119, 94, 72, 51, 33, 18, 6 };
   static int valenceBoostScoreTable[16] = {
      2000, 1414, 1155, 1000, 894, 816, 756, 707, 667, 632, 603, 577, 555, 535, 516, 500};
   if(vertexData.activeTriangle_.size() == 0 )
      return -1;
   int score = 0.0f;
   int cachePosition = vertexData.cachePos_;
   if(cachePosition >= 0){
      assert(cachePosition < vertexCacheSize);
      score = cacheScoreTable[cachePosition];
   }
   score += valenceBoostScoreTable[std::min(vertexData.activeTriangle_.size() -1,
      sizeof(valenceBoostScoreTable)/sizeof(valenceBoostScoreTable[0])-1)];
   return score;
}

First note the cacheScoreTable, the first three score are the same, the number is the same cause we want the same result if you insert 0 1 2 or 2 0 1. Also the first three value are lower then the other one to discourage the immediate reuse of the same edge. valenceBoostScoreTable is used to give a large bonus to vertex with few connection, otherwise the algorithm can leave single triangle surrounded by used triangle. These triangles will be selected only at the end, cause they can appear random will have an huge impact on the cache, the boost score greatly alleviate this problem.
The only thing left is the function that compute the triangle score, that it’s trivial, simply sum the triangle vertex score.

inline int VertexArranger::TriangleCacheData::updateScore(const vector<VertexCacheData>& vcd)
{
   return score_ = vcd[indexes_[0]].score_ + vcd[indexes_[1]].score_ + vcd[indexes_[2]].score_;
}

Final note
The algorithm is O(numTriangle * cacheSize) the cache size don’t have to be the exact size of the cache, with higher cacheSize you will only increase the computation time but the algorithm will always select the best triangle. The article say that 32 is a good cache size.

Leave a comment