Archive for the ‘Programming’ Category

Algebraic Mesher in Python

Monday, July 14th, 2008

This year I have tried to make it a habit to track how much time I spend on numerical programming. I’d like to put in 10 hours or so a week and this lets me know if I am doing that. Anyway, I went back to my records from February when I was developing the Algebraic Mesher in C. It took me 20.9 hours to complete, with about half of that time spent in a frustrating effort to track down numerous memory problems. The promise of scripting languages is they allow you to develop programs much more efficiently, at the price of slower run time performance. So I decided to create the mesher in Python and see how long it took me.

Method

Four months had elapsed since I finished the C version and I couldn’t recall how I had programmed it. All I remembered is it was a miserable experience. Thus it seemed that I could create a Python version from scratch without undue influence from the previous effort. To get started, I reviewed the equations linking the ID numbers for vertices, faces and cells from earlier, but nothing more. I didn’t look at the C code or my follow up blog entry on the experience.

So I opened up the IDLE editor and looked at a perfectly blank page. I wasn’t sure where to start. Then I thought: I need a cell object. So I made a very simple one:

class cellclass: def __init__(self,CID): self.CID = CID

All it does is store the ID number of the cell. I made a list of cells and stored some numbers. So far so good. I then made a vertex object, and created a list of vertices:

verts=[] for i in range(NumRows+1): for j in range(NumCols+1): VID = (NumCols+1)*i + j x = j*delx; y = i*dely temvert = vertclass(VID,x,y) verts.append(temvert)

In this way I added features incrementally, in baby steps, without really any planning. Proceeding along in this fashion I had the mesh generator finished in 1.5 hours. I couldn’t believe how fast it went. The code wasn’t especially pretty – nearly everything is lumped in one routine – but it seemed to run fine.

Testing

I wanted to make sure I tested the code as thoroughly as I had tested the C version, so I opened up the C code at this point, reviewed the tests there, and replicated them, one by one in Python. With the C version, every test revealed new bugs which had to be laboriously tracked down and fixed. With the Python version, the code passed the first four tests the first time. The fifth test found a bug where I was writing multiple copies of vertices to the face objects. I fixed that in a couple of minutes.

In a little more than 2 hours I had a working mesher. And it was fun to write. The code just seemed to flow out of my finger tips (and I am not especially skilled at Python). Yes, this code probably should be rewritten into a more modular style. And perhaps it could be made more ‘Pythonic’. So maybe another hour or two are needed to really polish it up. But there is something extremely satisfying about getting a rough version working with so little effort.

One major disappointment with the C version of the program was the memory problems. I had spent some time developing a generic list structure in C, so I could make my programming in that language less painful. The list and FOR_EACH macro were supposed to make C more like Python. The thought was good, but the experience was not.

Stats

The table shows the time and lines of code needed for each version of the code. From Prechelt’s study, we expect Python to be 3.8x faster and 3.6x shorter than C/C++. Roughly the speed up in development time mirrors that in lines of code. In my case, the speed up was nearly 9x, while the needed lines of code was about 2x less. This may be due to the inordinate amount debugging time the C code required. In any case the general trend holds: Python is a lot faster to develop in.

. Time --Lines of code -- Lang (hrs) Mesher Test Code Total ------------------------------------------------------------------------------------- C 20.9 299 163 462 Python 2.4 124 89 213 ------------------------------------------------------------------------------------- Ratio 8.7x 2.4x 1.8 2.2x

Run Times

The big downside of fast development time is slow execution time. Python does all kinds of nice things for you (like memory management), but that comes at a price. To measure this, I had both the C and Python meshers create a million cell mesh and timed them. Here are the results:

. Time RAM Lang (sec) (GB) ------------------------------------------------------------------------------------ C 6.9 0.366 Python 344 1.33 ------------------------------------------------------------------------------------ Ratio 50x 3.6x

This test was run on an Intel Core2 Duo (T7300) at 2 GHz, with 2GB of RAM. I think much of the time, for both programs, went into memory allocation. This was corroborated when I tried compiling the C version with various optimizations and got no change in the timing. There may some ways to allocate memory in bigger chunks, and thus speed up the code.

There are tools for speeding up the execution of Python programs. Tools like Psyco and Pyrex can offer dramatic improvements, at least under some conditions. That might be a good experiment for next time.

File: PyAlgMesher.py

Programmer Efficiency

Monday, April 14th, 2008

Programming is a slow process. You write a few lines, test them, find some errors and write a few more lines. The C programming language seems especially slow to me. Being statically typed, you must declare every variable. This takes time. Being a low level language, each statement typically doesn’t do a whole lot. And since its compiled, you periodically must leave the editor and compile and test it. But what really eats the time is the very ugly mistakes you can easily make in C. If you forget to allocate some memory, there is no pleasant reminder to set you back on track. Instead you get a cryptic message and mysterious crash. Thus a lot of time is spent tracking such issues.

Another thing that eats a lot of time is the whole pass by value convention of C. Do I pass var, *var, **var, &var, etc? When the data structures are complex and interconnected it can be darn confusing. You pretty much have to draw it all out on a piece of paper.

When caught up in the frustrations of C programming I often think: there must be a better programming language. Recently I did an internet search to see if anyone has studied programmer efficiency as a function of programming language. While I didn’t find much, there is a very interesting study by Lutz Prechelt at Karlsruhe University: Are Scripting Languages Any Good? A Validation of Perl, Python, Rexx, and Tcl against C, C++, and Java

Prechelt’s Study

In this study Prof. Prechelt gave programmers a specific programming task. They had to convert phone numbers into words from a German dictionary using a particular encoding scheme. The programmers had different levels of experience and came from different backgrounds, but Prechelt was able to extract meaningful results from this experiment.

Prechelt looked at many different things, like memory consumption, run time, and reliability of the programs. In those measures, the scripting languages did surprisingly well. Of course a well written C program will be faster than a well written Python program, but in this study there was a huge variability in the results and many of the Python programs were as fast as many of the C/C++ or Java programs. However, our interest is in programmer efficiency. Look at this graph:

Line of Code

It shows that the scripting languages (TCL, Rexx, Python and Perl) required an average of 98 lines of code to complete the program (not including comments). The non-scripting languages (C, C++, Java) needed around 277 lines. That is a huge difference. If you look at the shortest programs, the results are even more dramatic. The shortest C/C++ program was 150 lines, while the shortest Python script was 42 lines. That is 3.6 times more code for the C/C++ program.

There is a rule of thumb in programming that it takes about the same amount of time to program a line of code regardless of language. If true, this suggests that the script writers completed their task more quickly than the compiled language programmers. That was indeed the case. Here is the plot from Prechelt’s paper:

Programming Time

Sure enough, the same trend is seen. The average scripting time was 3.8 hours while the non-scripting languages are at 14.3 hours. The minimum reported times were 1.5 hours for Python and 10.5 hours for C/C++. We probably shouldn’t put much weight on the minimum values, since the times were self reported. But the averages are worth comparing. The scripting programs were developed 3.8x faster. That is amazing when you think about it. A task that might take a week of slogging in C could be done in a little over a day with Python.

Amazing, but True?

So how believable are these results? If we trust Prof Prechelt’s methodology, they are believable for the phone-code problem, but do they apply to numerical programming? There is a hint in the paper:

The shorter program length of the script programs can
be explained by the fact that most of the actual search is
done simply by the hashing algorithm used internally by
the associative arrays. In contrast, the non-script programs
require most of the elementary steps of the search process
to be coded explicitly by the programmer. This is further
pronounced by the effort (or lack of it) for data structure
and variable declarations.

This suggests that the scripting languages were more efficient because they had high level tools, like associative arrays, that were well suited to the phone-code problem. One could argue that had suitable tools existed in the non-scripting languages, they would have enjoyed similar efficiencies.

There is a slight problem with this argument. Namely some of the non-scripting languages do have such tools, but the programmers didn’t use them:

It is an interesting observation that despite the existence
of hash table implementations in both the Java and
the C++ class libraries none of the non-script programmers
used them (but rather implemented a tree solution by hand),
whereas for the script programmers the hash tables built
into the language were the obvious choice.

The implication of this is that it is not just the presence of high level tools that gives scripting languages their efficiency; it is also the ease with which you can use those tools. Python is a good example of this. After you import a module, you can access that module’s documentation from the command line:

>>>import somemodule >>>help(somemodule)

Not only is this very handy, it also encourages one to explore what is available in modules. In addition, the language is easy to read, so you can peruse the source code for additional insights. This greatly lowers the threshold for reusing other people’s code, and that is a great productivity booster.

What about Speed?

Scripting languages are slow, typically 10x to 100x slower than compiled languages, in terms of execution speed. Is it really helpful to create a numerical program in record time if it takes two months to run? Scripting advocates point to the Pareto principle: 20% of the routines take 80% of the execution time. Thus it makes sense to code everything in the scripting language, see where the bottlenecks are, and rewrite those in C, C++ or Fortran. Sounds good. I would like to try some experiments to see for myself.

Algebraic Mesher

Thursday, March 13th, 2008

In looking for an alternative to the half edge meshing scheme, I hit upon the idea of using algebraic relations between cells, faces and vertices in order to build a mesh. Those relationships were described in the previous post. Here I want to describe how I implemented this in C.

Data Structures

Like my Half Edge mesher, the MESH_TYPE data structure consists of lists of all the vertices, faces and cells in the mesh.

struct mesh_type { //contains all mesh info ALIST cells; //cell list ALIST faces; //face list ALIST verts; //vert list };

ALIST is the generic array I had defined earlier. It consists of a small data structure that includes a pointer to an array on the heap containing the actual data. In this way, the array can be resized and bounds checking can be done. The figure shows the mesh data structure.

Mesh Data Structure

Up to this point, the structures are the same as in the Half Mesh program. Where things change is in the data structures for the cells, faces and vertices. For instance a face structure looks like

struct face_type { ALIST verts; //list of ptrs to face verts ALIST cells; //list of ptrs to faces 1 or 2 cells };

Each face in the list of faces contains lists of pointers to elements in these three master lists. Trying to draw all the relationships can get confusing. This pictures gives an idea for the faces of a cell

Cell Data Structure

Each cell in the master list contains a list of four face pointers. These pointers point to the four faces in the master list where this cell’s faces reside. Imagine how confusing that picture would be if I were to try and draw all the connections between cells, vertices and faces?

Accessing the Data

Suppose you want to average the x and y coordinates all the vertices of a cell, to make sure it matches the cell center. (This is one of the tests I used to check the meshing code.) This can be done easily using the new improved FOR_EACH macro from the generic list include file:

double totX, totY,dx,dy; VERT_TYPE **ppVert; FOR_EACH(pCell, pMesh->cells) { totX = 0; totY = 0; FOR_EACH(ppVert, pCell->verts) { pVert = *ppVert; totX += pVert->x[0]; totY += pVert->x[1]; } //Find diff betw cell center and vert ave dx = pCell->x[0] - totX/4; dy = pCell->x[1] - totY/4;

The outer loop is over the cells in the master list of cells for the mesh. The way the FOR_EACH works, is it puts a pointer to the current cell into pCell on each pass through the loop. This allows us to dereference the fields of the cell. It is the inner loop, over the cell vertices, where confusion can set in. The vertex list in a cell structure holds pointers to the actual vertices residing in the master vertex list. Since FOR_EACH returns a pointer to whatever is in the list, it returns a pointer to a pointer. That is why I defined the variable ppVert. In order to gain access to the x and y position of the vertex, we dereference ppVert, to get a pointer to the vertex. From their we can get the position. These dereferencing subtleties were not at all clear to me as I wrote this. In addition, I hadn’t always allocated memory for my lists. The result was the memory debugging nightmare I mentioned previously.

Assessment

The entire motivation for this project was to generate a meshing program that had some of the elegance of the half edge mesh, while being better suited for CFD applications. Was it successful? It is hard for me to say at this point. In terms of code length, the two implementations are nearly identical. Thus one is not a lot clumsier than the other. One thing I find troubling about the algebraic meshing approach is you can’t look at the code and tell if its right. When you see a line like

pVert = getItem(verts,CID+i+1); //SE vert

how do you know that pVert is really the south east vertex of cell CID? You need to sketch out the grid and work out the relationships. Also, I am still reeling a bit from all the memory problems I had. The half edge mesh routine came together a lot more easily.

All in all, I would say the jury is still out on this one. But now that I have created it, I hope to use it to do some work with Immersed Boundaries.

Download

If you want to play with the code to generate meshes, you can download it here. Put the files ArrayType.inc and AlgMesher.c in the same folder and compile and run:

gcc -Wall AlgMesher.c ./a.exe

When main runs, it will call a test function that tests five different aspects of a simple 100×100 mesh. It writes out the results to a text file. In this way, I tried to make the code relatively bug free. Enjoy.

Meshing by the Numbers

Saturday, March 8th, 2008

We saw earlier that the half edge data structure is a very elegant way to create a mesh. However, as I pointed out recently, this approach does have a couple of drawbacks for general CFD applications. Are there other elegant approaches worth considering? One interesting possibility is to use the algebraic relationships between vertices, faces and cells to generate the connections between them. If the mesh is rectangular, and you number the components systematically, you can work out these relationships.

Cell and Vertex Numbering

Cells & Vertices

Lets say we number our cells from left to right, starting with the bottom row, as in the figure. Any given cell can be expressed in terms of its column number and row number:

CID = N_cols*i + j

The Cell ID (CID) is found by multiplying the number of columns in the grid (N_cols) by the current row number i, and adding the current column number j. We are following the C convention of starting our numbering from zero. We can number the vertices in the same way, starting in the lower left corner. Without too much trouble, we can see that if you know the ID of the cell, you can work out the ID of the south west vertex of that cell:

VID_SW = CID + i

where VID_SW is the ID number of the southwest vertex. The equation works out this way because any row of N cells has N+1 vertices.

The ID for the southeast vertex is just one more bigger than the southwest:

VID_SE = CID + i + 1

To get the index for the northern vertices, we need to add a row of vertices to the index of the corresponding southern vertex:

VID_NW = CID + i + (N_Cols + 1) VID_NE = CID + i + 1 + (N_Cols + 1)

Face Numbering

Faces

The numbering scheme for the faces is a bit trickier. You can just name them sequentially starting from the south west corner, but I think its a bit easier if you create two lists: a list of horizontal faces (i.e., the north and south faces) and a list of vertical faces (i.e., the east and west faces). The figure shows the idea, with the horizontal faces numbered in black and the verticals in blue. With this approach, the southern faces have the same numbering as the cells, thus

FID_S = CID

The northern face ID numbers can be found by taking the southern face ID and adding the number of columns in the grid:

FID_N = CID + N_cols

For the east and west faces, each row has one more face than there are cells. This is the same as the vertex numbering, thus

FID_W = CID + i FID_E = CID + i + 1

In practice, we don’t really want to have two lists of faces. It is more convenient if we can loop through a single list. This can be achieved by appending the east-west list onto the north-south list. Once that is done, the indices for the east and west faces will need to be increased by the number of horizontal faces:

FID_W = N_HFaces + CID + i FID_E = N_HFaces + CID + i + 1

where

N_HFaces = N_Cells + N_cols

.

Implementation

I have written a C program to implement these ideas, but it was a torture. I made heavy use of my generic list structure and in the process hit a bunch snags. C is notorious for giving you the power to shoot yourself in the foot, and I dare say that I don’t have many toes left. I made a ton of memory mistakes and it took me more than 10 hours of debugging to find and correct them all. In the process, I made a number of improvements to the generic list routines (including writing a bunch of documentation), so hopefully this won’t happen again. I’ll discuss the implementation more in the next post.

Imprinting a Polygon

Saturday, February 9th, 2008

Lets say we have a rectangular mesh, perhaps created with our half mesh tool, and we want to imprint upon it some polygon shape. For instance, we may want to model the temperature or fluid flow in such a polygon and need a mesh in that shape. By imprinting the polygon onto the mesh, we can create a simple stair step mesh that can be used for modeling.

Stair Step Mesh

What is the criteria for determining if a cell is inside the polygon or not? Clearly any cell that is fully inside the polygon would be considered “in”. What about one where one of the vertices is outside? Or two are outside? For the figure above, the imprinting follows a simple rule: if the cell center is in the polygon, then the cell is inside.

Shooting Rays

To create our stair step mesh, we need to cycle through the cells in the rectangular mesh and check each cell center to see if its inside. The classic way to determine if a point lies inside a polygon is to shoot a ray from this point and see how many times it crosses the polygon. If its an even number, the point is outside. An odd number of crossings means its inside. The Wikipedia has a nice article on this here.

Even-Odd Rule

Hitting a vertex – A rare Event?

There is a subtlety with this scheme. Your ray may strike one of the vertices of the polygon. In that case, you will likely get double counting. It will appear that your ray has crossed both polygon segments N and N+1. As a result, you may conclude exactly the wrong result. My initial reaction to this situation was: it could happen, but will be extremely rare. Wrong! Both meshes and polygons are created by humans. And humans tend to like round numbers. It is much more likely you will have a vertex point at (1,1) than at (1.0001, 0.9999). Similarly for the cell positions. In my own testing, I hit this problem right away.

The solution is found in how we think about a segment. Rather than thinking of a segment as two points and the line that connects them, we can think of it as just one point and the line. The figure shows the idea.

Vertex Problem

This can be implemented with a judicious use of the > and >= operators.

Implementations

A quick search of the internet reveals many implementations of this odd-even crossing approach. Bob Stein’s article compares a few implementations, looking at lines of code, number of if statements and so on. The routines he looks at range from 44 to 91 lines of code and include from 2 to 19 if statements, and 0 to 2 divides. When I first implemented this, some years ago, my code ran 70 lines and included 7 if statements and 3 divides. In my defense, my code also checked how close a point was to a segment and included the point if it was within some tolerance. Still, it seems a bit bulky compared to some of the implementations Stein looked at. He was justifiably proud of his implementation, which is only 44 long with no divides.

A bit more looking on the web reveals the incredibly tiny 12 line routine of W. Randolph Franklin. It contains only one if and one divide. I tried it out for a couple of cases and it does seem to capture the essense of the method without a single wasted character. Here is the Franklin’s routine:

int pnpoly(int npol, float *xp, float *yp, float x, float y) { int i, j, c = 0; for (i = 0, j = npol-1; i < npol; j = i++) { if ((((yp[i]<=y) && (y<yp[j])) || ((yp[j]<=y) && (y<yp[i]))) && (x < (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i])) c = !c; } return c; }

(Note this code is Copyright © 1970-2003, Wm. Randolph Franklin )

Stair Step Mesh

With a routine like this, we can easily create a stairstep mesh. You can envision a program something like this:

int main() { MESH_TYPE *pMesh; CELL_TYPE *pCell; int i; //Make a triangle float xp[10] = {0., 1.0, 1.0}; float yp[10] = {0., 0., 1.0}; //Create mesh pMesh = createMesh(0., 0., 1.0, 1.0, 0.1, 0.1); //Determine which cells in mesh are 'live' FOR_EACH(i, pCell, *(pMesh->cells)) { pCell->live = pnpoly(3, xp, yp, pCell->x, pCell->y,) } return 0; }

For this to work, we need to add a few more fields to the cell structure defined previously. Specifically, we need the cell center coordinates and a flag stating whether the cell is live or dead.

Some additional work would be required to actually use this stair step mesh in a CFD code. For instance, it is not efficient to keep dead cells in our cell list. Every time we loop through the cells, we will need an if statement:

if pCell->live …

. Also, we have not identified the boundary faces or made any effort to transfer the boundary conditions from the polygon segments to the edge faces of the stair step mesh. These are tasks for another day.

Half Edge Mesh Implementation

Saturday, February 2nd, 2008

In a previous post, I discussed the elegance of the Half Edge data structure for storing mesh information. This time, we will implement such a scheme in C. The goal is simple: create a program that meshes a 2d rectangle with a structured grid, while storing the information in the Half Edge data structure.

If we know the length and width of the rectangle, and the cell size, we can compute the number of rows and columns in our mesh. We can then loop over these rows and columns, creating cells, faces and vertices and linking them together. As we create these mesh elements, we will store them in lists, using the generalized arrays we developed earlier.

We will build the mesh row by row. The first step is to create the south most row of vertices. As we construct a row of cells, we will be building them on top of the vertices of the previous row. So for the very first row cells, we must first create a row of verts. Once this is done, we enter a pair of nested for loops where we loop over the rows and columns of the mesh. Here is the basic idea

for (i=0; i<numRows; i++) { Prepare the row for (j=0; j<numCols; j++) { Set corner vertices of cell Create new cell Create faces of cell Make remaining connections Hook up the pairs of WF and NF faces Save a few items to help with next cell or next row } wrap up this row }

Once the first row of verts is created, we begin creating cells. For the first cell of a row, we must create a north west vertex. For all other cells on that row, the north east vertex of the previous cell becomes the north west vertex for the new cell. We then need only create one vertex per cell, specially the north east one. This is shown in Step 2 of the figure. With the four vertices of the cell defined, we can then create the four faces of the cell (step 3), then create the cell and link everything together. Its then on to the next cell. In this step by step fashion we build up the mesh.

MeshSteps

A subtlety

There is a subtlety I didn’t mention in the previous post. There are 8 half faces that touch a vertex. How do you know which one to store in the vertex? For an interior vertex it doesn’t really matter, as long as its consistent with the scheme for cycling through the faces around the vertex. Specifically, if you store a face pointing toward the vertex, then you cycle through the faces with the command

face = face->pair->nextF;

If you pick a face that points toward the vertex, then you use

face = face->nextF->pair;

For vertices at the edge of the domain, greater care is needed. Not every face has a pair near the edge, so a command like

face = face->pair->nextF;

can lead to errors. Also, we need to pick the edge that allows us to go fully around the vertex. In the picture below, this can only be satisfied for vertex V by picking face F.

Faces around a vertex

In general, we need to pick the most counter clockwise edge, relative to the vertex. (That is because our circular linked list of faces is organized in a counter clockwise direction.)

More on Lists

In writing this simple mesher, I made use of the generalized arrays developed earlier. In the process of using these, I came across various things that didn’t work that well, so I changed them. For instance, the listAppend function originally returned an index to the data array where the new data was written. It turned out to be more convenient to return a pointer to that data.

I also realized that the creation and destruction functions were overly restrictive. They assume the user has already created the list object and only needs the data array allocated. There are times however, when the user would like both allocated and deallocated for him. So there are now two sets of creation and destruction functions:

void createListData(ALIST *list, long maxEl, size_t elSize) void destroyListData(ALIST *list) void createList(ALIST **list, long maxEl, size_t elSize) void destroyList(ALIST *list)

This provides greater flexibility, but also adds complexity. Speaking of complexity, the astute reader will note in the source code that I did not use these flexible arrays for the lower and oldNF arrays. Instead I used fixed arrays, dimensioned to 1000 elements. A flexible array would be perfect here, since it will auto-expand to however many rows are needed. But the extra hassle didn’t seem worth it for some something so simple.

If I had used these flexible arrays, setting and getting values would have looked like:

LL = *(VERT_TYPE *)getItem(lower, j); setItem(lower, j) = UL;

instead of

LL = lower[j]; lower[j] = UL;

That is a serious loss of readability. When getting the item from the list, you have to remember that getItem returns a pointer to the data element, so you must dereference it. To be proper, you need to cast the pointer first. That makes for some pretty darn ugly code. I could use a macro to clean it up. Perhaps something like:

#define GET_ITEM(pList, index, datatype) *(datatype *)getItem(pList, index);

I will have to experiment with it.

Wrap Up

If you want to play with the code to generate a half edge mesh, you can download it here. Put the files ArrayType.inc nd HalfMesher.c in the same folder and compile and run:

gcc -Wall HalfMesher.c ./a.exe

When main runs, it will call a test function that tests different aspects of a simple 10×10 mesh. It writes out the results to a text file. This is how I made sure the code was relatively bug free.

Naming Conventions in C

Saturday, January 12th, 2008

Having done much of my programming in Basic, I have been sweetly oblivious to the importance of case in naming variables. In Basic, NumNodes is the same as numnodes, numNodes and NUMNODES. As a result, you can type any combination that suits you and all is well. Your eye doesn’t even notice the differences after awhile. That all changes when you program in C. I am translating some meshing code from PowerBasic to C for my next blog entry and the gcc compiler is giving me tons of errors like:

HalfMesher.c: In function `createMesh': HalfMesher.c:242: error: `OldNF' undeclared (first use in this function) HalfMesher.c:254: error: structure has no member named `Faces'

And that is after going through the code line by line looking for such things!

CompoundNames

This got me thinking about naming conventions for variable names and functions. There is an amazing amount written on the topic and feverous religious battles have been fought. Still, a few patterns emerged. In the C world, there is a tendency to use all lower case for simple variable and function names. For compound names, like CellVolume, there are several options:

  • Run together: cellvolume
  • Underscores: cell_volume
  • Lower mixed case: cellVolume
  • Upper mixed case: CellVolume

Mixed case is one of several names for capitalizing the first letter of each word. (Others are camel case, medial capitals, InterCaps.) In lower mixed case, this is applied only to the words after the first. In the list above, the first two combining methods seem the most common for C programmers, followed by lower and finally upper mixed case.

I have to say that I don’t care for the run together approach because it can be ambiguous for some word combinations. For instance consider the variable cellslip. It could be interpreted as cell_slip or cells_lip. As you read the code, you need to stop and decipher this variable name. That may only take a moment, but it is an extra burden on the reader. The underscore method addresses this issue nicely, but adds to the bulk of variables, and can clutter the code. Mixed case gives you the benefits of word separation, without the extra bulk. Because C programmers have tended to prefer lower case, lower mixed case seems like the best approach for compound names.

Other Issues

It turns out there are many more issues that come up with name conventions. For instance, terseness. The C examples in Kernighan and Ritchie are filled with one and two letter variable names. Their code is compact, but it can be hard to read. At the other extreme, there is the natural language approach of Kari Laitinen. Here is a snippet of code taken from his site:

cout << "\n Please, type in two integers separated " << "with spaces: " ; cin >> first_integer_from_keyboard >> second_integer_from_keyboard ; sum_of_two_integers = first_integer_from_keyboard + second_integer_from_keyboard ;

It is very readable, but that is partly because it is also very simple. If you are solving complex equations, with lots of variables in them, your code will be highly cluttered if you follow this approach. Which is easier to read:

y = a*x*x + b*x + c;

or

Answer_to_problem_7 = First_Coefficient*x_variable*x_variable + Second_Coefficient*x_variable + Third_Coefficient;

The example above introduces the importance of idioms. Even very terse names can convey meaning if they follow a common idiom. Any mathematically aware persion will recognize the above equation as that of a second order polynomial. For this case single letter names are fine. Similarly, using i and j as index names for loops is a common idiom in programming, and so is fine too.

What is the optimum length for variable names? Steve McConnell discusses this in Code Complete. He references a 1990 study by Gorla, Benander and Benander that suggests names in the range of 10 to 16 characters where the easiest to debug.

Another important point is that care is required with global variables. Local variables can be understood in the context of the function they reside in. Global variables are generally without context. As you read code, they pop up randomly, and you need to understand them. Thus, their names must be descriptive enough to standalone, within reason.

Naming Convention

Here are the rules I have come up with for myself when programming in C. Following these should both make my life easier, and hopefully make my code easier to read.

  • Single word variable names are lower case: face
  • Compound variable names use lower mixed case: theLongName
  • Functions follow same convention as variables
  • Macros are upper case: MAX
  • Typedefs are upper case and end in _TYPE for complex types: FACE_TYPE
  • Names for lists or collections typically end in ’s’: cells
  • Global variables start with g and are more detailed: gDiscretizationScheme
  • If it helps readability, pointers can be prefixed with p: pFace
  • Abbreviations are to be minimized.
  • Idioms are used to shorten names where it makes sense