AngelikaLanger.com - Sequence Points and Expression Evaluation in C++

Angelika Langer - Training & Consulting

	HOME


	OVERVIEW BY TOPIC JAVA C++ BY COLUMN EFFECTIVE JAVA EFFECTIVE STDLIB BY MAGAZINE JAVA MAGAZIN JAVA SPEKTRUM JAVA WORLD JAVA SOLUTIONS JAVA PRO C++ REPORT CUJ OTHER
	GENERICS
	LAMBDAS
	IOSTREAMS
	ABOUT
	CONTACT

Sequence Points and Expression Evaluation in C++

Sequence Points and Expression Evaluation in C++

Sequence Points and Expression Evaluation

Visual Systems Journal, August 2002
Klaus Kreft & Angelika Langer

[There is a Russian translation available at CPP-REFERENCE.RU .]

Does anybody know what a sequence point is? Every C++ programmer should know it. However, when we pose this question even experienced programmers with many years of hands-on experience concede they have no idea. In this article we want to shed some light on the secret and explain what sequence points are and why it is important to be aware of them.

What is a sequence point?

A C++ program consists of basically two types of statements:

Declarations and definitions of types, constants, function names, etc. This is information that the compiler will read and process at compile time. Only a minor fraction of this information will show up in the executable at runtime.
Expressions . This kind of information is interpreted by the compiler and translated into instructions that will be executed at runtime in order to evaluate the expressions.

The expressions are the really interesting the part of the program because they have side effects at runtime and for this reason determine what the program will eventually do. Hence our main interest in programming is in evaluation of expressions. Sequence points have to do with expression evaluation.

Side Effects of Expression Evaluation

Before we delve into sequence points, let's clarify what the side effects of expression evaluation are. The side effects that a C++ program can produce are:

Modification of an object, that is, modification of some memory location or register.
Access to an object that is declared volatile.
Invocation of a system routine that produces side effects, like for instance input or output to files.
Invocation of functions that perform any of the above.

Hence, most of what a C++ program does is modification of memory locations and invocation of system-level routines. (If you are not sure about the effect of access to volatile objects, see the box 1 for further information.)

Box 1: Volatile Variables

The common understanding of volatile is that is tells the compiler it must not do certain optimizations. More precisely it means that the compiler must not optimize away access to any variable that is declared volatile, even if the compiler knows that its value has not been changed between the last write access and the current read access.
Example: Let us assume the compiler sees a sequence of statements like the following:

int i=0;
... many statements that do not touch i ...
if (i==0)
...

From the context the compiler can tell that the value of i must be 0 because there was no write access since it was initialized and it was not passed to any function that might have modified it. Hence the compiler could optimize away the actual access to the memory location that represents variable i because it already knows that the value must be 0. Exactly that optimization is disabled for a volatile variable.

What is volatile good for? Why would we ever want to disable such a useful optimization?

Well, sometimes we declare variables volatile that are modified from outside of the program. Examples are variables that represent a system clock for instance. The system will modify the variable's value and in such a case the compiler's program analysis will be pointless; the modification will not be visible in the program's source code. In system-level programming it is not unusual to occasionally declare variables as volatile.

But they are also used in concurrent programming where multiple threads may have access to the same variable. In that case one thread might change the value of a shared variable without any modifying access being visible in the other thread. Again, the compiler cannot tell from analysis of one thread's source code whether the variable has been changed by another thread or not, and the compiler must not optimize away any access to the shared variable. In concurrent programming we declare shared variables as volatile.

As you see from the examples, access to a volatile variable does produce a side effect: it retrieves a value from some memory location that has been provided by some other entity outside of the current program.

Order of Evaluation of Expressions

Back to sequence points: The compiler will generate executable instructions that will eventually evaluates all expressions in a program. The order in which the expressions will be evaluated is roughly the order in which they appear in the source code, that is, top to bottom and left to right. We say "roughly" because the compiler has some latitude to rearrange the order of expression evaluation if it sees potential for optimizations. However, the latitude granted to the compiler is limited - by sequence points. A sequence point is a point in the sequence of all expressions where the compiler is required to have finished the evaluation of all preceding expressions and has not yet started evaluation of any subsequent expressions.

Which means that a sequence point is a point in the program where we as programmers know which expressions (or subexpressions) have already been evaluated and which expressions (or subexpressions) have not yet been evaluated. In other words, sequence points are points in the program where we know where we stand in the process of execution of your program. Between sequence points we know nothing about the order of evaluation of expressions and subexpressions. To most programmers' surprise there are only very few sequence points defined in C++, which means that most of the time we have no idea what has happened yet and which subexpressions have not yet been evaluated.

Examples

Let's take a look at some examples. When we see an expression like i=2; then we expect that once control flow reaches the semicolon variable i will have the value 2. This expectation is correct because there is indeed a sequence point at the end of every "full expression". (For the time being do not worry about the term "full expression"; usually the end of an expression is the semicolon that marks the end of a statement.) Which means that at the end of each statement, the statement has been executed as expected.

But wait! How about this example:

x[i]=i++ + 1;

Let's assume variable i has the value 1 before we enter the statement. What will be the result of evaluation of this expression? The correct answer is: we don't know. However, programmers ever too often believe that they know what this program fragment does. Typical answers include: "x[1] will have the value 2", or "x[2] will have the value 2", or even "x[1] will have the value 3".

The third option is definitely wrong. This will not happen because i++ is a postfix increment and returns i's initial value 1; hence the value of the right hand side of the assignment is 2, and definitely not 3. (For further information on the difference between prefix and postfix increment, see box 2 .) So far so good, but we do not know which entry of the array x will be modified. Will the index be 1 or 2 when the right hand side value will be assigned to x[i]?

There is no definite answer to this question. It fully depends on the order in which the compiler evaluates the subexpressions. If the compiler starts on the right hand side of the assignment and evaluates i++ + 1 before it figures out which position in the array x must be assigned to then x[2] will be modified because i will have already been incremented in the course of evaluating the subexpression i++. Conversely, if the compiler starts on the left hand side and figures out that it must assign to position i in array x, which at that time will still be position 1, before it evaluates the right hand side then we'll end up with a modification of x[1]. Both outcomes are equally likely and equally correct.

Box 2: The Difference between Prefix and Postfix Increment and Decrement

Prefix operators differ from postfix operators in their return value.

Every operator produces a side effect and has a return value. The side effect of both prefix and postfix increment is incrementation of the variable's value. The return value, however, is different. The prefix operator return the value after the modification and the postfix operator returns the value before the modification.

Both effects, the side effect and the calculation of the return value, happen at the same time, namely when the pre- or postfix expression is evaluated. Sometimes people mistakenly believe that with a postfix increment the increment happens "afterwards", like in the following example:

f(i++);

is it sometimes expected that the i++ happens after the function call, which is not true. The expression i++ is evaluated before the function is called, which means that the increment happens before the function invoked. It is just that the return value of the expression i++ is the value before the modification and for that reason the original value is passed to the function. The side effect of the postfix increment however is definitely produced before the function will be called.

How can it be that the result of an innocent statement such as x[i]=i++ + 1; is undefined? Well, it is the immediate consequence of the fact that the order of evaluation of expressions and subexpression between sequences points is undefined. Remember, the only sequence point in this code fragment is at the end of the full expression, that is, at the semicolon. The common misconception is that people believe the assignment operator would introduce a sequence point, in which case the compiler would have to evaluate the left hand side before the right hand side. But the assignment operator does not mark a sequence point; practically none of the operators in C++ is a sequence point. (The only exception is the rarely used comma operator, as we will see later.)

Problematic vs. Safe Expressions

What is it that renders the assignemnt x[i]=i++ + 1; a problematic one whereas the assignement i=2; is harmless, in the sense that its result is well-defined and predictable? The crux is that in the expression x[i]=i++ + 1; there are two accesses to variable i and one of the accesses, namely the i++, is a modifying access. Since the order of evaluation between sequence points is not defined we do not know whether i will be modified before it will be read or whether it will be read before the modification. Hence the root of the problem is multiple access to a variable between sequence points if one the accesses is a modification.

Here is another example. What will happen here if i and j have values 1 and 2 before the statement is executed?

f(i++, j++, i+j);

Which value will be passed to function f as the third argument? Again, we don't know. It could be any of the following: 3, 4, or 5. It depends on the order in which the function arguments are evaluated. The common misconception here is that the arguments would be evaluated left to right. Or maybe right to left? In fact, there is no order whatsoever mandated by the language definition.

Latitude for Optimizations

Why actually does the language specification leave it open in which order compilers evaluate expressions between sequence points? Wouldn't matters be much clearer if a definite order of evaluation were defined?

The reason for the undefined order of evaluation between sequence points is, like ever so often in C++, efficiency. Compilers are given the liberty to evaluate expressions in the order that is most efficient for a given platform and processor. Which means that the sequence problems explained above are usually portability problems. The undefined order of evaluation might never ever bother you as long as you stick to the same compiler on the same platform. However, it can hit you as soon as you install the latest version of your compiler and recompile your source code. Or it hits you when you port your program to another platform. In any case, this type of bug is hard to track down because it hits you when you least expect it, because you can swear that you did not change a single line of code ...

The Complete List of Sequence Points in C++

So far we have only talked of the sequence point at the end of the full expression. Here is the complete list of all sequence points in C++:

at the end of a full expression
after the evaluation of all function arguments in a function call and before execution of any expressions in the function body
after copying of a returned value and before execution of any expressions outside the function
after evaluation of the first expression in a&&b, a||b, a?b:c, or a,b
after the initialization of each base and member in the constructor initialization list

Let's look at each of the sequence points in brevity:

End of a full expression . The end of the full expression is the semicolon in regular statements. In a conditional expression such as if(i==0 && j==f(x,y,z)) the end of the full expression not a semicolon but the closing bracket

Function calls. The sequence points before and after a function call mean that execution of the invoked function and the calling context is not intermingled: all statements before the function call are executed, the function arguments are evaluated (in undefined order), and then the function body is executed. Similarly on return from the function: all statements of the function body are executed; the return expression is evaluated, and then control flow continues in the calling context with the statement after the function call.

Operators . The sequence points in the logical expressions such as && and || and ternary operator ?: and the comma operator mean that the left hand side operand is evaluated before the right hand side operand. These few operands are the only operands in C++ that introduce sequence points.
Note , that the comma operator is among them, which complicates matters even further, because the comma operator is often confused with the comma as a separator, as for instance in the list of function arguments. Most of the time a comma is a separator; the comma operator is rarely used. Its most common use is in for-loops like for(i=0, j=0; i<100||j<200;++i,++j). In a list of function arguments such as f(++i,++j) the comma is just the separator between function arguments and does not imply any order of evaluation of the function arguments.
Just to demonstrate how confusing it can get, consider the following: what if f is a function with just one argument and we pass (++i,++j) to the function, like in f((++i,++j))? In that case the comma would not be a separator between function arguments, but the comma operator. As a result ++i would be evaluated before ++j. The result of that evaluation would be passed as an argument to function f, which raises the question: what is the result of the comma operator? Its the result of the evaluation of the right operand.
Equally confusing is an expression such as array[++i,++j]. Remember, positions in a 2-dimensional array are referred to as array[i][j]. Hence the comma in array[++i,++j] is not a separator, but the comma operator.

Constructor initialization list. The last sequence point between initializations of base classes and members in the constructor initialization list exists because the order of these initializations is well-defined and not left to the compiler's discretion. Note, that the order is not the order in which bases and members appear in the constructor initialization list, but is the order in which they are listed in the class definition. Example:

class Array {
private:
int* data;
unsigned size;
int lBound, hBound;
public:
Array(int low, int high)
:size(high-low+1), lBound(low), hBound(high), data(new int[size]) {}
};

The order of initialization is data(new int(size)) followed by size(high-low+1) followed by lBound(low) followed by hBound(high), which of course is problematic in this case because size will be used before is will be initialized to a sensible value. This is another known pitfall in C++, which you find explained in / MEY /.
The sequence point between initialization of each base and member in the constructor initialization list just makes sure the initializations are not intermingled. Instead they happen one after other in the order the bases and members are listed in the class definition.

Hidden Dependencies

Now that you know all sequence points you are ready for looking into a couple of additional examples. So far, we have only seen examples where the multiple access to at least variable was fairly obvious. How about this one?

x = f() + g() + h();
Is there any doubt about what will happen? At first sight it appears as though nothing could go wrong here. The 3 functions will be called in indefinite order, the sum of their return values will be calculated and an assignment will be performed. But, what if all 3 functions would access a shared static or global variable that they read and modify? We do not know in which order the 3 functions will be invoked and for that reason we would not know in which order read and write access to the shared data would be performed. Again, another sequence point challenge.

Sequence Points and Exceptions

Here is another interesting case:

f( new X(i++), new Y(i) );

Lots of things are going on here: we produce the side effects of modification of variable i, memory allocation for objects of type X and Y, and construction of these objects. We do not know in which order these side effects will be produced. Well, we do know that i++ will be evaluated before the constructor of X will be invoked and we can safely assume that the runtime system will allocate memory before it will try to initialize that memory be means of the constructor calls. But we do not know whether i will be incremented before it will be passed to the constructor of Y. And we do not know which object will be allocated first and/or which constructor will be called first. It is even allowed that the compiler may allocate the memory for both objects before it calls any of the constructors.

This lack of certainty is particularly problematic in case of exceptions being thrown by either operator new or any of the constructors. In case of an exception we would not know what has already happened and for this reason we would have no idea how to react and roll-back appropriately.

Avoid Complex Expressions

The problem in the example above was again: multiple access to the same variable including a modification. We simply tried to squeeze too many things into one expression. As a rule of thumb: avoid complex expressions. All of the problems discussed in this article can be prevented by breaking complex expressions down into smaller expressions. For instance, instead of saying

x[i]=i++ + 1;

we could say

x[i]=i + 1;
i++;

or whatever else we want to see happening. By taking the expression apart we introduce additional sequence points and as a result we have a defined order of expression evaluation. The same applies to the last examples: instead of saying

f( new X(i++), new Y(i) );

we could say

X* xptr = new X(i++);
Y* yptr = new Y(i);
f( xptr, yptr );

If we now catch a Yexception (from whose type we can tell that it was thrown by the Y constructor) then we would be able to tell that the X object must have already been successfully created and we have a much better chance to handle the exception more appropriately. Or we could wrap each of these statements into a try-block of its own if we wanted to catch and handle the bad_alloc exceptions from either of the calls to operator new.

Recommendation

Avoid complex expressions and especially those that include read and write access to the same object within one expression. With simpler expression we have significantly more sequence points and thus more control over the order of evaluation of the expressions that make up our program.

Conclusion

The order of evaluation of expressions and subexpressions is undefined between sequence points. C++ has amazingly few sequence points defined. The pitfall lies in the common misconception that there were many more sequence points than there really are. Most operators are not sequence points; the only exceptions are &&, ||, ?:, and the comma operator. Separators are never sequence points. The only sequence point you can really rely on is the end of a full expression, which is normally a semicolon.

Sequence point problems are typically portability problem that are difficult to track down in practice. Hence the recommendation: avoid them before they hit you. Keep your source code simple. Avoid complex expressions and especially those that contain multiple access to the same variable if one of the accesses is a modification. In case of doubt, break the expression down into several statements whereby you will introduce additional sequence points and will have a defined order of evaluation of your expressions.

References

/MEY/

Effective C++
Scott Meyers
Addison-Wesley, 2nd Ed., 1998

If you are interested to hear more about this and related topics you might want to check out the following seminar:

Seminar


	Reliable C++ - Avoiding Common Traps and Pitfalls 3 day seminar (open enrollment and on-site)