Doc. No.: WG21/N1653
J16/04-0093
Date: 2004-07-16
Reply to: Clark Nelson
Phone: +1-503-712-8433
Email: [email protected]

Working draft changes for C99 preprocessor synchronization

This document details the changes that need to be made to the working draft to resynchronize the preprocessor and translation phases of C++ with C99, according to the consensus of the combined Core and Evolution working group session at the Sydney meeting. Several areas with differences are explicitly excluded at this time:

Differences caused by differing terminology and by technical considerations unique to C++ are also not considered.

Most of the wording changes presented here are directly from C99, as of its TC1 (2001). The few exceptions, and anything about which I think there might be any remaining question, are all pointed out by a Note:.

Because most of the changes are not new drafting, but already appear in a published standard, this document probably needs little discussion. Because the Evolution group has already decided to go in this direction, and whatever issues remain are probably at the level of wording tweaks, it may be that whatever discussion is needed would be more effectively handled in the Core group. But to be on the safe side, this paper is directed to both groups.

It is my intention to continue doing textual comparisons between the C standard and subsequent working drafts of the C++ standard, and thereby discover any errors in the application of these changes, and also any changes that I may have inadvertently missed.

Predefined macros

Note: Sometime between 1989 and 1999, the C committee (or the editor) decided to put the list of predefined macro names into alphabetical order. Clearly, this was a strictly editorial change. Purely for the sake of facilitating textual comparisons between the two standards, I would recommend that the list be sorted in the C++ standard as well. But I do not present the necessary changes here, and I do not deal in detail with the order of presentation of predefined macro descriptions.

Insert in §16.8¶1:

__STDC_HOSTED__
The integer constant 1 if the implementation is a hosted implementation or the integer constant 0 if it is not.

Insert new paragraph following §16.8¶1:

The following macro names are conditionally defined by the implementation:

__STDC__
Whether __STDC__ is predefined and if so, what its value is, are implementation-defined.
__STDC_VERSION__
Whether __STDC_VERSION__ is predefined and if so, what its value is, are implementation-defined.
__STDC_ISO_10646__
An integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.

Note: I recommend moving the description of __STDC__ (unchanged) from ¶1 to ¶2, since it really is and always has been conditionally defined in C++.

Change §16.8¶3:

If any of the pre-defined macro names in this subclause, or the identifier defined, is the subject of a #define or a #undef preprocessing directive, the behavior is undefined. [Note: Any other predefined macro names shall begin with a leading underscore followed by an uppercase letter or a second underscore. --end note]

Note: The added sentence is in C99, and a slightly different version was in C89. I believe that the sentence was deleted from C++ because of general reluctance on the part of the editor to use the word "shall" to impose a requirement on an implementation. It is also possible that it was believed that this sentence was redundant with the rules for reserved names in §17.4.3.1. In any event, I recommend adding the sentence (back), as it appears in C99; I believe both these possible objections are addressed by making it a non-normative note.

Pragma operator

Change §16.3.4¶3:

The resulting completely macro-replaced preprocessing token sequence is not processed as a preprocessing directive even if it resembles one, but all pragma unary operator expressions within it are then processed as specified in 16.9 below.

Insert new section §16.9:

16.9 Pragma operator

A unary operator expression of the form:

_Pragma ( string-literal )

is processed as follows: The string literal is destringized by deleting the L prefix, if present, deleting the leading and trailing double-quotes, replacing each escape sequence \" by a double-quote, and replacing each escape sequence \\ by a single backslash. The resulting sequence of characters is processed through translation phase 3 to produce preprocessing tokens that are executed as if they were the pp-tokens in a pragma directive. The original four preprocessing tokens in the unary operator expression are removed.

[Example: A directive of the form:

#pragma listing on "..\listing.dir"

can also be expressed as:

_Pragma ( "listing on \"..\\listing.dir\"" )

The latter form is processed in the same way whether it appears literally as shown, or results from macro replacement, as in:

#define LISTING(x) PRAGMA(listing on #x)
#define PRAGMA(x) _Pragma(#x)

LISTING ( ..\listing.dir )

Change §2.1 phase 4:

Preprocessing directives are executed, and macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is produced by token concatenation (16.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.

Variadic macros and empty macro arguments

Change the definition of control-line (§16¶1):

control-line:
# include pp-tokens new-line
# define identifier replacement-list new-line
# define identifier lparen identifier-listopt ) replacement-list new-line
# define identifier lparen ... ) replacement-list new-line
# define identifier lparen identifier-list , ... ) replacement-list new-line
# undef identifier new-line
# line pp-tokens new-line
# error pp-tokensopt new-line
# pragma pp-tokensopt new-line
# new-line

Change §16.3¶4:

If the identifier-list in the macro definition does not end with an ellipsis, the number of arguments (including those arguments consisting of no preprocessing tokens) in an invocation of a function-like macro shall agree with equal the number of parameters in the macro definition., and Otherwise, there shall be more arguments in the invocation than there are parameters in the macro definition (excluding the ...). There shall exist a ) preprocessing token that terminates the invocation.

Insert new paragraph following §16.3¶4:

The identifier __VA_ARGS__ shall occur only in the replacement-list of a function-like macro that uses the ellipsis notation in the parameters.

Change §16.3¶9:

A preprocessing directive of the form

# define identifier lparen identifier-listopt ) replacement-list new-line
# define identifier lparen ... ) replacement-list new-line
# define identifier lparen identifier-list , ... ) replacement-list new-line

defines a function-like macro function-like macro with parameters, «etc.»

Note: This change incorporates part of a move of the definition of the term "function-like macro"; the rest of the move is presented below as a minor wording tweak. The confusion of presenting the changes in this way is hoped to be less than the confusion of presenting overlapping changes independently.

Change §16.3¶10:

The sequence of preprocessing tokens bounded by the outside-most matching parentheses forms the list of arguments for the function-like macro. The individual arguments within the list are separated by comma preprocessing tokens, but comma preprocessing tokens between matching inner parentheses do not separate arguments. If (before argument substitution) any argument consists of no preprocessing tokens, the behavior is undefined. If there are sequences of preprocessing tokens within the list of arguments that would otherwise act as preprocessing directives, the behavior is undefined.

Insert new paragraph after §16.3¶10:

If there is a ... in the identifier-list in the macro definition, then the trailing arguments, including any separating comma preprocessing tokens, are merged to form a single item: the variable arguments. The number of arguments so combined is such that, following merger, the number of arguments is one more than the number of parameters in the macro definition (excluding the ...).

Insert new paragraph after §16.3.1¶1:

An identifier __VA_ARGS__ that occurs in the replacement list shall be treated as if it were a parameter, and the variable arguments shall form the preprocessing tokens used to replace it.

Insert new sentence in §16.3.2¶2:

«....» If the replacement that results is not a valid character string literal, the behavior is undefined. The character string literal corresponding to an empty argument is "". The order of evaluation of # and ## operators is unspecified.

Change §16.3.3¶2:

If, in the replacement list of a function-like macro, a parameter is immediately preceded or followed by a ## preprocessing token, the parameter is replaced by the corresponding argument's preprocessing token sequence; however, if an argument consists of no preprocessing tokens, the parameter is replaced by a placemarker preprocessing token instead.«Footnote»

«Footnote:» Placemarker preprocessing tokens do not appear in the syntax because they are temporary entities that exist only within translation phase 4.

Change §16.3.3¶3:

For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an argument) is deleted and the preceding preprocessing token is concatenated with the following preprocessing token. Placemarker preprocessing tokens are handled specially: concatenation of two placemarkers results in a single placemarker preprocessing token, and concatenation of a placemarker with a non-placemarker preprocessing token results in the non-placemarker preprocessing token. If the result is not a valid preprocessing token, the behavior is undefined. The resulting token is available for further macro replacement. The order of evaluation of ## operators is unspecified.

Insert new paragraph following §16.3.3¶3:

[Example: In the following fragment:

#define hash_hash # ## #
#define mkstr(a) # a
#define in_between(a) mkstr(a)
#define join(c, d) in_between(c hash_hash d)
char p[] = join(x, y); // equivalent to
                       // char p[] = "x ## y";

The expansion produces, at various stages:

join(x, y)
in_between(x hash_hash y) 
in_between(x ## y)
mkstr(x ## y)
"x ## y"

In other words, expanding hash_hash produces a new token, consisting of two adjacent sharp signs, but this new token is not the ## operator. --end example]

Change §16.3.5¶5:

To illustrate the rules for redefinition and reexamination, the sequence

#define x 3
#define f(a) f(x * (a))
#undef x
#define x 2
#define g f
#define z z[0]
#define h g(~
#define m(a) a(w)
#define w 0,1
#define t(a) a
#define p() int
#define q(x) x
#define r(x,y) x ## y
#define str(x) # x

f(y+1) + f(f(z)) % t(t(g)(0) + t)(1);
g(x+(3,4)-w) | h 5) & m
      (f)^m(m);
p() i[q()] = { q(1), r(2,3), r(4,), r(,5), r(,) };
char c[2][6] = { str(hello), str() };

results in

f(2 * (y+1)) + f(2 * (f(2 * (z[0])))) % f(2 * (0)) + t(1);
f(2 * (2+(3,4)-0,1)) | f(2 * (~ 5)) & f(2 * (0,1))^m(0,1);
int i[] = { 1, 23, 4, 5, };
char c[2][6] = { "hello", "" };

Insert new paragraph after §16.3.5¶6:

To illustrate the rules for placemarker preprocessing tokens, the sequence

#define t(x,y,z) x ## y ## z
int j[] = { t(1,2,3), t(,4,5), t(6,,7), t(8,9,),
t(10,,), t(,11,), t(,,12), t(,,) };

results in

int j[] = { 123, 45, 67, 89,
10, 11, 12, };

Insert new paragraph after §16.3.5¶7:

Finally, to show the variable argument list macro facilities:

#define debug(...) fprintf(stderr, __VA_ARGS__)
#define showlist(...) puts(#__VA_ARGS__)
#define report(test, ...) ((test)?puts(#test):\
            printf(__VA_ARGS__))
debug("Flag");
debug("X = %d\n", x);
showlist(The first, second, and third items.);
report(x>y, "x is %d but y is %d", x, y);

results in

fprintf(stderr, "Flag" );
fprintf(stderr, "X = %d\n", x );
puts( "The first, second, and third items." );
((x>y)?puts("x>y"):
           printf("x is %d but y is %d", x, y));

String literal concatenation

Change §2.1 phase 6:

Adjacent ordinary string literal tokens are concatenated. Adjacent wide string literal tokens are concatenated.

Change §2.13.4¶3:

In translation phase 6 (2.1), adjacent narrow string literals are concatenated and adjacent wide string literals are concatenated. If a narrow string literal token is adjacent to a wide string literal token, the behavior is undefined result is a wide string literal. Characters in concatenated strings are kept distinct. [Example: «etc.»

Header and include file names

Note: In investigating this issue since the Sydney meeting, I discovered that C++ (unintentionally, I'm morally certain) includes UCNs in the set of portable elements of header and source file names, by virtue of referencing the non-terminal nondigit. The simplest solution to this problem would be to adopt C99's grammar for identifiers, by changing the rules in §2.10, preceding ¶1:

identifier:
nondigit identifier-nondigit
identifier nondigit identifier-nondigit
identifier digit
identifier-nondigit
nondigit
universal-character-name
other implementation-defined characters
nondigit: one of
universal-character-name
_ a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
digit: one of
0 1 2 3 4 5 6 7 8 9

Change §16.2¶5:

The mapping between the delimited sequence and the external source file name is implementation-defined. The implementation provides unique mappings for sequences consisting of one or more nondigits or digits (2.10) followed by a period (.) and a single nondigit. The first character shall not be a digit. The implementation may ignore the distinctions of alphabetical case.

Note: This is the only area where logically overlapping changes were made from C89 in both C99 and C++. This formulation matches what I will be proposing to the C committee, but of course I can't guarantee that it will be accepted in precisely this form.

Translation limit changes

Change the last sentence of §16.4¶3:

If the digit sequence specifies zero or a number greater than 32767 2147483647, the behavior is undefined.

Editorial changes

Clarifying phases of translation

Change §16¶1:

A preprocessing directive consists of a sequence of preprocessing tokens. The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character. The last token in the sequence is the first new-line character that follows the first token in the sequence.138) A new-line character ends the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.

Insert new example immediately before §16.1:

[Example: In:

#define EMPTY
EMPTY # include <file.h>

the sequence of preprocessing tokens on the second line is not a preprocessing directive, because it does not begin with a # at the start of translation phase 4, even though it will do so after the macro EMPTY has been replaced. -- end example]

Clarifying the definition of directive

Change the definition of group-part (§16):

group-part:
pp-tokensopt new-line
if-section
control-line
text-line
# non-directive

Insert definitions before lparen:

text-line:
pp-tokensopt new-line
non-directive:
pp-tokens new-line

Insert two new paragraphs before §16¶2:

A text line shall not begin with a # preprocessing token. A non-directive shall not begin with any of the directive names appearing in the syntax.

When in a group that is skipped (16.1), the directive syntax is relaxed to allow any sequence of preprocessing tokens to occur between the directive name and the following new-line character.

Clarifying macro names with extended characters

Insert new paragraph before §16.3¶3:

There shall be white-space between the identifier and the replacement list in the definition of an object-like macro.

Rationale: Consider the following example:

#define A$B c

In an implementation which allows $ as an identifier character, this is clearly a definition of a macro named A$B. In an implementation which does not allow $ in an identifier, this could have been taken as a definition of a macro named A, with the $ interpreted as the first pp-token of the definition, without emitting a diagnostic. With this change, the second implementation would be required to issue a diagnostic (this paragraph is a constraint in C99).

Clarifying precedence of token-pasting and stringizing

Change §16.3.4¶1:

After all parameters in the replacement list have been substituted and # and ## processing has taken place, all placemarker preprocessing tokens are removed. Then, the resulting preprocessing token sequence is rescanned, along with all subsequent preprocessing tokens of the source file, for more macro names to replace.

Clarifying "scope" of macros within phases of translation

Insert a new sentence at the end of §16.3.5¶1:

A macro definition lasts (independent of block structure) until a corresponding #undef directive is encountered or (if none is encountered) until the end of the preprocessing translation unit. Macro definitions have no significance after translation phase 4.

Note: This paragraph contains one of the very few uses of the term "preprocessing translation unit", which is new in C99, and is effectively equivalent to "translation unit", but applies during preprocessing (basically translation phase 4). If that term is not to be introduced into C++ (concerning which I have no strong feelings), the word "preprocessing" should not be inserted as directed above.

Clarifying the possible effects of a pragma

Change §16.6¶1:

A preprocessing directive of the form

# pragma pp-tokensopt new-line

causes the implementation to behave in an implementation-defined manner. The behavior might cause translation to fail or cause the translator or the resulting program to behave in a non-conforming manner. Any pragma that is not recognized by the implementation is ignored.

Note: In C99, this paragraph is also modified for the sake of standard pragmas. Those modifications are not presented here.

Clarifying line-splicing

Change §2.1 phase 2:

Each instance of a new-line character and an immediately preceding backslash character a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. If, as a result, a character sequence that matches the syntax of a universal-character-name is produced, the behavior is undefined. If a source file that is not empty does not end in a new-line character, or ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, the behavior is undefined.

Note: The statement about forming a UCN-like sequence is unique to C++, and will be reviewed in that context.

Clarifying the mapping to the execution character set

Change §2.1 phase 5:

Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a the corresponding member of the execution character set (2.13.2, 2.13.4); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character«Footnote».

«Footnote:» An implementation need not convert all non-corresponding source characters to the same execution character.

Minor wording tweaks

Change the definition of lparen (§16):

lparen:
the left-parenthesis character without preceding white-space
a ( character not immediately preceded by white-space

Change the last sentence of §16.1¶1:

«...» which evaluate to 1 if the identifier is currently defined as a macro name (that is, if it is predefined or if it has been the subject of a #define preprocessing directive without an intervening #undef directive with the same subject identifier), zero 0 if it is not.

Change and combine §16.3¶2-3:

An identifier currently defined as a macro without use of lparen (an object-like macro) may be redefined by another #define preprocessing directive provided that the second definition is an object-like macro definition and the two replacement lists are identical, otherwise the program is ill-formed. «¶break» An Likewise, an identifier currently defined as a macro using lparen (a function-like macro) may be redefined by another #define preprocessing directive provided that the second definition is a function-like macro definition that has the same number and spelling of parameters, and the two replacement lists are identical, otherwise the program is ill-formed.

Change §16.3¶8:

A preprocessing directive of the form

# define identifier replacement-list new-line

defines an object-like macro object-like macro that causes «etc.»

Note: The combination of the previous two changes with the change presented above to §16.3¶9 (under variadic macros) move the definitions of the terms "object-like macro" and "function-like macro" forward in the text. In C99, they are moved from Constraints paragraphs to Semantics paragraphs.

Change the last sentence of §16.3.1¶1:

«....» Before being substituted, each argument’s preprocessing tokens are completely macro replaced as if they formed the rest of the translation unit preprocessing file; no other preprocessing tokens are available.

Change §16.3.4¶2:

If the name of the macro being replaced is found during this scan of the replacement list (not including the rest of the source file’s preprocessing tokens), it is not replaced. Further Furthermore, if any nested replacements encounter the name of the macro being replaced, it is not replaced. «etc.»

Change §16.3.5¶6:

«...»
#define debug(s, t) printf("x" # s "= %d, x" # t "= %s", \
				x ## s, x ## t)
#define INCFILE(n) vers ## n /* from previous #include example */
#define glue(a, b) a ## b
«...»

Change §2.1 phase 3:

«...» A source file shall not end in a partial preprocessing token or in a partial comment. «...»

Change §2.1 phase 7:

White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.6). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. [Note: «etc.»

And finally, change almost all comments in examples in §16 from "C-style" to "C++-style". The only exceptions are in the valid redefinitions in §16.4.5¶7, which demonstrate the redefinition rules, and which depend on white space within the macro definition. The detailed changes are left as an editorial exercise.