Concurrency Modifications to Basic String

ISO/IEC JTC1 SC22 WG21 N2647 = 08-0157 - 2008-05-16

Alisdair Meredith, [email protected]
Hans Boehm, [email protected]
Lawrence Crowl, [email protected], [email protected]
Peter Dimov, [email protected]
Daniel Krügler, [email protected]

This paper is a modification of N2534 = 08-0044 - 2008-03-17. The changes include summaries of implementation status, clarification of front and back in the weak proposal, clarification of the requirements on c_str and data, and adjunct proposals for element and data access.

Problem

The current definition of basic_string allows only very limited concurrent access to strings. Such limited concurrency will inhibit performance in multi-threaded applications.

Analysis

We categorize string operations as follows:

construction and destruction

These operations must be single-threaded and define the temporal bounds of all other operations on an object. However, copy construction may induce a shared representation on the argument string in copy-on-write implementations.

mutating

These operations will (probably) change the state of a string variable regardless of the implementation. Furthermore, some may introduce a shared representation on the argument string in copy-on-write implementations.

operator= resize reserve clear operator+= append assign push_back insert erase pop_back replace swap operator<< getline
(Any library function that accepts a string by non-const reference.)
iterators and element access

These operations provide pointers or references to the internal string representation.

begin end rbegin rend operator[] at front back
unsharing

For non-const strings, iterator and element access operations may require copy-on-write implementation to 'unshare' the string to enable write access to the referenced characters.

Furthermore, the following operations may also require 'unsharing' the string data to provide contiguous null-terminated character arrays.

c_str data
sharing neutral

For const strings, iterators and element access operations do not inhibit concurrent accesses, nor do references through their results.

Iterators and references may be invalidated by:

  1. 'mutating' operations and
  2. the first 'unsharing' operation following a 'mutating' operation.

Since only the first 'unsharing' operation following a 'mutating' must make the string not sharable, further reference or iterator creations cannot invalidate anything.

Thus the only real anomaly seems to be that creating the initial non-const reference invalidates previously generated const references. This anomaly has unfortunate consequences. Typically, when v offers "container thread safety", we can do

#pragma omp parallel for
for( size_t i = 0, n = v.size(); i < n; ++i ) {
    v[i] = 0;
}

However, for a basic_string v, we must insert v[0]; before the pragma to render the code correct. Such a subtlety is difficult to teach and difficult to review.

Similarly, we would expect to be able to write s[0]+s[1] to add the first two characters of a string. And indeed this is correct if either s is a string or const string. However similar examples become incorrect if only one access to s is as a const string, and the other access is through a non-const reference to the same string. In that case, the second operator[] invocation may invalidate the first character reference before the character value can be obtained.

As these examples illustrate, this issue is not completely new with the introduction of threads, but it is aggravated by it.

There are also performance implications to the current design. In a copy-on-write implementation,

const string empty("");
vector<string> v;

#pragma omp parallel for
for( size_t i = 0, n = v.size(); i < n; ++i ) {
    v[i] = empty;
}

may run very slowly, since each iteration writes to the representation of empty, and thus is likely to generate a conflict cache miss. This issue may be secondary, but it argues that it is really hard to write efficient code if you do not know whether you have a copy-on-write implementation or not.

General Clarification

Any operation that may change the size of a string, or any standard library function that accepts a string by non-const reference, may potentially write to the string representation and thus is not suitable for concurrent operation.

The goal is that any operation that returns an internal reference or pointer can be called concurrently safely, but any write through the reference or pointer is not safe. However, this is not possible with a shared-buffer implementation, such as the classic reference counted variant.

Weak Proposal

Change c_str and data to not invalidate iterators and references.

We chose to leave swap as an invalidating operation to enable the small string optimization.

This change effectively requires null-terminated buffers. [Note: We think the wording is sufficient, but need confirmation.] For implementations that inline critical operations in user code, this change may affect the Application Binary Interface (ABI). The following table indicates whether or not the implementations null-terminate buffers. Entries marked with a '?' are "best guess" and not confirmed. In some cases, it has not been verified that entries apply to all libraries shipped by a vendor. No vendor is known to have a problem with the weak proposal.

compilerlibrary'98'03'0x
Apple libstdc++ yes yes yes 
CodeWarrior CodeWarrior yes yes yes? 
Comeau Dinkumware yes yes yes 
Comeau STLPort yes? yes? yes? 
GNU libstdc++ yes? yes yes? 
HP RW yes yes yes? 
IBM Dinkumware yes yes yes 
Intel/MS Dinkumware yes yes yes 
Intel/Linux libstdc++ yes yes yes? 
Microsoft Dinkumware yes yes yes 
Sun libCstd/RW yes yes yes? 
Sun STLport yes yes yes? 

21.3.1 basic_string general requirements [string.require]

In paragraph 4, edit

References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:

21.3.7.1 basic_string accessors [string.accessors]

Edit paragraphs 1 through 4 as follows.

const charT* c_str() const;
const charT* data() const;

Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT().

Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function of the class basic_string that designates the same object as this.

const charT* data() const;

Returns: If size() is nonzero, the member returns a pointer to the initial element of an array whose first size() elements equal the corresponding elements of the string controlled by *this. If size() is zero, the member returns a non-null pointer that is copyable and can have zero added to it.

Requires: The program shall not alter any of the values stored in the character array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function of basic_string that designates the same object as this.

Adjunct Weak Proposal

With the null-termination constraint on strings, it is possible to widen the string interface by allowing element access to the null.

This proposal may change the implementation of some inline string operations. We believe there is no change to existing conforming program behavior.

21.3.5 basic_string element access [string.access]

In paragraph 1, edit

const_reference operator[](size_type pos) const;
reference       operator[](size_type pos);
Requires: pos <= size().
Returns: If pos < size(), returns *(begin() + pos). Otherwise, if pos == size(), the const version returns a reference to a charT() that shall not be modified. Otherwise, the behavior is undefined.
Complexity: for const strings, constant time.

21.3.7.1 basic_string accessors [string.accessors]

To the c_str and data requirements, edit

Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT().

Returns: A pointer p such that p+i == &operator[](i) for each i in [0, size()].

Throws: for const strings, nothing.

Complexity: for const strings, constant time.

Strong Proposal

In addition to the weak proposal, we propose to make all iterator and element access operations safely concurrently executable.

We are increasing the stability of operations even in sequential code.

This change disallows copy-on-write implementations. For those implementations using copy-on-write implementations, this change would also change the Application Binary Interface (ABI). The following table indicates whether or not the implementations use copy-on-write. Entries marked with a '?' are "best guess" and not confirmed. In some cases, it has not been verified that entries apply to all libraries shipped by a vendor. Only one vendor is known to have a problem with the strong proposal.

compilerlibrary'98'03'0x
Apple libstdc++ yes yes no 
CodeWarrior CodeWarrior yes no no? 
Comeau Dinkumware yes no no 
Comeau STLPort no? no? no? 
GNU libstdc++ yes yes no 
HP RW yes yes yes? 
IBM Dinkumware 1 thread? 1 thread no 
Intel/MS Dinkumware yes no no 
Intel/Linux libstdc++ yes yes no 
Microsoft Dinkumware yes no no 
Sun libCstd/RW yes yes no 
Sun STLport no no no 

21.3.1 basic_string general requirements [string.require]

In paragraph 4, edit

References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:

Delete paragraph 5.

[Note: These rules are formulated to allow, but not require, a reference counted implementation. A reference counted implementation must have the same semantics as a non-reference counted implementation. [ Example:

string s1("abc");
string::iterator i = s1.begin();
string s2 = s1;
*i = 'a'; // Must modify only s1

end example] —end note]

21.3.7.1 basic_string accessors [string.accessors]

Edit paragraphs 1 through 4 as follows.

const charT* c_str() const;
const charT* data() const;

Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT().

Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function of the class basic_string that designates the same object as this.

const charT* data() const;

Returns: If size() is nonzero, the member returns a pointer to the initial element of an array whose first size() elements equal the corresponding elements of the string controlled by *this. If size() is zero, the member returns a non-null pointer that is copyable and can have zero added to it.

Requires: The program shall not alter any of the values stored in the character array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function of basic_string that designates the same object as this.

Adjunct Strong Proposal

With the null-termination constraint on strings, it is possible to widen the string interface by allowing element access to the null.

The difference between the adjunct weak proposal and the adjunct strong proposal is extending the throws and complexity requirements on const strings to non-const strings.

This proposal may change the implementation of some inline string operations. For implementations without copy-on-write, we believe there is no change to existing conforming program behavior. Implementions with copy-on-write, would have more significant changes.

21.3.5 basic_string element access [string.access]

In paragraph 1, edit

const_reference operator[](size_type pos) const;
reference       operator[](size_type pos);
Requires: pos <= size().
Returns: If pos < size(), returns *(begin() + pos). Otherwise, if pos == size(), the const version returns a reference to a charT() that shall not be modified. Otherwise, the behavior is undefined.
Complexity: constant time.

21.3.7.1 basic_string accessors [string.accessors]

To the c_str and data requirements, edit

Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT().

Returns: A pointer p such that p+i == &operator[](i) for each i in [0, size()].

Throws: nothing.

Complexity: constant time.

Adjunct Adjunct Proposal

With the null-termination access for operator[], as provided in the adjunct proposals, it is also possible to extend that same access to the member function at.

This proposal may change the implementation of some inline string operations. We believe the only change to existing conforming program behavior would be the absence of an exception on s.at(s.size()).

21.3.5 basic_string element access [string.access]

Edit paragraphs 2 through 4 as follows.

const_reference at(size_type pos) const;
reference       at(size_type pos);
Requires: pos < <= size()
Throws: out_of_range if pos >= > size().
Returns: operator[](pos).

Recovering the Loss

The largest potential loss in performance due to a switch away from copy-on-write implementations is the increased consumption of memory for applications with very large read-mostly strings. However, we believe that for those applications ropes are a better technical solution, and recommend a rope proposal be considered for inclusion in Library TR2.