Chapter 3. Using the String Classes

This chapter contains the following sections:

Manipulating strings is probably one of your most common tasks. Many developers say it is also the most error-prone. The Tools.h++ classes RWCString and RWWString give you the constructors, operators, and member functions you need to create, manipulate, and delete strings easily.

The member functions of class RWCString read, compare, store, restore, concatenate, prepend, and append RWCString objects and char*s. Its operators allow access to individual characters, with or without bounds checking. And the class automatically takes care of memory management: you never need to create or delete storage for the string's characters.

Class RWWString is similar to RWCString, except that RWWString works with wide characters. Since the interfaces of the two classes are similar, they can be easily interchanged. Details of these classes are described in the Class Reference. This section gives you some general examples of how RWCString works, followed by discussions of selected features of the string classes.

An Introductory Example

The following example calls on several essential features of the string classes. Basically, it shows the steps RWCString would take to substitute a new version number for the old ones in a piece of documentation.

#include <rw/cstring.h>
#include <rw/regexp.h>
#include <rw/rstream.h>
 
main(){
  RWCString a;                     //1 create string object a
 
  RWCRegexp re("V[0-9]\\.[0-9]+"); //2 define regular expression  
  while( a.readLine(cin) ){        //3 read standard input into a 
    a(re) = "V4.0";                //4 replace matched expression
    cout << a << endl;
  }
  return 0;
}

Program Input:

This text describes V1.2.  For more
information see the file install.doc.
The current version V1.2 implements...

Program Output:

This text describes V4.0.  For more
information see the file install.doc.
The current version V4.0 implements...

The code here describes the activity of the class. RWCString creates a string object a, reads lines from standard input into a, and searches a for a pattern matching the defined regular expression "V[0-9]\\.[0-9]+". A match would be a version number between V0 and V9; for example, V1.2 and V1.22, but not V12.3. When a match is found, it is replaced with the string "V4.0"

The power of this operation lies in the expression:

a(re) = "V4.0";

where () is an example of an overloaded operator. As you know, an overloaded operator is one which can perform more than one function, depending on context or argument.

In the example, the function call operator RWCString::operator() is overloaded to take an argument of type RWCRegexp, the regular expression. The operator returns either a substring that delimits the regular expression, or a null substring if a matching expression cannot be found. The program then calls the substring assignment operator, which replaces the delimited string with the contents of the right hand side, or does nothing if this is the null substring. Because Tools.h++ provides the overloaded operator, you can do a search and replace on the defined regular expression all in a single line.

You will notice that you need two backlashes in "V[0-9]\\.[0-9]+" to indicate that the special character "." is to be read literally as a decimal point. That's because the compiler removes one backslash when it evaluates a literal string. The remaining backslash alerts the regular expression evaluator to read whatever character follows literally.

In the next example, RWCString uses another overloaded operator, + , to concatenate the strings s1 and s2. The toUpper member function converts the strings from lower to upper case, and the results are sent to cout:

RWCString s1, s2;
cin >> s1 >> s2;
cout << toUpper(s1+s2);

See the Class Reference for details on the string classes.

Lexicographic Comparisons

If you're putting together a dictionary, you'll find the lexicographics comparison operators of RWCString particularly useful. They are:

RWBoolean operator==(const RWCString&, const RWCString&);
RWBoolean operator!=(const RWCString&, const RWCString&);
RWBoolean operator< (const RWCString&, const RWCString&);
RWBoolean operator<=(const RWCString&, const RWCString&);
RWBoolean operator> (const RWCString&, const RWCString&);
RWBoolean operator>=(const RWCString&, const RWCString&);

These operators are case sensitive. If you wish to make case insensitive comparisons, you can use the member function:

int RWCString::compareTo(const RWCString& str, 
                        caseCompare = RWCString::exact) const;

Here the function returns an integer less than zero, equal to zero, or greater than zero, depending on whether str is lexicographically less than, equal to, or greater than self. The type caseCompare is an enum with values:

exact                   Case sensitive
ignoreCase         Case insensitive

Its default setting is exact, which gives the same result as the logical operators ==, !=, etc.

For locale-specific string collations, you would use the member function:

int RWCString::collate(const RWCString& str) const;

which is an encapsulation of the Standard C library function strcoll(). This function returns results computed according to the locale-specific collating conventions set by category LC_COLLATE of the Standard C library function setlocale(). Because this is a relatively expensive calculation, you may want to pretransform one or more strings using the global function:

RWCString strXForm(const RWCString&);

then use compareTo() or one of the logical operators, ==, !=, etc., on the results. See the Class Reference entry for RWCString: the function strxForm appears under related global functions.

Substrings

A separate RWCSubString class supports substring extraction and modification. There are no public constructors; RWCSubStrings are constructed indirectly by various member functions of RWCString, and destroyed at the first opportunity.

You can use substrings in a variety of situations. For example, you can create a substring with RWCString::operator(), then use it to initialize an RWCString:

RWCString s("this is a string");
// Construct an RWCString from a substring:
RWCString s2 = s(0, 4);                               // "this"

The result is a string s2 that contains a copy of the first four characters of s.

You can also use RWSubStrings as lvalues in an assignment to a character string, or to an RWCString or RWCSubString:

// Construct an RWCString:
RWCString article("the");
RWCString s("this is a string");
s(0, 4) = "that";                           // "that is a string"
s(8, 1) = article;                        // "that is the string"

Note that assignment to a substring is not a conformal operation: the two sides of the assignment operator need not have the same number of characters.

Pattern Matching

Class RWCString supports a convenient interface for string searches. In the example below, the code fragment:

RWCString s("curiouser and curiouser.");
size_t i = s.index("curious");

will find the start of the first occurrence of curious in s. The comparison will be case sensitive, and the result will be that i is set to 0. To find the index of the next occurrence, you would use:

i = s.index("curious", ++i);

which will result in i set to 14. You can make a case-insensitive comparison with:

RWCString s("Curiouser and curiouser.");
size_t i = s.index("curious", 0, RWCString::ignoreCase);

which will also result in i set to 0.

If the pattern does not occur in the string, the index() will return the special value RW_NPOS.

Simple Regular Expressions

As part of its pattern matching capability, the Tools.h++ Class Library supports regular expression searches. See the Class Reference, under RWCRegexp, for details of the regular expression syntax. You can use a regular expression to return a substring; for example, here's how you might match all Windows messages (prefix WM_):

#include <rw/cstring.h>
#include <rw/regexp.h>
#include <rw/rstream.h>
 
main(){
  RWCString a("A message named WM_CREATE");
 
  // Construct a Regular Expression to match Windows messages:
  RWCRegexp re("WM_[A-Z]*");
  cout << a(re) << endl;
 
  return 0;
}

Program Output:

WM_CREATE

The function call operator for RWCString has been overloaded to take an argument of type RWCRegexp. It returns an RWCSubString matching the expression, or the null substring if there is no such expression.

Extended Regular Expressions

This version of the Tools.h++ class library supports extended regular expression searches based on the POSIX.2 standard. (See the bibliography in Appendix D.) Extended regular expressions are the regular expressions used in the UNIX utilities lex and awk. You will find details of the regular expression syntax in the Class Reference under RWCRExpr.


Note: RWCRExpr is available only if your compiler supports exception handling and the C++ Standard Library.

Extended regular expressions can be any length, although limited by available memory. You can use parentheses to group subexpressions, and the symbol _ to create either/or regular expressions for pattern matching.

The following example shows some of the capabilities of extended regular expressions:

#include "rw/rstream.h"
#include "rw/re.h"
 
main (){
  RWCRExpr  re("Lisa|Betty|Eliza");
  RWCString s("Betty II is the Queen of England.");
 
  s.replace(re, "Elizabeth");
  cout << s << endl;
 
  s = "Leg Leg Hurrah!";
  re = "Leg";
  s.replace(re, "Hip", RWCString::all);
  cout << s << endl;
} 
 

Program Output:

Elizabeth II is the Queen of England.
Hip Hip Hurrah! 

Note that the function call operator for RWCString has been overloaded to take an argument of type RWCRExpr. It returns an RWCSubString matching the expression, or the null substring if there is no such expression.

String I/O

Class RWCString offers a rich I/O facility to and from both iostreams and Rogue Wave virtual streams.

iostreams

The standard left-shift and right-shift operators have been overloaded to work with iostreams and RWCStrings:

ostream&operator<<(ostream& stream, const RWCString& cstr);
istream&operator>>(istream& stream, RWCString& cstr);

The semantics parallel the operators:

ostream&operator<<(ostream& stream, const char*);
istream&operator>>(istream& stream, char* p);

which are defined by the Standard C++ Library that comes with your compiler. In other words, the left-shift operator << writes a null-terminated string to the given output stream. The right-shift operator >> reads a single token, delimited by white space, from the input stream into the RWCString, replacing the previous contents.

Other functions allow finer tuning of RWCString input[2] . For instance, function readline() reads strings separated by newlines. It has an optional parameter controlling whether white space is skipped before storing characters. You can see the difference skipping white space makes in the following example:

#include <rw/cstring.h>
#include <iostream.h>
#include <fstream.h>
 
main(){
   RWCString line;
 
  { int count = 0;
    ifstream istr("testfile.dat");
 
    while (line.readLine(istr))             // Use default value:
                                              // skip whitespace
      count++;
    cout << count << " lines, skipping whitespace.\n";
  }
  
  { int count = 0;
    ifstream istr("testfile.dat");
    while (line.readLine(istr, FALSE))        // NB: Do not skip 
                                                  // whitespace
      count++;
    cout << count << " lines, not skipping whitespace.\n";
  }
 
  return 0;
}

Program Input:

line 1
 
 
 
line 5

Program Output:

2 lines, skipping whitespace.
5 lines, not skipping whitespace.

Virtual Streams

String operators to and from Rogue Wave virtual streams are also supported:

Rwvistream&  operator>>(RWvistream& vstream, RWCString& cstr);
Rwvostream&  operator<<(RWvostream& vstream, 
                        const RWCString& cstr);

By using these operators, you can save and restore a string without knowing its formatting. See Chapter 6 for details on virtual streams.

Tokenizer

You can use the class RWCTokenizer to break up a string into tokens separated by arbitrary white spaces. Here's an example:

#include <rw/ctoken.h>
#include <rw/cstring.h>
#include <rw/rstream.h>
 
main(){
  RWCString a("a string with five tokens");
 
  RWCTokenizer next(a);
 
  int i = 0;
 
  // Advance until the null string is returned:
  while( !next().isNull() ) i++;
 
  cout << i << endl;
  return 0;
}

Program Output:

5

This program counts the number of tokens in the string. The function call operator for class RWCTokenizer has been overloaded to mean "advance to the next token and return it as an RWCSubString," much like other Tools.h++ iterators. When there are no more tokens, it returns the null substring. Class RWCSubString has a member function isNull() which returns TRUE if the substring is the null substring. Hence, the loop is broken. See the Class Reference under RWCTokenizer for details.

Multibyte Strings

Class RWCString provides limited support for multibyte strings, sometimes used in representing various alphabets (see Chapter 16). Because a multibyte character can consist of two or more bytes, the length of a string in bytes may be greater than or equal to the number of actual characters in the string.

If the RWCString contains multibyte characters, you should use member function mbLength() to return the number of characters. On the other hand, if you know that the RWCString does not contain any multibyte characters, then the results of length() and mbLength() will be the same, and you may want to use length() because it is much faster. Here's an example using a multibyte string in Sun:

RWCString Sun("\306\374\315\313\306\374");
cout << Sun.length();                               // Prints "6"
cout << Sun.mbLength();                             // Prints "3"

The string in Sun is the name of the day Sunday in Kanji, using the EUC (Extended UNIX Code) multibyte code set. With the EUC, a single character may be 1 to 4 bytes long. In this example, the string Sun consists of 6 bytes, but only 3 characters.

In general, the second or later byte of a multibyte character may be null. This means the length in bytes of a character string may or may not match the length given by strlen(). Internally, RWCString makes no assumptions[3] about embedded nulls, and hence can be used safely with character sets that use null bytes. You should also keep in mind that while RWCString::data() always returns a null-terminated string, there may be earlier nulls in the string. All of these effects are summarized in the following program:

#include <rw/cstring.h>
#include <rw/rstream.h>
#include <string.h>
main() {
RWCString a("abc");                                          // 1
RWCString b("abc\0def");                                     // 2
RWCString c("abc\0def", 7);                                  // 3
 
cout << a.length();                                 // Prints "3"
cout << strlen(a.data());                           // Prints "3"
 
cout << b.length();                                 // Prints "3"
cout << strlen(b.data());                           // Prints "3"
 
cout << c.length();                                 // Prints "7"
cout << strlen(c.data());                           // Prints "3"
return 0; }

You will notice that two different constructors are used above. The constructor in lines 1 and 2 takes a single argument of const char*, a null-terminated string. Because it takes a single argument, it may be used in type conversion (ARM 12.3.1). The length of the results is determined the usual way, by the number of bytes before the null. The constructor in line 3 takes a const char* and a run length. The constructor will copy this many bytes, including any embedded nulls.

The length of an RWCString in bytes is always given by RWCString::length(). Because the string may include embedded nulls, this length may not match the results given by strlen().

Remember that indexing and other operators—basically, all functions using an argument of type size_t—work in bytes. Hence, these operators will not work for RWCStrings containing multibyte strings.

Wide Character Strings

Class RWWString , also used in representing various alphabets, is similar to RWCString except it works with wide characters. These are much easier to manipulate than multibyte characters because they are all the same size: the size of a wchar_t.

Tools.h++ makes it easy to convert back and forth between multibyte and wide character strings. Here's an example of how to do it, built on the Sun example in the previous section:

#include <rw/cstring.h>
#include <rw/wstring.h>
#include <assert.h>
main() {
RWCString Sun("\306\374\315\313\306\374");
RWWString wSun(Sun, RWWString::multiByte); // MBCS to wide string
 
RWCString check = wSun.toMultiByte();
assert(Sun==check);                                         // OK
return 0; }

Basically, you convert from a multibyte string to a wide string by using the special RWWString constructor:

RWWString(const char*, multiByte_);

The parameter multiByte_ is an enum with a single possible value, multiByte, as shown in the example. The multiByte argument ensures that this relatively expensive conversion is not done inadvertently. The conversion from a wide character string back to a multibyte string, using the function toMultiByte(), is similarly expensive.

If you know that your RWCString consists entirely of ASCII characters, you can greatly reduce the cost of the conversion in both directions. This is because the conversion involves a simple manipulation of high-order bits:

#include <rw/cstring.h>
#include <rw/wstring.h>
#include <assert.h>
main() {
RWCString EnglishSun("Sunday");                   // Ascii string
assert(EnglishSun.isAscii());                               // OK
 
// Now convert from Ascii to wide characters:
RWWString wEnglishSun(EnglishSun, RWWString::ascii);
 
assert(wEnglishSun.isAscii());                              // OK
RWCString check = wEnglishSun.toAscii();
assert(check==EnglishSun);                                  // OK
return 0; }

Note how the member functions RWCString::isAscii() and RWWString::isAscii() are used to ensure that the strings consist entirely of Ascii characters. The RWWString constructor:

RWWString(const char*, ascii_);

is used to convert from Ascii to wide characters. The parameter ascii_ is an enum with a single possible value, ascii.

The member function RWWString::toAscii() is used to convert back.



[2] Details about methods readFile(); readLine(); readString(istream&); readToDelim(); and readToken() may be found in the RWCString section of the Class Reference.

[3] However, system functions to transfer multibyte strings may make such assumptions. RWCString simply calls such functions to provide such transformations.