Inti Tutorial: String Handling

Tutorial

String Handling

The most basic task that applications using GTK+ have to handle when dealing with international text is manipulating strings. The strings in the GTK+ interfaces are handled in the multi-byte encoding for the current locale. This allows good compatibility with existing applications that aren't explicitely enabled for multi-byte support. UTF-8 is the multi-byte encoding standard used by GTK+.

UTF-8 is an efficient encoding of Unicode character-strings that recognizes the fact that the majority of text-based communications are in ASCII, and it therefore optimizes the encoding of these characters. Most code translates directly to UTF-8 with no changes at all, but because UTF-8 is a variable-length multibyte encoding you cannot calculate the number of characters from the mere number of bytes. Also, there is a small performance hit for working in UTF-8, probably about 5%, but this is more than offset by it's advantages:

UTF-8 preserves the uniqueness for ASCII characters so you wont mistake any non-ASCII character for an ASCII character.
UTF-8 is self-segregating: you can always distinguish a lead byte from a fill byte and you will never be mistaken about the beginning or the length of a multibyte character. You can start parsing backwards at the end or in the middle of a multibyte string and will soon find a synchronization point.
UTF-8 is a reasonably compact encoding: ASCII characters are not inflated, most other alphabetic characters occupy only two bytes each, no basic Unicode character needs more than three bytes and all extended Unicode characters can be expressed with four bytes.
UTF-8 multibyte character strings preserve the lexicographic sorting and tree-search order and there are no byte-order problems.

The GTK+ UTF-8 functions are declared in <glib/unicode.h>. If you look through this header file you will soon realize that a lot of extra work is required on your part when working with UTF-8.

In Inti, the use of UTF-8 strings is completely transparent because Inti includes its own standard string compatible UTF-8 string class, called String, that does this work for you. String is declared in <inti/utf-string.h> and wraps most ot the GLIB UTF-8 string functions in a standard string-like interface. For the most part, you can use String in the same way that you would use std::string. There are however a few important differences that you need to be aware of.

String is implemented using an internal std::string as a byte array. This allows construction from a std::string and simple conversion to a std::string with the method:

const std::string& str();

str() returns a const reference to the internal std::string, allowing the user to pass the String to functions that expect a std::string.

String's std::string-like methods use the corresponding std::string name but the meaning of two of the argument types is different. In a std::string method pos refers to a byte position within the string and n refers to the number of bytes. In a Inti::String method char_pos refers to a character position within the String, byte_pos refers to a byte position within the String, n_chars refers to the number of characters and n_bytes refers to the number of bytes. Internally, methods that take an n_chars argument have to parse the input string or character array for the number of valid UTF-8 characters, and this take time. Therefore, you can improve efficiency by using methods that don't need to know the number of characters.

Another efficiency measure is in the implementation of the substring search methods. The find(), rfind(), find_first_of(), find_last_of(), find_first_not_of(), find_last_not_of() methods take the byte position from which to start their search and return the byte position of the first element found or npos if unsuccessful. This is the same as with std::string. For example, the find() methods are:

size_t find(const char *s, size_t byte_pos, size_t n_chars) const; size_t find(const String& str, size_t byte_pos = 0) const; size_t find(const char *s, size_t byte_pos = 0) const; size_t find(char c, size_t byte_pos = 0) const; size_t find(gunichar c, size_t byte_pos = 0) const;

A byte_pos of 0 implies the beginning of the string, which is usually where you start from. The return value is then passed back to the search method as the byte_pos for the next search, and so on until you are done.

You can convert from a character offset within a String to an integer byte index by calling:

size_t index(size_t char_pos) const;

You can convert from a constant pointer to a position within a String to an integer character offset by calling:

size_t offset(const_pointer p) const;

You can convert an integer character offset within a String to a constant pointer to a position within the string by calling:

const_pointer pointer(size_t char_pos) const;

As with std::string the size() method returns the number of allocated bytes in a String. To get the number of characters in String you must instead call:

size_t length() const;

An Inti::String includes the concept of being null. This is to simplify passing a string to a function that accepts a C-string and/or assigning a C-string to an Inti::String. A null string can only be constructed with the following call:

String s(0);

but you would never do this. What you could do is something like this:

String s = gtk_some_method_that_returns_a_c_string();

If gtk_some_method_that_returns_a_c_string() returns a null pointer the Inti::String will be null and the null() method will return true.

bool null() const;

When you want to pass a C-string to a function you call the following method:

const char* c_str() const { return null() ? (char*)0 : string_.c_str(); }

As you can see c_str() is an inline method that returns a null pointer if the string is null, otherwise it calls the internal std::string's c_str() method.

G::Unichar operator[](size_t char_pos) const;

The index operator returns the UTF-8 character at char_pos as a G::Unichar, by value and not as a reference. G::Unichar is a gunichar wrapper class and is declared in <inti/glib/unichar.h>.

void format(const char *message_format, ...);

Another useful method is format() which lets you do inline sprintf-style text formatting. Calling format() is equivalent formatting a temporary character array and then calling assign(). Any characters in the string before format() is called will be replaced by the characters in the formatted text string.

String upper(); String upper(size_t char_pos, size_t n_bytes = npos); String lower(); String lower(size_t char_pos, size_t n_bytes = npos);

The upper() and lower() methods return a new String converted correctly to UTF-8 upper or lower case.

You can check the validity of the UTF-8 characters in a String by calling one of the following methods:

bool validate(size_t& byte_pos) const; bool validate(const_pointer *end = 0) const;

Both methods return true if the String is a valid UTF-8 string. After returning, the byte_pos and end arguments point to the first invalid byte or the end of the string.

A word about iterators. String defines its own iterators that know how to iterate over UTF-8 characters in a forward direction (iterator) or reverse direction (reverse_iterator). These iterators are used just like std::string iterators but note: std::string iterators can't be used on UTF-8 strings.

String defines it's own standard i/o stream operators so you can pass a String to any stream using the >> and << operators. There are also equivalence operators so you compare two strings or a string and a character array using the == and != operators.

The String class is declared in <inti/utf-string.h> and exports many more methods than discussed here. Most class methods take a String argument as a reference and return a String by value. For efficiency, methods that usually require a string literal do not take a String but take a const char*.

« Internationalization and Localization

Index
Top

Multi-Threaded Programming »