Tutorial
|
|
|
|
String Handling
The most basic task that applications using GTK+ have to handle when
dealing with international text is manipulating strings. The strings in
the GTK+ interfaces are handled in the multi-byte encoding for the
current locale. This allows good compatibility with existing
applications that aren't explicitely enabled for multi-byte support.
UTF-8 is the multi-byte encoding standard used by GTK+.
UTF-8 is an efficient encoding of Unicode
character-strings that recognizes the fact that the majority of
text-based communications are in ASCII, and it therefore optimizes the
encoding of these characters. Most code translates directly to UTF-8
with no changes at all, but because UTF-8 is a variable-length
multibyte
encoding you cannot calculate the number of characters from the mere
number of bytes. Also, there is a small performance hit for working in
UTF-8, probably about 5%, but this is more than offset by it's
advantages:
- UTF-8 preserves the uniqueness for ASCII characters so you wont
mistake any non-ASCII character for an ASCII character.
- UTF-8 is self-segregating: you can always distinguish a lead byte
from a fill byte and you will never be mistaken about the beginning or
the length of a multibyte character. You can start parsing backwards at
the end or in the middle of a multibyte string and will soon find a
synchronization point.
- UTF-8 is a reasonably compact encoding: ASCII characters are not
inflated, most other alphabetic characters occupy only two bytes each,
no basic Unicode character needs more than three bytes and all extended
Unicode characters can be expressed with four bytes.
- UTF-8 multibyte character strings preserve the lexicographic
sorting and tree-search order and there are no byte-order problems.
The GTK+ UTF-8 functions are declared in <glib/unicode.h>.
If you look through this header file you will soon realize that a lot
of extra work is required on your part when working with UTF-8.
In Inti, the use of UTF-8 strings is completely transparent because
Inti includes its own standard string compatible UTF-8 string class,
called String, that does this work for you. String is declared in <inti/utf-string.h>
and wraps most ot the GLIB UTF-8 string functions in a standard
string-like interface. For the most part, you can use String in the
same
way that you would use std::string. There are however a few important
differences that you need to be aware of.
String is implemented using an internal std::string as a byte array.
This allows construction from a std::string and simple conversion to a
std::string with the method:
const
std::string& str(); |
str() returns a const reference to the internal std::string, allowing
the user to pass the String to functions that expect a std::string.
String's std::string-like methods use the
corresponding std::string name but the meaning of two of the argument
types is different. In a std::string method pos refers to a
byte position within the string and n refers to the number of
bytes. In a Inti::String method char_pos refers to a
character
position within the String, byte_pos refers to a byte
position
within the String, n_chars refers to the number of characters
and n_bytes refers to the number of bytes. Internally,
methods
that take an n_chars argument have to parse the input string
or
character array for the number of valid UTF-8 characters, and this take
time. Therefore, you can improve efficiency by using methods that don't
need to know the number of characters.
Another efficiency measure is in the
implementation of the substring search methods. The find(), rfind(),
find_first_of(), find_last_of(), find_first_not_of(),
find_last_not_of()
methods take the byte position from which to start their search and
return the byte position of the first element found or npos
if
unsuccessful. This is the same as with std::string. For example, the
find() methods are:
size_t find(const char *s, size_t byte_pos, size_t n_chars)
const;
size_t find(const String& str, size_t
byte_pos = 0) const;
size_t find(const char *s, size_t byte_pos
= 0) const;
size_t find(char c, size_t byte_pos = 0) const;
size_t find(gunichar c, size_t byte_pos = 0) const;
|
A byte_pos of 0 implies the beginning of
the string, which is usually where you start from. The return value is
then passed back to the search method as the byte_pos for the
next search, and so on until you are done.
You can convert from a character offset within a String to an integer
byte index by calling:
size_t index(size_t
char_pos) const; |
You can convert from a constant pointer to a position within a String
to an integer character offset by calling:
size_t
offset(const_pointer p) const; |
You can convert an integer character offset within
a String to a constant pointer to a position within the string by
calling:
const_pointer
pointer(size_t char_pos) const; |
As with std::string the size() method returns the
number of allocated bytes in a String. To get the number of characters
in String you must instead call:
An Inti::String includes the concept of being null.
This is to simplify passing a string to a function that accepts a
C-string and/or assigning a C-string to an Inti::String. A null string
can only be constructed with the following call:
but you would never do this. What you could do is something like this:
String s =
gtk_some_method_that_returns_a_c_string();
|
If gtk_some_method_that_returns_a_c_string()
returns a null pointer the Inti::String will be null and the null()
method will return true.
When you want to pass a C-string to a function you call the following
method:
const
char* c_str() const { return null() ? (char*)0
: string_.c_str(); } |
As you can see c_str() is an inline method that
returns a null pointer if the string is null, otherwise it
calls the internal std::string's c_str() method.
G::Unichar operator[](size_t char_pos) const; |
The index operator returns the UTF-8 character at char_pos
as a G::Unichar, by value and not as a reference. G::Unichar is a
gunichar wrapper class and is declared in <inti/glib/unichar.h>.
void
format(const char *message_format, ...);
|
Another useful method is format() which lets you
do inline sprintf-style text formatting. Calling format() is equivalent
formatting a temporary character array and then calling assign(). Any
characters in the string before format() is called will be replaced by
the characters in the formatted text string.
String upper();
String upper(size_t char_pos, size_t n_bytes = npos);
String lower();
String lower(size_t char_pos, size_t n_bytes = npos);
|
The upper() and lower() methods return a new String converted correctly
to UTF-8 upper or lower case.
You can check the validity of the UTF-8 characters in a String by
calling one of the following methods:
bool
validate(size_t& byte_pos) const;
bool validate(const_pointer *end =
0) const; |
Both methods return true if the String
is
a valid UTF-8 string. After returning, the byte_pos and end
arguments point to the first invalid byte or the end of the string.
A word about iterators. String defines its own iterators that know how
to iterate over UTF-8 characters in a forward direction (iterator) or
reverse direction (reverse_iterator). These iterators are used just
like
std::string iterators but note: std::string iterators can't be used on
UTF-8 strings.
String defines it's own standard i/o stream operators so you can pass a
String to any stream using the >> and << operators. There
are also equivalence operators so you compare two strings or a string
and a character array using the == and != operators.
The String class is declared in <inti/utf-string.h>
and exports many more methods than discussed here. Most class methods
take a String argument as a reference and return a String by value. For
efficiency, methods that usually require a string literal do not take a
String but take a const char*.