class ref name: regex
category-group: strings
layer: 2
header file: z_regex.h

synopsis.
The rexeg_o class is a general-purpose regular expression parser. With it, you can search for substrings within string objects (string_o) that match a particular regular expression. More than one style of regular expression patterns are supported.

There are various implementations of regular expresssion parsing, each with its nuances. Vettrasoft has used some of these implementations for this object. In particular, an implementation by Henry Spencer is included. The regex syntax is based on version 8 of BSD unix (see "regexp(3)"), not System V unix.

Unfortunately the language of regular expressions is difficult to standardize. The following search patterns do work for the regex object:

  "(foo|bar|more)" - match anything in a list. if the search string
  contains any of "foo", "bar", or "more", it will be matched.
  "[A-Z/.]" - matches a single character matching the text inside the
  brackets. Here, upper case letters, slash, and dot will match
  (A, B, C, .. Z, '/', or '.' )
  "(de\>)*" - match 0 or more ocurrances of a word ending in "de"
  [eg "made"].

description.
The rexeg_o class provides for a simple way to do regular expression searches. In the header file are a number of classes, including regex_o, iregex_o, sun_regex_o, and hs_regex_o. Only the first is intended for public consumption. iregex_o exists as a manager for any and all "implementation" classes that do the actual work. A handle/body idiom (see James Coplien's purple book for more info) is employed in order to be able to accomodate new implemenations in the future.

You can specify the type of implementation to use when the object is created, via a simple enumerator parameter. The basic usage of this object is like so:

    regex_o R;
    string_o s ("All ze Kings Horses Bade a farewell to Kingz. That is all..");
    string_o s_out;
    int j, ie;
    R.set_pattern ("(a|b).*z");
    j = R.search (s, &s_out, &ie, 100);
Here, a regex object R is created, using the default style "Henry_Spencer". Alternatively, you can use the "Sun" style by changing the first line to:
    regex_o R (z_RegExp_Style_Sun);
The next step is to specify the regular expression pattern, which in this case is "(a|b).*z". This would hunt for text beginning with an 'a' or a 'b', followed by any number of characters, and the last one being a 'z'. Finally, the string object "s" is searched. If a match is found, it is put into output parameter "s_out", ie is set to 0, and the return value j is the starting index value (number of characters) to where this string lies.

If there is no matching string in s, ie will be non-zero.

The following list provides examples that are known to work for the default ("Henry Spencer") implementation of rexeg_o:

ref. c++ pattern string search against
[A] "\\([a-zA-Z0-9-]+@[a-zA-Z.]+craig\\.org\\)" the email address (prw-36427@job.craig.org) belongs to an ad
[B] "[0-9][0-9][0-9][\\./-][0-9][0-9][0-9][\\. -]" part of a NANP phone #: "978.548-" ( not 123.456 )
[C] "[0-9][0-9][0-9][\\./-].*[0-9][0-9][0-9][\\. -]" part of a NANP phone #: "978.548-" ( not 123.456 )
[D] "(a|b).*z" All ze Kings Horses BADE a farewell to Kingz; oh my..
[A] some vital points: (a) you need to escape a parenthesis in order to match it exactly (which, in c/c++ code would be 2 back-slash characters!); (b) to match a literal dot ('.'), escape it with a backslash unless it is inside the square brackets; (c) for dashes (aka hyphen, or '-') inside square brackets, make it the last character in brackets - move it to the right edge of the square brackets: [a-z0-9-], not [-a-z0-9]; (d) the position of a dot ('.') can be anywhere in a range (brackets), as long it doesn't interfere with a range (eg "0-9").

In example [A], the matched substring is "(prw-36427@job.craig.org)".

[B] The principal thing here, beyond the long strings of number checks, is grouping of '\', '.', '/', or '-' as matches for a single character. Also note that the target body text has 2 possiblew matches ("978.548-" and "123.456 "), but only the first is returned.

In example [B], the matched substring is "978.548-".

[C] this differs from [B] only by the addition of ".*" in the middle of the search pattern. This demonstrates "greedy matching" - this addition allows us to gobble up characters as long as there is a match for the sub-string match after the ".*", which is "456 ".

In example [C], the matched substring is "978.548-" ( not 123.456 ".

[D] an elaboration of the example provided earlier on this page, here you can see how a multiple-choice construct like "(a|b)" works. In example [D], the matched substring is "a farewell to Kingz". You may notice that regular expressions are case-sensitive: otherwise, "BADE ".. would have been included in the results. Also note that it is greedy: otherwise, a shorter ("non-greedy"?) match would be "arewell to Kingz".

member functions (primary)

regex_o()
SIGNATURE: regex_o ()
SYNOPSIS: creates a a new regex object, completely devoid of contents.
 

regex_o(regex_o)
SIGNATURE: regex_o (const regex_o &rhs)
SYNOPSIS: creates a a new regular expression object; an exact image of the object "rhs".
 

operator = (regex_o)
SIGNATURE: const regex_o &operator = (const regex_o &)
SYNOPSIS: copies exactly the RHS object.
 

destructor
SIGNATURE: ~regex_o ()
SYNOPSIS: virtual destructor. The object instance is reset.
 

regex_o(<args>)
SIGNATURE: regex_o (Regex_Style how)
SYNOPSIS: create a new regex object using the "how" implementation.
PARAMETERS

  • how: one of z_RegExp_Style_Henry_Spencer or z_RegExp_Style_Sun
  •  

    regex_o(<args>)
    SIGNATURE: regex_o (const string_o &s)
    SYNOPSIS: create a new regex object, and set the search pattern to "s"
     

    regex_o(<args>)
    SIGNATURE: regex_o (const char *buf)
    SYNOPSIS:
    create a new regex object, and set the search pattern to "buf" note, this member function was included due to problems with some older compilers in the past.
     

    pattern()
    SIGNATURE: string_o pattern () const
    SYNOPSIS: returns the current search pattern the object has. If there is none, an empty string ("") is returned.
     

    reset()
    SIGNATURE: int reset ()
    SYNOPSIS: resets the object to at-construction state. Any loaded search patterns are destroyed.
     

    clone()
    SIGNATURE: iregex_o *clone () const
    SYNOPSIS: makes a copy of the current object and returns a pointer to the new copy.
     

    set_pattern()
    SIGNATURE: int set_pattern (const string_o &s)
    SYNOPSIS: sets the search pattern to the object to that of "s", which must contain a valid regular expression.
     

    got_match()
    SIGNATURE: boolean got_match (const string_o &target, string_o &match, size_t n) const
    SYNOPSIS:
    search for a pattern match in "target". This member function is a wrapper around function search() , but is designed to be more convenient and easy to use:

    #include "z_regex.h"
    regex_o R;
    size_t j = 0;
    string_o si("[Some days], in some ways"), so;
    R.set_pattern ("\\[A-Za-z ]+\\]");
    if (R.got_match(si, so, j))
        std::cout << "found a match!: " << so << std::endl;
    else
        std::cout << "no match.\n";
    
    If a matching sub-string is found in the string to search ('target'), the match is put into output string 'match', and TRUE is returned. If there is no matching substring, FALSE is returned and the contents of 'match' is set to an empty string.
    Also, an offset index value must be provided in the [mandatory] 3rd parameter ('n'). To start the search at the beginning of the string, set n to 0.
    PARAMETERS
    • target: the string to be searched
    • match: a string object containing the text that matches there search expression, found in "target". This variable is set to an empty string ("") if no match was found.
    • n: this is both input and output. Set it to an offset, in terms of number of characters into 'target' where to start the search. To search all of 'target', set this value to 0.
    RETURNS:
    TRUE: a match was found
    FALSE: no match
     

    search()
    SIGNATURE: size_t search (const string_o &target, string_o *ans, int *pe, size_t offset = 0) const
    SYNOPSIS:
    initiate a search for a pattern match in "target". This is the primary operation of this class. A search pattern must be set prior to using this call via member function "set_pattern()".
    PARAMETERS

    • target: the string to be searched
    • ans: a string object containing the text that matches the search expression, found in "target". This variable is set only if such a match was found.
    • pe: error indicator output variable. A value of 0 indicates that a match was found. Non-zero indicates nothing in "target" matched the search pattern (or an error occurred).
    • offset: where in "target" to begin the search (the first character is at position 0).
     

    set_expr()
    SIGNATURE: void set_expr (const char *buf)
    SYNOPSIS: THIS FUNCTION IS NOT IMPLEMENTED. IGNORE
     

    set_mode()
    SIGNATURE: void set_mode (const char *buf)
    SYNOPSIS: THIS FUNCTION IS NOT IMPLEMENTED. IGNORE
     

    note.
    This section does not explain what a regular expression is - if you are not familiar with regular expressions, there is a huge amount of literature in the internet and in books. Many other computer languages incorporate it (eg, perl).

    It would behoove you to verify if a particular regular expression functions the way you expect it to prior to using it in an application. A good way to test this is to run the QA program for regex in interactive mode (available from Vettrasoft).

    warnings.
    CAUTION!: The size of the pattern after internal expansion, using the default method ("Henry Spencer"), must not exceed 256 characters. Currently there is no way to tell how big the resultant expanded buffer is. A rough estimate: a search pattern 114 bytes long (based on one of the examples above) expands to 260 bytes (a factor of about 2.3). A sign that you have buffer overrun is if set_pattern() returns -2.

    bugs.
    When using regexp_o in searches, it may on rare ocassion crash.