class ref name: kw_parser
class name: kw_parser_o
category-group: strings
layer: 4
header file: z_keyword.h
dependencies:
libraries: libz00.lib libz01.lib libz02.lib libz03.lib libz04.lib

synopsis.
the "keyword parser" consists of a large group of related classes:

keyword_o, kw_error_handler_o, kw_schema_o, cat_parser_o, line_item_parser_o, text_item_parser_o, mword_item_parser_o, mline_item_parser_o, rblock_item_parser_o, mblock_item_parser_o, and special_item_parser_o.

These exist to parse a block of multi-line text and load the contents into a database.

description.
The "keyWord-parser" object parses a block of text in a specific format. The "keyword" parsing system translates the data in this visual format to a data-bag format. From data bags, the information can be easily accessed via a program, or saved to a database. Most (actually, all) of the work this system involves parsing text of a specific format. In a way, it is like a glorified key-value parser on steroids (other software systems exist that manage data in this format):

    "item1=<value1> item2=<value2>" ..
For example:
    "type='message packet' proto="tcp/ip" packet_no=450 length=1024"
A 'keyword' identifies the type of information, and a user-supplied schema identifies how the corresponding text can be written. General rules govern the over-all appearance. Each keyword is found at the start of a line, and the line consists of the keyword followed by its data. Depending on the type of keyword, the data may continue on subsequent lines. The most common, typical syntax for a keyword item is:
    <keyword>:\t<text>
(the "\t" represents a single TAB character). The keyword is a "word" (any letter, followed by any combination of letters, numbers, or "_"). This must be at the start of a line of text, inside a block of text with 1 or more lines. it is then followed by a delimiter, which separates the keyword from its data. The rest of the line is the keyword's data. As to whether this data can repeat, span multiple lines, have sub-data within it, etc. is totally defined by you. This is done by specify the 'type' of keyword. There are 4 stages for this class:

(1) defining a "schema" - a string listing the legal keywords and their types;
(2) creating a keyword parser variable that uses the schema;
(3) converting a block of text into a data bag, by applying the keyword parser variable to some block of text;
(4) using the data, by pulling out whatever is interesting from the resultant data bag.
Given this input text block:

name:	Human Bean
range:	Americas, Asia, Africa, Australia
age:	20,000 years
note:	Human Bean is possibly a distant cousin of Mr. Bean. It is
	vaguely related to the turnip, but can move about, causing
	havoc and destruction everywhere it goes.
And given a schema containing this info:
const string_o HB_Schema =
"schema ( /* <SCHEMA_OPTIONS> */ ) \n\
    keywords\
    (\
        name  (category zkw_Line_Item) \
        age   (category zkw_Line_Item) \
        range (category zkw_Multiword_Item) \
        note  (category zkw_Text) \
    )\
)";
Would produce a recursive data bag containing the following (some parts have been slightly condensed here):
    animal
    (
        name "Human Bean"
        range "< Americas Asia Africa Australia >"
        age "20,000 years"
        note "Human Bean is possibly ..
	    causing havoc and destruction everywhere it goes."
    )
The type of keywords are:

LINE ITEM (zkw_Line_Item) -- the data is found on the same line as the keyword

REPEATABLE LINE ITEM (zkw_Multiline_Item) -- the data is found 1 per line; lines may repeat

REPEATABLE WORD (zkw_Multiword_Item): A "word" in this context is any text up to a comma or end-of-line. There can be many lines.

REPEATABLE BLOCK (zkw_Repeatable_Block) -- this is like "text" types, but there can be a multiple occurrence of them. if a line belongs to a repeatable block, it must be preceded by 2 occurrences of the "pre-line delimiter" (e.g., 2 tabs).

TEXT (zkw_Text) -- a multi-line block of text. If a tab starts the line, then that rest of the line is part of the text block. Actually, the character need not be a tab - you can configure your delimiters and line-start sequences to be whatever you want.

SPECIAL (zkw_Special): -- this is for text that does not fit into any of the previous parsing patterns. This is currently not supported (and possibly never will be). There are no known keywords of this type).

Here is another sample block of keywords:

keyword	format type
Phone	repeatable word
Note	text
Source	line item

Phone: (440) 888-1212, 555-1211 Note: Any text can start a note. If the text spans several lines, each line after the first starts with a tab. Source: the Wall Street Journal

Text in a keyword block must conform to the rules of keywords in order for it to parse successfully. If an error occurs during parsing, the location of the error - e.g., the number of items successfully parsed - is saved in an output parameter of the parse() function. Thus, you can find out where a format violation is, and go fix it. The parsing process is aborted upon encountering an error. One primary rule about keyword text: there can be only 1 keyword per block of text parsed, e.g., there cannot be duplicate keywords. All keywords are at the beginning of a line, and are followed by a keyword separator - this is one or more character that separates a keyword. Usually (not not necessarily), this is a colon (":"). After that follows "keyword text pre-line delimiters" - one or more characters indicating that the text on that line belongs to the keyword. This must be at the start of a line, except for the first line. For the first line, it follows immediately after the keyword separator. By default, the pre-line delimiter is simply a tab.

Keyword processing is highly configurable. A set of keywords must be loaded prior to parsing. The keyword separator and pre-line "delimiter" can be set to any string of characters. This is done by smply putting the control characters in a string (eg, a char *), passing the character string to a textstring_o variable, which in turn becomes an input parameter to your keyword parser object-instance (kw_parser_o). See the example below.

The large number of classes found in the header file may look intimidating, but using the keyword parser is easy. After declaring the schema in a character buffer, only 2 lines of code are required - constructing a variable, and calling parse():

    int ie = 0, num_ok;
    kw_parser_o parser("my_object", rec_dbag_o(CLASS_SCHEMA), TRUE, 0);
    rec_dbag_o data_obj = parser.parse (SAMPLE_CLASS, num_ok, &ie);
In this case, CLASS_SCHEMA is the name of a string_o variable containing the schema, and SAMPLE_CLASS is a textstring_o variable containing actual data. Afterwards, the next task is to extract whatever data is desired from the recursive data base data_obj. You should have a good understanding of data bags prior to using this class.

member functions (primary)

kw_parser_o_o(<args>)
SIGNATURE: kw_parser_o (const string_o &name, rec_dbag_o &schema, boolean to_stop, int max_errs)
SYNOPSIS: instantiate a keyword parser instance. does preliminary object configuration.
PARAMETERS
  • name - the name of the parser. you must assign a name (any combination of letters).
  • schema - the 'schema' the parser should use, as a recursive databag.
  • to_stop - a "boolean" type flag, it tells the parsing routine if it should stop if an error occurs. if set to FALSE, the parser will just skip the errant block.
  • max_errs - maximum number of errors to allow. 0 means no limit
DESCRIPTION:
setting up the 'schema' data-bag is typically done by assigning a char * to a string object, then passing the string object as an argument to an anonymous data bag constructor, as so:
const string_o PATIENT_SCHEMA =
"schema \n\
( \n\
    options\n\
    (\n\
        keyword_dlm     \":  \" \n\
        comment_start   \"#\" \n\
    )\n\
    keywords\n\
    (\n\
        name   (category zkw_line_item) \n\
        phone  (category zkw_multiword_item) \n\
        cond   (category zkw_text) \n\
        incom  (category zkw_line_item) \n\
    )\n\
)";
    kw_parser_o parser ("patient", rec_dbag_o(PATIENT_SCHEMA), TRUE, 0);

 

parse()
SIGNATURE: rec_dbag_o parse (const textstring_o &block, int &lc, int *xpie = 0)
PARAMETERS

  • block - a text string object containing a string that has an actual set of data to parse. using the example of the constructor's "schema" parameter (above), you might pass in the variable 'PATIENT_01':
    textstring_o PATIENT_01 =
    name:	Marina Maksimova
    cond:	mild bulimia, and an advanced case of halitosis.
    	the patient seems to be oblivious of her
    	condition, resulting in all doctors refusing
    	to examine her.
    incom:	500 rubles/year
    
  • lc - "start line number" warning: needs further analysis to uncover its meaning/purpose
  • xpie - standard, optional error flag output variable. pass in a pointer to an int variable if you are interested in seeing the cause of any errors.
 

error_msg()
SIGNATURE: inline const string_o &error_msg (int) const
WARNING: obscure usage (needs further analysis)
 

error_cnt()
SIGNATURE: inline int error_cnt () const
WARNING: obscure usage (needs further analysis)
 

note.
The business class ("business_o") currently has its own, older implementation of the keyword processor system. This separate set of code (to our horror) duplicates the functionality of the keyword processor. Hopefully the code will one day be upgraded. Being that it works as-is, and such an upgrade is a low-priority item, this duplication probably still exists today.

The format was originally developed to display data about businesses in a concise format. Relevant contact info about businesses was saved into text files. A concise, easy-to-read format evolved over time. The format for recording data consisted of the business name, followed by its address, or, if many, a list of addresses. This was followed by a block of text where all other data about the business was put. The format for this information evolved into a general-purpose format that can be applied to other things besides businesses. In fact, the format can be used for almost anything (inventory items, dental records, rare fruits taxonomy, automobile parts lists, recipes, etc).

warnings.
Exercise caution when setting up control strings (ie, keyword_dlm, word_dlm, line_start, block_start) in the schema string block. Including or excluding an innocuous character may have consequences. For example, for 'multi-word' items, if "word_delim" is set to only a comma (",") as opposed to a comma and space (", "), you may get unexpected results if your line of data looks something like "red, blue, green" - the space preceding each word becomes part of the word!

examples.
The example below is a complete, simple working example. It shows an object ("plant schema") that has 5 simple fields, including alias, which is a list. The program processes 4 plants. after it does the parse, each field that the plant-object has is printed.

#include "stdafx.h"
#include "z_func.h"
#include "z_keyword.h"

const string_o Plant_Schema =
"schema\
(\
    options\
    (\
        keyword_delim     \":	\" \
        word_delim        \", \" \
        line_start      \"	\" \
        block_start     \"	\" \
        block_line_start   \"		\" \
        special_start   \"		\" \
        comment_start   \"#\" \
    )\
    keywords\
    (\
        name   (category zkw_Line_Item) \
        botany (category zkw_Line_Item) \
        status (category zkw_Line_Item) \
        alias  (category zkw_Multiword_Item) \
        descr  (category zkw_Text) \
    )\
)";

textstring_o plant_01 =
"name:	Bok Choy\n\
botany:	Brassa rapa\n\
status:	fairly common\n\
alias:	Chinese White Cabbage, Tai sai\n";

textstring_o plant_02 =
"name:	Biddle's Lupine\n\
botany:	Lupinus oreganus\n\
status:	G3\n\
descr:	this plant is a treat for the eyes. Its large, palmately-compound,\n\
	hairy leaves are a vibrant green, set off by a tall spike of white\n\
	flowers. The seeds, about the size of a lentil or slightly larger,\n\
	range in color from light peach to a beautiful brick.\n\
	Lupinus biddlei populations are restricted to two distinct locations\n\
	in two areas of eastern Oregon, separated by approximately 50 km.\n\
	It is currently considered to be vulnerable to extinction.\n";

textstring_o plant_03 =
"name:	Lotus Berthelotii\n\
status:	RARE\n\
descr:	Lotus Berthelotii is a perennial plant native to the Canary Islands,\n\
	in the genus Lotus. It has leaves divided into 3-5 slender leaflets,\n\
	each leaflet 1-2 cm long and 1 mm broad, densely covered with fine\n\
	silvery hairs. Flowers are orange-red to red, and peaflower-shaped.\n\
	This plant is either extinct in the wild or persists as a few\n\
	individuals. In 1884 it was already classed as 'exceedingly rare'.\n\
	Decline was most likely inevitable, however, because of lack of\n\
	pollinators. The plant is obviously adapted to be pollinated by\n\
	birds, but no such birds remain in the Canaries\n";

textstring_o plant_04 =
"name:	Atemoya\n\
botany:	Annona Hybrid\n\
status:	unusual\n\
alias:	Sugar Apple, Sweetsop\n\
descr:	a creamy, pudding-like fruit. Originated in West Indies,\n\
	and grown in Florida.\n";

//----------------------------------------------------------------------
int main (int argc, char *argv[])
{
    int ie = 0, i = 0, j = 0, k = 0, zero = 0;
    z_start();

    textstring_o plant[4];
    plant[0] = plant_01; plant[1] = plant_02;
    plant[2] = plant_03; plant[3] = plant_04;

    string_o snam[5];
    snam[0] = "name";
    snam[1] = "botany";
    snam[2] = "status";
    snam[3] = "alias";                  // TO BE SKIPPED (it's a list)
    snam[4] = "descr";

    kw_parser_o plant("rare_plants", rec_dbag_o(Plant_Schema), TRUE, 0);

    rec_dbag_o my_plant;
    string_o s;

    for (i=0; i < 4; i++)
    {
        my_plant = plant.parse (plant[i], zero, &ie);   // THE MEAT

        for (j = 0; j < 5; j++)
        {
            if (j != 3)
            {
                s = my_plant.get (snam[j], &ie);
                if (!ie)
                {
                    if (j != 3)
                        std::cout << snam[j] << ": ";   // print field name
                    std::cout << s << "\n";             // descr: start on its own line
                }
            }
            else
            {
                // cycle thru the alias list, and print each alias
                array_dbag_o &alias_list
                    = (array_dbag_o &) my_plant.get_dbag ("alias", &ie);
                if (!ie)
                {
                    count_t n = alias_list.size();
                    if (n > 0)
                    {
                        std::cout << "other names: ";
                        for (k = 0; k < n; k++)
                        {
                            s = alias_list.get (k, &ie);
                            std::cout << "\"" << s << "\"";
                            if (k < n-1)
                                std::cout << "; ";
                        }
                        std::cout << std::endl;
                    }
                }
            }
        }

        std::cout << "------------------------\n";
    }

    z_finish();
    return 0;
}

limitations.
if there is a syntax error in the data bag containing the object's schema, the error processing is not very helpful.

bugs.
"zkw_Special" (keyword format type) is currently BROKEN and should NOT BE USED. Using it can lead to unpredictable behaviour and even crashing the program.

history.

??? 09/12/2002: 'kw_parser_o' created [--AU]
Tue 10/08/2002: checked in kwparse.h -> source-safe; g.p. cleanup
Tue 02/15/2011: renamed header file (was: 'z_kwkeyword.h')
Thu 02/17/2011: added 'zkw_Multiline_Item'
Sun 03/13/2011: added string trimming, & culling trailing delim