category-group: strings (virtual group - kwparser)
layer(s): 4

header file(s): z_keyword.h

classes in this group: kw_parser_o

support/internal classes: keyword_o, kw_error_handler_o, kw_schema_o, cat_parser_o,
                                            line_item_parser_o, text_item_parser_o, mword_item_parser_o,
                                            mline_item_parser_o, rblock_item_parser_o, mblock_item_parser_o,
                                            special_item_parser_o

description.
The "KeyWord-PARSE" object parses a block of text in a specific format. The "keyword" parsing system translates the data in this visual format to a data-bag format. From data bags, the information can be easily accessed via a program, or saved to a database. Most (actually, all) of the work this system involves parsing text of a specific format. In a way, it is like a glorified key-value parser on steroids (eg, other software systems exist that manage data in this format:

    "item1=<value1> item2=<value2>" ..
for example:
    "type='message packet' proto="tcp/ip" packet_no=450 length=1024"
A 'keyword' identifies the type of information, and a user-supplied schema identifies how the corresponding text can be written. General rules govern the over-all appearance. Each keyword is found at the start of a line, and the line consists of the keyword followed by its data. Depending on the type of keyword, the data may continue on subsequent lines. The most common, typical syntax for a keyword item is:
    <keyword>:\t<text>
(the "\t" represents a single TAB character). The keyword is a "word" (any letter, followed by any combination of letters, numbers, or "_"). This must be at the start of a line of text, inside a block of text with 1 or more lines. it is then followed by a delimiter, which separates the keyword from its data. The rest of the line is the keyword's data. As to whether this data can repeat, span multiple lines, have sub-data within it, etc. is totally defined by you. This is done by specify the 'type' of keyword. There are 4 stages for this class:

(1) defining a "schema" - a string listing the legal keywords and their types;
(2) creating a keyword parser variable that uses the schema;
(3) converting a block of text into a data bag, by applying the keyword parser variable to some block of text;
(4) using the data, by pulling out whatever is interesting from the resultant data bag.
Given this input text block:

name:	Human Bean
range:	Americas, Asia, Africa, Australia
age:	20,000 years
note:	Human Bean is possibly a distant cousin of Mr. Bean. It is
	vaguely related to the turnip, but can move about, causing
	havoc and destruction everywhere it goes.
And given a schema containing this info:
const string_o HB_Schema =
"schema ( /* <SCHEMA_OPTIONS> */ ) \n\
    keywords\
    (\
        name  (category zkw_Line_Item) \
        age   (category zkw_Line_Item) \
        range (category zkw_Multiword_Item) \
        note  (category zkw_Text) \
    )\
)";
Would produce a recursive data bag containing the following (some parts have been slightly condensed here):
    animal
    (
        name "Human Bean" range "< Americas Asia Africa Australia >" age "20,000 years"
        note "Human Bean is possibly ..
	    causing havoc and destruction everywhere it goes."
    )
The type of keywords are:

LINE ITEM (zkw_Line_Item) -- the data is found on the same line as the keyword

REPEATABLE LINE ITEM (zkw_Multiline_Item) -- the data is found 1 per line; lines may repeat

REPEATABLE WORD (zkw_Multiword_Item): A "word" in this context is any text up to a comma or end-of-line. There can be many lines.

REPEATABLE BLOCK (zkw_Repeatable_Block) -- this is like "text" types, but there can be a multiple occurrence of them. if a line belongs to a repeatable block, it must be preceded by 2 occurrences of the "pre-line delimiter" (e.g., 2 tabs).

TEXT (zkw_Text) -- a multi-line block of text. If a tab starts the line, then that rest of the line is part of the text block. Actually, the character need not be a tab - you can configure your delimiters and line-start sequences to be whatever you want.

SPECIAL (zkw_Special): -- this is for text that does not fit into any of the previous parsing patterns. This is currently not supported (and possibly never will be). There are no known keywords of this type.

Here is another sample block of keywords:

keyword	format type
Phone	repeatable word
Note	text
Source	line item

Phone: (440) 888-1212, 555-1211 Note: Any text can start a note. If the text spans several lines, each line after the first starts with a tab. Source: the Wall Street Journal

Text in a keyword block must conform to the rules of keywords in order for it to parse successfully. If an error occurs during parsing, the location of the error - e.g., the number of items successfully parsed - is saved in an output parameter of the "parse()" functio0n. Thus, you can find out where a format violation is, and go fix it. The parsing process is aborted upon encountering an error. One primary rule about keyword text: there can be only 1 keyword per block of text parsed, e.g., there cannot be duplicate keywords. All keywords are at the beginning of a line, and are followed by a keyword separator - this is one or more character that separates a keyword. Usually (not not necessarily), this is a colon (":"). After that follows "keyword text pre-line delimiters" - one or more characters indicating that the text on that line belongs to the keyword. This must be at the start of a line, except for the first line. For the first line, it follows immediately after the keyword separator. By default, the pre-line delimiter is simply a tab.

Keyword processing is highly configurable. A set of keywords must be loaded prior to parsing. The keyword separator and pre-line "delimiter" can be set to any string of characters. This is done by smply putting the control characters in a string (eg, a char *), passing the character string to a textstring_o variable, which in turn becomes an input parameter to your keyword parser object-instance (kw_parser_o). See the example below.

The large number of classes found in the header file may look intimidating, but using the keyword parser is easy. After declaring the schema in a character buffer, only 2 lines of code are required - constructing a variable, and calling parse():

    int ie = 0, num_ok;
    kw_parser_o parser("my_object", rec_dbag_o(CLASS_SCHEMA), TRUE, 0);
    rec_dbag_o data_obj = parser.parse (SAMPLE_CLASS, num_ok, &ie);
In this case, CLASS_SCHEMA is the name of a string_o variable containing the schema, and SAMPLE_CLASS is a textstring_o variable containing actual data. Afterwards, the next task is to extract whatever data is desired from the recursive data base data_obj. You should have a good understanding of data bags prior to using this class.

note.
The business class ("business_o") currently has its own, older implementation of the keyword processor system. This separate set of code (to our horror) duplicates the functionality of the keyword processor. Hopefully the code will one day be upgraded. Being that it works as-is, and such an upgrade is a low-priority item, this duplication probably still exists today.

warnings.
Exercise caution when setting up control strings (ie, keyword_dlm, word_dlm, line_start, block_start) in the schema string block. Including or excluding an innocuous character may have consequences. For example, for 'multi-word' items, if "word_delim" is set to only a comma (",") as opposed to a comma and space (", "), you may get unexpected results if your line of data looks something like "red, blue, green" - the space preceding each word becomes part of the word!

examples.
The example below is a complete, simple working example. It shows an object ("plant schema") that has 5 simple fields, including alias, which is a list. the program processes 4 plants. after it does the parse, each field that the plant-object has is printed.

#include "stdafx.h"
#include "z_func.h"
#include "z_keyword.h"


const string_o Plant_Schema =
"schema\
(\
    options\
    (\
        keyword_delim     \":	\" \
        word_delim        \", \" \
        line_start      \"	\" \
        block_start     \"	\" \
        block_line_start   \"		\" \
        special_start   \"		\" \
        comment_start   \"#\" \
    )\
    keywords\
    (\
        name   (category zkw_Line_Item) \
        botany (category zkw_Line_Item) \
        status (category zkw_Line_Item) \
        alias  (category zkw_Multiword_Item) \
        descr  (category zkw_Text) \
    )\
)";

textstring_o plant_01 =
"name:	Bok Choy\n\
botany:	Brassa rapa\n\
status:	fairly common\n\
alias:	Chinese White Cabbage, Tai sai\n";

textstring_o plant_02 =
"name:	Biddle's Lupine\n\
botany:	Lupinus oreganus\n\
status:	G3\n\
descr:	this plant is a treat for the eyes. Its large, palmately-compound,\n\
	hairy leaves are a vibrant green, set off by a tall spike of white\n\
	flowers. The seeds, about the size of a lentil or slightly larger,\n\
	range in color from light peach to a beautiful brick.\n\
	Lupinus biddlei populations are restricted to two distinct locations\n\
	in two areas of eastern Oregon, separated by approximately 50 km.\n\
	It is currently considered to be vulnerable to extinction.\n";

textstring_o plant_03 =
"name:	Lotus Berthelotii\n\
status:	RARE\n\
descr:	Lotus Berthelotii is a perennial plant native to the Canary Islands,\n\
	in the genus Lotus. It has leaves divided into 3-5 slender leaflets,\n\
	each leaflet 1-2 cm long and 1 mm broad, densely covered with fine\n\
	silvery hairs. Flowers are orange-red to red, and peaflower-shaped.\n\
	This plant is either extinct in the wild or persists as a few\n\
	individuals. In 1884 it was already classed as 'exceedingly rare'.\n\
	Decline was most likely inevitable, however, because of lack of\n\
	pollinators. The plant is obviously adapted to be pollinated by\n\
	birds, but no such birds remain in the Canaries\n";

textstring_o plant_04 =
"name:	Atemoya\n\
botany:	Annona Hybrid\n\
status:	unusual\n\
alias:	Sugar Apple, Sweetsop\n\
descr:	a creamy, pudding-like fruit. Originated in West Indies,\n\
	and grown in Florida.\n";


//----------------------------------------------------------------------
int main (int argc, char *argv[])
{
    z_start();

    textstring_o plant[4];
    plant[0] = plant_01; plant[1] = plant_02;
    plant[2] = plant_03; plant[3] = plant_04;

    string_o snam[5];
    snam[0] = "name";
    snam[1] = "botany";
    snam[2] = "status";
    snam[3] = "alias";                  // TO BE SKIPPED (it's a list)
    snam[4] = "descr";

    kw_parser_o plant("rare_plants", rec_dbag_o(Plant_Schema), TRUE, 0);

    int ie = 0, i = 0, j = 0, k = 0, zero = 0;
    rec_dbag_o my_plant;
    string_o s;

    for (i = 0; i < 4; i ++)
    {
        my_plant = plant.parse (plant[i], zero, &ie);   // THE MEAT

        for (j = 0; j < 5; j++)
        {
            if (j != 3)
            {
                s = my_plant.get (snam[j], &ie);
                if (!ie)
                {
                    if (j != 3)
                        std::cout << snam[j] << ": ";   // print field name
                    std::cout << s << "\n";             // descr: start on its own line
                }
            }
            else
            {
                // cycle thru the alias list, and print each alias
                array_dbag_o &alias_list
                    = (array_dbag_o &) my_plant.get_dbag ("alias", &ie);
                if (!ie)
                {
                    count_t n = alias_list.size();
                    if (n > 0)
                    {
                        std::cout << "other names: ";
                        for (k = 0; k < n; k++)
                        {
                            s = alias_list.get (k, &ie);
                            std::cout << "\"" << s << "\"";
                            if (k < n-1)
                                std::cout << "; ";
                        }
                        std::cout << std::endl;
                    }
                }
            }
        }

        std::cout << "------------------------\n";
    }

    z_finish();
    return (0);
}

history.

        ??? 09/12/2002: 'kw_parser_o' created {--AU}
        Tue 10/08/2002: checked [kwparse.h] -> source-safe; cleanup
        Tue 02/15/2011: renamed header file (was: "z_kwkeyword.h")
        Thu 02/17/2011: added 'zkw_Multiline_Item'
        Sun 03/13/2011: added string trimming, & culling trailing delim

historical.
The format was originally developed to display data about businesses in a concise format. Relevant contact info about businesses was saved into text files. A concise, easy-to-read format evolved over time. The format for recording data consisted of the business name, followed by its address, or, if many, a list of addresses. This was followed by a block of text where all other data about the business was put. The format for this information evolved into a general-purpose format that can be applied to other things besides businesses. In fact, the format can be used for almost anything (inventory items, dental records, rare fruits taxonomy, automobile parts lists, recipes, etc).

limitations.
if there is a syntax error in the data bag containing the object's schema, the error processing is not very helpful.