The GEDCOM 5 Parser (TGP)

Both and VGedX use The GEDCOM 5 Parser (TGP) to import your GEDCOM file. It can be downloaded here. TGP is a dictionary-based parser built upon the GEDCOM 5 grammar and various GEDCOM 5.x dictionaries defined in the specifications for the following revisions. It can be easily expanded to support additional GEDCOM dictionaries as needed.

  • GEDCOM 5.5 Rev. 1 (January 2, 1996)
  • GEDCOM 5.5 Rev. 2 (January 10, 1996)
  • GEDCOM 5.5.1
  • GEDCOM 5.6 (Draft)
TGP, upon which and is based validates the GEDCOM 5 Grammar as defined in the above listed specifications. GEDCOM 5.5.5 made changes to the underlying format and GEDCOM 7.0 uses another format entirely. As such, TGP cannot be used, without significant changes, to validate those data formats. It is for this reason that I made the source code for TGP available, giving other developers the opportunity to extend its capabilities by providing additional GEDCOM grammar and GEDCOM dictionary support. GEDCOM 6.0 XML is another beast entirely, being that it is in XML format.

Description

In addition to parsing GEDCOM files, GEDCOM validation requires that the GEDCOM file is validated against the specification requirements of both the GEDCOM grammar (line and file syntax) and the GEDCOM dictionary (record format, data types, data formats and data values). In this respect, TGP is only a partial validator in that it does not in all cases, validate data formats or data values. It also does not have complete coverage of the validation tests listed below. This is because in many cases, the data types and values defined in the specification are incomplete or inconsistent. TGP will validate each record's defined properties, i.e. minimum and maximum occurrences, id reference types, etc. It will not validate any enumerated values. Data consistency testing is not part of a GEDCOM validation.

Validation

Before we can describe how TGP validates a GEDCOM file, we must first provide some technical details.

Data Formats

Symbols

() parentheses  = grouped components
[] brackets     = optional components
*  astricks     = multiple occurrences of a component
-  dash         = range of values of a component
|  pipe         = component or

Characters

Character               ASCII value
=========               ===========
tab                     = 0x09
line feed               = 0x0A
carriage return         = 0x0D
space                   = 0x20
exclamation point (!)   = 0x21
cross hatch (#)         = 0x23
colon (:)               = 0x3A
ampersand (@)           = 0x40
underscore (_)          = 0x5F

Character Sets

Character set           ASCII range
=============           ===========
number digit (0-9)      = (0x30 - 0x39)
alpha char (a-zA-Z_)    = (0x41 - 0x5A) | (0x61 - 0x7A) | 0x5F
non-alpha char          = (0x21 - 0x2F) | (0x3A - 0x3F) | (0x5B - 0x5E) | (0x7B - 0x7E) | (0x80 - 0xFE) | 0x60

Character Groups

alphanum                = (alpha char | number digit)	
printable character     = alphanum | non-alpha char | space | cross hatch

Strings

double-at string (@@)   = ampersand + ampersand
number string           = number digit + [number digit]*
alphanum string         = alphanum + [alphanum]*
pointer id              = (alphanum | exclamation point) + [printable character]*						
pointer string          = ampersand + pointer id + ampersand 
embedded id string      = ampersand + [pointer id +] exclamation point + pointer_id + ampersand 
escape string           = ampersand + cross hatch + (printable character | double-at string)* + ampersand + [space] + (printable character)* 
value string            = printable character + [printable character]*
data string             = (value string | escape string) [+ (value string | escape string)]*
delimiter               = space
terminator              = carriage return | line feed | (carriage return + line feed) | (line feed + carriage return)
whitespace              = ([tab]* + [space]* + [terminator]*)* 

Validation Tests

GEDCOM Validation testing includes two types of tests, GEDCOM data format and the GEDCOM form.

GEDCOM 5 Line Syntax

All of the supported GEDCOM Dictionaries use the same GEDCOM 5.5 data format, which defines a line as having the following syntax:

line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + reference_id +] terminator

or

line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + line_value +] terminator

GEDCOM 5 Grammar Tests

The following is a list of requirements of the GEDCOM 5.grammar. Unsupported tests will be noted. String lengths are measured in characters, not bytes.

  1. The level is a number string.
  2. Level numbers should not contain leading zeroes.
  3. The minimum level number is 0.
  4. The maximum level number is 99.
  5. The maximum level number increment is 1.
  6. The level must be followed by a delimiter.

  7. A record_id can be a pointer string or an embedded id string.
  8. The length of a record_id is between 3 and 22 characters
  9. The record_id must be followed by a delimiter.
  10. The record_id must be unique to the file.

  11. for example:

    0 @I1@ INDI
    1 @!O1@ OBJE (I1 is implied)
    1 @I1!O1@ OBJE (duplicates not allowed)

    0 @I1@ INDI (duplicates not allowed)

  12. The tag is a alphanum string.
  13. The length of the tag is between 1 and 31 characters.
  14. The first 15 characters of the tag must be unique.

  15. A reference_id is a pointer string.
  16. The length of a reference_id is between 3 and 22 characters
  17. The reference_id must be preceded by a delimiter.
  18. The reference_id must be followed by a terminator.
  19. The presence of a reference_id implies that the record_id exists in the file unless a colon is present.
  20. If the reference_id contains an exclamation point, the record_id must exist in an embedded record contained within the same logical record.

    for example:

    0 @I1@ INDI
    1 @I1!O1@ OBJE
    1 OBJE @I1!O1@
    1 OBJE @!O1@ (I1 is implied)

    0 @I2@ INDI
    1 OBJE @I1!O1@ (not allowed)

  21. A line_value is a data string.
  22. The line_value must be preceded by a delimiter.
  23. The line_value must be followed by a terminator.
  24. If an ampersand is desired as part of the line_value, it must be included as a double-at string (i.e. name@@school.edu).

  25. The maximum length of a line is 255 characters.
  26. The maximum length of a logical record is 32 kilobytes (logical records are delineated by level numbers equal to 0 (zero)). [NOT SUPPORTED]

GEDCOM Dictionary Tests

To validate the GEDCOM dictionary, TGP compares the structure of the logical records to the GEDCOM dictionary template associated with its GEDCOM version. It also validates general GEDCOM dictionary constructs common to all supported GEDCOM versions.

  1. The GEDCOM version must be either "5.5", "5.5.1" or "5.6".
  2. Each line should match the GEDCOM dictionary template unless the line has a user defined tag beginning with an underscore.
  3. Each record_id should be referenced from within the same file.
  4. If the template expects a record_id, then the line must have a record_id of the same type.
  5. If the template expects no record_id, then the line must not have a record_id.
  6. If the template expects a reference_id, then the line must have a reference_id of the same type.
  7. If the template expects no reference_id, then the line must not have a reference_id.
  8. If the template defines a minimum number of record occurrences, then the record should not have fewer.
  9. If the template defines a maximum number of record occurrences, then the record should not have more.
  10. If the template defines a minimum line_value length, then the line_value should not be shorter.
  11. If the template defines a maximum line_value length, then the line_value should not be longer.

Distribution

The distribution file includes the source code for The GEDCOM 5 parser (TGP) and the four GEDCOM dictionaries, however, unlike a typical open source project, it does not include any build files or the wide array of utility classes that may be defined externally to the code provided. The source code is written in C++ and is provided 'as is' without any warranty, and should be easily converted for use in your genealogy application. You will need to write you own wrappers.

In addition to file import capabilities, TGP also supports building and exporting a GEDCOM file based upon its traversal of any of the GEDCOM dictionary trees. These exported files can then be used as the input file for other GEDCOM importing applications, such as TGP, VGedX and Gigatrees, as well as any other application that supports importing GEDCOM files. If an application cannot import one of these test files without error, then they are probably not using a dictionary-based parser like TGP. The generated test files ( 55r1.ged, 55r2.ged, 551.ged and 56.ged ) are included in the distribution for VGedX and are provided along with that application so users can test its importing abilities. The test files use typical values only, and are therefore not useful for testing boundary uses cases. The VGedX distribution includes an additional file ( test_grammar.ged ) that will test some boundary conditions.

Database Structure

The contents of each TGP Record contains the information provided in the GEDCOM file as well as a pointer to the record's container, a subfield vector, a pointer to a reference record (if applicable), the unique GEDCOM state, a naked pointer for adding application specific data and a naked application specific data type. All text has been concatenated and the concatenation ( CONC ) and continuation ( CONT ) records have been discarded. Other minor cleanup may have been preformed as well. Together, these should provide all the interface necessary to access your genealogy database programatically. If not, you have the source and are allowed to modify/port/rework it.

class TGPRecord
{
public:
  unsigned long  m_lineNum;
  int            m_level;
  String         m_id;
  String         m_tag;
  String         m_lineValue;

  GedcomStates   m_state;
  TGPRecord*     m_referencePtr;
  TGPRecord*     m_containerPtr;
  TGPRecordList  m_list;

  void*          m_dataPtr;
  int            m_dataType;
};

Usage

TGP creates an internal data representation of the parsed GEDCOM file and provides the accessors needed to access it. The TGP database is an empty record holding a vector of level 0 records that can be accessed using the getContents() method. Each of these records holds a vector of subfields. The database's root record can be accssed by calling TGPDatabase(). To parse a GEDCOM file, your application might do something like the following:

GedcomFile* importFile = new GedcomFile(Config::ImportFile);
if (importFile != NULL)    {

  // parse the file
  if (GedcomParser::ParseFile(importFile) == TGP_SUCCESS)  {
  
    // process the returned statuses
    // uses a map where the first element is the line number
    // and the second element is the return status code
    FOREACH (elem, *GedcomParser::RecordsWithStatus()) { 
      if (elem.second != NULL) {

        // retrieve the status codes
        FOREACH (code, elem.second->getStatus()) {

          // print the line number followed by the status code's text
          PRINT ("Line: " + elem.first + " => Status: " GedcomParser::GetErrorText(code));
    }}}

    // process the database
    ProcessDatabase(GedcomParser::TGPDatabase());
}}

// recursive function
void ProcessDatabase(TGPRecord* record) {
  if (record != NULL) {
  
    [process record here]

    // process the record fields
    FOREACH (contents, record->getContents()) {

      // repeat for each record field
      ProcessDatabase(contents); 
    }
}}

Validation Results

Please consult the GEDCOM standards for more information on the following listed Errors, Warnings and Alerts that you can expect TGP to detect and display. Errors are those most likely to cause a GEDCOM file to fail to import, or be partially unusable if successfully imported. Warnings are serious violations of the GEDCOM standard, but applications should not have any trouble handling these with ease. Many of them concern issues that were important to computer programmers of the 1970s, like record size and line length. Lastly, Alerts are informational only and are not a violation of the GEDCOM standard.

Errors

Unsupported GEDCOM version detected
Level number expected
Level number gap
Invalid ID length
ID missing
Invalid ID reference length
Tag Expected
Data contains non-printable characters
ID reference missing
Unexpected ID reference
Invalid ID reference type
ID reference substitution
Duplicate record found
Referenced record not found

Warnings

Level number exceeds limit
Level has leading zero
ID delimiter missing
Invalid ID length
Invalid ID character
Invalid ID reference length
Invalid ID reference character
Invalid tag length
Invalid tag character
Too few occurrences of tag
Too many occurrences of tag
Data contains tabs
Maximum line length exceeded
Data missing
Insufficient data
Maximum data length exceeded
Data not expected
Trailing data not expected
Unpaired ampersand (@)
Undefined record found
Record not referenced
Too many delimiters

Alerts

Trailing spaces not expected
User defined record found
Data contains 8-bit ASCII characters
Embedded ID reference uses an implied record id
Comments