diff --git a/README.md b/README.md new file mode 100644 index 00000000..17e4dd6e --- /dev/null +++ b/README.md @@ -0,0 +1,34 @@ +# FamilySearch GEDCOM + +The official FamilySearch GEDCOM specification for exchanging genealogical data. + +This repository is for the collaborative development of the FamilySearch GEDCOM specification. +If you are looking for the specifciation itself, see . + +If you are looking for FamilySearch's GEDCOM 5.5.1 Java parser, which previously had this same repository name, see + + +## Repository structure + +- [`change.log.md`](changelog.md) is a running log of major changes made to the specification. +- [`specifcation/`](specification/) contains the FamilySearch GEDCOM specification: + - [`specifcation/gedcom.md`](specification/gedcom.md) is the source document used to define the FamilySearch GEDCOM specification. It is written in pandoc-flavor markdown and is intended to be more easily written than read. + - other files are rendered versions of `gedcom.md`. One of these is likely to be the one users of the specification want. +- [`build/`](build/) contains files needed to render the specification + - See [`build/README.md`](build/) for more +- [`extracted-files/`](extracted-files/) contains digested information automatically extracted from the specification. All files in this directory are automatically generated by scripts in the [`build/`](build/) directory. + - [`extracted-files/grammar.abnf`](extracted-files/grammar.abnf) contains all the character-level ABNF for parsing lines and datatypes + - [`extracted-files/grammar.gedstruct`](extracted-files/grammar.gedstruct) contains a custom structure organization metasyntax + - [`extracted-files/tags/`](extracted-files/tags/) contains summary information for each -based URI defined in the specification. + +## Branches + +- `main` contains the current release. + Patch versions are generally pushed directly to `main` upon approval. + +- `next-minor` contains a working draft of the next minor release. Changes from `main` have been discussed and approved by the working group supervising the next minor release, but have not been fully vetted and approved for inclusion in the standard and may change at any time without notice. + +- `next-major` contains a working draft of the next major release. Changes from `main` have been discussed and approved by the working group supervising the next major release, but have not been fully vetted and approved for inclusion in the standard and may change at any time without notice. + +- All other branches are for conversation drafts that may or may not be incorporated into a future version of the specification. + diff --git a/build/README.md b/build/README.md new file mode 100644 index 00000000..1fca741f --- /dev/null +++ b/build/README.md @@ -0,0 +1,59 @@ +This directory is used to convert the `specifications/gedcom.md` source file into fully-hyperlinked HTML and PDF. + +# Building -- quick-start guide + +1. Install dependencies: + + - [python 3](https://python.org) + - [pandoc](https://pandoc.org) + - [weasyprint](https://weasyprint.org) installed by running `python3 -mpip install --user --upgrade weasyprint` + - [git](https://git-scm.com/) + - `make`-compatible executable + +2. From the directory containing this README, run `make` + +# Building -- how it works + +Getting from `gedcom.md` to `gedcom.pdf` is a multi-step process, all of which is handled by the `Makefile`: + +1. `hyperlink.py` reads `gedcom.md` and adds hyperlinks into `gedcom-tmp.md`. It is somewhat dependent on the internal formatting of `gedcom.md` and may need adjustment if, e.g., tables are switched to a different markdown table format. + +2. `pandoc` converts `gedcom-tmp.md` into `gedcom-tmp.html`. + It uses `template.html` for structure, + `pandoc.css` for styling, + and `gedcom.xml`, `gedstruct.xml`, and `abnf.xml` for syntax highlighting. + + Pandoc's command-line options include + + - syntax highlighting options: + - `--syntax-definition=gedcom.xml` + - `--syntax-definition=gedstruct.xml` + - `--syntax-definition=abnf.xml` + - `--highlight-style=kate` + - general formatting options + - `--from=markdown+smart` + - `--standalone` + - `--toc` + - `--number-sections` + - `--self-contained` + - `--metadata="date:`date you want on the cover page`"` + - stylistic options + - `--css=pandoc.css` + - `--template=template.html` + - input/output options + - `--wrap=none` + - `--to=html5` + - `--output=gedcom-tmo.html` + - `gedcom-tmp.md` + +3. `hyperlink-code.py` converts `gedcom-tmp.html` into `gedcom.html` by + + - removing all `col` and `colgroup` elements, which are incorrectly handled by some versions of the webkit rendering engine used by weasyprint. + - adding hyperlinks inside code blocks (which markdown cannot do) + + This is dependent on the code environment classes created by syntax highlighting, and may need adjusting if pandoc changes these class names or of the syntax highlighting definition files XML are edited. + +4. `python3 -mweasyprint gedcom.html gedcom.pdf` turns the HTML into PDF + + Note that a relatively recent version of `weasyprint` (published in 2020 or later) is needed to correctly handle syntax-highlighted code blocks. + Also note that it is expected that this will emit a variety of warning messages based on CSS rules intended for screen, not print. If it emits any error messages, those should be resolved whether they impede the creation of the PDF or not. diff --git a/build/abnf.xml b/build/abnf.xml new file mode 100644 index 00000000..d225e484 --- /dev/null +++ b/build/abnf.xml @@ -0,0 +1,90 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/build/extract-grammars.py b/build/extract-grammars.py new file mode 100644 index 00000000..7edb97c9 --- /dev/null +++ b/build/extract-grammars.py @@ -0,0 +1,59 @@ +from sys import argv +from os.path import join, dirname, isfile, isdir, exists +from os import makedirs + +def get_paths(): + """Parses command-line arguments, if present; else uses defaults""" + spec = join(dirname(argv[0]),'../GEDCOM.md') if len(argv) < 2 or not isfile(argv[1]) else argv[1] + dest = join(dirname(argv[0]),'../') + for arg in argv: + if arg and isdir(arg): + dest = arg + break + if arg and not exists(arg) and arg[0] != '-' and isdir(dirname(arg)): + dest = arg + break + + if not isdir(dest): + makedirs(dest) + + return spec, dest + + +if __name__ == '__main__': + src, dst = get_paths() + abnf = [] + gedstruct = [] + where = None + header = '' + with open(src) as f: + for line in f: + if line.startswith('```'): + if where: + if where == 'abnf': abnf.append('\n\n') + elif where == 'gedstruct': gedstruct.append('\n\n') + where = None + elif 'gedstruct' in line: + where = 'gedstruct' + if header: + gedstruct.append(header.replace('`', '') + '\n') + header = '' + elif 'abnf' in line: + where = 'abnf' + if header: + abnf.append('; ' + '-'*13 + ' ' +header + ' ' + '-'*13 + '\n\n') + header = '' + elif where == 'abnf': abnf.append(line) + elif where == 'gedstruct': gedstruct.append(line) + elif line.startswith('#'): + header = line + if '{' in header: header = header[:header.find('{')] + header = header.strip('# \n\r\t') + with open(join(dst,'grammar.abnf'), 'w') as f: + f.write('''; This document is in ABNF, see +; This document uses RFC 7405 to add case-sensitive literals to ABNF. + +''') + f.write(''.join(abnf)) + with open(join(dst,'grammar.gedstruct'), 'w') as f: + f.write(''.join(gedstruct)) diff --git a/build/gedcom.xml b/build/gedcom.xml new file mode 100644 index 00000000..c6a75b20 --- /dev/null +++ b/build/gedcom.xml @@ -0,0 +1,39 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/build/gedstruct.xml b/build/gedstruct.xml new file mode 100644 index 00000000..0f7768fa --- /dev/null +++ b/build/gedstruct.xml @@ -0,0 +1,84 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/build/hyperlink-code.py b/build/hyperlink-code.py new file mode 100644 index 00000000..d8baaf7c --- /dev/null +++ b/build/hyperlink-code.py @@ -0,0 +1,62 @@ +import re +import sys + +if len(sys.argv) < 3: + print("USAGE:",argv[0],"input.html output.html", sys.stderr) + sys.exit(1) + +src = sys.argv[1] +dst = sys.argv[2] + + + + +with open(src) as fh: + doc = fh.read() + +# remove col and colgroup elements, which confuse some HTML rendering engines +doc = re.sub(r']*>','',doc) + +# find header IDs +targets = re.findall(r']*id="([^"]*)"', doc) +special = {a[1]:a[0] for a in re.findall(r']*>([A-Z][A-Za-z]*)', doc)} + +# find tags in tables (individual events, etc) +table_tags = {} +for table in re.finditer(r']*id="([^"]*)".*?(.*?)
', doc): + anchor, body = table.groups() + if 'Tag' in body: + for tag in re.findall(r'([^<]*)', body): + table_tags[tag] = anchor + +def anchorify(m): + full = m.group(0) + tag = m.group(1) + tag2 = m.groups()[-1].replace('#','-') + if tag2 in targets: + full = full.replace(tag, ''+tag+'') + elif tag2 in special: + full = full.replace(tag, ''+tag+'') + elif tag2 in table_tags: + full = full.replace(tag, ''+tag+'') + elif tag2.lower().replace(' ','-') in targets: + full = full.replace(tag, ''+tag+'') + return full + +doc = re.sub(r'(g7:[^<]*)\1]*ged(?:struct|com)[^>]*>.*?)', doc, flags=re.DOTALL) + +# to do: find anchors for n @XREF:...@ and link tag @XREF:...@ to them +# to do: make code links to table tag definitions + +with open(dst, 'w') as to: + for i in range(len(chunks)): + txt = chunks[i] + if (i&1): + # txt = re.sub(r'\s*([^ <]+)\s*', anchorify, txt) + txt = re.sub(r'\s*<<([^ &]+)>>\s*', anchorify, txt) + txt = re.sub(r'(g7:([^<]+))', anchorify, txt) + txt = re.sub(r'<([A-Z][a-z][A-Za-z ]+)(?:>|:)', anchorify, txt) + txt = re.sub(r':([A-Z][a-z][A-Za-z ]+)>', anchorify, txt) + to.write(txt) diff --git a/build/hyperlink.py b/build/hyperlink.py new file mode 100644 index 00000000..4c35b818 --- /dev/null +++ b/build/hyperlink.py @@ -0,0 +1,107 @@ +import re +import sys + +if len(sys.argv) < 3: + print("USAGE:",argv[0],"input.md output.md", sys.stderr) + sys.exit(1) + +src = sys.argv[1] +dst = sys.argv[2] + + +def slugify(bit): + if '`g7:' in bit: + si = bit.rfind('`g7:')+4 + ei = bit.find('`', si) + slug = bit[si:ei].replace('#','-') + elif '`' in bit: + bit = re.search('`[A-Z0-9_`.]+`', bit) + slug = bit.group(0).replace('`','').replace('.','-') + else: + slug = re.sub('[^-._a-z0-9]+','-', bit.lower()) + return slug + +# Step 1: find all anchors and ABNF rules +slugs = {} +abnf_rules = {} +table_tags = {} +header_row = None +with open(src) as f: + num = 0 + inabnf = False + for line in f: + num += 1 + if line[0] == '#': + if '`' in line and '{' not in line: + slug = slugify(line.replace("'s ",'.')) + elif '{' in line and line.find('#', line.find('{')) > 0: + slug = line[line.rfind('#')+1:] + slug = slug[:slug.find('}')] + else: + if '{' in line: line = line[:line.find('{')] + slug = slugify(line.strip('# \n\r')) + if slug in slugs: + raise Exception('Duplicate slug '+slug) + slugs[slug] = num + elif '`abnf' in line: + inabnf = True + elif inabnf and '`' in line: + inabnf = False + elif inabnf and line[0] != ' ' and '=' in line: + abnf_rules[line.split()[0]] = slug + elif not inabnf: + if header_row: + if '|' not in line: header_row = None + elif 'Tag' in header_row and '`' in line: + table_tags[line.split('`')[1]] = slug + elif '|' in line: header_row = line + +last = {} +def linkable(line, num, istable=False): + """Finds linkable items in a line of text and adds links for them""" + def linkify(txt, slug): + near = abs(slugs[slug]-num) < 20 or abs(last.get(slug,-100)-num) < 20 + last[slug] = num + if near: + return '['+txt+'](#'+slug+'){.close}' + else: + return '['+txt+'](#'+slug+')' + + def repl(m): + slug = slugify(m.group(0)) + if slug in slugs: + return linkify(m.group(0), slug) + return m.group(0) + def abnf(m): + if m.group(1) in abnf_rules: + slug = abnf_rules[m.group(1)] + return linkify(m.group(0), slug) + if m.group(1) in table_tags: + slug = table_tags[m.group(1)] + return linkify(m.group(0), slug) + return m.group(0) + uried = re.sub(r'(? :first-child:before { + content: "Example \2014\00A0 "; + font-style: italic; + color: #005A9C; +} + +.note { + margin: 1em; + background: #FCFCFC; + border-left: #C0C0C0 solid 4px; + padding: 0em 0.5em; +} +.note > :first-child:before { + content: "Note \2014\00A0 "; + font-style: italic; +} + +code.uri { + font-weight: normal; + background: none !important; +} +code.uri:before { content: " <"; } +code.uri:after { content: ">"; } + +code .sc { color: #2879a4 !important; } /* kate style makes this too pale */ diff --git a/build/template.html b/build/template.html new file mode 100644 index 00000000..408a1a2f --- /dev/null +++ b/build/template.html @@ -0,0 +1,93 @@ + + + + + +$for(author-meta)$ + +$endfor$ +$if(date-meta)$ + +$endif$ +$if(keywords)$ + +$endif$ + $if(title-prefix)$$title-prefix$ – $endif$$pagetitle$ + +$for(css)$ + +$endfor$ +$if(math)$ + $math$ +$endif$ + +$for(header-includes)$ + $header-includes$ +$endfor$ + + +$for(include-before)$ +$include-before$ +$endfor$ + +
+

$title$

+$if(subtitle)$ +

$subtitle$

+$endif$ + +
+$for(author)$ +$author$ +$endfor$ +
+ +$if(date)$ +

$date$

+$endif$ + +$if(address)$ +
+Suggestions and Correspondence: +
+$address$ +
+ +
+$endif$ + +$if(copyright)$ + +$endif$ +
+ + + + + +$if(toc)$ + +$endif$ +
+$body$ +
+$for(include-after)$ +$include-after$ +$endfor$ + + diff --git a/build/uri-def.py b/build/uri-def.py new file mode 100644 index 00000000..b0c81219 --- /dev/null +++ b/build/uri-def.py @@ -0,0 +1,343 @@ +import re +from sys import argv, stderr +from os.path import isfile, isdir, exists, dirname, join +from os import makedirs +from subprocess import run + + +def get_paths(): + """Parses command-line arguments, if present; else uses defaults""" + spec = join(dirname(argv[0]),'../specifications/gedcom.md') if len(argv) < 2 or not isfile(argv[1]) else argv[1] + dest = join(dirname(argv[0]),'../extracted-files/tags') + for arg in argv: + if arg and isdir(arg): + dest = arg + break + if arg and not exists(arg) and arg[0] != '-' and isdir(dirname(arg)): + dest = arg + break + + if not isdir(dest): + makedirs(dest) + + return spec, dest + +def get_text(spec): + """Reads the contents of the given file""" + with open(spec) as fh: return fh.read() + +def get_prefixes(txt): + """Find and parse prefix definition tables""" + pfx = {} + for pfxtable in re.finditer(r'([^\n]*)Short Prefix *\| *URI Prefix *\|(\s*\|[^\n]*)*', txt): + for abbr, lng in re.findall(r'`([^`]*)` *\| *`([^`]*)`', pfxtable.group(0)): + pfx[abbr] = lng + return pfx + +def find_datatypes(txt, g7): + """Returns datatype:uri and adds URI suffixes to g7""" + dturi = {} + for section in re.finditer(r'^#+ *([^\n]*)\n+((?:[^\n]|\n+[^\n#])*[^\n]*URI for[^\n]*datatypes? is(?:[^\n]|\n+[^\n#])*)', txt, re.M): + for dt, uri in re.findall(r'URI[^\n]*`([^\n`]*)` datatype[^\n]*`([^`\n:]*:[^\n`]*)`', section.group(0)): + dturi[dt] = uri + if uri.startswith('g7:'): + if '#' in uri: uri = uri[:uri.find('#')] + if uri[3:] not in g7: + g7[uri[3:]] = ('datatype', [section.group(2).strip()]) + return dturi + +def find_cat_tables(txt, g7, tagsets): + """Looks for tables of tags preceded by a concatenation-based URI + + Raises an exception if any URI is repeated with distinct definitions. This code contains a hard-coded fix for BIRTH which has the same unifying concept but distinct text in the spec. + + Returns a {structure:[list,of,allowed,enums]} mapping + """ + hard_code = { + "g7:enum-BIRTH": 'Associated with birth, such as a birth name or birth parents.', + } + cats = {} + enums = {} + for bit in re.finditer(r'by\s+concatenating\s+`([^`]*)`', txt): + i = txt.rfind('\n#', 0, bit.start()) + j = txt.find(' ',i) + j = txt.find(txt[i:j+1], j) + sect = txt[i:j].replace('(Latter-Day Saint Ordinance)','`ord`') ## <- hack for ord-STAT + for entry in re.finditer(r'`([A-Z0-9_]+)` *\| *(.*?) *[|\n]', sect): + enum, meaning = entry.groups() + pfx = bit.group(1)+enum + if 'The URI of this' in meaning: + meaning, tail = meaning.split('The URI of this') + pfx = tail.split('`')[1] + meaning = hard_code.get(pfx,meaning) + if pfx in cats and meaning != cats[pfx]: + raise Exception('Concatenated URI '+pfx+' has multiple definitions:' + + '\n '+cats[pfx] + + '\n '+meaning + ) + if 'enum-' in pfx: + k1 = sect.find('`', sect.rfind('\n#', 0, entry.start())) + k2 = sect.rfind('`', 0, sect.find('\n', k1)) + key = sect[k1:k2].replace('`','').replace('.','-') + enums.setdefault(key,[]).append(pfx) + if pfx not in cats: + cats[pfx] = meaning + if pfx.startswith('g7:'): + if pfx[3:] in g7: + raise Exception(pfx+' defined as an enumeration and a '+g7[pfx[3:]][0]) + g7[pfx[3:]] = ('enumeration', [meaning]) + return enums + +def find_calendars(txt, g7): + """Looks for sections defining a `g7:cal-` URI""" + for bit in re.finditer(r'#+ `[^`]*`[^\n]*\n+((?:\n+(?!#)|[^\n])*is `g7:(cal-[^`]*)`(?:\n+(?!#)|[^\n#])*)', txt): + g7[bit.group(2)] = ('calendar',[bit.group(1)]) + + +def joint_card(c1,c2): + """Given two cardinalities, combine them.""" + return '{' + ('1' if c1[1] == c2[1] == '1' else '0') + ':' + ('1' if c1[3] == c2[3] == '1' else 'M') + '}' + +def parse_rules(txt): + """returns {rule:[(card,uri),(card,uir),...] for each level-n + production of the rule, even if indirect (via another rule), + regardless of if alternation or set.""" + # Find gedstruct context + rule_becomes = {} + rule_becomes_rule = {} + for rule,block,notes in re.findall(r'# *`([A-Z_0-9]+)` *:=\s+```+[^\n]*\n([^`]*)``+[^\n]+((?:[^\n]|\n(?!#))*)', txt): + for card, uri in re.findall(r'^n [A-Z@][^\n]*(\{.:.\}) *(\S+:\S+)', block, re.M): + rule_becomes.setdefault(rule,[]).append((card, uri)) + for r2, card in re.findall(r'^n <<([^>]*)>>[^\n]*(\{.:.\})', block, re.M): + rule_becomes_rule.setdefault(rule,[]).append((card, r2)) + # Fixed-point rule-to-rule resolution + again = True + while again: + again = False + for r1,rset in tuple(rule_becomes_rule.items()): + flat = True + for c,r2 in rset: + if r2 in rule_becomes_rule: + flat = False + if flat: + for c,r2 in rset: + rule_becomes.setdefault(r1,[]).extend((joint_card(c,c2),uri) for (c2,uri) in rule_becomes[r2]) + del rule_becomes_rule[r1] + else: + again = True + return rule_becomes + +def new_key(val, d, *keys, msg=''): + """Helper method to add to a (nested) dict and raise if present""" + for k in keys[:-1]: + d = d.setdefault(k, {}) + if keys[-1] in d: + if d[keys[-1]] != val: + raise Exception(msg+'Duplicate key: '+str(keys)) + else: d[keys[-1]] = val + +def parse_gedstruct(txt, rules, dtypes): + """Reads through all gedstruct blocks to find payloads, substructures, and superstructures""" + sup,sub,payload = {}, {}, {} + for block in re.findall(r'```[^\n]*gedstruct[^\n]*\n([^`]*)\n```', txt): + stack = [] + for line in block.split('\n'): + parts = line.strip().split() + if len(parts) < 3: + if line not in ('[','|',']'): + raise Exception('Invalid gedstruct line: '+repr(line)) + continue + if parts[1].startswith('@'): del parts[1] + if parts[0] == 'n': stack = [] + else: + n = int(parts[0]) + while n < len(stack): stack.pop() + if parts[1].startswith('<'): + card = parts[2] + if len(stack): + for c,u in rules[parts[1][2:-2]]: + new_key(joint_card(card,c), sup, u, stack[-1], msg='rule sup: ') + new_key(joint_card(card,c), sub, stack[-1], u, msg='rule sub: ') + else: + uri = parts[-1] + if '{' in uri: + uri = parts[1]+' pseudostructure' + card = parts[-2] + if len(parts) > 4: + p = ' '.join(parts[2:-2])[1:-1] + if p.startswith('': pass + else: p = dtypes[p] + else: p = None + new_key(p, payload, uri, msg='payload: ') + if len(stack): + new_key(card, sup, uri, stack[-1], msg='line sup: ') + new_key(card, sub, stack[-1], uri, msg='line sub: ') + stack.append(uri) + return {k:{'sub':sub.get(k,[]),'sup':sup.get(k,[]),'pay':payload.get(k)} for k in sub.keys()|sup.keys()|payload.keys()} + +def find_descriptions(txt, g7, ssp): + """Collects structure definitions as follows: + + - Sections '#+ TAG (Name) `g7:FULL.TAG`' + - Sections '#+ `RULE` :=' with only one level-n struct + - Rows in tables 'Tag | Name
URI | Description' + + Returns a {section header:[list,of,uris]} mapping + """ + + # structure sections + for name,uri,desc in re.findall(r'#+ `[^`]*`[^\n]*\(([^)]*)\)[^\n]*`([^:`\n]*:[^`\n]*)`[^\n]*\n+((?:\n+(?!#)|[^\n])*)', txt): + if uri not in ssp: + raise Exception('Found section for '+uri+' but no gedstruct') + if uri.startswith('g7:'): + g7.setdefault(uri[3:],('structure',[],ssp[uri]))[1].extend(( + name.strip(), + desc.strip() + )) + for other in re.findall(r'[Aa] type of `(\S*)`', desc): + m = re.search('^#+ +`'+other+r'`[^\n`]*\n((?:[^\n]+|\n+(?!#))*)', txt, re.M) + if m: + g7[uri[3:]][1].append(m.group(1).strip()) + + # error check that gedstruct and sections align + for uri in ssp: + if 'pseudostructure' in uri: continue + if uri.startswith('g7:') and uri[3:] not in g7: + raise Exception('Found gedstruct for '+uri+' but no section') + + # gedstruct sections + for uri, desc in re.findall(r'#+ *`[^`]*` *:=[^\n]*\n+`+[^\n]*\n+n [^\n]*\} *(\S+:\S+) *(?:\n [^\n]*)*\n`+[^\n]*\n+((?:[^\n]|\n(?!#))*)', txt): + g7[uri[3:]][1].append(desc.strip()) + + tagsets = {} + # tag tables + for table in re.finditer(r'\n#+ (\S[-A-Za-z0-9 ]*[a-z0-9])[^#]*?Tag *\| *Name[^|\n]*\| *Description[^\n]*((?:\n[^\n|]*\|[^\n|]*\|[^\n]*)*)', txt): + pfx = '' + header = table.group(1) + if header.startswith('Fam'): pfx = 'FAM-' + if header.startswith('Indi'): pfx = 'INDI-' + for tag, name, desc in re.findall(r'`([A-Z_0-9]+)` *\| *([^|\n]*?) *\| *([^|\n]*[^ |\n]) *', table.group(2)): + if ' + +author: + - | + Prepared by the + + :::{style="font-size:130%"} + Family History Department
+ The Church of Jesus Christ of Latter-day Saints + ::: +address: | + **Family History Department**
+ 15 East South Temple Street
+ Salt Lake City, UT 84150 USA +toc-title: Contents +lang: en +... + +# Introduction {.unnumbered} + +FamilySearch GEDCOM 7.0 was released in 2021 as the latest version of the GEDCOM format for the transmission and storage of genealogical information. +GEDCOM was developed by the Family History Department of The Church of Jesus Christ of Latter-day Saints to provide a flexible, uniform format for exchanging computerized genealogical data. +Its first purpose is to foster sharing genealogical information and to develop a wide range of inter-operable software products to assist genealogists, historians, and other researchers. +Its second purpose is as a long-term storage format for preserving genealogical information in an open, standard format that will be accessible and understood by future genealogists and the systems they use. + +"GEDCOM" is an acronym for **GE**nealogical **D**ata **COM**munication and is traditionally pronounced "ˈdʒɛdkɑm." + +## Purpose and Content of *The FamilySearch GEDCOM Specification* {.unnumbered} + +*The FamilySearch GEDCOM Specification* is a technical document written for computer programmers, system developers, and technology-aware users. +This document describes a document and file format as follows: + +- A hierarchical container format (see [Chapter 1](#container)) +- A set of data types (see [Chapter 2](#datatypes)) +- A set of genealogical structures (see [Chapter 3](#gedcom-structures)) +- The FamilySearch GEDZIP file format (see [Chapter 4](#gedzip)) + +Chapter 1 describes a hierarchical container format. +This container format is a general-purpose data representation language for representing any kind of structured information in a sequential medium, +similar to XML, JSON, YAML, or SDLang. +Chapter 1 discusses the syntax and identification of structured information in general, +but it does not deal with the semantic content of any particular kind of data. + +Chapter 2 describes several data types used to represent genealogical information, +such as a date format that permits dating in multiple calendar systems. + +Chapter 3 describes a set of nested genealogical structures +for representing +historical claims, such as individuals, families, and events; +sourcing information, such as sources, repositories, and citations; +and research metadata, such as information about researchers and rights. + +A set of structures conforming to the first 3 chapters of *The FamilySearch GEDCOM Specification* is called a FamilySearch GEDCOM dataset. +A string of octets encoding a dataset is called a data stream. + +Chapter 4 describes a file format for bundling a dataset +with a set of media files or other supporting documents. + +:::note +Prior to 7.0: + +- The container format was called "the GEDCOM data format." +- The data types were unnamed and described in various places throughout the document. +- The genealogical structures were known as "the Lineage-Linked GEDCOM Form." +::: + +## Purpose for Version 7.x {.unnumbered} + +There have been multiple prior releases of this specifciation, with somewhat idiosyncratic version numbering. +The first public comment draft was released in 1984. +The previous major version was 5.5.1 which was released in draft status in November 1999 +and re-released as a standard in October 2019. + +Version 7.0 has a number of goals, including + +- Clarify ambiguities in the specification. +- Simplify implementations by removing special-case handling. +- Modernize character encoding, length restrictions, and specification wording. +- Introduce semantic versioning (see ). +- Add better multimedia handling, negative assertions, and rich-text notes. +- Add support for common extensions to 5.5.1. +- Provide tools for better interoperability of extensions. + +Version 7.0 introduces several breaking changes with version 5.5.1; +5.5.1 files are, in general, not valid 7.0 files and *vice versa*. +These breaking changes were necessary to remove complicated constructs left over from earlier versions. +For a complete list of changes, see the accompanying changelog. + +## A Guide to Version Numbers {.unnumbered} + +Starting with version 7.0.0, version numbers use semantic versioning. +The 3 numbers are titled *major*.*minor*.*patch*. + +A new *major* version may make arbitrary changes to the specification. Distinct major versions are not in general either forward or backward compatible with one another. + +A new *minor* version will preserve the validity of data from all previous minor versions. It may make additional data valid, for example by adding new structure types, allowing current structures in new contexts, or adding new enumerated values or calendars. A minor release will not change the semantic meaning of data from previous minor releases, so for example a 7.0 document is also a valid 7.1 document and represents the same information in both. + +A new *patch* version is a clarified or improved specification for the same data and introduces no changes in the data itself. Any software that correctly implements *X*.*Y*.*Z* also correctly implements *X*.*Y*.*W*. If there is an ambiguity or contradiction in the specification, it will be resolved in a patch version unless it is known that implementations interpreted the spec differently and that clarifying the intended meaning would cause incompatibilities between those implementations. + +It is recommended that implementations accept all data at their own or a lesser minor version, regardless of the patch version. +It is also recommended that they import data from subsequent minor versions by treating any unexpected structures, enumerations, or calendars as if they were [extensions]. + +## URIs and Prefix Notation {.unnumbered} + +This document defines [Uniform Resource Identifiers (URIs)](https://tools.ietf.org/html/rfc3986) to unambiguously identify various concepts, including structure types, data types, calendars, enumerated values, and so on. +In a few places, existing URIs defined by other bodies are used, following the best practice that a new URI should not be introduced for a concept for which a URI is known. + +Rather than write out URIs in full, we use prefix notation: +any URI beginning with 1 of the following short prefixes followed by a colon +is shorthand for a URI beginning with the corresponding URI prefix + +| Short Prefix | URI Prefix | +|:-------------|:------------------------------------| +| `g7` | `https://gedcom.io/terms/v7/` | +| `xsd` | `http://www.w3.org/2001/XMLSchema#` | +| `dcat` | `http://www.w3.org/ns/dcat#` | + +:::example +When the specification says `xsd:string`, it means `http://www.w3.org/2001/XMLSchema#string`. +This is a specification shorthand only; the string "`xsd:string`" is not the URI defining this concept, `http://www.w3.org/2001/XMLSchema#string` is. +::: + + +# Hierarchical container format {#container} + +## Characters + +Each data stream is a sequence of octets or bytes. +The octets encode a sequence of characters according to the UTF-8 character encoding as described in §10.2 of [ISO/IEC 10646:2020](https://www.iso.org/standard/76835.html). + +:::note +Previous versions allowed multiple character encodings, defaulting to ANSEL. +7.0 only uses the UTF-8 character encoding. +::: + +A file containing a FamilySearch GEDCOM data stream should use the filename extension `.ged`. + +The first character in each data stream should be U+FEFF, the byte-order mark. +If present, this initial character has no meaning within this specification but serves to indicate to other systems that the file uses the UTF-8 character encoding. + +Certain characters must not appear anywhere within a data stream: + +- The C0 control characters other than tab and line endings (U+0000--U+001F except U+0009, U+000A and U+000D) +- The DEL character (U+007F) +- Surrogates (U+D800--U+DFFF) +- Invalid code points (U+FFFE and U+FFFF) + +Implementations should be aware that bytes per character and characters per glyph are both variable when using UTF-8. +Use of Unicode-aware processing and display libraries is recommended. + +Character-level grammars are specified in this document using +Augmented Bakaus-Naur Form (ABNF) +as defined in IETF STD 68 () +and modified in IETF RFC 7405 (). +We use the term "production" to refer to an ABNF rule, supported by any other rules it references. + +:::note +The following is a brief summary of the parts of ABNF, as defined by STD 68 and RFC 7405, that are used in this document: + +- A rule consists of a rulename, an equals sign `=`, and 1 or more alternative matches. +- Alternatives are separated by slashes `/`. +- The first line of a rule must not be indented; the second and subsequent lines of a rule must be indented. +- Comments are introduced with a semi-colon `;`. +- Unicode codepoints are given in hexadecimal preceded by `%x`. Ranges of allowed codepoints are given with a hyphen `-`. +- Double-quote delimit literal strings. Literal strings are case-insensitive unless they are preceded by `%s`. +- Parentheses `()` group elements. Brackets `[]` mark optional content. Preceding a group or element by `*` means any number may be included. Preceding a group or element by `1*` means 1 or more may be included. +::: + +The banned characters can be expressed in ABNF as production `banned`: + +```abnf +banned = %x00-08 / %x0B-0C / %x0E-1F ; C0 other than LF CR and Tab + / %x7F ; DEL + / %x80-9F ; C1 + / %xD800-DFFF ; Surrogates + / %xFFFE-FFFF ; invalid +; All other rules assume the absence of any banned characters +``` + +All other ABNF expressions in this document assume the absence of any characters matching production `banned`. + +This document additionally makes use of the following named character sets in ABNF: + + +```abnf +digit = %x30-39 ; 0 through 9 +nonzero = %x31-39 ; 1 through 9 +ucletter = %x41-5A ; A through Z +underscore = %x5F ; _ +atsign = %x40 ; @ +``` + +## Structures + +A **structure** consists of a structure type, an optional **payload**, and a collection of substructures. +The payload is a value expressed as a string using 1 of several data types, as described in [Chapter 2](#datatypes). + +Every structure is either a **record**, meaning it is not contained in any other structure's collection of substructures, +or it is a **substructure** of exactly 1 other structure. The other structure is called its **superstructure**. +Each substructure either refines the meaning of its superstructure, provides metadata about its superstructure, or introduces new data that is closely related to its superstructure. + +The collection of substructures is partially ordered. +Substructures with the same structure type are in a fixed order, +but substructures with different structure types may be reordered. +The order of substructures of a single type indicates user preference, with the first substructure being the most-preferred value, +unless a different meaning is explicitly indicated in the structure's definition. + +A structure must have either a non-empty payload or at least 1 substructure. +Empty payloads and missing payloads are considered equivalent. The remainder of this document uses "payload" as shorthand for "non-empty payload". + +:::note +Unlike structures, pseudo-structures needn't have either payloads or substructures. `TRLR` never has either, and `CONT` doesn't when payloads contain empty lines. +::: + +A structure is a representation of data about its **subject**. Examples include the entity, event, claim, or activity that the structure describes. + +Datasets also contain 3 types of pseudo-structures: + +- The header resembles a record, comes first in each document, and contains metadata about the entire document in its substructures. + See [The Header](#the-header) for more. + +- The trailer resembles a record, comes last in each document, and cannot contain substructures. + +- A line continuation resembles a substructure, comes before any other substructures, is used to encode multi-line payloads, and cannot contain substructures. + +Previous versions limited the number of characters that could appear in a structure, record, and payload. Those restrictions were removed in 7.0. + +## Lines + +A **line** is a string representation of (part of) a *structure*. +A line consists of a level, optional cross-reference identifier, tag, optional line value, and line terminator. +It matches the production `Line`: + +```abnf +Line = Level D [Xref D] Tag [D LineVal] EOL + +Level = "0" / nonzero *digit +D = %x20 ; space +Xref = atsign 1*tagchar atsign ; but not "@VOID@" +Tag = stdTag / extTag +LineVal = pointer / lineStr +EOL = %x0D [%x0A] / %x0A ; CR-LF, CR, or LF + +stdTag = ucletter *tagchar +extTag = underscore 1*tagchar +tagchar = ucletter / digit / underscore + +pointer = voidPtr / Xref +voidPtr = %s"@VOID@" + +nonAt = %x09 / %x20-3F / %x41-10FFFF ; non-EOL, non-@ +nonEOL = %x09 / %x20-10FFFF ; non-EOL +lineStr = (nonAt / atsign atsign) *nonEOL ; leading @ doubled +``` + +The **level** matches production `Level` and is used to encode substructure relationships. +Any line with level $0$ encodes a record or a record-like pseudo-structure. +Any line with level $x > 0$ encodes a substructure of the structure encoded by the nearest preceding line with level $x-1$. + +:::note +Previous versions allowed spaces and blank lines to precede the level of a line. +That permission was removed from 7.0 to simplify parsing. +::: + +The **cross-reference identifier** matches production `Xref` (but not `voidPtr`) and indicates that this is a structure to which pointer-type payloads may point. +Each cross-reference identifier must be unique within a given data document. +Cross-reference identifiers are not retained between data streams and should not be made visible to the user to avoid them referring to transient data within notes or other durable data. + +Each record to which other structures point must have a cross-reference identifier. +A record to which no structures point may have a cross-reference identifier, but does not need to have one. +A substructure or pseudo-structure must not have a cross-reference identifier. + +The **tag** matches production `Tag` and encodes the structure's type. +Tags that match the production `stdTag` are defined in this document. +Tags that match `extTag` are defined according to [Extensions]. + +The **line value** matches production `LineVal` and encodes the structure's payload. +Line value content is sufficient to distinguish between pointers and line strings. +Pointers are encoded as the cross-reference identifier of the pointed-to structure. +Each non-pointer payload may be encoded in 1 or more line strings (line continuations encode multi-line payloads in several line strings). +The exact encoding of non-pointer payloads is dependent on the datatype of the payload, as determined by the structure type. +The datatype of non-pointer payloads cannot be fully determined by line value content alone. + +If a line value matches production `Xref`, the same value must occur as the cross-reference identifier of a structure within the document. +The special `voidPtr` production is provided to encode null pointers. + +If the first character of the string stored in a line string is U+0040 (`@`), the line string must escape that character by doubling that `@`. + +:::note +Previous versions required doubling all `@` in a line value, but such doubling was not widely implemented in practice. +`@` is only doubled in this version if it is the first character of a line string. +::: + +:::example +A structure with tag `NOTE`, level 1, and a 2-line payload where the first line is "`me@example.com is my email`" and the second line is "`@me and @I are my social media handles`" would be encoded as + +```gedcom +1 NOTE me@example.com is my email +2 CONT @@me and @I are my social media handles +``` +::: + +:::note +Line values that match neither `Xref` nor `lineStr` are prohibited. They have been used in previous versions (for example, a line value beginning `@#D` was a date in versions 4.0 through 5.5.1) and may be used again in a future version if an appropriate need arises. +::: + +The components of a line are each separated by a single **delimiter** matching production `D`. A delimiter is always a single space character (U+0020). Using multiple delimiters between components of a line is prohibited. Thus if the tag is followed by 2 spaces, the first space is a delimiter and the second space is part of the line value. + +All characters in a payload must be preserved in the corresponding line value, including preserving any leading or trailing spaces. + +Each line is ended by a **line terminator** matching production `EOL`. A line terminator may be a carriage return U+000D, line feed U+000A, or a carriage return followed by a line feed. The same line terminator should be used on every line of a given document. + +Line values cannot contain internal line terminators, but some payloads can. If a payload contains a line terminator, the payload is split on the line terminators into several payloads. The first of these split payloads is encoded as the line value of the structure's line, and each subsequent split payload is encoded as the line value of a **line continuation** pseudo-structure of the structure. +The tag of a line continuation pseudo-structure is `CONT`. +The order of the line continuation pseudo-structures matches the order of the lines of text in the payload. + +Line continuation pseudo-structures are not considered to be structures nor to be part of a structure's collection of substructures. They must appear immediately following the line whose payload they are encoding and before any other line. + +Because line terminators in payloads are encoded using line continuations, it is not possible to distinguish between U+000D and U+000A in payloads. + +:::note +Previous versions limited the number of characters that could appear in a tag, cross-reference identifier, and line-value. +Those restrictions were removed in version 7.0. +The `CONC` pseudo-structure, which allowed line values to have a shorter length restriction than payloads, was also removed. +::: + +:::example +The following are examples of valid but unrelated lines: + +- level 0, cross-reference identifier `@I1234@`, tag `INDI`, no line value. + + ````gedcom + 0 @I1234@ INDI + ```` + +- level 1, no cross-reference identifier, tag `CHIL`, pointer line value pointing to the structure with cross-reference identifier "`@I1234@`". + + ````gedcom + 1 CHIL @I1234@ + ```` + +- level 1, no cross-reference identifier, tag `NOTE`, and line value + continuation pseudo-structure to encode a 4-line payload string: "`This is a note field that`", "`  spans four lines.`", “”, and "`(the third line was blank)`". Note that leading and trailing spaces are preserved. + + ````gedcom + 1 NOTE This is a note field that + 2 CONT spans four lines. + 2 CONT + 2 CONT (the third line was blank) + ```` +::: + +## The Header and Trailer {#the-header} + +Every dataset must begin with a header pseudo-structure and end with a trailer pseudo-structure. + +The trailer pseudo-structure has level `0`, tag `TRLR` and no line value or substructures. +The trailer has no semantic meaning; it is present only to mark the end of the dataset. + +The header pseudo-structure has level `0`, tag `HEAD`, and no line value. +The substructures of the header pseudo-structure provide metadata about the entire dataset. +Some of those substructures are defined here; +others are defined in [Chapter 3](#gedcom-structures) or by extensions. + +Every header must contain a substructure with a known tag that identifies the specification to which the dataset complies. +For FamilySearch GEDCOM 7.0, this is the `GEDC` structure described in [Chapter 3](#GEDC). + +A header should contain an extension schema structure with tag `SCHMA` +as described in [Extensions]. + +## Extensions + +A **standard structure** is a structure whose type, tag, meaning, superstructure, and cardinality within the superstructure are described in this document. This includes records such as `INDI` and substructures such as `INDI`.`NAME`. + +Two forms of **extension structures** are permitted: + +- A **tagged extension structure** is a structure whose tag matches production `extTag`. Tagged extension structures may appear as records or substructures of any other structure. +- An **extended-use standard structure** is a structure whose type, tag, and meaning are defined in this document and whose superstructure is a tagged extension structure. + +Extension structures may have substructures, which may be either tagged extension structures of extended-use standard structures. + +All other non-standard structures are prohibited. Examples of prohibited structures include, but are not limited to, + +- any structure with a tag matching production `stdTag` that is not defined in this document; +- any substructure with cardinality `{0:1}` appearing more than once; +- a standard substructure appearing as a record or vice-versa; +- a standard structure whose payload does not match the requirements of this document. + +:::note +In some cases, an extension may need to allow multiple structures where this document allows only 1. The recommended way to do this is to create an extension tag and URI and serve a page describing how the semantics of the structure have been extended to allow multiple instances. + +:::example +Suppose I have multiple sources that give different ages of the wife at a wedding; however, this specification allows only 1 `MARR`.`WIFE`.`AGE`. An extension could not include multiple `MARR`.`WIFE` nor `MARR`.`WIFE`.`AGE`, but could define a new extension `_AGE`, give it a URL, and provide the following definition of this extension structure type at that URL: + +> Alternate age: an age attested by some source, but not accepted by the researcher as the actual age of the individual. If the age is accepted by the researcher, the standard tag `AGE` should be used instead. + +This alternate age extension structure could be used as follows: + +```gedcom +1 MARR +2 WIFE +3 AGE 27y +3 _AGE 22y +``` +::: +::: + +Enumerated values may be extended with new values that match production `extTag`. +Enumerations may not use standard values from other enumeration sets. + +:::example +The following is not allowed because `PARENT` is defined as a value for `ROLE`, not for `RESN` + +```gedcom +0 @BAD@ INDI +1 RESN PARENT +1 NOTE The above enumeration value is not allowed +``` +::: + +Dates may be extended provided they use a calendar that matches production `extTag`. +Dates with extension calendars may also use extension months and epochs. + + +### Extension Tags + +Each use of the `extTag` production is called an extension tag, +including when used as a tag, calendar, month, epoch, or enumerated value. +Each `extTag` is either a *documented extension tag* or an *undocumented extension tag*. +It is recommended that documented extension tags be used instead of undocumented extension tags wherever possible. + +A **documented extension tag** is a tag that is mapped to a URI using the schema structure. +The schema structure is a substructure of the header with tag `SCHMA`. +It should appear within the document before any extension tags. +The schema's substructures are tag definitions. + +A tag definition is a structure with tag `TAG`. +Its payload is an extension tag, a space, and a URI +and defines that extension tag to be an abbreviation for that URI within the current document. + +:::example +The following header + +```gedcom +0 HEAD +1 SCHMA +2 TAG _SKYPEID http://xmlns.com/foaf/0.1/skypeID +2 TAG _MEMBER http://xmlns.com/foaf/0.1/member +``` + +defines the following tags + +| Tag | Means | +| :---- | :---- | +| `_SKYPEID` | `http://xmlns.com/foaf/0.1/skypeID` | +| `_MEMBER` | `http://xmlns.com/foaf/0.1/member` | +::: + +The meaning of a documented extension tag is identified by its URI, not its tag. +Documented extension tags can be changed freely by modifying the schema, +though it is recommended that documented extension tags not be changed. +However, a tag change may be necessary if a product picks the same tags for URIs that another product uses for different URIs. + +:::example +The following 2 document fragments are semantically equivalent +and a system importing one may export it as the other without change of meaning. + +```gedcom +0 HEAD +1 SCHMA +2 TAG _SKYPEID http://xmlns.com/foaf/0.1/skypeID +0 @I0@ INDI +1 _SKYPEID example.person +``` + +```gedcom +0 HEAD +1 SCHMA +2 TAG _SI http://xmlns.com/foaf/0.1/skypeID +0 @I0@ INDI +1 _SI example.person +``` +::: + +An extension tag that is not given a URI in the schema structure is called an **undocumented extension tag**. +The meaning of an undocumented extension tag is identified by its tag. + + +### Requirements and Recommendations + +- It is recommended that applications not use undocumented extension tags. +- It is required that each tag definition's extension tag be unique within the document. +- It is recommended that each documented extension tag's URI be unique within the document. +- It is recommended that extension creators use URLs as their URIs +and serve a page describing the meaning of an extension at its URL. +- It is recommended that extensions use extended-use standard structures instead of tagged extension structures if extended-use standard structures will suffice. + +Future versions may include additional recommendations relating to documentation, machine-readable documentation, or embedded metadata about extensions within the schema. + +### Extension versus Standard + +Standard structures take priority over extensions. +Data contained in extension tags will not be interpreted by other systems correctly unless the other system supports that particular extension. +In particular, those supporting extensions should keep in mind the following: + +- If a standard structure is present that contradicts an extension that is present, the standard structure has priority and the extension should be updated to align with it. + +
+ + If a document has an extension `_ISODATE` in ISO 8601 format that disagrees with a `DATE` in the `DateValue` format, the `DATE` shall be taken as more correct and the `_ISODATE` updated to reflect that. + +
+ +- If a standard structure can be extracted as a subset of the semantics of an extension, the standard tag must be generated along with the extension and kept in sync with it by systems understanding the extension. + +
+ + If a document has an extension `_LOC` providing a detailed hierarchical place representation with historical names, boundaries, and the like, it must also generate the corresponding `PLAC` structures with the subset of that information which `PLAC` can represent. + +
+ +- If an extension can be extracted as a subset of the semantics of a standard structure, or if the extension and standard structure only sometimes align, then the standard structure should be included if and only if the semantics align in this case. + +
+ + If a document has an extension `_PARTNER` that generalizes `HUSB` and `WIFE` and some `ASSO` `ROLE`s, then it should pair the extension with those standard structures if and only if it knows which one applies. + +
+ +
+ + If a document has an extension `_HOUSEHOLD` that is the same as `FAM` in some situations but not in others, then it should keep the `_HOUSEHOLD` and `FAM` in sync if and only if they align. + +
+ +- Six standard structure types are exceptions to these rules: + `NOTE`,` SNOTE`, `INDI`.`EVEN`, `FAM`.`EVEN`, `INDI`.`FACT`, and `FAM`.`FACT`. + Each of these allows human-readable text to describe information that cannot be captured in more-specific structures. + As such, all other structures express information that could be described using 1 or more of those structure types. + Extensions do not need to duplicate their information using any of those structures. + +
+ + If a document has an extension `_MEMBER` that indicates membership in clubs, boards, and other groups, + it is not required to duplicate that information in an `INDI`.`FACT` + because `INDI`.`FACT` is 1 of the 6 special structure types listed above. + +
+ +
+ + If a document has an extension `_WEIGHT` that describes the weight of a person, + it must duplicate that information in an `INDI`.`DSCR` + because `INDI`.`DSCR` is not 1 of the 6 generic structure types. + +
+ +## Removing data + +There may be situations where data needs to be removed from a dataset, such as when a user requests its deletion or marks it as confidential and not for export. + +In general, removed data should result in removed structures. + +Pointers to a removed structure should be replaced with `voidPtr`s. + +If removal of a structure makes the superstructure invalid because the superstructure required the substructure, the structure should instead be retained and have its payload changed to a `voidPtr` if a pointer, or to a datatype-appropriate empty value if a non-pointer. + +If removing a structure leaves its superstructure with no payload and no substructures, the superstructure should also be removed. + + + +# Data types {#datatypes} + +Every line value (with any continuation pseudo-structures) is a string. +However, those strings can encode 1 of several conceptual datatypes. + +## Text + +A free-text string is text in a human language. +Conceptually, it may be either a user-generated string or a source-generated string. +Programmatically, both are treated as unconstrained sequences of characters with an associated language. + +```abnf +anychar = %x09-10FFFF ; but not banned, as with all ABNF rules +Text = *anychar +``` + +The URI for the `Text` datatype is `xsd:string`. + +## Integer + +An integer is a non-empty sequence of ASCII decimal digits +and represents a non-negative integer in base-10. +Leading zeros have no semantic meaning and should be omitted. + +```abnf +Integer = 1*digit +``` + +Negative integers are not supported by this specification. + +The URI for the `Integer` datatype is `xsd:nonNegativeInteger`. + +## Enumeration + +An enumeration is a selection from a set of options. +They are represented as a string matching the same production as a tag, including the rules about extensions beginning with `_` (U+005F) and being mapped to URIs by a schema. + +```abnf +Enum = Tag +``` + +Each enumeration value has a distinct meaning +as identified by its corresponding URI. + +The URI for the `Enum` datatype is `g7:type-Enum`. + +## Date + +The date formats defined in this specification +include the ability to store approximate dates, date periods, and dates expressed in different calendars. + +Technically, there are 3 distinct date datatypes: + +- `DateValue` is a generic type that can express many kinds of dates. +- `DateExact` is used for timestamps and other fully-known dates. +- `DatePeriod` is used to express time intervals that span multiple days. + + +```abnf +DateValue = date / DatePeriod / dateRange / dateApprox +DateExact = day D month D year ; in Gregorian calendar +DatePeriod = %s"FROM" D date [D %s"TO" D date] + / %s"TO" D date + +date = [calendar D] [[day D] month D] year [D epoch] +dateRange = %s"BET" D date D %s"AND" D date + / %s"AFT" D date + / %s"BEF" D date +dateApprox = (%s"ABT" / %s"CAL" / %s"EST") D date + +dateRestrict = %s"FROM" / %s"TO" / %s"BET" / %s"AND" / %s"BEF" + / %s"AFT" / %s"ABT" / %s"CAL" / %s"EST" / %s"BCE" + +calendar = %s"GREGORIAN" / %s"JULIAN" / %s"FRENCH_R" / %s"HEBREW" + / extTag + +day = Integer +year = Integer +month = stdTag / extTag ; constrained by calendar +epoch = %s"BCE" / extTag ; constrained by calendar +``` + +In addition to the constraints above: + +- The allowable `month`s and `epoch`s are determined by the `calendar`. +- No calendar names, months, or epochs match `dateRestrict`. +- Extension calendars (those with `extTag` for their `calendar`) must use `extTag`, not `stdTag`, for months. + +An absent `calendar` is equivalent to the calendar `GREGORIAN`. + +The grammar above allows for `date`s to be preceded by various words. The meaning of these words is given as follows: + +|Production| Meaning | +|:---------|:-------------------------------------------| +|`FROM` *x*|Lasted for multiple days, beginning on *x*. | +|`TO` *x* |Lasted for multiple days, ending on *x*. | +|`BET` *x*
`AFT` *x*|Exact date unknown, but no earlier than *x*.| +|`AND` *x*
`BEF` *x*|Exact date unknown, but no later than *x*. | +|`ABT` *x* |Exact date unknown, but near *x*. | +|`CAL` *x* |*x* is calculated from other data. | +|`EST` *x* |Exact date unknown, but near *x*; and *x* is calculated from other data.| + +Known calendars and tips for handling dual dating and extension calendars are given in [Appendix A: Calendars and Dates](#A-calendars). + +Date payloads may also be omitted entirely if no suitable form is known but a substructure (such as a `PHRASE` or `TIME`) is desired. + +:::note +Versions 5.3 through 5.5.1 allowed phrases inside `DateValue` payloads. +Date phrases were moved to the `PHRASE` substructure in version 7.0. +::: + +:::note +As defined by the grammar above, every date must have a year. +If no year is known, the entire date may be omitted. + +:::example +The following is an appropriate way to handle a missing year + +```gedcom +2 DATE +3 PHRASE 5 January (year unknown) +``` +::: +::: + +The URI for the `DateValue` datatype is `g7:type-Date`. + +The URI for the `DateExact` datatype is `g7:type-Date#exact`. + +The URI for the `DatePeriod` datatype is `g7:type-Date#period`. + +## Time + +Time is represented on a 24-hour clock (for example, 23:00 rather than 11:00 PM). +It may be represented either in event-local time or in Coordinated Universal Time (UTC). +UTC is indicated by including a `Z` (U+005A) after the time value; event-local time is indicated by its absence. + +```abnf +Time = hour ":" minute [":" second ["." fraction]] [%s"Z"] + +hour = digit / ("0" / "1") digit / "2" ("0" / "1" / "2" / "3") +minute = ("0" / "1" / "2" / "3" / "4" / "5") digit +second = ("0" / "1" / "2" / "3" / "4" / "5") digit +fraction = 1*digit +``` + +:::note +The above grammar prohibits end-of-day instant `24:00:00` and leap-seconds. It allows both `02:50` and `2:50` as the same time. +::: + +The URI for the `Time` datatype is `g7:type-Time`. + +## Age + +Ages are represented by counts of years, months, weeks, and days. + +```abnf +Age = [ageBound D] ageDuration + +ageBound = "<" / ">" +ageDuration = years [D months] [D weeks] [D days] + / months [D weeks] [D days] + / weeks [D days] + / days + +years = Integer %x79 ; 35y +months = Integer %x6D ; 11m +weeks = Integer %x77 ; 8w +days = Integer %x64 ; 21d +``` + +Where + +|Production |Meaning | +|:----------|:-------------------------------------------------| +|`<` | The real age was less than the provided age | +|`>` | The real age was greater than the provided age | +|`years` | a number of years | +|`months` | a number of months | +|`weeks` | a number of weeks | +|`days` | a number of days | + +Non-integer numbers should be rounded down to an integer. Thus, if someone has lived for 363.5 days, their age might be written as `363d`, `51w 6d`, `51w`, `0y`, etc. + +Because numbers are rounded down, `>` effectively includes its endpoint; that is, the age `> 8d` includes people who have lived 8 days + a few seconds. + +Different cultures count ages differently. Some increment years on the anniversary of birth and others at particular seasons. Some round to the nearest year, others round years down, others round years up. Because users may be unaware of these traditions or may fail to convert them to the round-down convention, errors in age of up to a year are common. + +Age payloads may also be omitted entirely if no suitable form is known but a substructure (such as a `PHRASE`) is desired. + +:::note +Versions 5.5 and 5.5.1 allowed a few specific phrases inside `Age` payloads. +Age phrases were moved to the `PHRASE` substructure in 7.0. +::: + +The URI for the `Age` datatype is `g7:type-Age`. + + +## List + +A list is a meta-syntax representing a sequence of values with another datatype. +Two list datatypes are used in this document: List:Text and List:Enum. +Lists are serialized in a comma-separated form, delimited by a comma (U+002C `,`) and any number of spaces (U+0020) between each item. +It is recommended that a comma-space pair (U+002C U+0020) be used as the delimiter. + +```abnf +List = listItem *(listDelim listItem) +listItem = "" / nocommasp / nocommasp *nocomma nocommasp +listDelim = *D "," *D +nocomma = %x09-2B / %x2D-10FFFF +nocommasp = %x09-1D / %x21-2B / %x2D-10FFFF + +List-Text = List +List-Enum = Enum *(listDelim Enum) +``` + +If valid for the underlying type, empty strings may be included in a list by having no characters between delimiters. + +:::example +A `List:Text` with value "`, , one, more,`" has 5 `Text`-type values: 2 empty strings, the string "`one`", the string "`more`", and 1 more empty string. +::: + +There is no escaping mechanism to allow lists of entries that begin or end with spaces or that contain comma characters. + +The URI for the `List:Text` datatype is `g7:type-List#Text`. + +The URI for the `List:Enum` datatype is `g7:type-List#Enum`. + + +## Personal Name + +A personal name is mostly free-text. It should be the name as written in the culture of the individual and should not contain line breaks, repeated spaces, or characters not part of the written form of a name (except for U+002F as explained below). + +```abnf +NamePersonal = nameStr + / [nameStr] "/" [nameStr] "/" [nameStr] + +nameChar = %x20-2E / %x30-10FFFF ; any but '/' and '\t' +nameStr = 1*nameChar +``` + +The character U+002F (`/`, slash or solidus) has special meaning in a personal name, being used to delimit the portion of the name that most closely matches the concept of a surname, family name, or the like. +This specification does not provide any standard way of representing names that contain U+002F. + +The URI for the `PersonalName` datatype is `g7:type-Name`. + +## Language + +The language datatype represents a human language or family of related languages, as defined by the IETF in [BCP 46](https://tools.ietf.org/html/bcp47). +It consists of a sequence of language subtags separated by hyphens, +where language subtags are [registered by the IANA](https://www.iana.org/assignments/language-subtag-registry). + +The ABNF grammar for language tags is given in BPC 47, section 2.1, production `Language-Tag`. + +The URI for the `Language` datatype is `xsd:Language`. + +## Media Type + +The media type datatype represents the encoding of information in bytes or characters, as defined by the IETF in [RFC 2045](https://tools.ietf.org/html/rfc2045) and [registered by the IANA](http://www.iana.org/assignments/media-types/). + +The official grammar for media type is given in RFC 2045, section 5.1. +However, that document does not give stand-alone ABNF, instead refering to registration rules and describing some components in English. +The programmatic parts of the media type grammar can be summarized as follows: + +```abnf +MediaType = mt-type "/" mt-subtype *(";" mt-parameter) + +mt-type = mt-token +mt-subtype = mt-token +mt-parameter = mt-attribute "=" mt-value +mt-token = 1*mt-char +mt-attribute = mt-token +mt-value = mt-token / quoted-string +mt-char = %x20-21 / %x23-27 / %x2A-2B / %x2D-2E ; not "(),/ + / %x30-39 / %x41-5A / %x5E-7E ; not :;<=>?@[\] + +mt-qstring = %x22 *(mt-qtext / mt-qpair) %x22 +mt-qtext = %x09-0A / %x20-21 / %x23-5B / %x5D-7E ; not CR "\ +mt-qpair = "\" %x09-7E +``` + +The URI for the `MediaType` datatype is `dcat:mediaType`. + +## Special + +The special datatype is a string conforming to a case-specific standard or constraints. The constraints on each special datatype instance are either unique to that structure type or are not simply expressed. +For example, the payload of an `IDNO` structure may obey different rules for each possible `TYPE` substructure. + +Each special datatype is distinct. +The URI for the generic datatype subsuming all `Special` datatypes is `xsd:string` (the same as the `Text` datatype). + +```abnf +Special = Text +``` + + +# Genealogical structures {#gedcom-structures} + +This chapter describes a set of structure types for exchanging family-based lineage-linked genealogical information. +Lineage-linked data pertains to individuals linked in family relationships across multiple generations. + +The genealogical structures defined in this chapter are based on the general framework of the container format and data types defined in Chapters 1 and 2. + +Historically, these genealogical structures were used as the only form approved for exchanging data with Ancestral File, TempleReady and other Family History resource files. +Those systems were all replaced between 1999 and 2019, and GEDCOM-X () was introduced as the new syntax for communication with their replacements. +FamilySearch GEDCOM 7.0 and GEDCOM-X have similar expressive power, +but as of 2021 GEDCOM is more common for exchanging single-researcher files between applications +and GEDCOM-X is more common for transferring bulk data and communication directly between applications. + +The basic description of the genealogical structures' organization is presented in the following 3 major sections: + +* "[Structure Organization]" describes records and other nested structures. +* "[Structure Meaning]" provides a definition of each structure by its tag. +* "[Enumeration Values]" provides a definition of each enumeration value by its containing structure. + +## A Metasyntax for Structure Organization + +The structures, with their payloads and substructures, +are represented using a custom metasyntax. +The intent of this metasyntax is to resemble the line encoding of allowable structures. In the metasyntax: + +- Options are placed between brackets `[` and `]` and have choices separated by pipes `|`. +- Named sets of rules are indicated with a name followed by `:=`. +- Level markers are used to indicate substructure relationships. + - `0` means "must be a record". + - `n` means "level inherited from rule instantiation". + - `+1`, `+2`, etc., indicate nesting within nearest preceding structure with lesser level. +- Four cardinality markers are used: `{0:1}`, `{1:1}`, `{0:M}`, and `{1:M}`. + - `{0:` means "optional" -- the structure may be omitted + - `{1:` means "required" -- at least 1 must appear + - `:M}` means "any number" -- 1 or more structures may appear. + Unless otherwise specified, the first is the most-preferred value. + If an application needs to display just 1 of several `NAME`s, `BIRT`s, etc, they should show the first such structure unless more specific selection criteria are available. + - `1:}` means "singular" -- at most 1 may appear; a second must not be present. + + Systems interested in violating the cardinality rules should instead create [extension structures](#extensions) with different cardinality. +- Rule instantiation is indicated by the rule name in double angle-brackets (such as `<<`rule name`>>`) and a cardinality marker. + + The cardinality markers of rule instantiations and their referenced line templates are combined such that the resulting cardinality + is required only if both combined cardinalities are required + and singular only if both combined cardinalities are singular. + +
+ The definition of the `FAM` record has the line + + ````gedstruct + +1 <> {0:1} + ```` + + and the `CREATION_DATE` rule begins + + ````gedstruct + n CREA {1:1} g7:CREA + ```` + + Thus, a `FAM` record has an optional singular `CREA` substructure + (such as cardinality `{0:1}`). +
+ +- Line templates have several parts: + - An optional cross-reference template `@XREF:`tag`@`, meaning this structure may be pointed to by other structures. + + Structures that are not pointed to by other structures need not have a [cross-reference identifier](#lines) even if their line template has a cross-reference template. + - The standard tag for this structure. + - An optional payload descriptor; if present this is 1 of the following: + + - `@@` means a pointer to a structure with this cross-reference template; `@VOID@` is also permitted. + - `<`datatype`>` means a non-pointer payload, as described in [Data types](#datatypes). If the datatype allows the empty string, the payload may be omitted. + - `[`text`|]` means the payload is optional but if present must be the given text. + + If there is a payload descriptor, a payload that matches the payload is required of the described structure unless the descriptor says the payload is optional. + + If there is no payload descriptor, the described structure must not have a payload. + + - A cardinality marker. + - The URI of this structure type. + + Pseudo-structures do not have a URI. + +The context of a structure's superstructure may be necessary in addition to the structure's standard tag to fully determine its structure type. +To refer to a structure in the context of its superstructure, +tags are written with intervening periods. +For example, `GEDC`.`VERS` refers to a structure with tag `VERS` +and a superstructure with tag `GEDC`. + + +## Structure Organization + +### Document + +#### Dataset := {-} + +```gedstruct +0 <
> {1:1} +0 <> {0:M} +0 TRLR {1:1} +``` + +#### `RECORD` := + +```gedstruct +[ +n <> {1:1} +| +n <> {1:1} +| +n <> {1:1} +| +n <> {1:1} +| +n <> {1:1} +| +n <> {1:1} +| +n <> {1:1} +] +``` + +#### `HEADER` := + +```gedstruct +n HEAD {1:1} + +1 GEDC {1:1} g7:GEDC + +2 VERS {1:1} g7:GEDC-VERS + +1 SCHMA {0:1} g7:SCHMA + +2 TAG {0:M} g7:TAG + +1 SOUR {0:1} g7:HEAD-SOUR + +2 VERS {0:1} g7:VERS + +2 NAME {0:1} g7:NAME + +2 CORP {0:1} g7:CORP + +3 <> {0:1} + +3 PHON {0:M} g7:PHON + +3 EMAIL {0:M} g7:EMAIL + +3 FAX {0:M} g7:FAX + +3 WWW {0:M} g7:WWW + +2 DATA {0:1} g7:HEAD-SOUR-DATA + +3 DATE {0:1} g7:DATE-exact + +4 TIME