GRM syntax

Typography

On the typography used in this document.

A character is represented by its representation as a character in the C programming language is followed by its name and its Unicode Code Point. Depending on the type of character, one or more elements can be omitted. For example a new line character is represented as as \n (NEWLINE, U+000A).
When we write from X to Z we mean any element from the sequence X, Y and Z. In other words, Z is included when we write from X to Z.

Glossary

Attribute: An element which gives additional information for a Named Node. An Attribute has an Attribute Name and optionally an Attribute Value.
Comment: An element that is not interpreted by a GRM parser. The purpose of a Comment is the same as the purpose of a comment in any programming languages or markup languages.
GRM: A markup language aimed to be short to write, easy to parse and customizable. GRM stands for "GeneRic Markup".
GRM Definition Document: A special GRM document used to describe which Named Node and Marks are defined. This is optional.
GRM Lite: A subset of GRM which does not contain Marks.
Mark: A shortcut for a Named Node containing no Attributes. A Mark has a starting visible character and an ending visible character. A Mark can be a Verbatim Mark which means that no characters between its starting and its ending characters are interpreted as GRM language characters.
Node: A GRM document is a set of Nodes. A Node can be a Text Node or a Named Node.
Text Node: A Node which only represents text.
Named Node: A Node which has a Name, optionally Attributes and optionally child Nodes.

Syntax in human language

Category 1: Encoding and characters

Here are the rules for the encoding and the characters used for a valid GRM document.

Rule C1R01

Unicode is used, and preferrably UTF-8. A GRM parser must support characters above the Basic Multilingual Plane (BMP) for which their Unicode code point is above U+FFFF. For example: emojis.

Rule C1R02

GRM documents are case-sensible. This means that a a and a A are two distinct characters.

Rule C1R03

The following control characters are totally ignored by GRM. This means that a GRM document containing them is valid, but the parser ignores those characters.

From NULL (U+0000) to BACKSPACE (U+0008).
From VERTICAL TAB (U+000B) to INFORMATION SEPARATOR ONE (U+001F).
From DELETE (U+007F) to APPLICATION PROGRAM COMMAND (U+009F).

Category 2: Syntax definition axioms

Here are the basis of the syntax definition. We write only rules that apply for a single character.

Remember that some characters are completely ignored (Rule C1R03).

Syntax for general single characters

newline

The character \n (NEWLINE, U+000A).

tab

The character \t (HORIZONTAL TAB, U+0009).

space

The character SPACE (U+0020).

whitespace

A newline or a tab or a space.

letter

Any character from a to z or from A to Z.

digit

Any character from 0 to 9.

alphanumeric

Any letter or digit.

invisible-char

Any character that cannot be seen by a someone reading the document. In other words, it is a character for which a printer would not used any ink to display it. Listing all possible invisible-char is too daunting a task, so here is a non-exhaustive list.

A whitespace.
A character listed in Invisible Characters:
- NO-BREAK SPACE (U+00A0);
- SOFT HYPHEN (U+00AD);
- COMBINING GRAPHEME JOINER (U+034F);
- ARABIC LETTER MARK (U+061C);
- from HANGUL CHOSEONG FILLER (U+115F) to HANGUL JUNGSEONG FILLER (U+1160);
- from KHMER VOWEL INHERENT AQ (U+17B4) to KHMER VOWEL INHERENT AA (U+17B5);
- from MONGOLIAN FREE VARIATION SELECTOR ONE (U+180B) to MONGOLIAN VOWEL SEPARATOR (U+180E);
- from EN QUAD (U+2000) to RIGHT-TO-LEFT MARK (U+200F);
- from LEFT-TO-RIGHT EMBEDDING (U+202A) to NARROW NO-BREAK SPACE (U+202F);
- from MEDIUM MATHEMATICAL SPACE (U+205F) to NOMINAL DIGIT SHAPES (U+206F);
- BRAILLE PATTERN BLANK (U+2800);
- IDEOGRAPHIC SPACE (U+3000);
- HANGUL FILLER (U+3164);
- from VARIATION SELECTOR-1 (U+FE00) to VARIATION SELECTOR-16 (U+FE0F);
- ZERO WIDTH NO-BREAK SPACE (U+FEFF);
- HALFWIDTH HANGUL FILLER (U+FFA0);
- OBJECT REPLACEMENT CHARACTER (U+FFFC).
A non-character listed in WHATWG community and other sources:
- from U+FDC8 to U+FDCE;
- from U+FDD0 to U+FDEF;
- from U+FFFE to U+FFFF.
Other invisible characters or non-characters not listed previously.
Other invisible characters beyond U+FFFF.

visible-char

Any character that is not an invisible-char.

Syntax for GRM single characters

grm-char

An invisible-char or a visible-char.

The character #.

The character {.

The character }.

An alphanumeric or _ or -.

The character \.

The character [.

The character ].

The character =.

An alphanumeric or _ or -.

att-val-str-delim

The character ".

att-val-str-escape

The character \.

att-val-str-escapable

The character " or \ or n or t.

att-val-str-non-escape

A grm-char excluding ", \, newline or tab.

Category 3: Syntax definition for GRM Lite

Here we are defining GRM without the concept of Marks. For convenience, we call that GRM Lite.

By convention the extension file for a GRM document is .grm. For example: blog-post.grm.

Concept

A GRM document, defined below as grm-document, is a collection of Nodes. There are two types of Node: Named Node and Text Node.

A Named Node, defined below as named-node, is a Node which has a Name, optionally Attributes and optionally child Nodes.
A Text Node, defined below as text-node, is a Node which only represents text. A Text Node cannot have any Attributes nor child Nodes.
An Attribute, defined below as attribute, gives more information on its Node. An Attribute has an Attribute Name, defined below as attribute-name, and optionally an Attrbute Value, defined below as attribute-value.
Just like XML or HTML, GRM has Comments. A Comment, defined below as comment, is a text ignored by the GRM parser.

Syntax

comment

Starts by comment-start and continues until the first newline or the end of the document. A comment does not capture the newline.

From here for the following syntax definitions, we ignore all comment.

useless

Zero or more whitespace.

att-val-str-char

Any of the following:

an att-val-str-non-escape;
the sequence att-val-str-escape, att-val-str-escapable.

att-val-str

An att-val-str-delim followed by zero or more att-val-str-char followed by att-val-str-delim.

attribute-name

One or more att-name-char.

attribute-value

A att-val-str.

attribute

Any of the following:

an attribute-name;
the sequence attribute-name, useless, attribute-assign, useless, attribute-value.

attribute-seq

Any of the following:

an attribute;
the sequence attribute, useless, attribute-seq.

attributes

The sequence attributes-start, useless, optionally attribute-seq, useless, attributes-end.

node-name

One or more node-name-char.

text-node-char

Any of the following:

grm-char except ( text-node-escape or named-node-start or named-node-end or user-mark);
the sequence text-node-escape, grm-char.

text-node

One or more text-node-char.

named-node

The sequence named-node-start, useless, its node-name, useless, optionally its attributes, useless, zero or more child node, named-node-end.

node

A named-node or a text-node.

grm-document

Zero or more node.

Rules

Rule C3R01

To make the first child node of a named-node as a text-node starting with a whitespace, it is mandatory to start the text-node with the sequence text-node-escape, whitespace.

Rule C3R02

If text-node-escape is not used for the first character of the text-node then the whitespace is considered as part of useless.

Rule C3R03

It is not necessary to escape a whitespace of a text-node outside of the context of Rule C3R01 and Rule C3R02.

Rule C3R04

In named-node definition, there is no useless between a child node and the named-node-end because the whitespace is part of the last child text-node.

Rule C3R05

If text-node-escape is the last character of the document, then it is simply ignored.

Rule C3R06

When the parser returns what is captured in a text-node, it only returns the grm-char for the sequence text-node-escape, grm-char. In other words if a text-node contains \a then only a is captured by Text Node and \ is ignored. This is the same thing for \\ for which only one \ is returned.

Rule C3R07

When the parser returns what is captured in an attribute-value, it does not return the starting nor the ending att-val-str-delim.

Rule C3R08

When the parser returns what is captured in an attribute-value, it interprets the escaped characters as followed:

the sequence att-val-str-escape, " is interpreted as the character ";
the sequence att-val-str-escape, \ is interpreted as the character \;
the sequence att-val-str-escape, n is interpreted as the character \n (NEWLINE, U+000A);
the sequence att-val-str-escape, t is interpreted as the character \t (HORIZONTAL TAB, U+0009).

Rule C3R09

If the same attribute-name is used for the same named-node then only its last definition is considered and all its previous definitions are ignored.

Rule C3R10

An attribute without an attribute-value does not have the same meaning as an attribute with an empty string as attribute-value. We say that the attribute-value is null when it is not present.

Category 4: Syntax definition for GRM

We previously defined what we called for convenience GRM Lite. GRM has a notion of Marks which we excluded in GRM Lite.

Concept

GRM is mostly aimed for writing prose documents. When writing, we do not want to write lengthy markup syntaxes. A Mark is way to shorten the markup syntax.

A Mark, defined below as mark, is a short way to write a specific Named Node without Attributes. A Mark has one Mark Start Character, defined below as mark-start, and one Mark End Character, defined below as mark-end.

GRM is aimed to be flexible. It is up to the developer of the software which uses a GRM parser to define which Named Node a Mark represents. It is still up to this developer to define which characters are used as Mark Start Character and Mark End Character.

A Mark can be a Verbatim Mark, defined below as verbatim-mark. In a Verbatim Mark, all characters until its Mark End Character is part of its child Text Node. This means that inside a verbatim Mark any special characters are considered as characters of a Text Node and \ is not needed to escape anything.

A non Verbatim Mark is defined below as non-verbatim-mark.

It is up to the developer of the software which uses a GRM parser to define if a Mark is a Verbatim Mark or not.

Syntax

We define below the rest of the GRM syntax. For the following syntax definitions, we ignore all comment.

The syntax definition for GRM Lite applies here too.

valid-mark-char

Any visible-char except one of the following:

comment-start;
named-node-start;
named-node-end;
node-name-char;
text-node-escape;
att-name-char;
attributes-start;
attributes-end;
attribute-assign;
att-val-str-delim.

mark-start

Any valid-mark-char except one already used for defining a mark-start or a mark-end for another mark.

mark-end

Any valid-mark-char except one already used for defining a mark-start or a mark-end for another mark.

user-mark

Any mark-start or mark-end defined by the user of the parser.

non-verbatim-mark

The sequence its mark-start, zero or more child node, its mark-end.

mark

A non-verbatim-mark or a verbatim-mark.

verbatim-mark

The sequence its mark-start, zero or more (grm-char except its mark-end) as its child text-node, its mark-end.

Rules

Rule C4R01

For the same mark, mark-start and mark-end can be the same valid-mark-char.

Rule C4R02

It is an error if marks are mingled/tangled. For example if Mark A is defined by * and * and Mark B is defined by < and > then the sequence *<*> is incorrect.

Rule C4R03

whitespace after a mark-start is not ignored. If there is a whitespace just after the mark-start then this whitespace starts a text-node.

Rule C4R04

The child text-node of a verbatim-mark cannot contain the mark-end of this verbatim-mark since the presence of this character ends the mark definition.

Rule C4R05

The child text-node of a verbatim-mark cannot escape any characters because even text-node-escape is considered as a regular character. Outside of a verbatim-mark, text-node-escape still escapes any user-mark.

Rule C4R06

verbatim-mark cannot have comments inside them because even comment-start is considered as a regular character.

Category 5: GRM Definition Document

Concept

A GRM Definition Document is a special GRM document used to describe which Named Nodes and Marks are defined. This document is not mandatory but it can be a good help for the people writing GRM documents interpreted by a specific software. A GRM Definition Document could be used by linters or LSP servers and clients to help someone writing a GRM document.

By convention the extension file for a GRM Definition Document is .grmd. For example: website.grmd.

A GRM Definition Document is written in GRM Lite. Text Nodes in a GRM Definition Document are used to comment the definition made by their parent Named Node or to comment the whole document if they are not inside a Named Node. Other Text Nodes should be completely ignored.

Note that a GRM Definition Document is merely an indication. It is not the role of the parser to ensure that the definitions in a GRM Document complies to a GRM Definition Document. This role belongs to the software implementing the parser to generate documents from GRM Document.

Syntax

Below is the GRM Definition Document of what can be found in a GRM Definition Document.

GRM

Definition for GRM Definition Document.

{node [name="node"] Defines a Named Node.
	{attribute [name="name"] Defines the name of the Node. This "name" Attribute is mandatory and must have a non-empty Attribute Value.
}

{node [name="attribute"] Defines an Attribute in a Named Node. This "attribute" Node must be the child of a "node" Node.
	{attribute [name="name"]
		Defines the Attribute Name.
		This "name" Attribute is mandatory and must have a non-empty Attribute Value.
	}
	{attribute [name="optional" optional nullable]
		By default when an Attribute is defined, it is considered as mandatory.
		Defining an Attribute as "optional" means that this Attribute can be absent.
	}
	{attribute [name="nullable" optional nullable]
		By default when an Attribute is defined, it is considered as requiring an Attribute Value.
		Defining an Attribute as "nullable" means that this Attribute Value can be absent.
	}
}

{node [name="mark"] Defines a Mark.
	{attribute [name="name"]
		References the Named Node that this Mark is referencing.
		Multiple marks can have the same name.
	}
	{attribute [name="start"]
		Defines the Mark Start Character.
		This Attribute is mandatory and its value must be one single valid mark character not used yet for other Marks.
		Its value can be the same has the value for the "end" Attribute defined for the same "mark" Node.
	}
	{attribute [name="end"]
		Defines the Mark End Character.
		This Attribute is mandatory and its value must be one single valid mark character not used yet for other Marks.
		Its value can be the same has the value for the "start" Attribute defined for the same "mark" Node.
	}
	{attribute [name="verbatim" optional nullable]
		Defines if the Mark is a Verbatim Mark.
		By default, a Mark is not a Verbatim Mark.
	}
}

Rules

We do not define "types" (boolean, integer, number, enumerate values...) for the Attributes. The reason is simple: GRM is a simple markup language for writing text. GRM is not the right tool for structuring complex precise data.

Rule C5R01

Only named-node and attributes previously defined are allowed. In other words:

The allowed Named Nodes are node, attribute and mark;
node Named Node must have an Attribute name;
attribute Named Node has the Attributes name (mandatory), optional (optional) and nullable (optional);
mark Named Node has the Attributesname (mandatory), start (mandatory), end (mandatory) and verbatim (optional).

Rule C5R02

node and mark named-node must be in the first level of the parent hierarchy. In other words, they have no parents.

Rule C5R03

attribute Named Node is a direct child of node Named Node.

Rule C5R04

As for Attributes the last definition for the same thing is the one considered.

If a node Named Node redefines a Named Node previously defined, the old definition is forgotten and the new one is used.
If a mark Named Node redefines a Mark previously defined, the old definition is forgotten and the new one is used.

Rule C5R05

Text Nodes are considered as comment for their parent node. All sibling Text Nodes are concatenated into a single comment which preserves all the whitespace present. It is up to the parser library user to trim and simplify the whitespace. Also the first level comment (i.e. the one without any parent) is the comment for the whole definition.

Category 6: JSON representation

A GRM document can be represented in JSON. This is mandatory for a parser to implement a way to export a GRM document into JSON.

Rule C6R01

When exported as a JSON string, some characters must have this representation in order to be valid JSON string characters.

newline is written \n.
tab is written \t.
\ is written \\.
" is written \".

Rule C6R02

attribute is exported into an object with:

the JSON field representing its attribute-name;
the JSON value representing its attribute-value as a JSON string if it exists, or null.

Rule C6R03

text-node is exported into a JSON string.

Rule C6R04

named-node is exported into an object with:

a field name set to the JSON string representation of its node-name;
a field attributes set to a JSON array containing the JSON representation of each attribute of this node;
a field children set to a JSON array of the child node exported into JSON.

Rule C6R05

Conceptually a mark is a reference to a named-node so it is not represented into JSON.

Rule C6R06

grm-document is exported into a JSON array of all its multiple node exported into JSON.