GRM syntax

Typography

On the typography used in this document.

Glossary

Attribute
An element which gives additional information for a Named Node. An Attribute has an Attribute Name and optionally an Attribute Value.
Comment
An element that is not interpreted by a GRM parser. The purpose of a Comment is the same as the purpose of a comment in any programming languages or markup languages.
GRM
A markup language aimed to be short to write, easy to parse and customizable. GRM stands for "GeneRic Markup".
GRM Definition Document
A special GRM document used to describe which Named Node and Marks are defined. This is optional.
GRM Lite
A subset of GRM which does not contain Marks.
Mark
A shortcut for a Named Node containing no Attributes. A Mark has a starting visible character and an ending visible character. A Mark can be a Verbatim Mark which means that no characters between its starting and its ending characters are interpreted as GRM language characters.
Node
A GRM document is a set of Nodes. A Node can be a Text Node or a Named Node.
Text Node
A Node which only represents text.
Named Node
A Node which has a Name, optionally Attributes and optionally child Nodes.

Syntax in human language

Category 1: Encoding and characters

Here are the rules for the encoding and the characters used for a valid GRM document.

Rule C1R01
Unicode is used, and preferrably UTF-8. A GRM parser must support characters above the Basic Multilingual Plane (BMP) for which their Unicode code point is above U+FFFF. For example: emojis.
Rule C1R02
GRM documents are case-sensible. This means that a a and a A are two distinct characters.
Rule C1R03
The following control characters are totally ignored by GRM. This means that a GRM document containing them is valid, but the parser ignores those characters.
  • From NULL (U+0000) to BACKSPACE (U+0008).
  • From VERTICAL TAB (U+000B) to INFORMATION SEPARATOR ONE (U+001F).
  • From DELETE (U+007F) to APPLICATION PROGRAM COMMAND (U+009F).

Category 2: Syntax definition axioms

Here are the basis of the syntax definition. We write only rules that apply for a single character.

Remember that some characters are completely ignored (Rule C1R03).

Syntax for general single characters

newline
The character \n (NEWLINE, U+000A).
tab
The character \t (HORIZONTAL TAB, U+0009).
space
The character SPACE (U+0020).
whitespace
A newline or a tab or a space.
letter
Any character from a to z or from A to Z.
digit
Any character from 0 to 9.
alphanumeric
Any letter or digit.
invisible-char
Any character that cannot be seen by a someone reading the document. In other words, it is a character for which a printer would not used any ink to display it. Listing all possible invisible-char is too daunting a task, so here is a non-exhaustive list.
  • A whitespace.
  • A character listed in Invisible Characters:
    • NO-BREAK SPACE (U+00A0);
    • SOFT HYPHEN (U+00AD);
    • COMBINING GRAPHEME JOINER (U+034F);
    • ARABIC LETTER MARK (U+061C);
    • from HANGUL CHOSEONG FILLER (U+115F) to HANGUL JUNGSEONG FILLER (U+1160);
    • from KHMER VOWEL INHERENT AQ (U+17B4) to KHMER VOWEL INHERENT AA (U+17B5);
    • from MONGOLIAN FREE VARIATION SELECTOR ONE (U+180B) to MONGOLIAN VOWEL SEPARATOR (U+180E);
    • from EN QUAD (U+2000) to RIGHT-TO-LEFT MARK (U+200F);
    • from LEFT-TO-RIGHT EMBEDDING (U+202A) to NARROW NO-BREAK SPACE (U+202F);
    • from MEDIUM MATHEMATICAL SPACE (U+205F) to NOMINAL DIGIT SHAPES (U+206F);
    • BRAILLE PATTERN BLANK (U+2800);
    • IDEOGRAPHIC SPACE (U+3000);
    • HANGUL FILLER (U+3164);
    • from VARIATION SELECTOR-1 (U+FE00) to VARIATION SELECTOR-16 (U+FE0F);
    • ZERO WIDTH NO-BREAK SPACE (U+FEFF);
    • HALFWIDTH HANGUL FILLER (U+FFA0);
    • OBJECT REPLACEMENT CHARACTER (U+FFFC).
  • A non-character listed in WHATWG community and other sources:
    • from U+FDC8 to U+FDCE;
    • from U+FDD0 to U+FDEF;
    • from U+FFFE to U+FFFF.
  • Other invisible characters or non-characters not listed previously.
  • Other invisible characters beyond U+FFFF.
visible-char
Any character that is not an invisible-char.

Syntax for GRM single characters

grm-char
An invisible-char or a visible-char.
comment-start
The character #.
named-node-start
The character {.
named-node-end
The character }.
node-name-char
An alphanumeric or _ or -.
text-node-escape
The character \.
attributes-start
The character [.
attributes-end
The character ].
attribute-assign
The character =.
att-name-char
An alphanumeric or _ or -.
att-val-str-delim
The character ".
att-val-str-escape
The character \.
att-val-str-escapable
The character " or \ or n or t.
att-val-str-non-escape
A grm-char excluding ", \, newline or tab.

Category 3: Syntax definition for GRM Lite

Here we are defining GRM without the concept of Marks. For convenience, we call that GRM Lite.

By convention the extension file for a GRM document is .grm. For example: blog-post.grm.

Concept

A GRM document, defined below as grm-document, is a collection of Nodes. There are two types of Node: Named Node and Text Node.

Syntax

comment
Starts by comment-start and continues until the first newline or the end of the document. A comment does not capture the newline.

From here for the following syntax definitions, we ignore all comment.

useless
Zero or more whitespace.
att-val-str-char
Any of the following:
att-val-str
An att-val-str-delim followed by zero or more att-val-str-char followed by att-val-str-delim.
attribute-name
One or more att-name-char.
attribute-value
A att-val-str.
attribute
Any of the following:
attribute-seq
Any of the following:
attributes
The sequence attributes-start, useless, optionally attribute-seq, useless, attributes-end.
node-name
One or more node-name-char.
text-node-char
Any of the following:
text-node
One or more text-node-char.
named-node
The sequence named-node-start, useless, its node-name, useless, optionally its attributes, useless, zero or more child node, named-node-end.
node
A named-node or a text-node.
grm-document
Zero or more node.

Rules

Rule C3R01
To make the first child node of a named-node as a text-node starting with a whitespace, it is mandatory to start the text-node with the sequence text-node-escape, whitespace.
Rule C3R02
If text-node-escape is not used for the first character of the text-node then the whitespace is considered as part of useless.
Rule C3R03
It is not necessary to escape a whitespace of a text-node outside of the context of Rule C3R01 and Rule C3R02.
Rule C3R04
In named-node definition, there is no useless between a child node and the named-node-end because the whitespace is part of the last child text-node.
Rule C3R05
If text-node-escape is the last character of the document, then it is simply ignored.
Rule C3R06
When the parser returns what is captured in a text-node, it only returns the grm-char for the sequence text-node-escape, grm-char. In other words if a text-node contains \a then only a is captured by Text Node and \ is ignored. This is the same thing for \\ for which only one \ is returned.
Rule C3R07
When the parser returns what is captured in an attribute-value, it does not return the starting nor the ending att-val-str-delim.
Rule C3R08
When the parser returns what is captured in an attribute-value, it interprets the escaped characters as followed:
Rule C3R09
If the same attribute-name is used for the same named-node then only its last definition is considered and all its previous definitions are ignored.
Rule C3R10
An attribute without an attribute-value does not have the same meaning as an attribute with an empty string as attribute-value. We say that the attribute-value is null when it is not present.

Category 4: Syntax definition for GRM

We previously defined what we called for convenience GRM Lite. GRM has a notion of Marks which we excluded in GRM Lite.

Concept

GRM is mostly aimed for writing prose documents. When writing, we do not want to write lengthy markup syntaxes. A Mark is way to shorten the markup syntax.

A Mark, defined below as mark, is a short way to write a specific Named Node without Attributes. A Mark has one Mark Start Character, defined below as mark-start, and one Mark End Character, defined below as mark-end.

GRM is aimed to be flexible. It is up to the developer of the software which uses a GRM parser to define which Named Node a Mark represents. It is still up to this developer to define which characters are used as Mark Start Character and Mark End Character.

A Mark can be a Verbatim Mark, defined below as verbatim-mark. In a Verbatim Mark, all characters until its Mark End Character is part of its child Text Node. This means that inside a verbatim Mark any special characters are considered as characters of a Text Node and \ is not needed to escape anything.

A non Verbatim Mark is defined below as non-verbatim-mark.

It is up to the developer of the software which uses a GRM parser to define if a Mark is a Verbatim Mark or not.

Syntax

We define below the rest of the GRM syntax. For the following syntax definitions, we ignore all comment.

The syntax definition for GRM Lite applies here too.

valid-mark-char
Any visible-char except one of the following:
mark-start
Any valid-mark-char except one already used for defining a mark-start or a mark-end for another mark.
mark-end
Any valid-mark-char except one already used for defining a mark-start or a mark-end for another mark.
user-mark
Any mark-start or mark-end defined by the user of the parser.
non-verbatim-mark
The sequence its mark-start, zero or more child node, its mark-end.
mark
A non-verbatim-mark or a verbatim-mark.
verbatim-mark
The sequence its mark-start, zero or more (grm-char except its mark-end) as its child text-node, its mark-end.

Rules

Rule C4R01
For the same mark, mark-start and mark-end can be the same valid-mark-char.
Rule C4R02
It is an error if marks are mingled/tangled. For example if Mark A is defined by * and * and Mark B is defined by < and > then the sequence *<*> is incorrect.
Rule C4R03
whitespace after a mark-start is not ignored. If there is a whitespace just after the mark-start then this whitespace starts a text-node.
Rule C4R04
The child text-node of a verbatim-mark cannot contain the mark-end of this verbatim-mark since the presence of this character ends the mark definition.
Rule C4R05
The child text-node of a verbatim-mark cannot escape any characters because even text-node-escape is considered as a regular character. Outside of a verbatim-mark, text-node-escape still escapes any user-mark.
Rule C4R06
verbatim-mark cannot have comments inside them because even comment-start is considered as a regular character.

Category 5: GRM Definition Document

Concept

A GRM Definition Document is a special GRM document used to describe which Named Nodes and Marks are defined. This document is not mandatory but it can be a good help for the people writing GRM documents interpreted by a specific software. A GRM Definition Document could be used by linters or LSP servers and clients to help someone writing a GRM document.

By convention the extension file for a GRM Definition Document is .grmd. For example: website.grmd.

A GRM Definition Document is written in GRM Lite. Text Nodes in a GRM Definition Document are used to comment the definition made by their parent Named Node or to comment the whole document if they are not inside a Named Node. Other Text Nodes should be completely ignored.

Note that a GRM Definition Document is merely an indication. It is not the role of the parser to ensure that the definitions in a GRM Document complies to a GRM Definition Document. This role belongs to the software implementing the parser to generate documents from GRM Document.

Syntax

Below is the GRM Definition Document of what can be found in a GRM Definition Document.

GRM
Definition for GRM Definition Document.

{node [name="node"] Defines a Named Node.
	{attribute [name="name"] Defines the name of the Node. This "name" Attribute is mandatory and must have a non-empty Attribute Value.
}

{node [name="attribute"] Defines an Attribute in a Named Node. This "attribute" Node must be the child of a "node" Node.
	{attribute [name="name"]
		Defines the Attribute Name.
		This "name" Attribute is mandatory and must have a non-empty Attribute Value.
	}
	{attribute [name="optional" optional nullable]
		By default when an Attribute is defined, it is considered as mandatory.
		Defining an Attribute as "optional" means that this Attribute can be absent.
	}
	{attribute [name="nullable" optional nullable]
		By default when an Attribute is defined, it is considered as requiring an Attribute Value.
		Defining an Attribute as "nullable" means that this Attribute Value can be absent.
	}
}

{node [name="mark"] Defines a Mark.
	{attribute [name="name"]
		References the Named Node that this Mark is referencing.
		Multiple marks can have the same name.
	}
	{attribute [name="start"]
		Defines the Mark Start Character.
		This Attribute is mandatory and its value must be one single valid mark character not used yet for other Marks.
		Its value can be the same has the value for the "end" Attribute defined for the same "mark" Node.
	}
	{attribute [name="end"]
		Defines the Mark End Character.
		This Attribute is mandatory and its value must be one single valid mark character not used yet for other Marks.
		Its value can be the same has the value for the "start" Attribute defined for the same "mark" Node.
	}
	{attribute [name="verbatim" optional nullable]
		Defines if the Mark is a Verbatim Mark.
		By default, a Mark is not a Verbatim Mark.
	}
}

Rules

We do not define "types" (boolean, integer, number, enumerate values...) for the Attributes. The reason is simple: GRM is a simple markup language for writing text. GRM is not the right tool for structuring complex precise data.

Rule C5R01
Only named-node and attributes previously defined are allowed. In other words:
Rule C5R02
node and mark named-node must be in the first level of the parent hierarchy. In other words, they have no parents.
Rule C5R03
attribute Named Node is a direct child of node Named Node.
Rule C5R04
As for Attributes the last definition for the same thing is the one considered.
  • If a node Named Node redefines a Named Node previously defined, the old definition is forgotten and the new one is used.
  • If a mark Named Node redefines a Mark previously defined, the old definition is forgotten and the new one is used.
Rule C5R05
Text Nodes are considered as comment for their parent node. All sibling Text Nodes are concatenated into a single comment which preserves all the whitespace present. It is up to the parser library user to trim and simplify the whitespace. Also the first level comment (i.e. the one without any parent) is the comment for the whole definition.

Category 6: JSON representation

A GRM document can be represented in JSON. This is mandatory for a parser to implement a way to export a GRM document into JSON.

Rule C6R01
When exported as a JSON string, some characters must have this representation in order to be valid JSON string characters.
  • newline is written \n.
  • tab is written \t.
  • \ is written \\.
  • " is written \".
Rule C6R02
attribute is exported into an object with:
Rule C6R03
text-node is exported into a JSON string.
Rule C6R04
named-node is exported into an object with:
  • a field name set to the JSON string representation of its node-name;
  • a field attributes set to a JSON array containing the JSON representation of each attribute of this node;
  • a field children set to a JSON array of the child node exported into JSON.
Rule C6R05
Conceptually a mark is a reference to a named-node so it is not represented into JSON.
Rule C6R06
grm-document is exported into a JSON array of all its multiple node exported into JSON.