2009 lines
42 KiB
Markdown
2009 lines
42 KiB
Markdown
# regexp-tree
|
|
|
|
[![Build Status](https://travis-ci.org/DmitrySoshnikov/regexp-tree.svg?branch=master)](https://travis-ci.org/DmitrySoshnikov/regexp-tree) [![npm version](https://badge.fury.io/js/regexp-tree.svg)](https://badge.fury.io/js/regexp-tree) [![npm downloads](https://img.shields.io/npm/dt/regexp-tree.svg)](https://www.npmjs.com/package/regexp-tree)
|
|
|
|
Regular expressions processor in JavaScript
|
|
|
|
TL;DR: **RegExp Tree** is a _regular expressions processor_, which includes _parser_, _traversal_, _transformer_, _optimizer_, and _interpreter_ APIs.
|
|
|
|
You can get an overview of the tool in [this article](https://medium.com/@DmitrySoshnikov/regexp-tree-a-regular-expressions-parser-with-a-simple-ast-format-bcd4d5580df6).
|
|
|
|
### Table of Contents
|
|
|
|
- [Installation](#installation)
|
|
- [Development](#development)
|
|
- [Usage as a CLI](#usage-as-a-cli)
|
|
- [Usage from Node](#usage-from-node)
|
|
- [Capturing locations](#capturing-locations)
|
|
- [Using traversal API](#using-traversal-api)
|
|
- [Using transform API](#using-transform-api)
|
|
- [Transform plugins](#transform-plugins)
|
|
- [Using generator API](#using-generator-api)
|
|
- [Using optimizer API](#using-optimizer-api)
|
|
- [Optimizer ESLint plugin](#optimizer-eslint-plugin)
|
|
- [Using compat-transpiler API](#using-compat-transpiler-api)
|
|
- [Compat-transpiler Babel plugin](#compat-transpiler-babel-plugin)
|
|
- [RegExp extensions](#regexp-extensions)
|
|
- [RegExp extensions Babel plugin](#regexp-extensions-babel-plugin)
|
|
- [Creating RegExp objects](#creating-regexp-objects)
|
|
- [Executing regexes](#executing-regexes)
|
|
- [Using interpreter API](#using-interpreter-api)
|
|
- [Printing NFA/DFA tables](#printing-nfadfa-tables)
|
|
- [AST nodes specification](#ast-nodes-specification)
|
|
|
|
### Installation
|
|
|
|
The parser can be installed as an [npm module](https://www.npmjs.com/package/regexp-tree):
|
|
|
|
```
|
|
npm install -g regexp-tree
|
|
```
|
|
|
|
You can also [try it online](https://astexplorer.net/#/gist/4ea2b52f0e546af6fb14f9b2f5671c1c/39b55944da3e5782396ffa1fea3ba68d126cd394) using _AST Explorer_.
|
|
|
|
### Development
|
|
|
|
1. Fork https://github.com/DmitrySoshnikov/regexp-tree repo
|
|
2. If there is an actual issue from the [issues](https://github.com/DmitrySoshnikov/regexp-tree/issues) list you'd like to work on, feel free to assign it yourself, or comment on it to avoid collisions (open a new issue if needed)
|
|
3. Make your changes
|
|
4. Make sure `npm test` still passes (add new tests if needed)
|
|
5. Submit a PR
|
|
|
|
The _regexp-tree_ parser is implemented as an automatic LR parser using [Syntax](https://www.npmjs.com/package/syntax-cli) tool. The parser module is generated from the [regexp grammar](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/parser/regexp.bnf), which is based on the regular expressions grammar used in ECMAScript.
|
|
|
|
For development from the github repository, run `build` command to generate the parser module, and transpile JS code:
|
|
|
|
```
|
|
git clone https://github.com/<your-github-account>/regexp-tree.git
|
|
cd regexp-tree
|
|
npm install
|
|
npm run build
|
|
```
|
|
|
|
> NOTE: JS code transpilation is used to support older versions of Node. For faster development cycle you can use `npm run watch` command, which continuously transpiles JS code.
|
|
|
|
### Usage as a CLI
|
|
|
|
**Note:** the CLI is exposed as its own [regexp-tree-cli](https://www.npmjs.com/package/regexp-tree-cli) module.
|
|
|
|
Check the options available from CLI:
|
|
|
|
```
|
|
regexp-tree-cli --help
|
|
```
|
|
|
|
```
|
|
Usage: regexp-tree-cli [options]
|
|
|
|
Options:
|
|
-e, --expression A regular expression to be parsed
|
|
-l, --loc Whether to capture AST node locations
|
|
-o, --optimize Applies optimizer on the passed expression
|
|
-c, --compat Applies compat-transpiler on the passed expression
|
|
-t, --table Print NFA/DFA transition tables (nfa/dfa/all)
|
|
```
|
|
|
|
To parse a regular expression, pass `-e` option:
|
|
|
|
```
|
|
regexp-tree-cli -e '/a|b/i'
|
|
```
|
|
|
|
Which produces an AST node corresponding to this regular expression:
|
|
|
|
```js
|
|
{
|
|
type: 'RegExp',
|
|
body: {
|
|
type: 'Disjunction',
|
|
left: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
right: {
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
},
|
|
flags: 'i',
|
|
}
|
|
```
|
|
|
|
> NOTE: the format of a regexp is `/ Body / OptionalFlags`.
|
|
|
|
### Usage from Node
|
|
|
|
The parser can also be used as a Node module:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
console.log(regexpTree.parse(/a|b/i)); // RegExp AST
|
|
```
|
|
|
|
Note, _regexp-tree_ supports parsing regexes from strings, and also from actual `RegExp` objects (in general -- from any object which can be coerced to a string). If some feature is not implemented yet in an actual JavaScript RegExp, it should be passed as a string:
|
|
|
|
```js
|
|
// Pass an actual JS RegExp object.
|
|
regexpTree.parse(/a|b/i);
|
|
|
|
// Pass a string, since `s` flag may not be supported in older versions.
|
|
regexpTree.parse('/./s');
|
|
```
|
|
|
|
Also note, that in string-mode, escaping is done using two slashes `\\` per JavaScript:
|
|
|
|
```js
|
|
// As an actual regexp.
|
|
regexpTree.parse(/\n/);
|
|
|
|
// As a string.
|
|
regexpTree.parse('/\\n/');
|
|
```
|
|
|
|
### Capturing locations
|
|
|
|
For source code transformation tools it might be useful also to capture _locations_ of the AST nodes. From the command line it's controlled via the `-l` option:
|
|
|
|
```
|
|
regexp-tree-cli -e '/ab/' -l
|
|
```
|
|
|
|
This attaches `loc` object to each AST node:
|
|
|
|
```js
|
|
{
|
|
type: 'RegExp',
|
|
body: {
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97,
|
|
loc: {
|
|
start: {
|
|
line: 1,
|
|
column: 1,
|
|
offset: 1,
|
|
},
|
|
end: {
|
|
line: 1,
|
|
column: 2,
|
|
offset: 2,
|
|
},
|
|
}
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98,
|
|
loc: {
|
|
start: {
|
|
line: 1,
|
|
column: 2,
|
|
offset: 2,
|
|
},
|
|
end: {
|
|
line: 1,
|
|
column: 3,
|
|
offset: 3,
|
|
},
|
|
}
|
|
}
|
|
],
|
|
loc: {
|
|
start: {
|
|
line: 1,
|
|
column: 1,
|
|
offset: 1,
|
|
},
|
|
end: {
|
|
line: 1,
|
|
column: 3,
|
|
offset: 3,
|
|
},
|
|
}
|
|
},
|
|
flags: '',
|
|
loc: {
|
|
start: {
|
|
line: 1,
|
|
column: 0,
|
|
offset: 0,
|
|
},
|
|
end: {
|
|
line: 1,
|
|
column: 4,
|
|
offset: 4,
|
|
},
|
|
}
|
|
}
|
|
```
|
|
|
|
From Node it's controlled via `setOptions` method exposed on the parser:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
const parsed = regexpTree
|
|
.parser
|
|
.setOptions({captureLocations: true})
|
|
.parse(/a|b/);
|
|
```
|
|
|
|
The `setOptions` method sets global options, which are preserved between calls. It is also possible to provide options per a single `parse` call, which might be more preferred:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
const parsed = regexpTree.parse(/a|b/, {
|
|
captureLocations: true,
|
|
});
|
|
```
|
|
|
|
### Using traversal API
|
|
|
|
The [traverse](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/traverse) module allows handling needed AST nodes using the _visitor_ pattern. In Node the module is exposed as the `regexpTree.traverse` method. Handlers receive an instance of the [NodePath](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/traverse/README.md#nodepath-class) class, which encapsulates `node` itself, its `parent` node, `property`, and `index` (in case the node is part of a collection).
|
|
|
|
Visiting a node follows this algorithm:
|
|
- call `pre` handler.
|
|
- recurse into node's children.
|
|
- call `post` handler.
|
|
|
|
For each node type of interest, you can provide either:
|
|
- a function (`pre`).
|
|
- an object with members `pre` and `post`.
|
|
|
|
You can also provide a `*` handler which will be executed on every node.
|
|
|
|
Example:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
// Get AST.
|
|
const ast = regexpTree.parse('/[a-z]{1,}/');
|
|
|
|
// Traverse AST nodes.
|
|
regexpTree.traverse(ast, {
|
|
|
|
// Visit every node before any type-specific handlers.
|
|
'*': function({node}) {
|
|
...
|
|
},
|
|
|
|
// Handle "Quantifier" node type.
|
|
Quantifier({node}) {
|
|
...
|
|
},
|
|
|
|
// Handle "Char" node type, before and after.
|
|
Char: {
|
|
pre({node}) {
|
|
...
|
|
},
|
|
post({node}) {
|
|
...
|
|
}
|
|
}
|
|
|
|
});
|
|
|
|
// Generate the regexp.
|
|
const re = regexpTree.generate(ast);
|
|
|
|
console.log(re); // '/[a-z]+/'
|
|
```
|
|
|
|
### Using transform API
|
|
|
|
> NOTE: you can play with transformation APIs, and write actual transforms for quick tests in AST Explorer. See [this example](http://astexplorer.net/#/gist/d293d22742b42cd1f7ee7b7e5dc6f697/39b0aabc42fb6fb106b9e368341d3300098f08c0).
|
|
|
|
While traverse module provides basic traversal API, which can be used for any purposes of AST handling, _transform_ module focuses mainly on _transformation_ of regular expressions.
|
|
|
|
It accepts a regular expressions in different formats (string, an actual `RegExp` object, or an AST), applies a set of transformations, and retuns an instance of [TransformResult](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/transform/README.md#transformresult). Handles receive as a parameter the same [NodePath](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/traverse/README.md#nodepath-class) object used in traverse.
|
|
|
|
Example:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
// Handle nodes.
|
|
const re = regexpTree.transform('/[a-z]{1,}/i', {
|
|
|
|
/**
|
|
* Handle "Quantifier" node type,
|
|
* transforming `{1,}` quantifier to `+`.
|
|
*/
|
|
Quantifier(path) {
|
|
const {node} = path;
|
|
|
|
// {1,} -> +
|
|
if (
|
|
node.kind === 'Range' &&
|
|
node.from === 1 &&
|
|
!node.to
|
|
) {
|
|
path.replace({
|
|
type: 'Quantifier',
|
|
kind: '+',
|
|
greedy: node.greedy,
|
|
});
|
|
}
|
|
},
|
|
});
|
|
|
|
console.log(re.toString()); // '/[a-z]+/i'
|
|
console.log(re.toRegExp()); // /[a-z]+/i
|
|
console.log(re.getAST()); // AST for /[a-z]+/i
|
|
```
|
|
|
|
#### Transform plugins
|
|
|
|
A _transformation plugin_ is a module which exports a _transformation handler_. We have seen [above](#using-transform-api) how we can pass a handler object directly to the `regexpTree.transform` method, here we extract it into a separate module, so it can be implemented and shared independently:
|
|
|
|
Example of a plugin:
|
|
|
|
```js
|
|
// file: ./regexp-tree-a-to-b-transform.js
|
|
|
|
|
|
/**
|
|
* This plugin replaces chars 'a' with chars 'b'.
|
|
*/
|
|
module.exports = {
|
|
Char({node}) {
|
|
if (node.kind === 'simple' && node.value === 'a') {
|
|
node.value = 'b';
|
|
node.symbol = 'b';
|
|
node.codePoint = 98;
|
|
}
|
|
},
|
|
};
|
|
```
|
|
|
|
Once we have this plugin ready, we can require it, and pass to the `transform` function:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
const plugin = require('./regexp-tree-a-to-b-transform');
|
|
|
|
const re = regexpTree.transform(/(a|c)a+[a-z]/, plugin);
|
|
|
|
console.log(re.toRegExp()); // /(b|c)b+[b-z]/
|
|
```
|
|
|
|
> NOTE: we can also pass a _list of plugins_ to the `regexpTree.transform`. In this case the plugins are applied in one pass in order. Another approach is to run several sequential calls to `transform`, setting up a pipeline, when a transformed AST is passed further to another plugin, etc.
|
|
|
|
You can see other examples of transform plugins in the [optimizer/transforms](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer/transforms) or in the [compat-transpiler/transforms](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/compat-transpiler/transforms) directories.
|
|
|
|
### Using generator API
|
|
|
|
The [generator](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/generator) module generates regular expressions from corresponding AST nodes. In Node the module is exposed as `regexpTree.generate` method.
|
|
|
|
Example:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
const re = regexpTree.generate({
|
|
type: 'RegExp',
|
|
body: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
flags: 'i',
|
|
});
|
|
|
|
console.log(re); // '/a/i'
|
|
```
|
|
|
|
### Using optimizer API
|
|
|
|
[Optimizer](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer) transforms your regexp into an _optimized_ version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.
|
|
|
|
> NOTE: the Optimizer is implemented as a set of _regexp-tree_ [plugins](#transform-plugins).
|
|
|
|
Example:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
const originalRe = /[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/;
|
|
|
|
const optimizedRe = regexpTree
|
|
.optimize(originalRe)
|
|
.toRegExp();
|
|
|
|
console.log(optimizedRe); // /\w+e+/
|
|
```
|
|
|
|
From CLI the optimizer is available via `--optimize` (`-o`) option:
|
|
|
|
```
|
|
regexp-tree-cli -e '/[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/' -o
|
|
```
|
|
|
|
Result:
|
|
|
|
```
|
|
Optimized: /\w+e+/
|
|
```
|
|
|
|
See the [optimizer README](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer) for more details.
|
|
|
|
#### Optimizer ESLint plugin
|
|
|
|
The [optimizer](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer) module is also available as an _ESLint plugin_, which can be installed at: [eslint-plugin-optimize-regex](https://www.npmjs.com/package/eslint-plugin-optimize-regex).
|
|
|
|
### Using compat-transpiler API
|
|
|
|
The [compat-transpiler](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/compat-transpiler) module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don't yet implement the new syntax.
|
|
|
|
> NOTE: the compat-transpiler is implemented as a set of _regexp-tree_ [plugins](#transform-plugins).
|
|
|
|
Example, "dotAll" `s` flag:
|
|
|
|
|
|
```js
|
|
/./s
|
|
```
|
|
|
|
Is translated into:
|
|
|
|
```js
|
|
/[\0-\uFFFF]/
|
|
```
|
|
|
|
Or [named capturing groups](#named-capturing-group):
|
|
|
|
```js
|
|
/(?<value>a)\k<value>\1/
|
|
```
|
|
|
|
Becomes:
|
|
|
|
```js
|
|
/(a)\1\1/
|
|
```
|
|
|
|
To use the API from Node:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
// Using new syntax.
|
|
const originalRe = '/(?<all>.)\\k<all>/s';
|
|
|
|
// For legacy engines.
|
|
const compatTranspiledRe = regexpTree
|
|
.compatTranspile(originalRe)
|
|
.toRegExp();
|
|
|
|
console.log(compatTranspiledRe); // /([\0-\uFFFF])\1/
|
|
```
|
|
|
|
From CLI the compat-transpiler is available via `--compat` (`-c`) option:
|
|
|
|
```
|
|
regexp-tree-cli -e '/(?<all>.)\k<all>/s' -c
|
|
```
|
|
|
|
Result:
|
|
|
|
```
|
|
Compat: /([\0-\uFFFF])\1/
|
|
```
|
|
|
|
#### Compat-transpiler Babel plugin
|
|
|
|
The [compat-transpiler](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/compat-transpiler) module is also available as a _Babel plugin_, which can be installed at: [babel-plugin-transform-modern-regexp](https://www.npmjs.com/package/babel-plugin-transform-modern-regexp).
|
|
|
|
Note, the plugin also includes [extended regexp](#regexp-extensions) features.
|
|
|
|
### RegExp extensions
|
|
|
|
Besides future proposals, like [named capturing group](#named-capturing-group), and other which are being currently standardized, _regexp-tree_ also supports _non-standard_ features.
|
|
|
|
> NOTE: _"non-standard"_ means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard.
|
|
|
|
One of such featurs is `x` flag, which enables _extended_ mode of regular expressions. In this mode most of whitespaces are ignored, and expressions can use #-comments.
|
|
|
|
Example:
|
|
|
|
```regex
|
|
/
|
|
# A regular expression for date.
|
|
|
|
(?<year>\d{4})- # year part of a date
|
|
(?<month>\d{2})- # month part of a date
|
|
(?<day>\d{2}) # day part of a date
|
|
|
|
/x
|
|
```
|
|
|
|
This is normally parsed by the _regexp-tree_ parser, and [compat-transpiler](#using-compat-transpiler-api) has full support for it; it's translated into:
|
|
|
|
```regex
|
|
/(\d{4})-(\d{2})-(\d{2})/
|
|
```
|
|
|
|
#### RegExp extensions Babel plugin
|
|
|
|
The regexp extensions are also available as a _Babel plugin_, which can be installed at: [babel-plugin-transform-modern-regexp](https://www.npmjs.com/package/babel-plugin-transform-modern-regexp).
|
|
|
|
Note, the plugin also includes [compat-transpiler](#using-compat-transpiler-api) features.
|
|
|
|
### Creating RegExp objects
|
|
|
|
To create an actual `RegExp` JavaScript object, we can use `regexpTree.toRegExp` method:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
const re = regexpTree.toRegExp('/[a-z]/i');
|
|
|
|
console.log(
|
|
re.test('a'), // true
|
|
re.test('Z'), // true
|
|
);
|
|
```
|
|
|
|
### Executing regexes
|
|
|
|
It is also possible to execute regular expressions using `exec` API method, which has support for new syntax, and features, such as [named capturing group](#named-capturing-group), etc:
|
|
|
|
```js
|
|
const regexpTree = require('regexp-tree');
|
|
|
|
const re = `/
|
|
|
|
# A regular expression for date.
|
|
|
|
(?<year>\\d{4})- # year part of a date
|
|
(?<month>\\d{2})- # month part of a date
|
|
(?<day>\\d{2}) # day part of a date
|
|
|
|
/x`;
|
|
|
|
const string = '2017-04-14';
|
|
|
|
const result = regexpTree.exec(re, string);
|
|
|
|
console.log(result.groups); // {year: '2017', month: '04', day: '14'}
|
|
```
|
|
|
|
### Using interpreter API
|
|
|
|
> NOTE: you can read more about implementation details of the interpreter in [this series of articles](https://medium.com/@DmitrySoshnikov/building-a-regexp-machine-part-1-regular-grammars-d4986b585d7e).
|
|
|
|
In addition to executing regular expressions using JavaScript built-in RegExp engine, RegExp Tree also implements own [interpreter](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/interpreter/finite-automaton) based on classic NFA/DFA finite automaton engine.
|
|
|
|
Currently it aims educational purposes -- to trace the regexp matching process, transitioning in NFA/DFA states. It also allows building state transitioning table, which can be used for custom implementation. In API the module is exposed as `fa` (finite-automaton) object.
|
|
|
|
Example:
|
|
|
|
```js
|
|
const {fa} = require('regexp-tree');
|
|
|
|
const re = /ab|c*/;
|
|
|
|
console.log(fa.test(re, 'ab')); // true
|
|
console.log(fa.test(re, '')); // true
|
|
console.log(fa.test(re, 'c')); // true
|
|
|
|
// NFA, and its transition table.
|
|
const nfa = fa.toNFA(re);
|
|
console.log(nfa.getTransitionTable());
|
|
|
|
// DFA, and its transition table.
|
|
const dfa = fa.toDFA(re);
|
|
console.log(dfa.getTransitionTable());
|
|
```
|
|
|
|
For more granular work with NFA and DFA, `fa` module also exposes convenient builders, so you can build NFA fragments directly:
|
|
|
|
```js
|
|
const {fa} = require('regexp-tree');
|
|
|
|
const {
|
|
alt,
|
|
char,
|
|
or,
|
|
rep,
|
|
} = fa.builders;
|
|
|
|
// ab|c*
|
|
const re = or(
|
|
alt(char('a'), char('b')),
|
|
rep(char('c'))
|
|
);
|
|
|
|
console.log(re.matches('ab')); // true
|
|
console.log(re.matches('')); // true
|
|
console.log(re.matches('c')); // true
|
|
|
|
// Build DFA from NFA
|
|
const {DFA} = fa;
|
|
|
|
const reDFA = new DFA(re);
|
|
|
|
console.log(reDFA.matches('ab')); // true
|
|
console.log(reDFA.matches('')); // true
|
|
console.log(reDFA.matches('c')); // true
|
|
```
|
|
|
|
#### Printing NFA/DFA tables
|
|
|
|
The `--table` option allows displaying NFA/DFA transition tables. RegExp Tree also applies _DFA minimization_ (using _N-equivalence_ algorithm), and produces the minimal transition table as its final result.
|
|
|
|
In the example below for the `/a|b|c/` regexp, we first obtain the NFA transition table, which is further converted to the original DFA transition table (down from the 10 non-deterministic states to 4 deterministic states), and eventually minimized to the final DFA table (from 4 to only 2 states).
|
|
|
|
```
|
|
./bin/regexp-tree-cli -e '/a|b|c/' --table all
|
|
```
|
|
|
|
Result:
|
|
|
|
```
|
|
> - starting
|
|
✓ - accepting
|
|
|
|
NFA transition table:
|
|
|
|
┌─────┬───┬───┬────┬─────────────┐
|
|
│ │ a │ b │ c │ ε* │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 1 > │ │ │ │ {1,2,3,7,9} │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 2 │ │ │ │ {2,3,7} │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 3 │ 4 │ │ │ 3 │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 4 │ │ │ │ {4,5,6} │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 5 │ │ │ │ {5,6} │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 6 ✓ │ │ │ │ 6 │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 7 │ │ 8 │ │ 7 │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 8 │ │ │ │ {8,5,6} │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 9 │ │ │ 10 │ 9 │
|
|
├─────┼───┼───┼────┼─────────────┤
|
|
│ 10 │ │ │ │ {10,6} │
|
|
└─────┴───┴───┴────┴─────────────┘
|
|
|
|
|
|
DFA: Original transition table:
|
|
|
|
┌─────┬───┬───┬───┐
|
|
│ │ a │ b │ c │
|
|
├─────┼───┼───┼───┤
|
|
│ 1 > │ 4 │ 3 │ 2 │
|
|
├─────┼───┼───┼───┤
|
|
│ 2 ✓ │ │ │ │
|
|
├─────┼───┼───┼───┤
|
|
│ 3 ✓ │ │ │ │
|
|
├─────┼───┼───┼───┤
|
|
│ 4 ✓ │ │ │ │
|
|
└─────┴───┴───┴───┘
|
|
|
|
|
|
DFA: Minimized transition table:
|
|
|
|
┌─────┬───┬───┬───┐
|
|
│ │ a │ b │ c │
|
|
├─────┼───┼───┼───┤
|
|
│ 1 > │ 2 │ 2 │ 2 │
|
|
├─────┼───┼───┼───┤
|
|
│ 2 ✓ │ │ │ │
|
|
└─────┴───┴───┴───┘
|
|
```
|
|
|
|
### AST nodes specification
|
|
|
|
Below are the AST node types for different regular expressions patterns:
|
|
|
|
- [Char](#char)
|
|
- [Simple char](#simple-char)
|
|
- [Escaped char](#escaped-char)
|
|
- [Meta char](#meta-char)
|
|
- [Control char](#control-char)
|
|
- [Hex char-code](#hex-char-code)
|
|
- [Decimal char-code](#decimal-char-code)
|
|
- [Octal char-code](#octal-char-code)
|
|
- [Unicode](#unicode)
|
|
- [Character class](#character-class)
|
|
- [Positive character class](#positive-character-class)
|
|
- [Negative character class](#negative-character-class)
|
|
- [Character class ranges](#character-class-ranges)
|
|
- [Unicode properties](#unicode-properties)
|
|
- [Alternative](#alternative)
|
|
- [Disjunction](#disjunction)
|
|
- [Groups](#groups)
|
|
- [Capturing group](#capturing-group)
|
|
- [Named capturing group](#named-capturing-group)
|
|
- [Non-capturing group](#non-capturing-group)
|
|
- [Backreferences](#backreferences)
|
|
- [Quantifiers](#quantifiers)
|
|
- [? zero-or-one](#-zero-or-one)
|
|
- [* zero-or-more](#-zero-or-more)
|
|
- [+ one-or-more](#-one-or-more)
|
|
- [Range-based quantifiers](#range-based-quantifiers)
|
|
- [Exact number of matches](#exact-number-of-matches)
|
|
- [Open range](#open-range)
|
|
- [Closed range](#closed-range)
|
|
- [Non-greedy](#non-greedy)
|
|
- [Assertions](#assertions)
|
|
- [^ begin marker](#-begin-marker)
|
|
- [$ end marker](#-end-marker)
|
|
- [Boundary assertions](#boundary-assertions)
|
|
- [Lookahead assertions](#lookahead-assertions)
|
|
- [Positive lookahead assertion](#positive-lookahead-assertion)
|
|
- [Negative lookahead assertion](#negative-lookahead-assertion)
|
|
- [Lookbehind assertions](#lookbehind-assertions)
|
|
- [Positive lookbehind assertion](#positive-lookbehind-assertion)
|
|
- [Negative lookbehind assertion](#negative-lookbehind-assertion)
|
|
|
|
#### Char
|
|
|
|
A basic building block, single character. Can be _escaped_, and be of different _kinds_.
|
|
|
|
##### Simple char
|
|
|
|
Basic _non-escaped_ char in a regexp:
|
|
|
|
```
|
|
z
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: 'z',
|
|
symbol: 'z',
|
|
kind: 'simple',
|
|
codePoint: 122
|
|
}
|
|
```
|
|
|
|
> NOTE: to test this from CLI, the char should be in an actual regexp -- `/z/`.
|
|
|
|
##### Escaped char
|
|
|
|
```
|
|
\z
|
|
```
|
|
|
|
The same value, `escaped` flag is added:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: 'z',
|
|
symbol: 'z',
|
|
kind: 'simple',
|
|
codePoint: 122,
|
|
escaped: true
|
|
}
|
|
```
|
|
|
|
Escaping is mostly used with meta symbols:
|
|
|
|
```
|
|
// Syntax error
|
|
*
|
|
```
|
|
|
|
```
|
|
\*
|
|
```
|
|
|
|
OK, node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '*',
|
|
symbol: '*',
|
|
kind: 'simple',
|
|
codePoint: 42,
|
|
escaped: true
|
|
}
|
|
```
|
|
|
|
##### Meta char
|
|
|
|
A _meta character_ should not be confused with an [escaped char](#escaped-char).
|
|
|
|
Example:
|
|
|
|
```
|
|
\n
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\n',
|
|
symbol: '\n',
|
|
kind: 'meta',
|
|
codePoint: 10
|
|
}
|
|
```
|
|
|
|
Among other meta character are: `.`, `\f`, `\r`, `\n`, `\t`, `\v`, `\0`, `[\b]` (backspace char), `\s`, `\S`, `\w`, `\W`, `\d`, `\D`.
|
|
|
|
> NOTE: Meta characters representing ranges (like `.`, `\s`, etc.) have `undefined` value for `symbol` and `NaN` for `codePoint`.
|
|
|
|
> NOTE: `\b` and `\B` are parsed as `Assertion` node type, not `Char`.
|
|
|
|
##### Control char
|
|
|
|
A char preceded with `\c`, e.g. `\cx`, which stands for `CTRL+x`:
|
|
|
|
```
|
|
\cx
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\cx',
|
|
symbol: undefined,
|
|
kind: 'control',
|
|
codePoint: NaN
|
|
}
|
|
```
|
|
|
|
##### HEX char-code
|
|
|
|
A char preceded with `\x`, followed by a HEX-code, e.g. `\x3B` (symbol `;`):
|
|
|
|
```
|
|
\x3B
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\x3B',
|
|
symbol: ';',
|
|
kind: 'hex',
|
|
codePoint: 59
|
|
}
|
|
```
|
|
|
|
##### Decimal char-code
|
|
|
|
Char-code:
|
|
|
|
```
|
|
\42
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\42',
|
|
symbol: '*',
|
|
kind: 'decimal',
|
|
codePoint: 42
|
|
}
|
|
```
|
|
|
|
##### Octal char-code
|
|
|
|
Char-code started with `\0`, followed by an octal number:
|
|
|
|
```
|
|
\073
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\073',
|
|
symbol: ';',
|
|
kind: 'oct',
|
|
codePoint: 59
|
|
}
|
|
```
|
|
|
|
##### Unicode
|
|
|
|
Unicode char started with `\u`, followed by a hex number:
|
|
|
|
```
|
|
\u003B
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\u003B',
|
|
symbol: ';',
|
|
kind: 'unicode',
|
|
codePoint: 59
|
|
}
|
|
```
|
|
|
|
When using the `u` flag, unicode chars can also be represented using `\u` followed by a hex number between curly braces:
|
|
|
|
```
|
|
\u{1F680}
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\u{1F680}',
|
|
symbol: '🚀',
|
|
kind: 'unicode',
|
|
codePoint: 128640
|
|
}
|
|
```
|
|
|
|
When using the `u` flag, unicode chars can also be represented using a surrogate pair:
|
|
|
|
```
|
|
\ud83d\ude80
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Char',
|
|
value: '\\ud83d\\ude80',
|
|
symbol: '🚀',
|
|
kind: 'unicode',
|
|
codePoint: 128640,
|
|
isSurrogatePair: true
|
|
}
|
|
```
|
|
|
|
#### Character class
|
|
|
|
Character classes define a _set_ of characters. A set may include as simple characters, as well as _character ranges_. A class can be _positive_ (any from the characters in the class match), or _negative_ (any _but_ the characters from the class match).
|
|
|
|
##### Positive character class
|
|
|
|
A positive character class is defined between `[` and `]` brackets:
|
|
|
|
```
|
|
[a*]
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'CharacterClass',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: '*',
|
|
symbol: '*',
|
|
kind: 'simple',
|
|
codePoint: 42
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
> NOTE: some meta symbols are treated as normal characters in a character class. E.g. `*` is not a repetition quantifier, but a simple char.
|
|
|
|
##### Negative character class
|
|
|
|
A negative character class is defined between `[^` and `]` brackets:
|
|
|
|
```
|
|
[^ab]
|
|
```
|
|
|
|
An AST node is the same, just `negative` property is added:
|
|
|
|
```js
|
|
{
|
|
type: 'CharacterClass',
|
|
negative: true,
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
##### Character class ranges
|
|
|
|
As mentioned, a character class may also contain _ranges_ of symbols:
|
|
|
|
```
|
|
[a-z]
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'CharacterClass',
|
|
expressions: [
|
|
{
|
|
type: 'ClassRange',
|
|
from: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
to: {
|
|
type: 'Char',
|
|
value: 'z',
|
|
symbol: 'z',
|
|
kind: 'simple',
|
|
codePoint: 122
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
> NOTE: it is a _syntax error_ if `to` value is less than `from` value: `/[z-a]/`.
|
|
|
|
The range value can be the same for `from` and `to`, and the special range `-` character is treated as a simple character when it stands in a char position:
|
|
|
|
```
|
|
// from: 'a', to: 'a'
|
|
[a-a]
|
|
|
|
// from: '-', to: '-'
|
|
[---]
|
|
|
|
// simple '-' char:
|
|
[-]
|
|
|
|
// 3 ranges:
|
|
[a-zA-Z0-9]+
|
|
```
|
|
|
|
#### Unicode properties
|
|
|
|
Unicode property escapes are a new type of escape sequence available in regular expressions that have the `u` flag set. With this feature it is possible to write Unicode expressions as:
|
|
|
|
```js
|
|
const greekSymbolRe = /\p{Script=Greek}/u;
|
|
|
|
greekSymbolRe.test('π'); // true
|
|
```
|
|
|
|
The AST node for this expression is:
|
|
|
|
```js
|
|
{
|
|
type: 'UnicodeProperty',
|
|
name: 'Script',
|
|
value: 'Greek',
|
|
negative: false,
|
|
shorthand: false,
|
|
binary: false,
|
|
canonicalName: 'Script',
|
|
canonicalValue: 'Greek'
|
|
}
|
|
```
|
|
|
|
All possible property names, values, and their aliases can be found at the [specification](https://tc39.github.io/ecma262/#sec-runtime-semantics-unicodematchproperty-p).
|
|
|
|
For `General_Category` it is possible to use a shorthand:
|
|
|
|
```js
|
|
/\p{Letter}/u; // Shorthand
|
|
|
|
/\p{General_Category=Letter}/u; // Full notation
|
|
```
|
|
|
|
Binary names use the single value as well:
|
|
|
|
```js
|
|
/\p{ASCII_Hex_Digit}/u; // Same as: /[0-9A-Fa-f]/
|
|
```
|
|
|
|
The capitalized `P` defines the negation of the expression:
|
|
|
|
```js
|
|
/\P{ASCII_Hex_Digit}/u; // NOT a ASCII Hex digit
|
|
```
|
|
|
|
#### Alternative
|
|
|
|
An _alternative_ (or _concatenation_) defines a chain of patterns followed one after another:
|
|
|
|
```
|
|
abc
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'c',
|
|
symbol: 'c',
|
|
kind: 'simple',
|
|
codePoint: 99
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Another examples:
|
|
|
|
```
|
|
// 'a' with a quantifier, followed by 'b'
|
|
a?b
|
|
|
|
// A group followed by a class:
|
|
(ab)[a-z]
|
|
```
|
|
|
|
#### Disjunction
|
|
|
|
The _disjunction_ defines "OR" operation for regexp patterns. It's a _binary_ operation, having `left`, and `right` nodes.
|
|
|
|
Matches `a` or `b`:
|
|
|
|
```
|
|
a|b
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Disjunction',
|
|
left: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
right: {
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Groups
|
|
|
|
The groups play two roles: they define _grouping precedence_, and allow to _capture_ needed sub-expressions in case of a capturing group.
|
|
|
|
##### Capturing group
|
|
|
|
_"Capturing"_ means the matched string can be referred later by a user, including in the pattern itself -- by using [backreferences](#backreferences).
|
|
|
|
Char `a`, and `b` are grouped, followed by the `c` char:
|
|
|
|
```
|
|
(ab)c
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Group',
|
|
capturing: true,
|
|
number: 1,
|
|
expression: {
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
]
|
|
}
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'c',
|
|
symbol: 'c',
|
|
kind: 'simple',
|
|
codePoint: 99
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
As we can see, it also tracks the number of the group.
|
|
|
|
Another example:
|
|
|
|
```
|
|
// A grouped disjunction of a symbol, and a character class:
|
|
(5|[a-z])
|
|
```
|
|
|
|
##### Named capturing group
|
|
|
|
> NOTE: _Named capturing groups_ are not yet supported by JavaScript RegExp. It is an ECMAScript [proposal](https://tc39.github.io/proposal-regexp-named-groups/) which is at stage 3 at the moment.
|
|
|
|
A capturing group can be given a name using the `(?<name>...)` syntax, for any identifier `name`.
|
|
|
|
For example, a regular expressions for a date:
|
|
|
|
```js
|
|
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u
|
|
```
|
|
|
|
For the group:
|
|
|
|
```js
|
|
(?<foo>x)
|
|
```
|
|
|
|
We have the following node (the `name` property with value `foo` is added):
|
|
|
|
```js
|
|
{
|
|
type: 'Group',
|
|
capturing: true,
|
|
name: 'foo',
|
|
number: 1,
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'x',
|
|
symbol: 'x',
|
|
kind: 'simple',
|
|
codePoint: 120
|
|
}
|
|
}
|
|
```
|
|
|
|
##### Non-capturing group
|
|
|
|
Sometimes we don't need to actually capture the matched string from a group. In this case we can use a _non-capturing_ group:
|
|
|
|
Char `a`, and `b` are grouped, _but not captured_, followed by the `c` char:
|
|
|
|
```
|
|
(?:ab)c
|
|
```
|
|
|
|
The same node, the `capturing` flag is `false`:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Group',
|
|
capturing: false,
|
|
expression: {
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
]
|
|
}
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'c',
|
|
symbol: 'c',
|
|
kind: 'simple',
|
|
codePoint: 99
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
##### Backreferences
|
|
|
|
A [capturing group](#capturing-group) can be referenced in the pattern using notation of an escaped group number.
|
|
|
|
Matches `abab` string:
|
|
|
|
```
|
|
(ab)\1
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Group',
|
|
capturing: true,
|
|
number: 1,
|
|
expression: {
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
]
|
|
}
|
|
},
|
|
{
|
|
type: 'Backreference',
|
|
kind: 'number',
|
|
number: 1,
|
|
reference: 1,
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
A [named capturing group](#named-capturing-group) can be accessed using `\k<name>` pattern, and also using a numbered reference.
|
|
|
|
Matches `www`:
|
|
|
|
```js
|
|
(?<foo>w)\k<foo>\1
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Group',
|
|
capturing: true,
|
|
name: 'foo',
|
|
number: 1,
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'w',
|
|
symbol: 'w',
|
|
kind: 'simple',
|
|
codePoint: 119
|
|
}
|
|
},
|
|
{
|
|
type: 'Backreference',
|
|
kind: 'name',
|
|
number: 1,
|
|
reference: 'foo'
|
|
},
|
|
{
|
|
type: 'Backreference',
|
|
kind: 'number',
|
|
number: 1,
|
|
reference: 1
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
#### Quantifiers
|
|
|
|
Quantifiers specify _repetition_ of a regular expression (or of its part). Below are the quantifiers which _wrap_ a parsed expression into a `Repetition` node. The quantifier itself can be of different _kinds_, and has `Quantifier` node type.
|
|
|
|
##### ? zero-or-one
|
|
|
|
The `?` quantifier is short for `{0,1}`.
|
|
|
|
```
|
|
a?
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: '?',
|
|
greedy: true
|
|
}
|
|
}
|
|
```
|
|
|
|
##### * zero-or-more
|
|
|
|
The `*` quantifier is short for `{0,}`.
|
|
|
|
```
|
|
a*
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: '*',
|
|
greedy: true
|
|
}
|
|
}
|
|
```
|
|
|
|
##### + one-or-more
|
|
|
|
The `+` quantifier is short for `{1,}`.
|
|
|
|
```
|
|
// Same as `aa*`, or `a{1,}`
|
|
a+
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: '+',
|
|
greedy: true
|
|
}
|
|
}
|
|
```
|
|
|
|
##### Range-based quantifiers
|
|
|
|
Explicit _range-based_ quantifiers are parsed as follows:
|
|
|
|
###### Exact number of matches
|
|
|
|
```
|
|
a{3}
|
|
```
|
|
|
|
The type of the quantifier is `Range`, and `from`, and `to` properties have the same value:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: 'Range',
|
|
from: 3,
|
|
to: 3,
|
|
greedy: true
|
|
}
|
|
}
|
|
```
|
|
|
|
###### Open range
|
|
|
|
An open range doesn't have max value (assuming semantic "more", or Infinity value):
|
|
|
|
```
|
|
a{3,}
|
|
```
|
|
|
|
An AST node for such range doesn't contain `to` property:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: 'Range',
|
|
from: 3,
|
|
greedy: true
|
|
}
|
|
}
|
|
```
|
|
|
|
###### Closed range
|
|
|
|
A closed range has explicit max value: (which syntactically can be the same as min value):
|
|
|
|
```
|
|
a{3,5}
|
|
|
|
// Same as a{3}
|
|
a{3,3}
|
|
```
|
|
|
|
An AST node for a closed range:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: 'Range',
|
|
from: 3,
|
|
to: 5,
|
|
greedy: true
|
|
}
|
|
}
|
|
```
|
|
|
|
> NOTE: it is a _syntax error_ if the max value is less than min value: `/a{3,2}/`
|
|
|
|
##### Non-greedy
|
|
|
|
If any quantifier is followed by the `?`, the quantifier becomes _non-greedy_.
|
|
|
|
Example:
|
|
|
|
```
|
|
a+?
|
|
```
|
|
|
|
Node:
|
|
|
|
```js
|
|
{
|
|
type: 'Repetition',
|
|
expression: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
quantifier: {
|
|
type: 'Quantifier',
|
|
kind: '+',
|
|
greedy: false
|
|
}
|
|
}
|
|
```
|
|
|
|
Other examples:
|
|
|
|
```
|
|
a??
|
|
a*?
|
|
a{1}?
|
|
a{1,}?
|
|
a{1,3}?
|
|
```
|
|
|
|
#### Assertions
|
|
|
|
Assertions appear as separate AST nodes, however instread of manipulating on the characters themselves, they _assert_ certain conditions of a matching string. Examples: `^` -- beginning of a string (or a line in multiline mode), `$` -- end of a string, etc.
|
|
|
|
##### ^ begin marker
|
|
|
|
The `^` assertion checks whether a scanner is at the beginning of a string (or a line in multiline mode).
|
|
|
|
In the example below `^` is not a property of the `a` symbol, but a separate AST node for the assertion. The parsed node is actually an `Alternative` with two nodes:
|
|
|
|
```
|
|
^a
|
|
```
|
|
|
|
The node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Assertion',
|
|
kind: '^'
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Since assertion is a separate node, it may appear anywhere in the matching string. The following regexp is completely valid, and asserts beginning of the string; it'll match an empty string:
|
|
|
|
```
|
|
^^^^^
|
|
```
|
|
|
|
##### $ end marker
|
|
|
|
The `$` assertion is similar to `^`, but asserts the end of a string (or a line in a multiline mode):
|
|
|
|
```
|
|
a$
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Assertion',
|
|
kind: '$'
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
And again, this is a completely valid regexp, and matches an empty string:
|
|
|
|
```
|
|
^^^^$$$$$
|
|
|
|
// valid too:
|
|
$^
|
|
```
|
|
|
|
##### Boundary assertions
|
|
|
|
The `\b` assertion check for _word boundary_, i.e. the position between a word and a space.
|
|
|
|
Matches `x` in `x y`, but not in `xy`:
|
|
|
|
```
|
|
x\b
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'x',
|
|
symbol: 'x',
|
|
kind: 'simple',
|
|
codePoint: 120
|
|
},
|
|
{
|
|
type: 'Assertion',
|
|
kind: '\\b'
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
The `\B` is vice-versa checks for _non-word_ boundary. The following example matches `x` in `xy`, but not in `x y`:
|
|
|
|
```
|
|
x\B
|
|
```
|
|
|
|
A node is the same:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'x',
|
|
symbol: 'x',
|
|
kind: 'simple',
|
|
codePoint: 120
|
|
},
|
|
{
|
|
type: 'Assertion',
|
|
kind: '\\B'
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
##### Lookahead assertions
|
|
|
|
These assertions check whether a pattern is _followed_ (or not followed for the negative assertion) by another pattern.
|
|
|
|
###### Positive lookahead assertion
|
|
|
|
Matches `a` only if it's followed by `b`:
|
|
|
|
```
|
|
a(?=b)
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Assertion',
|
|
kind: 'Lookahead',
|
|
assertion: {
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
###### Negative lookahead assertion
|
|
|
|
Matches `a` only if it's _not_ followed by `b`:
|
|
|
|
```
|
|
a(?!b)
|
|
```
|
|
|
|
A node is similar, just `negative` flag is added:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
},
|
|
{
|
|
type: 'Assertion',
|
|
kind: 'Lookahead',
|
|
negative: true,
|
|
assertion: {
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
##### Lookbehind assertions
|
|
|
|
> NOTE: _Lookbehind assertions_ are not yet supported by JavaScript RegExp. It is an ECMAScript [proposal](https://tc39.github.io/proposal-regexp-lookbehind/) which is at stage 3 at the moment.
|
|
|
|
These assertions check whether a pattern is _preceded_ (or not preceded for the negative assertion) by another pattern.
|
|
|
|
###### Positive lookbehind assertion
|
|
|
|
Matches `b` only if it's preceded by `a`:
|
|
|
|
```
|
|
(?<=a)b
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Assertion',
|
|
kind: 'Lookbehind',
|
|
assertion: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
}
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
},
|
|
]
|
|
}
|
|
```
|
|
|
|
###### Negative lookbehind assertion
|
|
|
|
Matches `b` only if it's _not_ preceded by `a`:
|
|
|
|
```
|
|
(?<!a)b
|
|
```
|
|
|
|
A node:
|
|
|
|
```js
|
|
{
|
|
type: 'Alternative',
|
|
expressions: [
|
|
{
|
|
type: 'Assertion',
|
|
kind: 'Lookbehind',
|
|
negative: true,
|
|
assertion: {
|
|
type: 'Char',
|
|
value: 'a',
|
|
symbol: 'a',
|
|
kind: 'simple',
|
|
codePoint: 97
|
|
}
|
|
},
|
|
{
|
|
type: 'Char',
|
|
value: 'b',
|
|
symbol: 'b',
|
|
kind: 'simple',
|
|
codePoint: 98
|
|
},
|
|
]
|
|
}
|
|
```
|