Forum: Too Lazy BBS

[Python-announce] TatSu v5.17.0

From =?UTF-8?Q?Juancarlo_A=C3=B1ez?=@apalala@gmail.com to comp.lang.python.announce on Mon Feb 16 16:09:00 2026

From Newsgroup: comp.lang.python.announce

t2L TatSu is a tool that takes grammars in a superset of EBNF as input, and outputs memoizing (Packrat) PEG parsers in Python. The classic variations
of EBNF (Tomassetti, EasyExtend, Wirth) and ISO EBNF are also supported as input grammar format.

The Overdue Major Refactoring
------------------------------------------

Maintenance and contributions to TatSu have been more difficult than
necessary because of the way the code evolved through its lifetime.

- Very long modules and classes that try to do too much
- Algorithms difficult to understand or with incorrect semantics
- Basic features missing, because the above made them hard to implement

This release is a major refactoring of the code in TatSu.

- Complex modules were partitioned into sub-modules and classes with
well-defined purpose
- Several algorithms were rewritten to make their semantics clear and
evident, and their implementation more efficient
- Many unit tests were added to assert the semantics of complex
algorithms
- Several user-facing features were added as they became easier to
implement

For the details about the many changes please take a look at the commit
log.

- pypi: https://pypi.org/project/TatSu/
- docs: https://tatsu.readthedocs.io/
- repo: https://github.com/neogeny/TatSu

Every effort has been made to preserve backwards compatibility by
keeping most unit tests intact and testing with projects with large
grammars and complex processing. If something escaped those tests, there
will be a bugfix release with the fixes soon enough.

User-Facing Changes

- The TatSu documentation has been improved and expanded, and it has a
better look&feel with improved navigation.

- TatSu doesnrCOt care about file names, but the default extension used in
unit tests, examples, and documentation for grammars is now .tatsu

- EBNF, both ISO and the classic variations, is fully supported as
grammar input format

- Now tatsu.parse(...., asmodel=True) produces a model that matches the
::Type declarations in their grammar (see the models documentation for
a thorough review of the features).

- walkers.NodeWalker now handles all known types of input. Also:

- DepthFirstWalker was reimplemented to ensure DFS semantics
- PostOrderDepthFirstWalker walks children before parents
- PreOrderWalker was broken and crazy. It was rewritten as a
BreadthFirstWalker with the correct semantics

- Constant expressions in a grammar are now evaluated deeply with
multiple passes of eval() as to produce results that are intuitively
correct:

def test_constant_math():
grammar = r"""
start = a:`7` b:`2` @:```{a} / {b}``` $ ;
"""
result = parse(grammar, '', trace=True)
assert result == 3.5

- Evaluation of Python expressions by the parsing engine now use
safe_eval(), a hardened firewall around most security attacks
targeting eval() (see the safeeval module for details)

- Because None is a valid initial value for attributes and a frequent
return value for callables, the required logic for undefined values
was moved to the notnone module, which declares Undefined as an alias
for notnone.NotNone

In [1]: from tatsu.util.undefined import Undefined
In [2]: u = Undefined
In [3]: u is None
Out[3]: False
In [4]: u is Undefined
Out[4]: True
In [5]: Undefined is None
Out[5]: False
In [6]: d = u or 'OK'
In [7]: d
Out[7]: 'OK'

- objectmodel.Node was rewritten to give it clear semantics and
efficiency

- New attributes to Node after initialization generate a warning if
the name of a method is being shadowed. This change avoids confusing
@dataclass, which is used in generated object models.
- Node equality is explicitly defined as object identity. No attempts
are made at comparing Node structurally.
- Node.children() has the expected semantics, and is much more
efficient.

- Node.parseinfo is now honored by the parsing engine (previously, only
results of type AST could have a parseinfo). Generation of parseinfo
is disabled by default, and is enabled by passing pareseinfo=True to
the API entry points.

def test_node_parseinfo(self):
grammar = """
@@grammar :: Test
start::Test = true | false ;
true = "test" @:`True` $;
false = "test" @:`False` $;
"""

text = 'test'
node = tatsu.parse(grammar, text, asmodel=True,
parseinfo=True, )
assert type(node).__name__ == 'Test'
assert node.ast is True
assert node.parseinfo is not None
assert node.parseinfo.pos == 0
assert node.parseinfo.endpos == len(text)

- Synthetic classes created by synth.synthetize() during parsing with
ModelBuilderSemantics behave more consistently, and now have a base
class of class SynthNode(BaseNode)

- Now ast.AST has consistent semantics of a dict that allows access to
contents using the attribute interface

- asjson() and friends now cover all known cases with improved
consistency and efficiency, so there are less demands over clients of
the API

- Entry points no longer list a large subset of the configuration
options defined in ParserConfig, but still accept them through
**settings keyword arguments. Now ParserConfig verifies that the
settings passed to are valid, eliminating the frustration of passing
an incorrect setting name (a typo) and hoping it has the intended
effect.

- TatSu still has no library dependencies for its core functionality,
but several libraries are used during its development and testing. The
TatSu development configuration uses uv and hatch. Several
requirements-xyz.txt files are generated in favor of those using pip
with pyenv, virtualenvwrapper, or virtualenv

- All attempts at recovering comments from parsed input were removed. It
never worked, so it had no use. Comment recovery may be attempted in
the future.

- All pre-existing grammars are compatible with this version of TatSu.

- Previously generated Python parsers and models, work with this version
of TatSu, yet you should consider generating them anew to take
advantage of the improved speed, layout, and features.

- CAVEAT: Several functions, methods, and argument names were
deprecated. They can still be used, but warnings will be issued at
runtime.

- CAVEAT: If there are invalid strings or regex patterns in your
grammars YOU MUST fix them because now the grammar parser validates
strings and patterns.

- Many of the functions that TatSu defines for its own use are useful in
other contexts. Some examples are:

from tatsu.safeeval import is_eval_safe
from tatsu.safeeval import hasshable
from tatsu.safeeval import make_hashable
from tatsu.util import safe_name
from tatsu.util.misc import find_from_rematch
from tatsu.util.misc import topsort
from tatsu.util.undefined import Undefined
# ... -----------------------------------------------------------------------------------------------
--
Juancarlo A|#ez
mailto:apalala@gmail.com
--- Synchronet 3.21b-Linux NewsLink 1.2

From =?UTF-8?Q?Juancarlo_A=C3=B1ez?=@apalala@gmail.com to comp.lang.python.announce on Wed May 6 12:48:25 2026

From Newsgroup: comp.lang.python.announce

t2L TatSu is a tool that takes grammars in a superset of EBNF as input, and outputs memoizing (Packrat) PEG parsers in Python. The classic variations
of EBNF (Tomassetti, EasyExtend, Wirth) and ISO EBNF are also supported as input grammar format.

* https://pypi.org/project/TatSu/
* https://github.com/neogeny/TatSu
* https://tatsu.readthedocs.io/

v5.19.0
----------

- The $-> (EOL) expression was introduced in the grammar language to
match and consume the whitespace up to and including the next line
break, using the Python semantics of os.linesep. The match interprets
whitespace using the Python definition as implemented by
str.isspace(), so beware when a particular definition of whitespace is
part of the language to parse.

- The @nostak decorator for rules was added to the grammar. The setting
hints the tracer and error handler that the rule should not be part of
the call stack. The setting is useful to avoid noise in traces when
low-level rules (like those for qualified or attributed identifiers)
form their own small hierarchy.

- The file extension for TatSu grammars is now .ebnf. The grammar
language is, after all, an extension of the most known forms of EBNF
syntax. Syntax highlighters may recognize the extension

- The benchmark in tatsu.tool.bench was used over several large grammars
and large input sets to evaluate parser strategies. The result is that
there is a 1.3x performance advantage in generating a Python program
versus using the in-memory model of the parsed TatSu grammar for
parsing. In tests with complex projects (Java) the performance
difference is not perceivable. The codspeed benchmark that runs with
unit tests on GitHub doesnrCOt see the performance difference either.

Now TatSu uses for bootstrap a module that loads its own grammar model
as the main parser (the one used by tatsu.compile()). The previous
kind of parser can still be generated with
tatsu.to_python_sourcecode(), which remains well tested in several
unit tests. The new model-based kind of parser can be generated with
tatsu.to_parsermodel_sourcecode().

Note that you donrCOt need to generate any source code for a parser in
your own projects. TatSu does generate a module to make it faster to
bootstrap a parser from its own grammar. In your projects you can run
the usual steps to have a performant parser:

import tatsu

grammartext = ...

model = tatsu.compile(grammartext, asmodel=True)
output = model.parse(input)

Generating a module with classes for the type definitions in the
grammar is still useful.

from pathlib import Path
import tatsu

grammartext = ...

sourcecode = tatsu.to_python_model(grammartext)
Path('./modelclases.py').write_text(sourcecode)

- Optimizations in the parser logic produce parsing speeds comparable to
those of TatSu v5.16 with any parsing strategy (model or generated
code).

- The old parser and model generator modules in tatsu.codegen have been
deleted. Using pyrefly revealed that they are both incorrect and
non-working. Their defunctness was caused by the lack of unit tests
and their lack of use since tatsu.ngcodegen was introduced several
years ago. The helper modules codegen.cgbase and codegen.rendering
remain in case any old projects use them for their own code
generation.

- The g2e example in ./examples/g2e was removed. The example had become
irrelevant now that the new PEG parser in Python uses a pegen-style
grammar for the language that is less than a 1000 lines long. The
TatSu grammar for ANTLR in ./examples/g2e/antlr.tatsu can still parse
ANTLR grammars, but thererCOs no test case for it. The semantics in
g2e.semanrics.ANTLRSemantics try to do everything on a single pass
(like substituting simple TOKEN rules by their value), when
transformation of the parsed input grammar model should be more stable
and easier to understand with a simplerr approach.

- ThererCOs no longer a separate stack for the state of cut. The state of
cut is kept in the general state stack.

- A new @statescope context manager takes care of handling the state
stack in most cases.

- Lookaheads are always memoized. Configuration settings for disabling
it have been deprecated and disabled.

- A new PaserConfig.perlinememos: float configuration sets a
(perlinememos * linecount) bound on the total number of memoization
entries that are allowed on each parse.

- Incorporated zuban to the set of type linters.

- Introduced objectmodel.ctx.CanParse(Protocol) defining the parse()
method for entry point to parsing.

- An important refactoring was done to get rid of the legacy names
rCLtokenizingrCY and rCLtokenizerrCY which didnrCOt abide to theory and practice
of parsing. Now the names are tatsu.input, tatsu.input.text, and
tatsu.input.text.Text. The old names are still available as legacy for
backwards compatibility.

- Rule invludes (RuleInclude) kept an atcutal copy of the included rule
in the model. To preserve consistent semantics, the only mentions of
Rule in a model are at the top-level, in Grammar.rules and
Grammar.rulemap.

- Grammar models that havenrCOt been compiled from a grammar but instead
loaded from the JSON or Python representations donrCOt need to be
analyzed for left recursion, because the markers of the analysis are
already in the loaded models. A new Grammar.analyzed: bool attribute
was added to quickly check if a grammar model from any source has
already been analyzed.

- Support for #include in grammars has been dropped. It was always a bad
idea. Text-to-text preprocessing doesnrCOt belong in the grammar in part
because it doesnrCOt apply to input sources that are not text, like that
of tokenizers or streams. The class tatsu.input.buffer.Buffer still
has all the infrastrucure for supporting C-style or COBOL-style
textual includes, and its definition of BufferCursor honors it. Buffer
keeps track of which file was the source of each line of input,
something essential for good error reporting. During compilation of
grammar text to a Grammar object, the grammar text is the parserrCOs
input, so the Cursor semantics regarding the parsing still apply.

- The CLI tool now has a --json option to produce the JSON version of
the model for a grammar. Re-importing of a JSON model is not yet
implemented in TatSu, but TieXiu uses them successfully as the fast
way to import a TatSu grammar model.

--
Juancarlo A|#ez
mailto:apalala@gmail.com
--- Synchronet 3.21f-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Tue May 19 08:20:37 2026
  from Euclid, Oh via Telnet
- Geek2
  Sun May 17 07:06:15 2026
  from Euclid, Oh via Telnet
- Geek2
  Sat May 16 21:25:04 2026
  from Euclid, Oh via Telnet
- Jas Hud
  Sat May 16 00:50:28 2026
  from Bbs.Eob-Bbs.Com,wi via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	65
Nodes:	6 (0 / 6)
Uptime:	16:07:11
Calls:	863
Calls today:	1
Files:	1,311
D/L today:	11 files (21,614K bytes)
Messages:	265,788

[Python-announce] TatSu v5.17.0

Who's Online

Recent Visitors

System Info