Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (1 / 5) |
Uptime: | 21:05:38 |
Calls: | 333 |
Calls today: | 1 |
Files: | 989 |
D/L today: |
1 files (14K bytes) |
Messages: | 111,473 |
Posted today: | 1 |
Many times I need to parse/decode a text string that comes from an
external system, over a serial bus, MQTT, and so on.
Many times this string has a fixed syntax/layout. In order to parse this string, I everytime create a custom parser that can be tedious,
cumbersom and error prone.
For example, suppose you have to decode a string from a modem that uses
AT commands. Many answers from the modem has the following schema:
\r\n+<prefix>: <field1>,<field2>\r\n
\r\nOK\r\n
The prefix is known, the number and type of fileds are known too. With
regex, the parser would be simple.
Many times I need to parse/decode a text string that comes from an[...]
external system, over a serial bus, MQTT, and so on.
Many times this string has a fixed syntax/layout. In order to parse this string, I everytime create a custom parser that can be tedious,
cumbersom and error prone.
Il 21/01/2025 17:03, Stefan Reuther ha scritto:
Am 21.01.2025 um 15:31 schrieb pozz:
Many times I need to parse/decode a text string that comes from an[...]
external system, over a serial bus, MQTT, and so on.
Many times this string has a fixed syntax/layout. In order to parse this >>> string, I everytime create a custom parser that can be tedious,
cumbersom and error prone.
I don't see a question in this posting,
The hiddend question was if there's a better approach than handcrafted >parsers.
but isn't this the task that
'lex' is intended to be used for?
I will look at it.
(Personally, I have no problem with handcrafted parsers.)
Stefan
On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:
Il 21/01/2025 17:03, Stefan Reuther ha scritto:
Am 21.01.2025 um 15:31 schrieb pozz:
Many times I need to parse/decode a text string that comes from an[...]
external system, over a serial bus, MQTT, and so on.
Many times this string has a fixed syntax/layout. In order to parse this >>>> string, I everytime create a custom parser that can be tedious,
cumbersom and error prone.
I don't see a question in this posting,
The hiddend question was if there's a better approach than handcrafted
parsers.
but isn't this the task that
'lex' is intended to be used for?
I will look at it.
(Personally, I have no problem with handcrafted parsers.)
So long as they are correct 8-)
Stefan
Lex and Flex create table driven lexers (and driver code for them).
Under certain circumstances Flex can create far smaller tables than
Lex, but likely either would be massive overkill for the scenario you described.
Minding David's warnings about lexer size, if you really want to try
using regex, I would recommend RE2C. RE2C is a preprocessor that
generates simple recursive code to directly implement matching of
regex strings in your code. There are versions available for several languages.
https://re2c.org/
On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:
(Personally, I have no problem with handcrafted parsers.)
So long as they are correct 8-)
Lex and Flex create table driven lexers (and driver code for them).
Under certain circumstances Flex can create far smaller tables than
Lex, but likely either would be massive overkill for the scenario you described.
Am 22.01.2025 um 01:38 schrieb George Neuner:
On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:
(Personally, I have no problem with handcrafted parsers.)
So long as they are correct 8-)
Correctness has an inverse correlation with complexity, so optimize for >non-complexity.
I would implement a two-stage parser: first break the lines into a
buffer, then throw a bunch of statements like
if (Parser p(str); p.matchString("+")
&& p.matchTextUntil(":", &prefix)
&& p.matchWhitespace() ...)
at this, with Parser being a small C++ class wrapping the individual
matching operations (strncmp, strspn, etc.)
Surely this is more complex as a regex/template, but still easy enough
to be "obviously correct".
Lex and Flex create table driven lexers (and driver code for them).
Under certain circumstances Flex can create far smaller tables than
Lex, but likely either would be massive overkill for the scenario you
described.
Maybe, maybe not. I find it hard to extrapolate to the complete task
from the two examples given. If there's hundreds of these templates,
that need to be matched bit-by-bit, I have the impression that lex would
be a quick and easy way to pull them out of a byte stream.
But splitting it into lines first, and then tackling each line on its
own (...using lex, maybe? Or any other tool. Or a parser class.) might
be a good option as well. For example, this can answer the question
whether linefeeds are required to be \r\n, or whether a single \n also >suffices, in a central place. And if you decide that you want to do a
hard connection close if you see a \r or \n outside a \r\n sequence (to >prevent an attack such as SMTP smuggling), that would be easy.
Stefan
On 22/01/2025 01:38, George Neuner wrote:
On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:
Il 21/01/2025 17:03, Stefan Reuther ha scritto:
Am 21.01.2025 um 15:31 schrieb pozz:
Many times I need to parse/decode a text string that comes from an[...]
external system, over a serial bus, MQTT, and so on.
Many times this string has a fixed syntax/layout. In order to parse this >>>>> string, I everytime create a custom parser that can be tedious,
cumbersom and error prone.
I don't see a question in this posting,
The hiddend question was if there's a better approach than handcrafted
parsers.
but isn't this the task that
'lex' is intended to be used for?
I will look at it.
(Personally, I have no problem with handcrafted parsers.)
So long as they are correct 8-)
This is vital. You want a /lot/ of test cases to check the algorithm.
Stefan
Lex and Flex create table driven lexers (and driver code for them).
Under certain circumstances Flex can create far smaller tables than
Lex, but likely either would be massive overkill for the scenario you
described.
Minding David's warnings about lexer size, if you really want to try
using regex, I would recommend RE2C. RE2C is a preprocessor that
generates simple recursive code to directly implement matching of
regex strings in your code. There are versions available for several
languages.
https://re2c.org/
The "best" solution depends on the OP's knowledge, the variety of the >patterns needed, the resources of the target system, and restrictions on >things like programming language support. For example, the C++ template >based project I suggested earlier (which I have not tried myself) should
give quite efficient results, but it requires a modern C++ compiler.
I think if the OP is only looking for a few patterns, or styles of
pattern, then regex's and powerful code generator systems are overkill.
It will take more work to learn and understand them, and code generated
by tools like lex and flex is not designed to be human-friendly, nor is
it likely to match well with coding standards for small embedded systems.
I'd probably just have a series of matcher functions for different parts >(fixed string, numeric field as integer, flag field as boolean, etc.)
and have manual parsers for the different types. As a C++ user I'd be >returning std::optional<> types here and using the new "and_then"
methods to give neat chains, but a C programmer might want to pass a
pointer to a value variable and return "bool" for success. If I had a
lot of such patterns to match, then I might use templates for generating
the higher level matchers - for C, it would be either a macro system or
an external Python script.
Or just use sscanf() :-)