Forum: Too Lazy BBS

A simpler way to tokenize and parse?

From Roger L Costello@costello@mitre.org to comp.compilers on Fri Mar 24 14:45:40 2023

From Newsgroup: comp.compilers

Hello Compiler Experts!

I am reading the book, "Programming Languages, Application and Interpretation" by Shriram Krishnamurthi.

The book says that Lisp and Scheme have a primitive called "read".

The book says, "The read primitive is a crown jewel of Lisp and Scheme."

Some of my notes from reading the book:

- Read does tokenizing and reading.
- Read returns a value known as an s-expression.
- The s-expression is an intermediate representation.
- The output of read is either a number or a list. That's it!

Example of tokenizing/parsing using read:

(+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

The first expression (+ 3 4) is the concrete syntax.
The middle expression (list `+ 3 4) is an s-expression. It is an intermediate representation.
The last expression (add (num 3) (num 4)) is the abstract syntax.

The book says: read is one of the great ideas of computer science. It helps decompose a fundamentally difficult process - generalized parsing of the input stream - into two simple processes:

(1) reading the input stream into an intermediate representation
(2) parsing that intermediate representation

I've read several compiler books and none of them talked about this. They talk about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"?

I have done some work with Flex and Bison and recently I've done some work
with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community?
Surely the concept that read embodies is not specific to Lisp and Scheme, right?

/Roger
[Yes, it's specific to Lisp and Scheme. They have an extremely simple
symtax called S expressions of nested parenthesized lists of space
separated tokens with some quoting. The original plan was that Lisp 2
would have M expressions that looked more like a normal language but
it's over 50 years later and they still haven't gotten around to it.
-John]
--- Synchronet 3.21b-Linux NewsLink 1.2

From Spiros Bousbouras@spibou@gmail.com to comp.compilers on Sat Mar 25 11:55:35 2023

From Newsgroup: comp.compilers

On Fri, 24 Mar 2023 14:45:40 +0000
Roger L Costello <costello@mitre.org> wrote:

Hello Compiler Experts!

I am reading the book, "Programming Languages, Application and Interpretation"
by Shriram Krishnamurthi.

The book says that Lisp and Scheme have a primitive called "read".

The book says, "The read primitive is a crown jewel of Lisp and Scheme."

Some of my notes from reading the book:

- Read does tokenizing and reading.
- Read returns a value known as an s-expression.
- The s-expression is an intermediate representation.
- The output of read is either a number or a list. That's it!

For educational examples perhaps it's only a number or a list. But for real world usage it has to be more. In Common Lisp it can be a string or a symbol (an identifier more or less) or a vector or a number of other Common Lisp types. If the programmer defines his own classes (which then count as new types) then automatically syntax is created to be able to read and return
such objects too.

[...]

I've read several compiler books and none of them talked about this. They talk
about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"?

I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community?

Probably because there really isn't much to say. It's straightforward to parse so if it works for your needs then you don't need to read any compiler books about it.

Surely the concept that read embodies is not specific to Lisp and Scheme, right?

It is specific to when you have a very simple and uniform syntax and experience suggests that this isn't to most people's taste. Whether it is a result of "mental wiring" or tradition (including mathematical tradition) to which one gets exposed from a young age , I don't know. What I mean by this is that most people seem to find it easier to read
a + b * c
as opposed to
(+ a (* b c))

and I don't know if this is just the result of early exposure or an inherent part of how most humans' brains function.

Another issue is that sometimes people have to turn mathematical notation in computer programmes and it is an extra step to transform
a + b * c to (+ a (* b c)) regardless of which one finds easier to read in isolation.

In mathematical logic the formal syntax also specifies a uniform and simple notation , usually based on parentheses , but even there authors immediately introduce conventions about operator precedence so that you don't have to
read (and they don't have to type) so many parentheses. So what in formal syntax would be for example
((A reo B) raA C) becomes A reo B raA C where reo is specified to have higher
precedence than raA .

I note that Forth also has a very simple and uniform syntax and again Forth isn't very popular.

Moderator wrote:

/Roger
[Yes, it's specific to Lisp and Scheme. They have an extremely simple
symtax called S expressions of nested parenthesized lists of space
separated tokens with some quoting. The original plan was that Lisp 2
would have M expressions that looked more like a normal language but
it's over 50 years later and they still haven't gotten around to it.

Actually Common Lisp has gotten around to it. I have seen Common Lisp
libraries which create a more conventional syntax for Common Lisp and even claim to retain the power of writing macros. But I've never paid much
attention because the usual Common Lisp syntax works fine for me. So I can't provide links , perhaps someone on comp.lang.lisp will know. I don't think that such libraries have seen much if any use. From what I recall , even
their authors did not claim that they prefer the different syntax but they
were simply hoping that with a more conventional syntactic wrapper Common
Lisp (or some Lisp) would become more popular ; or perhaps they saw it as an interesting intellectual exercise.

--
vlaho.ninja/prog
--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.compilers on Sat Mar 25 13:14:31 2023

From Newsgroup: comp.compilers

[[...] The original plan was that Lisp 2
would have M expressions that looked more like a normal language but
it's over 50 years later and they still haven't gotten around to it.
-John]

Actually they have. Some HOPL paper (or several of them) discuss
this: There were several attempts at an Algol-like syntax, but Lisp
proprammers found that they preferred programming in S-Expressions
over the Algol-like syntax, whether it's M-Expressions, Dylan syntax,
or several other attempts.

The only language which might be considered a success at having an
Algol-like syntax in something similar to Lisp is JavaScript. Maybe
this is just because JavaScript is far enough from Lisp not just in
syntax, and there is no S-expression syntax for JavaScript, is there?

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
--- Synchronet 3.21b-Linux NewsLink 1.2

From Lieven Marchand@mal@wyrd.be to comp.compilers on Sat Mar 25 19:58:58 2023

From Newsgroup: comp.compilers

Roger L Costello <costello@mitre.org> writes:

I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community? Surely the concept that read embodies is not specific to Lisp and Scheme, right?

Apart from the already mentioned problem that it forces you into a
syntax that a lot of people don't like, there's also the problem that
you have to deal with hostile input. Where you expect "(+ 2 3)" someone
will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security
problems in real world settings come from not correctly validating
inputs and by the time you have worked around all these problems read
isn't all that easy anymore. C for example has a somewhat similar
facility scanf that tries to pattern match input and is also considered
unsafe. A good rule of thumb for production ready software is to define
a grammar for valid input and provide a validating parser.

--
Laat hulle almal sterf. Ek is tevrede om die w|-reld te sien brand en die vallende
konings te spot. Ek en my aasdier sal loop op die as van die verwoeste aarde.

--- Synchronet 3.21b-Linux NewsLink 1.2

From gah4@gah4@u.washington.edu to comp.compilers on Sat Mar 25 14:32:18 2023

From Newsgroup: comp.compilers

On Saturday, March 25, 2023 at 7:48:45rC>AM UTC-7, Spiros Bousbouras wrote:

(snip)

It is specific to when you have a very simple and uniform syntax and

experience

suggests that this isn't to most people's taste. Whether it is a result of "mental wiring" or tradition (including mathematical tradition) to which one gets exposed from a young age , I don't know. What I mean by this is that

most

people seem to find it easier to read
a + b * c
as opposed to
(+ a (* b c))

and I don't know if this is just the result of early exposure or an inherent part of how most humans' brains function.

Years ago, when there was actual competition between TI and HP
for calculator sales, there was much discussion on the advantages
of HPs RPN (postfix notation) vs. TI's algebraic (infix notation).

That seems to have gone away now, and is not discussed much.

What it always seemed to me, was it was easier to think in
postfix terms, but easier to write in infix notation. You won't
find algebra or calculus books writing expressions in prefix or
postfix form.

No idea about mental wiring or being exposed at a young age.

I wonder about studies of young(er) kids learning to use
calculators or in teaching math to younger kids.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun Mar 26 00:46:40 2023

From Newsgroup: comp.compilers

On 2023-03-25, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

[[...] The original plan was that Lisp 2
would have M expressions that looked more like a normal language but
it's over 50 years later and they still haven't gotten around to it.
-John]

Actually they have. Some HOPL paper (or several of them) discuss
this: There were several attempts at an Algol-like syntax, but Lisp proprammers found that they preferred programming in S-Expressions
over the Algol-like syntax, whether it's M-Expressions, Dylan syntax,
or several other attempts.

The situation is more nuanced. Common Lisp has programmable read tables
which let you have any surface syntax, and this is used. It's just not
the predominant mode of writing the bulk of the code. For instance, the cl-interpol library provides string syntax with interpolation.

There exists an open source module for Common Lisp called
named-readtables which provides disciplined registration for managing
multiple read-tables. If a developer wants to mix multiple
read-table-based syntaxes in the same source, they can clash.

Racket is a popular language based on Scheme, which also has
programmable syntax. A Racket source file can begin with a #lang
directive which indicates which language module is being used;
that syntax then applies to the rest of the file. I have the
impression that this is farily widely used in the Racket world.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- Synchronet 3.21b-Linux NewsLink 1.2

From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun Mar 26 01:17:40 2023

From Newsgroup: comp.compilers

On 2023-03-24, Roger L Costello <costello@mitre.org> wrote:

Example of tokenizing/parsing using read:

(+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

You've not quite hit upon how it works, and I'd encourage you to keep exploring.

Read takes the seven characters (+ 3 4) and returns an object
which stands for the same thinig. When Lisp programmers discuss
that object, they refer to it using the same notation (+ 3 4).

Actual copy-paste from a Lisp session:

[1]> (read-from-string "(+ 3 4)")
(+ 3 4) ;
7

The second return value of read-from-string, 7, isn't the
value of the expression; it's the position of the first
character of the string which was not read. Our expression
is seven characters long.

The first expression (+ 3 4) is the concrete syntax.
The middle expression (list `+ 3 4) is an s-expression. It is an intermediate representation.

"S-expression" actually refers to the character syntax. The object
in memory is just an expression.

The reader in Lisps like Scheme and Common Lisp perpetrates no such embellishment. The symbol "list" and quotation around the + will not
appear from reading "(+ 3 7)". You get a three-element list, made up out
of three cons cells (pair-like objects), whose elements are strictly
those that are implied by the read syntax: the + symbol and the two
numbers.

The last expression (add (num 3) (num 4)) is the abstract syntax.

No such thing is user-visible in any mainstream Lisp. Lisp interpreters directly evaluate the (+ 3 4) object.

Lisp compilers potentially build some annotated syntax tree, but
this is not a documented feature of any Lisp that I know; it will be
an internal matter.

Compiling the raw (+ 3 4) form is perfectly possible.

The book says: read is one of the great ideas of computer science. It helps decompose a fundamentally difficult process - generalized parsing of the input
stream - into two simple processes:

(1) reading the input stream into an intermediate representation
(2) parsing that intermediate representation

The bigger idea in Lisp is actually "print-read consistency": that
objects have a printed notation that the machine can produce, which the
machine can read to reproduce a similar object.

Not all objects have print-read consistency in Lisp, but things are
usualy strict int he mature Lisp dialects. If something doesn't have
print-read consistency, it will print in an unreadable form that
generates an error.

In Common Lisp, the character sequence #< (sharpsign less-than),
in the standard read-table, signals an error. Objects which
don't have a printed notation that can be read can use that
syntax, e.g. #<socket-handle 10.1.2.3:8080>.

I've read several compiler books and none of them talked about this. They talk
about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"?

It's because we are not in a branch of the parallel universe in which a
lot of people know about and program in Lisp.

The Lisp microcosm has a lot to say on many topics, but computing
is largely ignorant of it.

I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community? Surely the concept that read embodies is not specific to Lisp and Scheme, right?

S-expressions do crop up outside of Lisp.

The IMAP4 protocol uses them.

The GNU C compiler uses a form of S-expression internally.
Look up RTL:

https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL

The Rational Rose object design tool stores files in a S-expression
format called Petal.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- Synchronet 3.21b-Linux NewsLink 1.2

From Spiros Bousbouras@spibou@gmail.com to comp.compilers on Sun Mar 26 14:10:14 2023

From Newsgroup: comp.compilers

On Sat, 25 Mar 2023 19:58:58 +0100
Lieven Marchand <mal@wyrd.be> wrote:

Apart from the already mentioned problem that it forces you into a
syntax that a lot of people don't like, there's also the problem that
you have to deal with hostile input. Where you expect "(+ 2 3)" someone
will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))".

The ability to read S-expressions does not imply that you must also have
a "read and evaluate" facility. In any case , in Common Lisp you can
trivially disable it by setting *READ-EVAL* to NIL .

--- Synchronet 3.21b-Linux NewsLink 1.2

From Christopher F Clark@christopher.f.clark@compiler-resources.com to comp.compilers on Sun Mar 26 17:11:14 2023

From Newsgroup: comp.compilers

If I recall correctly, at one time, PCCTS and itrCOs related tool
Sorcerer used an S-expression-like representation for ASTs, including an equivelnt to CAR and CDR, whose names I donrCOt recall but they were
something like rCLdownrCY and rCLrightrCY.

More recent versions of ANTLR don't expose that as far at I have seen, preferring to use a Visitor pattern for traversals.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun Mar 26 18:19:23 2023

From Newsgroup: comp.compilers

On 2023-03-25, Lieven Marchand <mal@wyrd.be> wrote:

Roger L Costello <costello@mitre.org> writes:

I have done some work with Flex and Bison and recently I've done some work >> with building parsers using read. My experience is the latter is much easier.
Why isn't read more widely discussed and used in the compiler community?
Surely the concept that read embodies is not specific to Lisp and Scheme,
right?

Apart from the already mentioned problem that it forces you into a
syntax that a lot of people don't like, there's also the problem that
you have to deal with hostile input. Where you expect "(+ 2 3)" someone
will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security

Not every Lisp dialect has hash-dot read-time evaluation; that's
a feature of Common Lisp, disabled by setting/binding *read-eval*
to nil. I don't seem to recall that Scheme has it. I deliberately
kept it out of TXR Lisp.

However, that doesn't disable compile-time evaluation in macros,
which kicks in if you feed the read code to the compile function.
compile must be regarded the same as eval from a security POV.
We are seeing compile-time evaluation in newer languages,
though.

It's not a bona-fide security issue, except in applications that
dynamically compile untrusted input. Since the aim is almost
always to execute it, whether the malice happens at compile
time or run time. Both have to be sandboxed.

When you're building an open-source program, it's a given that
you're running its code: shell scripts, make files or what
have you. It doesn't need read-time evaluation to perpetrate
malice.

problems in real world settings come from not correctly validating
inputs and by the time you have worked around all these problems read
isn't all that easy anymore. C for example has a somewhat similar
facility scanf that tries to pattern match input and is also considered unsafe.

scanf is unsafe, but not in the way that hash-dot read-time evaluation
is unsafe. The situations are not comparable.

scanf doesn't feature a documented, reliable scan-time programing language
that the *input* can use to extend the program which calls scanf,
and which can be turned off by a flag.

You have to exploit a buffer overflow or whatever, enabled by careless
use of scanf.

All they have in common is that calling read on untrusted input without disabling *read-eval* is a kind of careless use of read.

But setting *read-eval* to nil is s something you may be able to do in
just one place in the entire application. (Anything which needs
*read-eval* can opt-in using (let ((*read-eval t)) ...) around its
calls to read.

A good rule of thumb for production ready software is to define
a grammar for valid input and provide a validating parser.

Sure, if you want to waste your time defining grammars and
writing validating parsers.

This is no longer done that much outside of the Lisp world. People use XML, JSON, Yaml, ..., whose grammar they definitely didn't design or
implement, and validate the content/shape of the object that comes out.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.compilers on Mon Mar 27 05:08:43 2023

From Newsgroup: comp.compilers

Christopher F Clark <christopher.f.clark@compiler-resources.com> schrieb:

If I recall correctly, at one time, PCCTS and itrCOs related tool
Sorcerer used an S-expression-like representation for ASTs, including an equivelnt to CAR and CDR, whose names I donrCOt recall but they were something like rCLdownrCY and rCLrightrCY.

S-expressions are a fairly natural representation of using trees,
at least for expressions.

An example: If you use gfortran's -fdump-fortran-original option, you
will get its internal expression after parsing on standard output.

For example, the code of

real function foo(x,y)
foo = 2.3 + 5.0*x*y
end function foo

is represented as

ASSIGN foo:foo (+ 2.29999995 (* (* 5.00000000 foo:x) foo:y))

where the internal representation is a tree.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Lieven Marchand@mal@wyrd.be to comp.compilers on Mon Mar 27 23:15:02 2023

From Newsgroup: comp.compilers

Kaz Kylheku <864-117-4973@kylheku.com> writes:

Not every Lisp dialect has hash-dot read-time evaluation; that's
a feature of Common Lisp, disabled by setting/binding *read-eval*
to nil. I don't seem to recall that Scheme has it. I deliberately
kept it out of TXR Lisp.

You also should wrap a use of read in a with-standard-io-syntax and even
then there can still be surprises in CL. One implementation I used by
default still also had the #, that has disappeared from the CLtL2 era of
the standard.

A good rule of thumb for production ready software is to define
a grammar for valid input and provide a validating parser.

Sure, if you want to waste your time defining grammars and
writing validating parsers.

This is no longer done that much outside of the Lisp world. People use XML, JSON, Yaml, ..., whose grammar they definitely didn't design or
implement, and validate the content/shape of the object that comes out.

The Haskell and the Rust world have partially gone to applicative and
monadic parsing. In such a paradigm it's not that hard to write
grammars.

--
Laat hulle almal sterf. Ek is tevrede om die w|-reld te sien brand en die vallende
konings te spot. Ek en my aasdier sal loop op die as van die verwoeste aarde. --- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Sun May 17 07:06:15 2026
  from Euclid, Oh via Telnet
- Geek2
  Sat May 16 21:25:04 2026
  from Euclid, Oh via Telnet
- Jas Hud
  Sat May 16 00:50:28 2026
  from Bbs.Eob-Bbs.Com,wi via Telnet
- Geek2
  Fri May 15 19:53:20 2026
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	65
Nodes:	6 (0 / 6)
Uptime:	11:54:37
Calls:	862
Files:	1,311
D/L today:	5 files (10,064K bytes)
Messages:	265,373

A simpler way to tokenize and parse?

Who's Online

Recent Visitors

System Info