• A simpler way to tokenize and parse?

    From Roger L Costello@costello@mitre.org to comp.compilers on Fri Mar 24 14:45:40 2023
    From Newsgroup: comp.compilers

    Hello Compiler Experts!

    I am reading the book, "Programming Languages, Application and Interpretation" by Shriram Krishnamurthi.

    The book says that Lisp and Scheme have a primitive called "read".

    The book says, "The read primitive is a crown jewel of Lisp and Scheme."

    Some of my notes from reading the book:

    - Read does tokenizing and reading.
    - Read returns a value known as an s-expression.
    - The s-expression is an intermediate representation.
    - The output of read is either a number or a list. That's it!

    Example of tokenizing/parsing using read:

    (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

    The first expression (+ 3 4) is the concrete syntax.
    The middle expression (list `+ 3 4) is an s-expression. It is an intermediate representation.
    The last expression (add (num 3) (num 4)) is the abstract syntax.

    The book says: read is one of the great ideas of computer science. It helps decompose a fundamentally difficult process - generalized parsing of the input stream - into two simple processes:

    (1) reading the input stream into an intermediate representation
    (2) parsing that intermediate representation

    I've read several compiler books and none of them talked about this. They talk about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"?

    I have done some work with Flex and Bison and recently I've done some work
    with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community?
    Surely the concept that read embodies is not specific to Lisp and Scheme, right?

    /Roger
    [Yes, it's specific to Lisp and Scheme. They have an extremely simple
    symtax called S expressions of nested parenthesized lists of space
    separated tokens with some quoting. The original plan was that Lisp 2
    would have M expressions that looked more like a normal language but
    it's over 50 years later and they still haven't gotten around to it.
    -John]
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Spiros Bousbouras@spibou@gmail.com to comp.compilers on Sat Mar 25 11:55:35 2023
    From Newsgroup: comp.compilers

    On Fri, 24 Mar 2023 14:45:40 +0000
    Roger L Costello <costello@mitre.org> wrote:
    Hello Compiler Experts!

    I am reading the book, "Programming Languages, Application and Interpretation"
    by Shriram Krishnamurthi.

    The book says that Lisp and Scheme have a primitive called "read".

    The book says, "The read primitive is a crown jewel of Lisp and Scheme."

    Some of my notes from reading the book:

    - Read does tokenizing and reading.
    - Read returns a value known as an s-expression.
    - The s-expression is an intermediate representation.
    - The output of read is either a number or a list. That's it!

    For educational examples perhaps it's only a number or a list. But for real world usage it has to be more. In Common Lisp it can be a string or a symbol (an identifier more or less) or a vector or a number of other Common Lisp types. If the programmer defines his own classes (which then count as new types) then automatically syntax is created to be able to read and return
    such objects too.

    [...]

    I've read several compiler books and none of them talked about this. They talk
    about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"?

    I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community?

    Probably because there really isn't much to say. It's straightforward to parse so if it works for your needs then you don't need to read any compiler books about it.

    Surely the concept that read embodies is not specific to Lisp and Scheme, right?

    It is specific to when you have a very simple and uniform syntax and experience suggests that this isn't to most people's taste. Whether it is a result of "mental wiring" or tradition (including mathematical tradition) to which one gets exposed from a young age , I don't know. What I mean by this is that most people seem to find it easier to read
    a + b * c
    as opposed to
    (+ a (* b c))

    and I don't know if this is just the result of early exposure or an inherent part of how most humans' brains function.

    Another issue is that sometimes people have to turn mathematical notation in computer programmes and it is an extra step to transform
    a + b * c to (+ a (* b c)) regardless of which one finds easier to read in isolation.


    In mathematical logic the formal syntax also specifies a uniform and simple notation , usually based on parentheses , but even there authors immediately introduce conventions about operator precedence so that you don't have to
    read (and they don't have to type) so many parentheses. So what in formal syntax would be for example
    ((A reo B) raA C) becomes A reo B raA C where reo is specified to have higher
    precedence than raA .


    I note that Forth also has a very simple and uniform syntax and again Forth isn't very popular.

    Moderator wrote:
    /Roger
    [Yes, it's specific to Lisp and Scheme. They have an extremely simple
    symtax called S expressions of nested parenthesized lists of space
    separated tokens with some quoting. The original plan was that Lisp 2
    would have M expressions that looked more like a normal language but
    it's over 50 years later and they still haven't gotten around to it.

    Actually Common Lisp has gotten around to it. I have seen Common Lisp
    libraries which create a more conventional syntax for Common Lisp and even claim to retain the power of writing macros. But I've never paid much
    attention because the usual Common Lisp syntax works fine for me. So I can't provide links , perhaps someone on comp.lang.lisp will know. I don't think that such libraries have seen much if any use. From what I recall , even
    their authors did not claim that they prefer the different syntax but they
    were simply hoping that with a more conventional syntactic wrapper Common
    Lisp (or some Lisp) would become more popular ; or perhaps they saw it as an interesting intellectual exercise.

    --
    vlaho.ninja/prog
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.compilers on Sat Mar 25 13:14:31 2023
    From Newsgroup: comp.compilers

    [[...] The original plan was that Lisp 2
    would have M expressions that looked more like a normal language but
    it's over 50 years later and they still haven't gotten around to it.
    -John]

    Actually they have. Some HOPL paper (or several of them) discuss
    this: There were several attempts at an Algol-like syntax, but Lisp
    proprammers found that they preferred programming in S-Expressions
    over the Algol-like syntax, whether it's M-Expressions, Dylan syntax,
    or several other attempts.

    The only language which might be considered a success at having an
    Algol-like syntax in something similar to Lisp is JavaScript. Maybe
    this is just because JavaScript is far enough from Lisp not just in
    syntax, and there is no S-expression syntax for JavaScript, is there?

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Lieven Marchand@mal@wyrd.be to comp.compilers on Sat Mar 25 19:58:58 2023
    From Newsgroup: comp.compilers

    Roger L Costello <costello@mitre.org> writes:

    I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community? Surely the concept that read embodies is not specific to Lisp and Scheme, right?

    Apart from the already mentioned problem that it forces you into a
    syntax that a lot of people don't like, there's also the problem that
    you have to deal with hostile input. Where you expect "(+ 2 3)" someone
    will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security
    problems in real world settings come from not correctly validating
    inputs and by the time you have worked around all these problems read
    isn't all that easy anymore. C for example has a somewhat similar
    facility scanf that tries to pattern match input and is also considered
    unsafe. A good rule of thumb for production ready software is to define
    a grammar for valid input and provide a validating parser.

    --
    Laat hulle almal sterf. Ek is tevrede om die w|-reld te sien brand en die vallende
    konings te spot. Ek en my aasdier sal loop op die as van die verwoeste aarde.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From gah4@gah4@u.washington.edu to comp.compilers on Sat Mar 25 14:32:18 2023
    From Newsgroup: comp.compilers

    On Saturday, March 25, 2023 at 7:48:45rC>AM UTC-7, Spiros Bousbouras wrote:

    (snip)

    It is specific to when you have a very simple and uniform syntax and
    experience
    suggests that this isn't to most people's taste. Whether it is a result of "mental wiring" or tradition (including mathematical tradition) to which one gets exposed from a young age , I don't know. What I mean by this is that
    most
    people seem to find it easier to read
    a + b * c
    as opposed to
    (+ a (* b c))

    and I don't know if this is just the result of early exposure or an inherent part of how most humans' brains function.

    Years ago, when there was actual competition between TI and HP
    for calculator sales, there was much discussion on the advantages
    of HPs RPN (postfix notation) vs. TI's algebraic (infix notation).

    That seems to have gone away now, and is not discussed much.

    What it always seemed to me, was it was easier to think in
    postfix terms, but easier to write in infix notation. You won't
    find algebra or calculus books writing expressions in prefix or
    postfix form.

    No idea about mental wiring or being exposed at a young age.

    I wonder about studies of young(er) kids learning to use
    calculators or in teaching math to younger kids.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun Mar 26 00:46:40 2023
    From Newsgroup: comp.compilers

    On 2023-03-25, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    [[...] The original plan was that Lisp 2
    would have M expressions that looked more like a normal language but
    it's over 50 years later and they still haven't gotten around to it.
    -John]

    Actually they have. Some HOPL paper (or several of them) discuss
    this: There were several attempts at an Algol-like syntax, but Lisp proprammers found that they preferred programming in S-Expressions
    over the Algol-like syntax, whether it's M-Expressions, Dylan syntax,
    or several other attempts.

    The situation is more nuanced. Common Lisp has programmable read tables
    which let you have any surface syntax, and this is used. It's just not
    the predominant mode of writing the bulk of the code. For instance, the cl-interpol library provides string syntax with interpolation.

    There exists an open source module for Common Lisp called
    named-readtables which provides disciplined registration for managing
    multiple read-tables. If a developer wants to mix multiple
    read-table-based syntaxes in the same source, they can clash.

    Racket is a popular language based on Scheme, which also has
    programmable syntax. A Racket source file can begin with a #lang
    directive which indicates which language module is being used;
    that syntax then applies to the rest of the file. I have the
    impression that this is farily widely used in the Racket world.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun Mar 26 01:17:40 2023
    From Newsgroup: comp.compilers

    On 2023-03-24, Roger L Costello <costello@mitre.org> wrote:
    Example of tokenizing/parsing using read:

    (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

    You've not quite hit upon how it works, and I'd encourage you to keep exploring.

    Read takes the seven characters (+ 3 4) and returns an object
    which stands for the same thinig. When Lisp programmers discuss
    that object, they refer to it using the same notation (+ 3 4).

    Actual copy-paste from a Lisp session:

    [1]> (read-from-string "(+ 3 4)")
    (+ 3 4) ;
    7

    The second return value of read-from-string, 7, isn't the
    value of the expression; it's the position of the first
    character of the string which was not read. Our expression
    is seven characters long.

    The first expression (+ 3 4) is the concrete syntax.
    The middle expression (list `+ 3 4) is an s-expression. It is an intermediate representation.

    "S-expression" actually refers to the character syntax. The object
    in memory is just an expression.

    The reader in Lisps like Scheme and Common Lisp perpetrates no such embellishment. The symbol "list" and quotation around the + will not
    appear from reading "(+ 3 7)". You get a three-element list, made up out
    of three cons cells (pair-like objects), whose elements are strictly
    those that are implied by the read syntax: the + symbol and the two
    numbers.

    The last expression (add (num 3) (num 4)) is the abstract syntax.

    No such thing is user-visible in any mainstream Lisp. Lisp interpreters directly evaluate the (+ 3 4) object.

    Lisp compilers potentially build some annotated syntax tree, but
    this is not a documented feature of any Lisp that I know; it will be
    an internal matter.

    Compiling the raw (+ 3 4) form is perfectly possible.

    The book says: read is one of the great ideas of computer science. It helps decompose a fundamentally difficult process - generalized parsing of the input
    stream - into two simple processes:

    (1) reading the input stream into an intermediate representation
    (2) parsing that intermediate representation

    The bigger idea in Lisp is actually "print-read consistency": that
    objects have a printed notation that the machine can produce, which the
    machine can read to reproduce a similar object.

    Not all objects have print-read consistency in Lisp, but things are
    usualy strict int he mature Lisp dialects. If something doesn't have
    print-read consistency, it will print in an unreadable form that
    generates an error.

    In Common Lisp, the character sequence #< (sharpsign less-than),
    in the standard read-table, signals an error. Objects which
    don't have a printed notation that can be read can use that
    syntax, e.g. #<socket-handle 10.1.2.3:8080>.

    I've read several compiler books and none of them talked about this. They talk
    about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"?

    It's because we are not in a branch of the parallel universe in which a
    lot of people know about and program in Lisp.

    The Lisp microcosm has a lot to say on many topics, but computing
    is largely ignorant of it.

    I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community? Surely the concept that read embodies is not specific to Lisp and Scheme, right?

    S-expressions do crop up outside of Lisp.

    The IMAP4 protocol uses them.

    The GNU C compiler uses a form of S-expression internally.
    Look up RTL:

    https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL

    The Rational Rose object design tool stores files in a S-expression
    format called Petal.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Spiros Bousbouras@spibou@gmail.com to comp.compilers on Sun Mar 26 14:10:14 2023
    From Newsgroup: comp.compilers

    On Sat, 25 Mar 2023 19:58:58 +0100
    Lieven Marchand <mal@wyrd.be> wrote:
    Apart from the already mentioned problem that it forces you into a
    syntax that a lot of people don't like, there's also the problem that
    you have to deal with hostile input. Where you expect "(+ 2 3)" someone
    will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))".

    The ability to read S-expressions does not imply that you must also have
    a "read and evaluate" facility. In any case , in Common Lisp you can
    trivially disable it by setting *READ-EVAL* to NIL .

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Christopher F Clark@christopher.f.clark@compiler-resources.com to comp.compilers on Sun Mar 26 17:11:14 2023
    From Newsgroup: comp.compilers

    If I recall correctly, at one time, PCCTS and itrCOs related tool
    Sorcerer used an S-expression-like representation for ASTs, including an equivelnt to CAR and CDR, whose names I donrCOt recall but they were
    something like rCLdownrCY and rCLrightrCY.

    More recent versions of ANTLR don't expose that as far at I have seen, preferring to use a Visitor pattern for traversals.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun Mar 26 18:19:23 2023
    From Newsgroup: comp.compilers

    On 2023-03-25, Lieven Marchand <mal@wyrd.be> wrote:
    Roger L Costello <costello@mitre.org> writes:

    I have done some work with Flex and Bison and recently I've done some work >> with building parsers using read. My experience is the latter is much easier.
    Why isn't read more widely discussed and used in the compiler community?
    Surely the concept that read embodies is not specific to Lisp and Scheme,
    right?

    Apart from the already mentioned problem that it forces you into a
    syntax that a lot of people don't like, there's also the problem that
    you have to deal with hostile input. Where you expect "(+ 2 3)" someone
    will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security

    Not every Lisp dialect has hash-dot read-time evaluation; that's
    a feature of Common Lisp, disabled by setting/binding *read-eval*
    to nil. I don't seem to recall that Scheme has it. I deliberately
    kept it out of TXR Lisp.

    However, that doesn't disable compile-time evaluation in macros,
    which kicks in if you feed the read code to the compile function.
    compile must be regarded the same as eval from a security POV.
    We are seeing compile-time evaluation in newer languages,
    though.

    It's not a bona-fide security issue, except in applications that
    dynamically compile untrusted input. Since the aim is almost
    always to execute it, whether the malice happens at compile
    time or run time. Both have to be sandboxed.

    When you're building an open-source program, it's a given that
    you're running its code: shell scripts, make files or what
    have you. It doesn't need read-time evaluation to perpetrate
    malice.

    problems in real world settings come from not correctly validating
    inputs and by the time you have worked around all these problems read
    isn't all that easy anymore. C for example has a somewhat similar
    facility scanf that tries to pattern match input and is also considered unsafe.

    scanf is unsafe, but not in the way that hash-dot read-time evaluation
    is unsafe. The situations are not comparable.

    scanf doesn't feature a documented, reliable scan-time programing language
    that the *input* can use to extend the program which calls scanf,
    and which can be turned off by a flag.

    You have to exploit a buffer overflow or whatever, enabled by careless
    use of scanf.

    All they have in common is that calling read on untrusted input without disabling *read-eval* is a kind of careless use of read.

    But setting *read-eval* to nil is s something you may be able to do in
    just one place in the entire application. (Anything which needs
    *read-eval* can opt-in using (let ((*read-eval t)) ...) around its
    calls to read.

    A good rule of thumb for production ready software is to define
    a grammar for valid input and provide a validating parser.

    Sure, if you want to waste your time defining grammars and
    writing validating parsers.

    This is no longer done that much outside of the Lisp world. People use XML, JSON, Yaml, ..., whose grammar they definitely didn't design or
    implement, and validate the content/shape of the object that comes out.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.compilers on Mon Mar 27 05:08:43 2023
    From Newsgroup: comp.compilers

    Christopher F Clark <christopher.f.clark@compiler-resources.com> schrieb:
    If I recall correctly, at one time, PCCTS and itrCOs related tool
    Sorcerer used an S-expression-like representation for ASTs, including an equivelnt to CAR and CDR, whose names I donrCOt recall but they were something like rCLdownrCY and rCLrightrCY.

    S-expressions are a fairly natural representation of using trees,
    at least for expressions.

    An example: If you use gfortran's -fdump-fortran-original option, you
    will get its internal expression after parsing on standard output.

    For example, the code of

    real function foo(x,y)
    foo = 2.3 + 5.0*x*y
    end function foo

    is represented as

    ASSIGN foo:foo (+ 2.29999995 (* (* 5.00000000 foo:x) foo:y))

    where the internal representation is a tree.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Lieven Marchand@mal@wyrd.be to comp.compilers on Mon Mar 27 23:15:02 2023
    From Newsgroup: comp.compilers

    Kaz Kylheku <864-117-4973@kylheku.com> writes:

    Not every Lisp dialect has hash-dot read-time evaluation; that's
    a feature of Common Lisp, disabled by setting/binding *read-eval*
    to nil. I don't seem to recall that Scheme has it. I deliberately
    kept it out of TXR Lisp.

    You also should wrap a use of read in a with-standard-io-syntax and even
    then there can still be surprises in CL. One implementation I used by
    default still also had the #, that has disappeared from the CLtL2 era of
    the standard.

    A good rule of thumb for production ready software is to define
    a grammar for valid input and provide a validating parser.

    Sure, if you want to waste your time defining grammars and
    writing validating parsers.

    This is no longer done that much outside of the Lisp world. People use XML, JSON, Yaml, ..., whose grammar they definitely didn't design or
    implement, and validate the content/shape of the object that comes out.

    The Haskell and the Rust world have partially gone to applicative and
    monadic parsing. In such a paradigm it's not that hard to write
    grammars.

    --
    Laat hulle almal sterf. Ek is tevrede om die w|-reld te sien brand en die vallende
    konings te spot. Ek en my aasdier sal loop op die as van die verwoeste aarde. --- Synchronet 3.21b-Linux NewsLink 1.2