• Bad performance of back-references

    From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.unix.shell on Sun Jun 1 14:07:33 2025
    From Newsgroup: comp.unix.shell

    In a recent application case I used back-references to find duplicate
    strings in large data sets. I tested that with grep as in

    grep -E -o '(.{42}).*\1'

    While this is obviously a neat way to formulate and solve the task in
    principle it is impractical for performance reasons.[*]

    Applied on my MB sized data the task did not terminate and I killed
    the process after a day.

    I also implemented the desired function explicitly (using two nested
    loops) in a couple of languages (interpreted or compiled). All those hand-crafted and non-optimized implementations terminated and did
    that within minutes or up to only few hours (depending on the pattern
    length).

    My astonishment is why the back-reference implementation performs so
    badly here with 'grep'.

    Janis

    [*] Note: Back-references are not from the Regular Expression functions
    class so they cannot be done in O(N) or O(N+K); so I don't expect this complexity where I use them. This is not the question here, just to be
    clear.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.unix.shell on Sun Jun 1 12:52:58 2025
    From Newsgroup: comp.unix.shell

    In article <101hfq7$22v3c$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a recent application case I used back-references to find duplicate
    strings in large data sets. I tested that with grep as in

    grep -E -o '(.{42}).*\1'

    While this is obviously a neat way to formulate and solve the task in >principle it is impractical for performance reasons.[*]

    Applied on my MB sized data the task did not terminate and I killed
    the process after a day.

    I also implemented the desired function explicitly (using two nested
    loops) in a couple of languages (interpreted or compiled). All those >hand-crafted and non-optimized implementations terminated and did
    that within minutes or up to only few hours (depending on the pattern >length).

    My astonishment is why the back-reference implementation performs so
    badly here with 'grep'.

    Janis

    [*] Note: Back-references are not from the Regular Expression functions
    class so they cannot be done in O(N) or O(N+K); so I don't expect this >complexity where I use them. This is not the question here, just to be
    clear.

    Your results don't surprise me in the the least.

    First, "back references" make "regular expressions" not regular,
    in the formal sense sense that they are no longer isomorphic to
    deterministic finite automata or their NDFA simulations.
    Matching DFA is inherently linear, but _creating_ DFAs can be
    exponential; NDFAs can be created in reasonable time (I forget
    the exact complexity, I'm afraid) though executing them may be
    superlinear; regardless it's much better than exponential.

    Second, most implementations that support backreferences use
    backtracking to do so, which can be exponential in both space
    and time.

    There's some good background information here: https://swtch.com/~rsc/regexp/regexp1.html

    The bottom line is that regexp that use back tracking are not
    actually regular expressions, and there is no known way to make
    them fast generally.

    - Dan C.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.unix.shell on Sun Jun 1 15:07:38 2025
    From Newsgroup: comp.unix.shell

    On 01.06.2025 14:52, Dan Cross wrote:
    In article <101hfq7$22v3c$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a recent application case I used back-references to find duplicate
    strings in large data sets. I tested that with grep as in

    grep -E -o '(.{42}).*\1'

    While this is obviously a neat way to formulate and solve the task in
    principle it is impractical for performance reasons.[*]

    Applied on my MB sized data the task did not terminate and I killed
    the process after a day.

    I also implemented the desired function explicitly (using two nested
    loops) in a couple of languages (interpreted or compiled). All those
    hand-crafted and non-optimized implementations terminated and did
    that within minutes or up to only few hours (depending on the pattern
    length).

    My astonishment is why the back-reference implementation performs so
    badly here with 'grep'.

    Janis

    [*] Note: Back-references are not from the Regular Expression functions
    class so they cannot be done in O(N) or O(N+K); so I don't expect this
    complexity where I use them. This is not the question here, just to be
    clear.

    Your results don't surprise me in the the least.

    First, "back references" make "regular expressions" not regular,
    in the formal sense sense that they are no longer isomorphic to
    deterministic finite automata or their NDFA simulations.

    I'm surprised you give me that answer since with my footnote
    above I explicitly intended to exactly prevent focusing on
    or distracting about this already known fact.

    Matching DFA is inherently linear, but _creating_ DFAs can be
    exponential; NDFAs can be created in reasonable time (I forget
    the exact complexity, I'm afraid) though executing them may be
    superlinear; regardless it's much better than exponential.

    Second, most implementations that support backreferences use
    backtracking to do so, which can be exponential in both space
    and time.

    I'd think that most applications that support back-references
    might rely on the same library (and not reinvent the wheel).

    If the accessible implementations will all resort to backtracking
    as the sole implementation for back-references it indeed explains
    something.


    There's some good background information here: https://swtch.com/~rsc/regexp/regexp1.html

    The bottom line is that regexp that use back tracking are not
    actually regular expressions, and there is no known way to make
    them fast generally.

    - Dan C.


    Thanks.

    Janis

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.unix.shell on Mon Jun 2 11:20:25 2025
    From Newsgroup: comp.unix.shell

    In article <101hjas$242rp$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 01.06.2025 14:52, Dan Cross wrote:
    In article <101hfq7$22v3c$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a recent application case I used back-references to find duplicate
    strings in large data sets. I tested that with grep as in

    grep -E -o '(.{42}).*\1'

    While this is obviously a neat way to formulate and solve the task in
    principle it is impractical for performance reasons.[*]

    Applied on my MB sized data the task did not terminate and I killed
    the process after a day.

    I also implemented the desired function explicitly (using two nested
    loops) in a couple of languages (interpreted or compiled). All those
    hand-crafted and non-optimized implementations terminated and did
    that within minutes or up to only few hours (depending on the pattern
    length).

    My astonishment is why the back-reference implementation performs so
    badly here with 'grep'.

    Janis

    [*] Note: Back-references are not from the Regular Expression functions
    class so they cannot be done in O(N) or O(N+K); so I don't expect this
    complexity where I use them. This is not the question here, just to be
    clear.

    Your results don't surprise me in the the least.

    First, "back references" make "regular expressions" not regular,
    in the formal sense sense that they are no longer isomorphic to
    deterministic finite automata or their NDFA simulations.

    I'm surprised you give me that answer since with my footnote
    above I explicitly intended to exactly prevent focusing on
    or distracting about this already known fact.

    Oh, sorry; I missed the footnote. It's odd, because that really
    is the question in some sense.

    Matching DFA is inherently linear, but _creating_ DFAs can be
    exponential; NDFAs can be created in reasonable time (I forget
    the exact complexity, I'm afraid) though executing them may be
    superlinear; regardless it's much better than exponential.

    Second, most implementations that support backreferences use
    backtracking to do so, which can be exponential in both space
    and time.

    I'd think that most applications that support back-references
    might rely on the same library (and not reinvent the wheel).

    I don't see how, when the functionality is implemented in many
    different languages.

    Historically, regular expressions were one of those things where
    library support wasn't super awesome. This is why Spencer's
    library was so popular, but that used backtracking.

    If the accessible implementations will all resort to backtracking
    as the sole implementation for back-references it indeed explains
    something.

    I don't know that anyone knows of a better way for the general
    case. As I said, there is no known fast (as in, less than
    exponential) and general solution to matching with back
    references.

    One can probably write something faster for specific use cases,
    but that would obviously be specific to those cases.

    There's some good background information here:
    https://swtch.com/~rsc/regexp/regexp1.html

    The bottom line is that regexp that use back tracking are not
    actually regular expressions, and there is no known way to make
    them fast generally.

    Thanks.

    Happy to help.

    - Dan C.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.unix.shell on Sat Jun 7 23:48:14 2025
    From Newsgroup: comp.unix.shell

    On 02.06.2025 13:20, Dan Cross wrote:
    In article <101hjas$242rp$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 01.06.2025 14:52, Dan Cross wrote:
    In article <101hfq7$22v3c$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a recent application case I used back-references to find duplicate
    strings in large data sets. I tested that with grep as in

    grep -E -o '(.{42}).*\1'

    While this is obviously a neat way to formulate and solve the task in
    principle it is impractical for performance reasons.[*]

    Applied on my MB sized data the task did not terminate and I killed
    the process after a day.

    I also implemented the desired function explicitly (using two nested
    loops) in a couple of languages (interpreted or compiled). All those
    hand-crafted and non-optimized implementations terminated and did
    that within minutes or up to only few hours (depending on the pattern
    length).

    My astonishment is why the back-reference implementation performs so
    badly here with 'grep'.

    Janis

    [*] Note: Back-references are not from the Regular Expression functions >>>> class so they cannot be done in O(N) or O(N+K); so I don't expect this >>>> complexity where I use them. This is not the question here, just to be >>>> clear.

    Your results don't surprise me in the the least.

    First, "back references" make "regular expressions" not regular,
    in the formal sense sense that they are no longer isomorphic to
    deterministic finite automata or their NDFA simulations.

    I'm surprised you give me that answer since with my footnote
    above I explicitly intended to exactly prevent focusing on
    or distracting about this already known fact.

    Oh, sorry; I missed the footnote. It's odd, because that really
    is the question in some sense.

    Yes, of course. It's just the magnitude that is so frustrating,
    especially in comparison to other solutions. (See below.)

    (Myself I'm rarely using expressions that go beyond the Regular
    Expressions formal class, since I usually want to rely on the O(N)
    or O(N+K) complexity. But in the current case the simple solution
    was just tempting.)


    Matching DFA is inherently linear, but _creating_ DFAs can be
    exponential; NDFAs can be created in reasonable time (I forget
    the exact complexity, I'm afraid) though executing them may be
    superlinear; regardless it's much better than exponential.

    Second, most implementations that support backreferences use
    backtracking to do so, which can be exponential in both space
    and time.

    I'd think that most applications that support back-references
    might rely on the same library (and not reinvent the wheel).

    I don't see how, when the functionality is implemented in many
    different languages.

    Oh, I wasn't primarily speaking about different languages. Here
    I had tools in mind that are supposedly all written in "C", like
    'grep', 'sed', etc. But of course also other libraries and tools
    written in other languages may access (for example) "C" libraries
    that provide that functionality; many languages support external
    access to functions written in other languages.

    That said; there's of course also various other implementations
    anyway. (I hope not all do the same mistakes.)


    Historically, regular expressions were one of those things where
    library support wasn't super awesome. This is why Spencer's
    library was so popular, but that used backtracking.

    If the accessible implementations will all resort to backtracking
    as the sole implementation for back-references it indeed explains
    something.

    I don't know that anyone knows of a better way for the general
    case. As I said, there is no known fast (as in, less than
    exponential) and general solution to matching with back
    references.

    It's certainly a correct observation that if you have a suboptimal
    but "general" algorithmic solution that it's tempting to use that.


    One can probably write something faster for specific use cases,
    but that would obviously be specific to those cases.

    To be honest, I haven't to any significant depth studied details
    of various "beyond-regular" implementations; with formal strictly
    Regular Expressions it's obvious and simple.

    I've asked that question mainly because the performance measures
    where too far apart. My simple 'grep' based solution '(.{12}).*\1'
    terminated (finally!) after 27.5 hours (more than a day).[*] Where
    a bulky hand-crafted nested loops required only a fraction of that;
    don't recall exactly but I think it was in the low minutes(!) range.
    That was some Awk code that even did a lot of unnecessary costly
    copying (because of the necessity, with the given functions, of
    repeated unavoidable huge copies). With yet another approach I had
    got that task down to 4 seconds, but that's not my expectation with
    a standard regexp library. But 27.5 hours is way off, IMO, compared
    to the straight forward primitive approach with 2 minutes; a factor
    of magnitude approx. 1000 times slower.

    If a "general solution" can be so costly then, I think, something
    better should be implemented than a "one-backtracking-fits-all".

    Janis

    [*] Note the comparable small string size of 12 of the substring to
    match. (This value is also much smaller than the sample in my OP.)
    For sized of {6} on a MB sized file it's okay, then it's still in
    the seconds range. But, as opposed to FSMs, the algorithm is badly
    scaling.

    [...]


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.unix.shell on Mon Jun 9 12:17:57 2025
    From Newsgroup: comp.unix.shell

    In article <1022c30$3cdq3$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 02.06.2025 13:20, Dan Cross wrote:
    In article <101hjas$242rp$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 01.06.2025 14:52, Dan Cross wrote:
    In article <101hfq7$22v3c$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a recent application case I used back-references to find duplicate >>>>> strings in large data sets. I tested that with grep as in

    grep -E -o '(.{42}).*\1'

    While this is obviously a neat way to formulate and solve the task in >>>>> principle it is impractical for performance reasons.[*]

    Applied on my MB sized data the task did not terminate and I killed
    the process after a day.

    I also implemented the desired function explicitly (using two nested >>>>> loops) in a couple of languages (interpreted or compiled). All those >>>>> hand-crafted and non-optimized implementations terminated and did
    that within minutes or up to only few hours (depending on the pattern >>>>> length).

    My astonishment is why the back-reference implementation performs so >>>>> badly here with 'grep'.

    Janis

    [*] Note: Back-references are not from the Regular Expression functions >>>>> class so they cannot be done in O(N) or O(N+K); so I don't expect this >>>>> complexity where I use them. This is not the question here, just to be >>>>> clear.

    Your results don't surprise me in the the least.

    First, "back references" make "regular expressions" not regular,
    in the formal sense sense that they are no longer isomorphic to
    deterministic finite automata or their NDFA simulations.

    I'm surprised you give me that answer since with my footnote
    above I explicitly intended to exactly prevent focusing on
    or distracting about this already known fact.

    Oh, sorry; I missed the footnote. It's odd, because that really
    is the question in some sense.

    Yes, of course. It's just the magnitude that is so frustrating,
    especially in comparison to other solutions. (See below.)

    Well, factorials get big really fast.

    6! = 720
    12! = 479,001,600

    That is, a factor of two between which factorial we want (6 vs
    12) but more than half a million between the value.

    (Myself I'm rarely using expressions that go beyond the Regular
    Expressions formal class, since I usually want to rely on the O(N)
    or O(N+K) complexity. But in the current case the simple solution
    was just tempting.)

    Yup. That's why they're there. But they're deceptive with
    respect to their complexity. Moreover, since authors like
    Friedl misrepresent the theory in popular books on the subject,
    people use what they think are "regular expressions" and get
    confused by the observed runtimes.

    Matching DFA is inherently linear, but _creating_ DFAs can be
    exponential; NDFAs can be created in reasonable time (I forget
    the exact complexity, I'm afraid) though executing them may be
    superlinear; regardless it's much better than exponential.

    Second, most implementations that support backreferences use
    backtracking to do so, which can be exponential in both space
    and time.

    I'd think that most applications that support back-references
    might rely on the same library (and not reinvent the wheel).

    I don't see how, when the functionality is implemented in many
    different languages.

    Oh, I wasn't primarily speaking about different languages. Here
    I had tools in mind that are supposedly all written in "C", like
    'grep', 'sed', etc. But of course also other libraries and tools
    written in other languages may access (for example) "C" libraries
    that provide that functionality; many languages support external
    access to functions written in other languages.

    Traditionally, those tools rolled their own versions. The
    regular expression libraries people have used over the years
    came somewhat later, in the evolutinary timeline.

    That said; there's of course also various other implementations
    anyway. (I hope not all do the same mistakes.)

    They don't. If you can use something like re2 (written in C++,
    which may constrain portability) then you're already better off.

    I don't know that anyone knows of a better way for the general
    case. As I said, there is no known fast (as in, less than
    exponential) and general solution to matching with back
    references.

    It's certainly a correct observation that if you have a suboptimal
    but "general" algorithmic solution that it's tempting to use that.

    For a general tool, it's unclear to me how one could do anything
    else.

    One can probably write something faster for specific use cases,
    but that would obviously be specific to those cases.

    To be honest, I haven't to any significant depth studied details
    of various "beyond-regular" implementations; with formal strictly
    Regular Expressions it's obvious and simple.

    Mmm, I'm not sure I entirely agree with that. NDFA analysis can
    be subtle, but the real thing that twists people's noggins is
    how DFA _creation_ can be exponential.

    I've asked that question mainly because the performance measures
    where too far apart. My simple 'grep' based solution '(.{12}).*\1'
    terminated (finally!) after 27.5 hours (more than a day).[*] Where
    a bulky hand-crafted nested loops required only a fraction of that;
    don't recall exactly but I think it was in the low minutes(!) range.
    That was some Awk code that even did a lot of unnecessary costly
    copying (because of the necessity, with the given functions, of
    repeated unavoidable huge copies). With yet another approach I had
    got that task down to 4 seconds, but that's not my expectation with
    a standard regexp library. But 27.5 hours is way off, IMO, compared
    to the straight forward primitive approach with 2 minutes; a factor
    of magnitude approx. 1000 times slower.

    If a "general solution" can be so costly then, I think, something
    better should be implemented than a "one-backtracking-fits-all".

    Well, if someone comes up with something better, they will have
    solved a very important open problem in computer science. :-)

    [*] Note the comparable small string size of 12 of the substring to
    match. (This value is also much smaller than the sample in my OP.)
    For sized of {6} on a MB sized file it's okay, then it's still in
    the seconds range. But, as opposed to FSMs, the algorithm is badly
    scaling.

    See above to see how this scales with exponentiation.

    - Dan C.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.unix.shell on Wed Jun 11 10:31:43 2025
    From Newsgroup: comp.unix.shell

    On 09.06.2025 14:17, Dan Cross wrote:
    In article <1022c30$3cdq3$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    grep -E -o '(.{42}).*\1'

    [...]

    Yes, of course. It's just the magnitude that is so frustrating,
    especially in comparison to other solutions. (See below.)

    Well, factorials get big really fast.

    Of course; given the algorithm(s) the complexity follows.

    [...]

    (Myself I'm rarely using expressions that go beyond the Regular
    Expressions formal class, since I usually want to rely on the O(N)
    or O(N+K) complexity. But in the current case the simple solution
    was just tempting.)

    Yup. That's why they're there. But they're deceptive with
    respect to their complexity. Moreover, since authors like
    Friedl misrepresent the theory in popular books on the subject,

    I haven't read his books. Are they in any way (sort of) standard
    literature?

    people use what they think are "regular expressions" and get
    confused by the observed runtimes.

    *shrug* Can't tell. (For me it's obvious, and in the past I've
    also regularly pointed that out in the respective discussions.)

    But I recall some bad matching behavior in (I think) some Perl
    version; a very primitive RE (with no back-references or such)
    led to exponential runtime behavior.


    Traditionally, those tools rolled their own versions. The
    regular expression libraries people have used over the years
    came somewhat later, in the evolutinary timeline.

    I'm lacking an overview here. One thing I know of GNU Awk - but
    that's of course just one example - is that a standard library
    had been used, and more recently an alternative library has been
    examined. All I remember is that they said it does implement in
    some respect improvements. GNU Awk is relying a lot on existing
    functions as sort of an adapter.

    (But Awk doesn't support back-references, so it won't apply to
    the topic of this thread.)

    [...]

    I don't know that anyone knows of a better way for the general
    case. As I said, there is no known fast (as in, less than
    exponential) and general solution to matching with back
    references.

    It's certainly a correct observation that if you have a suboptimal
    but "general" algorithmic solution that it's tempting to use that.

    For a general tool, it's unclear to me how one could do anything
    else.

    Well, it's not uncommon to have tools separate function classes.
    In case of Regexps you could certainly tell apart RE expressions
    that use [true] Regular Expression patterns and those beyond.


    One can probably write something faster for specific use cases,
    but that would obviously be specific to those cases.

    To be honest, I haven't to any significant depth studied details
    of various "beyond-regular" implementations; with formal strictly
    Regular Expressions it's obvious and simple.

    Mmm, I'm not sure I entirely agree with that. NDFA analysis can
    be subtle, but the real thing that twists people's noggins is
    how DFA _creation_ can be exponential.

    Well, I agree if you mean that you can formulate non-deterministic
    RE patterns that would lead to NDFAs whose transformation to DFAs
    may become non-trivial - although patterns are mostly short compared
    to the data they are applied to, so even in those rarer cases it may
    not be an issue.

    Myself I seem to have generally always defined deterministic Regular
    Expression patterns. The linear complexity follows.


    [...]

    If a "general solution" can be so costly then, I think, something
    better should be implemented than a "one-backtracking-fits-all".

    Well, if someone comes up with something better, they will have
    solved a very important open problem in computer science. :-)

    As said, I haven't dived deeper into the "beyond RE" case. But are
    you saying that this class of patterns can *only* be solved by
    expensive backtracking? - I'd doubt that, but I'm too old and too
    lazy to do any fundamental research. ;-)

    Janis

    [...]


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.unix.shell on Wed Jun 11 13:48:31 2025
    From Newsgroup: comp.unix.shell

    In article <102beth$1sdmr$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 09.06.2025 14:17, Dan Cross wrote:
    [...]

    (Myself I'm rarely using expressions that go beyond the Regular
    Expressions formal class, since I usually want to rely on the O(N)
    or O(N+K) complexity. But in the current case the simple solution
    was just tempting.)

    Yup. That's why they're there. But they're deceptive with
    respect to their complexity. Moreover, since authors like
    Friedl misrepresent the theory in popular books on the subject,

    I haven't read his books. Are they in any way (sort of) standard
    literature?

    No. He wrote a book called, "Mastering Regular Expressions"
    where he mixes up some of the formal terminology. In particular
    he calls things "NDFAs" that are not, in fact, non-deterministic
    finite automata. When corrected, he got somewhat indignant.

    That is to say, he's made factually incorrect statements (even
    in his book!) and refused to correct them.

    people use what they think are "regular expressions" and get
    confused by the observed runtimes.

    *shrug* Can't tell. (For me it's obvious, and in the past I've
    also regularly pointed that out in the respective discussions.)

    But I recall some bad matching behavior in (I think) some Perl
    version; a very primitive RE (with no back-references or such)
    led to exponential runtime behavior.

    An easy way to implement even real regular expressions is with
    backtracking, and building an NDFA simulation is rather more
    work. Consequently, a lot of early "regexp" libraries took this
    approach, and could suffer from exponential blowup, even using
    standard regular expressions that would be linear in a DFA
    implementation. Perl, in particular, did this, borrowing Henry
    Spencer's RE library, which used backtracking. The version of
    regular expressions described in "The Practice of Programming"
    by Kernighan and Pike also uses backtracking; it's only a page
    or two, but exhibits exponential behavior in the worst case.

    Traditionally, those tools rolled their own versions. The
    regular expression libraries people have used over the years
    came somewhat later, in the evolutinary timeline.

    I'm lacking an overview here. One thing I know of GNU Awk - but
    that's of course just one example - is that a standard library
    had been used, and more recently an alternative library has been
    examined. All I remember is that they said it does implement in
    some respect improvements. GNU Awk is relying a lot on existing
    functions as sort of an adapter.

    (But Awk doesn't support back-references, so it won't apply to
    the topic of this thread.)

    GNU awk appears to have imported a regular expression library
    that does DFA construction, if possible, and an NDFA simulation
    if not. The DFA code was written by Mike Haertel; I assume it
    came via gnulib or something.

    It's quite a bit of code, wrapped in a few layers. It doesn't
    seem like they link to e.g. a system library or something that
    already comes with the base OS.

    [...]

    I don't know that anyone knows of a better way for the general
    case. As I said, there is no known fast (as in, less than
    exponential) and general solution to matching with back
    references.

    It's certainly a correct observation that if you have a suboptimal
    but "general" algorithmic solution that it's tempting to use that.

    For a general tool, it's unclear to me how one could do anything
    else.

    Well, it's not uncommon to have tools separate function classes.
    In case of Regexps you could certainly tell apart RE expressions
    that use [true] Regular Expression patterns and those beyond.

    Tcl does something like that.

    To be honest, I haven't to any significant depth studied details
    of various "beyond-regular" implementations; with formal strictly
    Regular Expressions it's obvious and simple.

    Mmm, I'm not sure I entirely agree with that. NDFA analysis can
    be subtle, but the real thing that twists people's noggins is
    how DFA _creation_ can be exponential.

    Well, I agree if you mean that you can formulate non-deterministic
    RE patterns that would lead to NDFAs whose transformation to DFAs
    may become non-trivial - although patterns are mostly short compared
    to the data they are applied to, so even in those rarer cases it may
    not be an issue.

    Myself I seem to have generally always defined deterministic Regular >Expression patterns. The linear complexity follows.

    Careful here; many people assume that but really end up with
    something that's simulated by an NDFA, which is superlinear
    (though still not exponential). Any DFA can be _simulated_ by
    an NDFA; it's a tradeoff between constuction and runtime.

    [...]

    If a "general solution" can be so costly then, I think, something
    better should be implemented than a "one-backtracking-fits-all".

    Well, if someone comes up with something better, they will have
    solved a very important open problem in computer science. :-)

    As said, I haven't dived deeper into the "beyond RE" case. But are
    you saying that this class of patterns can *only* be solved by
    expensive backtracking? - I'd doubt that, but I'm too old and too
    lazy to do any fundamental research. ;-)

    What I'm saying is that no one actually knows, but if there's a
    quicker way to do it, no one has found it yet, and a lot of
    people have tried over a lot of years.

    As of a few years ago, the best known algorithm for supporting
    backreferences had exponential complexity. As far as I know,
    nothing has changed about that since the last time I looked
    closely. People _have_ done some work with fixing the number
    of back references and so on, and maybe shown that that problem
    is in P (it's somewhat unclear) but that's a) not the general
    case, and b) the runtimes are still awful. Like, O(N^2k), where
    k is number of capture groups, kind of awful.

    Again, I point to Russ Cox's reference on the subject, which is
    a fairly gentle introduction for the lay reader: https://swtch.com/~rsc/regexp/regexp1.html

    - Dan C.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.unix.shell on Thu Jun 12 03:58:46 2025
    From Newsgroup: comp.unix.shell

    On 11.06.25 15:48, Dan Cross wrote:
    [...]

    Again, I point to Russ Cox's reference on the subject, which is
    a fairly gentle introduction for the lay reader: https://swtch.com/~rsc/regexp/regexp1.html

    Looks somewhat familiar; maybe I had read that some time ago.
    I'll (re-)read it to refresh the details. - Thanks.

    Janis

    --- Synchronet 3.21b-Linux NewsLink 1.2