[ prog / sol / mona ]

prog


How can I run my own instance of this

91 2020-02-25 10:50

Here are the two conditions necessary for irregex-search to break its "leftmost, longest" contract:
1. Range iteration instead of star iteration.
2. An alternation where the first branch is a prefix of the second branch.
A minimal example:

$ mit-scheme --load irregex.scm
[...]
;Loading "irregex.scm"... done
1 ]=> (define (imsis re str) (irregex-match-substring (irregex-search re str)))
;Value: imsis
1 ]=> (imsis "(a|ab){0,3}" "abab")
;Value 35: "a"

Irregex-match keeps working:

1 ]=> (define (imsim re str) (irregex-match-substring (irregex-match re str)))
;Value: imsim
1 ]=> (imsim  "(a|ab){0,3}" "abab")
;Value 36: "abab"

If only one condition is met, irregex-search works too:

1 ]=> (imsis "(ab|a){0,3}" "abab")
;Value 34: "abab"
1 ]=> (imsis "(a|ab)*" "abab")
;Value 33: "abab"

I think this might be worth telling the shinnoid. I cannot see any obvious contact info on
http://synthcode.com/scheme/irregex/
but his git commits on
https://github.com/ashinn/irregex
are by "Alex Shinn <alexshinn@gmail.com>". Well, at least he's not one of those protonmail people. If someone could drop him a line, that would be great.

>>90

That's what I tried
Can't remember what the problem was
I could have miswritten it

Clearly, it was a typo of some sort. No worries.

92 2020-02-25 21:48

Here is something peculiar. Irregex-match uses irregex-match/chunked which has two branches, one with dfa-match/longest and another with irregex-nfa. But the irregex-search/matches used by irregex-search has three branches: dfa-match/longest, dfa-match/shortest and irregex-search/backtrack. However, it is not a difference between these branches that breaks irregex-search. With some bronze age debug prints we have:

$ guile -l irregex.scm 
[...]
scheme@(guile-user)> (define (imsis re str) (irregex-match-substring (irregex-search re str)))
scheme@(guile-user)> (imsis "(a|ab){0,3}" "abab")
is/m->backtrack $1 = "a"
scheme@(guile-user)> (imsis "(ab|a){0,3}" "abab")
is/m->backtrack $2 = "abab"
scheme@(guile-user)> (imsis "(a|ab)*" "abab")
is/m->dfa is/m->shortest $3 = "abab"
scheme@(guile-user)> 

So both range iteration versions take the backtrack branch, but only the prefix-first version breaks irregex-search. It would seem that the problem might be somewhere within irregex-search/backtrack.

93 2020-02-26 01:35

Irregex PCRE x{m,n} ranges are parsed into SRE (** m n x) ranges in string->sre -> named let lp -> main parsing -> case c -> branch (#\{). The crucial bit is:

(m
 (lp (+ j 1) (+ j 1) flags `((** ,n ,m ,x) ,@tail) st))

which is at lines {962,963} in the "0.9.6: 2016/12/05" version of deps/irregex.scm from SchemeBBS on "20 Feb, 2020 4 commits". The irregex-search breakage does not depend on PCRE parsing and can be demonstrated directly on SRE. The same conditions I explained in >>91 hold for SRE:

$ guile -l irregex.scm 
scheme@(guile-user)> (define (imsis re str) (irregex-match-substring (irregex-search re str)))
scheme@(guile-user)> (imsis '(** 0 3 (or "a" "ab")) "abab")
is/m->backtrack $1 = "a"
scheme@(guile-user)> (imsis '(** 0 3 (or "ab" "a")) "abab")
is/m->backtrack $2 = "abab"
scheme@(guile-user)> (imsis '(* (or "a" "ab")) "abab")
is/m->dfa is/m->shortest $3 = "abab"
scheme@(guile-user)> 

The bug is now quite likely to be in irregex-search/backtrack with (** m n (or x y)) constructs.

SRE irregex reference:
http://synthcode.com/scheme/irregex/#SECTION_3.2

94 2020-02-26 13:10

Irregex-search/backtrack uses irregex-nfa, which is just an accessor into the irregex representation, which is a vector. The nfa is index 3 in make-irregex, and is computed by the final branch of sre->irregex via sre->procedure. Sre->procedure composes a tree of lambdas over the tree of a SRE like '(** 0 3 (or "a" "ab")). The "case (car sre)" has (or) and (**) branches, the bug should be here.

With the benefit of hindsight, this can be gleaned from the REPL:

$ guile -l irregex.scm 
scheme@(guile-user)> (irregex '(** 0 3 (or "a" "ab")))
$2 = #(*irregex-tag* #f #f #<procedure 55af37871090 at irregex.scm:3161:21 (cnk init src str i end matches fail)> 0 0 #((0 . 6)) ())
scheme@(guile-user)> 

Irregex.scm:3161:21 is where the lambda of (**) lives.

95 2020-02-26 23:52

Here are two more pieces of information. First, the (or) branch of sre->procedure doesn't follow "leftmost, longest":

scheme@(guile-user)> (define (imsis re str) (irregex-match-substring (irregex-search re str)))
scheme@(guile-user)> (imsis '(** 1 1 (or "free" "freedom")) "freedom")
$11 = "free"

The fixed range of 1 is only there to cause irregex to use sre->procedure. A plain 'or' uses a vanilla NFA, the implementation of which seems to work. Second, the 'or' can be forced to use later branches to reach a non-zero lower range limit, but no further, again in violation of "leftmost, longest":

scheme@(guile-user)> (imsis '(** 2 4 (or "a" "ab")) "abababab")
$12 = "aba"
scheme@(guile-user)> (imsis '(** 3 4 (or "a" "ab")) "abababab")
$13 = "ababa"

The way in which the (or) branch is broken seems fairly clear, but for the (**) branch I do not have all the details yet.

97 2020-02-28 04:02

As an aside, before I explain why (or) and (**) are broken in sre->procedure, there is a "Fix exponential explosion in backtrack compilation" commit to irregex by Peter Bex on "Dec 5, 2016".
https://github.com/ashinn/irregex/commit/a16ffc86eca15fca9e40607d41de3cea9cf868f1
It only came to my attention because it contains the current implementation of the (+) branch of sre->procedure. While "define * in terms of +, instead of vice versa" is a fine idea, you still need a working (+). This (+), however, also takes a light-hearted comedy approach to the "POSIX leftmost, longest semantics" guaranteed by the documentation. As explained in >>95 the fixed range of 1 is only there to cause irregex to use sre->procedure.

$ guile -l irregex.scm
[...]
scheme@(guile-user)> (define (imsis re str) (irregex-match-substring (irregex-search re str)))
scheme@(guile-user)> (define (inout re n)
   (let* ((sin  (string-join (make-list n "a") ""))
          (sout (imsis re sin)))
      (simple-format #t  " in ~A ~A\nout ~A ~A\n" (string-length sin) sin (string-length sout) sout)))
scheme@(guile-user)> (inout '(** 1 1 (+ (or "aaa" "aaaaa"))) 8)
 in 8 aaaaaaaa
out 6 aaaaaa
scheme@(guile-user)> (inout '(** 1 1 (+ (or "aaa" "aaaaa"))) 9)
 in 9 aaaaaaaaa
out 9 aaaaaaaaa
scheme@(guile-user)> (inout '(** 1 1 (+ (or "aaa" "aaaaa"))) 10)
 in 10 aaaaaaaaaa
out 9 aaaaaaaaa

This class of tests can also be made to fail if the prefix is the second alternative:

scheme@(guile-user)> (inout '(** 1 1 (+ (or "aaaaa" "aaa"))) 9)
 in 9 aaaaaaaaa
out 8 aaaaaaaa
scheme@(guile-user)> (inout '(** 1 1 (+ (or "aaaaa" "aaa"))) 10)
 in 10 aaaaaaaaaa
out 10 aaaaaaaaaa
scheme@(guile-user)> (inout '(** 1 1 (+ (or "aaaaa" "aaa"))) 11)
 in 11 aaaaaaaaaaa
out 10 aaaaaaaaaa

Contrast this with grep, whose authors appear to actually know what they're doing. The {1,1} is only there for equivalence.

$ g () {
> local sin sout len
> len () { echo "$1" | gawk '{ print length ($0) }'; }
> sin=$(printf "%$2s" "" | tr ' ' 'a')
> echo " in $(len "$sin") $sin"
> sout=$(echo "$sin" | grep -E -oe "$1")
> echo "out $(len "$sout") $sout"
> }
$ g '(aaa|aaaaa)+{1,1}' 8
 in 8 aaaaaaaa
out 8 aaaaaaaa
$ g '(aaa|aaaaa)+{1,1}' 9
 in 9 aaaaaaaaa
out 9 aaaaaaaaa
$ g '(aaa|aaaaa)+{1,1}' 10
 in 10 aaaaaaaaaa
out 10 aaaaaaaaaa

And with reversed alternatives:

$ g '(aaaaa|aaa)+{1,1}' 9
 in 9 aaaaaaaaa
out 9 aaaaaaaaa
$ g '(aaaaa|aaa)+{1,1}' 10
 in 10 aaaaaaaaaa
out 10 aaaaaaaaaa
$ g '(aaaaa|aaa)+{1,1}' 11
 in 11 aaaaaaaaaaa
out 11 aaaaaaaaaaa

Great commit you have there.

>>96
Indeed, I hear entomologists can be quite peculiar people.

100 2020-03-01 11:17

[1/5]
OK, here is the cold, hard truth. The (or) branch of sre->procedure looks like this:

((or)
 (case (length (cdr sre))
   ((0) (lambda (cnk init src str i end matches fail) (fail)))
   ((1) (rec (cadr sre)))
   (else
    (let* ((first (rec (cadr sre)))
           (rest (lp (sre-alternate (cddr sre))
                     (+ n (sre-count-submatches (cadr sre)))
                     flags
                     next)))
      (lambda (cnk init src str i end matches fail)
        (first cnk init src str i end matches
               (lambda ()
                 (rest cnk init src str i end matches fail))))))))

It tries the first alternative, and if this fails it tries the (or) of the other alternatives. There is not even an attempt at the "POSIX leftmost, longest semantics". Since the (or) might be the entire regex, this is enough to make the sre->procedure call break the same semantics. Furthermore, in the general case where we do not restrict ourselves to alternatives of simple structure to construct a minimal >>91 failure case, the trick of reordering the alternatives cannot save the current (or). Consider "(aaa)*(ba)*" and "(aaaaa)*(ab)*". If we put the 3-group first, we can test on "aaaaaabab":

$ guile -l irregex.scm 
[...]
scheme@(guile-user)> (define (imsis re str) (irregex-match-substring (irregex-search re str)))
scheme@(guile-user)> (imsis "((aaa)*(ba)*|(aaaaa)*(ab)*){1,1}" "aaaaaabab")
$1 = "aaaaaaba"

which leaves off the final "b". The second branch would have taken one more character:

scheme@(guile-user)> (imsis "((aaaaa)*(ab)*){1,1}" "aaaaaabab")
$2 = "aaaaaabab"

If we put the 5-group first, we can test on "aaaaaababa":

scheme@(guile-user)> (imsis "((aaaaa)*(ab)*|(aaa)*(ba)*){1,1}" "aaaaaababa")
$3 = "aaaaaabab"

which leaves off the final "a". The second branch would again have taken one more character:

scheme@(guile-user)> (imsis "((aaa)*(ba)*){1,1}" "aaaaaababa")
$4 = "aaaaaababa"

The test string can be grown to arbitrary size by prefixing groups of 15 'a's. There is no trick that can substitute in the general case for the (or) properly searching its match space and returning the match candidates in non-increasing order of length.

101 2020-03-01 11:18

[2/5]
As for (**), the reason it breaks is only marginally more complicated. To illustrate the explanation with an example, we begin by replacing the string atom matcher with a thin proxy that reports the progress of the computation:

$ TZ=GMT diff -u schemebbs/deps/irregex.scm edit/deps/irregex.scm 
--- schemebbs/deps/irregex.scm	2020-02-17 21:53:31.563445679 +0000
+++ edit/deps/irregex.scm	2020-02-29 02:01:57.844878000 +0000
@@ -3508,7 +3508,10 @@
                     (fail))))
           ))
      ((string? sre)
-      (rec (sre-sequence (string->list sre)))
+      (let ((sub (rec (sre-sequence (string->list sre)))))
+        (lambda (cnk init src str i end matches fail)
+          (simple-format #t "trying ~A at ~A\n" sre i)
+          (sub cnk init src str i end matches fail)))
 ;; XXXX reintroduce faster string matching on chunks
 ;;       (if (flag-set? flags ~case-insensitive?)
 ;;           (rec (sre-sequence (string->list sre)))

The (**) branch is:

((**)
 (cond
  ((or (and (number? (cadr sre))
            (number? (caddr sre))
            (> (cadr sre) (caddr sre)))
       (and (not (cadr sre)) (caddr sre)))
   (lambda (cnk init src str i end matches fail) (fail)))
  (else
   (letrec
       ((from (cadr sre))
        (to (caddr sre))
        (body-contents (sre-sequence (cdddr sre)))
        (body
         (lambda (count)
           (lp body-contents
               n
               flags
               (lambda (cnk init src str i end matches fail)
                 (if (and to (= count to))
                     (next cnk init src str i end matches fail)
                     ((body (+ 1 count))
                      cnk init src str i end matches
                      (lambda ()
                        (if (>= count from)
                            (next cnk init src str i end matches fail)
                            (fail))))))))))
     (if (and (zero? from) to (zero? to))
         next
         (lambda (cnk init src str i end matches fail)
           ((body 1) cnk init src str i end matches
            (lambda ()
              (if (zero? from)
                  (next cnk init src str i end matches fail)
                  (fail))))))))))

Here is a talkative run on the last example of >>95:

scheme@(guile-user)> (imsis '(** 3 4 (or "a" "ab")) "abababab")
trying a at 0
trying a at 1
trying ab at 1
trying ab at 0
trying a at 2
trying a at 3
trying ab at 3
trying ab at 2
trying a at 4
trying a at 5
trying ab at 5
$1 = "ababa"
scheme@(guile-user)> 

The "a at 0" and "a at 2" are both retried with "ab" because the overall repeat counts they yielded were 1 and 2, under the lower limit of 3. These retries happen on the alternate of the innermost conditional of 'body'. But after "a at 4", which makes both attempts at 5 fail, there is no retry with "ab at 4". This is because the produced repeat count is now 3, and the same conditional declares it a success because extending to 4 repeats failed. There is no attempt to extend with further retries, even though it would obviously work since the target string is composed of 4 "ab"s.

The () branch is not interested in "leftmost, longest", it stops at the first repeat count extension failure within the allowed range. The only way it can produce "leftmost, longest" on its own is to only have repeat count extension failures under the lower limit, but smooth sailing from the lower limit to the upper limit. Since the () might be the entire regex, this is also enough to make the sre->procedure call break the same semantics.

102 2020-03-01 11:19

[3/5]
However, I promised the cold, hard truth and this was only the cold part, so here is the rest. The disregard for "leftmost, longest" in sre->procedure is not limited to (or) and (**), it is present in all branches that can take multiple paths. This is meant to be salvaged by the external user of sre->procedure via the 'fail' lambda that is returned as part of the 'matches' object in the named let lp. On a successful match, 'fail' is actually the retry continuation. This is how irregex-match works. Its driver for sre->procedure is the 'else' branch of irregex-match/chunked:

(define (irregex-match/chunked irx cnk src)
  (let* ((irx (irregex irx))
         (matches (irregex-new-matches irx)))
    (irregex-match-chunker-set! matches cnk)
    (cond
     ((irregex-dfa irx)
      [...]
     (else
      (let* ((matcher (irregex-nfa irx))
             (str ((chunker-get-str cnk) src))
             (i ((chunker-get-start cnk) src))
             (end ((chunker-get-end cnk) src))
             (init (cons src i)))
        (let lp ((m (matcher cnk init src str i end matches (lambda () #f))))
          (and m
               (cond
                ((and (not ((chunker-get-next cnk)
                            (%irregex-match-end-chunk m 0)))
                      (= ((chunker-get-end cnk)
                          (%irregex-match-end-chunk m 0))
                         (%irregex-match-end-index m 0)))
                 (%irregex-match-fail-set! m #f)
                 m)
                ((%irregex-match-fail m)
                 (lp ((%irregex-match-fail m))))
                (else
                 #f)))))))))

Whenever there is a match that does not exhaust the input, and a retry continuation exists, the retry is called by the "(lp ((%irregex-match-fail m)))" branch. This means that if a full match is possible, it will be found. Here is the above (**) example with irregex-match and one more debug print:

scheme@(guile-user)> (define (imsim re str) (irregex-match-substring (irregex-match re str)))
scheme@(guile-user)> (imsim '(** 3 4 (or "a" "ab")) "abababab")
trying a at 0
trying a at 1
trying ab at 1
trying ab at 0
trying a at 2
trying a at 3
trying ab at 3
trying ab at 2
trying a at 4
trying a at 5
trying ab at 5
retry by irregex-match/chunked
trying ab at 4
trying a at 6
retry by irregex-match/chunked
trying ab at 6
$1 = "abababab"
scheme@(guile-user)> 

Irregex-match/chunked has to override sre->procedure's result twice to get the full match.

103 2020-03-01 11:20

[4/5]
By contrast, irregex-search's driver for sre->procedure is irregex-search/backtrack:

(define (irregex-search/backtrack irx cnk init src i matches)
  (let ((matcher (irregex-nfa irx))
        (str ((chunker-get-str cnk) src))
        (end ((chunker-get-end cnk) src))
        (get-next (chunker-get-next cnk)))
    (if (flag-set? (irregex-flags irx) ~searcher?)
        (matcher cnk init src str i end matches (lambda () #f))
        (let lp ((src2 src)
                 (str str)
                 (i i)
                 (end end))
          (cond
           ((matcher cnk init src2 str i end matches (lambda () #f))
            (irregex-match-start-chunk-set! matches 0 src2)
            (irregex-match-start-index-set! matches 0 i)
            matches)
           ((< i end)
            (lp src2 str (+ i 1) end))
           (else
            (let ((src2 (get-next src2)))
              (if src2
                  (lp src2
                      ((chunker-get-str cnk) src2)
                      ((chunker-get-start cnk) src2)
                      ((chunker-get-end cnk) src2))
                  #f))))))))

If there was no match, it advances one character or acquires more input. But if there was a match, it is immediately returned. What is completely missing is any backtracking to attempt to find a longer match at the current position, to achieve "leftmost, longest". This procedure performs backtracking in the same way that the DPRK is democratic, all of the backtracking is in the procedure name, none in the code. According to git blame, this way of doing backtracking is from the "http-url fix to support multiple directories in the path" commit by ashinn on "Aug 15, 2012".
https://github.com/ashinn/irregex/commit/3de590ee0b517ab786a42a6a75922890e7972acf
Here is an example to illustrate the underlying issue. Consider 5-groups and 3-groups on an input of 9:

scheme@(guile-user)> (imsim '(** 1 1 (+ (or "aaaaa" "aaa"))) "aaaaaaaaa")
trying aaaaa at 0
trying aaaaa at 5
trying aaa at 5
trying aaaaa at 8
trying aaa at 8
retry by irregex-match/chunked
retry by irregex-match/chunked
trying aaa at 0
trying aaaaa at 3
trying aaaaa at 8
trying aaa at 8
retry by irregex-match/chunked
trying aaa at 3
trying aaaaa at 6
trying aaa at 6
trying aaaaa at 9
trying aaa at 9
$1 = "aaaaaaaaa"
scheme@(guile-user)> 

In order to get the longest overall match, the 5-group needs to back off twice, at 0 and at 3. The (or) itself did nothing wrong here, it provided the longest match it could at each step, due to the order of alternatives. It just so happens that in this case those are not the paths that lead to overall success. In the general case, proper backtracking is not a gentle suggestion but a hard requirement. As this example shows, and as one might recall from paying attention in college, the crux of the matter is this: regex matching can possess global maxima that are assembled entirely from local minima.

104 2020-03-01 11:21

[5/5]
And as a bonus, here is a DOS attack on irregex. The (+) branch of sre->procedure is:

((+)
 (cond
  ((sre-empty? (sre-sequence (cdr sre)))
   (error "invalid sre: empty +" sre))
  (else
   (letrec
       ((body
         (lp (sre-sequence (cdr sre))
             n
             flags
             (lambda (cnk init src str i end matches fail)
               (body cnk init src str i end matches
                     (lambda ()
                       (next cnk init src str i end matches fail)
                       ))))))
     body))))

It unconditionally loops its subexpression and only concedes when that fails. It does not even give one neutrino that the input might be exhausted. Protection from being passed a null matcher is provided by the sre-empty? test. Sre-empty? looks like this:

;; returns #t if the sre can ever be empty
(define (sre-empty? sre)
  (if (pair? sre)
      (case (car sre)
        ((* ? look-ahead look-behind neg-look-ahead neg-look-behind) #t)
        ((**) (or (not (number? (cadr sre))) (zero? (cadr sre))))
        ((or) (any sre-empty? (cdr sre)))
        ((: seq $ submatch => submatch-named + atomic)
         (every sre-empty? (cdr sre)))
        (else #f))
      (memq sre '(epsilon bos eos bol eol bow eow commit))))

It handles pairs and symbols. Irregex represents string atoms with raw strings, so that we can write '(or "a" "b") instead of having to write something like '(or (s "a") (s "b")). This means that strings can be SREs and they have their own branch in sre->procedure, it is the (string? sre) branch that we patched above for debug prints. But sre-empty? returns #f for everything it doesn't care about, and it doesn't care about strings. Naturally, this is an invitation to pass the empty string to (+) via SREs, e.g.:

scheme@(guile-user)> (sre-empty? '(or "aa" ""))
$1 = #f
scheme@(guile-user)> (imsis '(** 1 1 (+ (or "aa" ""))) "aaaa")

This will keep one of your CPU cores on 100% for a while. According to git blame, this sre-empty? is from the same commit by ashinn as the backtracking. In light of the comment above the procedure, "can ever be empty", here is how complicated the fix is: you have to consider the possibility that the empty string might sometimes be empty. Until you do, this is an easy DOS where SREs are directly accepted. The PCRE equivalent is:

scheme@(guile-user)> (imsis "(aa|){1,}" "aaaa")

which will also spin up your fans, and uses the fact that the author thought it was perfectly reasonable to add null matcher protection to * and + but not to upper unbounded ranges, neither in string->sre nor in sre->procedure. As I have already asked in >>91 perhaps somebody could bring the shinnoid up to speed.

And now you can have your thread back. I'm done with this irregex thing.

105 2020-03-01 11:24

Where you have the bold nonsense in [2/5] just imagine (**) for ().

301


VIP:

do not edit these