[ prog / sol / mona ]

prog


What are you working on?

143 2020-07-14 21:13

There are three users of irregex-fold/fast in irregex.scm: irregex-replace/all, irregex-extract and irregex-split. There is also a public wrapper: irregex-fold. Irregex-replace/all has the double match bug >>140 which has been dealt with. Irregex-extract also has a double match bug:
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3936

$ guile --no-auto-compile -l irregex.scm 
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-extract "(?=a)" "---a---")
$1 = ("" "")

but its lambdas are so simple that after the irregex-fold/fast fix >>140 it starts working:

$ guile --no-auto-compile -l irregex2.scm 
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-extract "(?=a)" "---a---")
$1 = ("")

The truly strange one is irregex-split.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3946

(define (irregex-split irx str . o)
  (if (not (string? str)) (error "irregex-split: not a string" str))
  (let ((start (if (pair? o) (car o) 0))
        (end (if (and (pair? o) (pair? (cdr o))) (cadr o) (string-length str))))
    (irregex-fold/fast
     irx
     (lambda (i m a)
       (cond
        ((= i (%irregex-match-end-index m 0))
         ;; empty match, include the skipped char to rejoin in finish
         (cons (string-ref str i) a))
        ((= i (%irregex-match-start-index m 0))
         a)
        (else
         (cons (substring str i (%irregex-match-start-index m 0)) a))))
     '()
     str
     (lambda (i a)
       (let lp ((ls (if (= i end) a (cons (substring str i end) a)))
                (res '())
                (was-char? #f))
         (cond
          ((null? ls) res)
          ((char? (car ls))
           (lp (cdr ls)
               (if (or was-char? (null? res))
                   (cons (string (car ls)) res)
                   (cons (string-append (string (car ls)) (car res))
                         (cdr res)))
               #t))
          (else (lp (cdr ls) (cons (car ls) res) #f)))))
     start
     end)))

It relies on the bug that it will see offset empty matches twice. On the first sighting it adds the prefix, and on the second one it adds the skipped character. The point of the chunk tampering behavior of irregex-fold/fast seems to be to use the chunk starting position as an undeclared part of the kons accumulator for irregex-split. However, with all of its cases and intricate postprocessing, irregex-split is still wrong:

$ guile --no-auto-compile -l irregex.scm 
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-split "-" "-a-")
$8 = ("a")
scheme@(guile-user)> (irregex-split "a" "---a")
$9 = ("---")
scheme@(guile-user)> (irregex-split "a" "---aa")
$10 = ("---")
scheme@(guile-user)> (irregex-split "a" "aaa")
$11 = ()
scheme@(guile-user)> (irregex-split 'any "aaa")
$13 = ()

Irregex-split needs a careful fix, but one that relies on the corrected irregex-fold/fast.

>>142

Is this some important library that many projects depend on?

It is the default regex library in some scheme implementations. I cannot speak to its importance and number of dependents. I only started looking at it because it is used in schemebbs and we hit upon a bug in the quotelink regex in the previous thread.

199


VIP:

do not edit these