[plt-scheme] help about regexp or read-line

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Fri Apr 27 03:44:19 EDT 2007

At Fri, 27 Apr 2007 11:16:37 +0400, wwall wrote:
>   I have problem with this code
> (define rx #rx"[_A-Za-zА-Яа-я0-9]+")
> (define zz "function яя(z){ret?rn 1+z;}  zzz(2);")
> (regexp-match-positions rx (open-input-string zz))
>  return ((0 . 8))
> This is right, but if define zz so
> (define zz "функция яя(z){ret?rn 1+z;}  zzz(2);")
> then (regexp-match-positions rx (open-input-string zz)) return ((0 . 14))
> I think it becouse i use UTF, but i have question - how corret this
> error?

Yes, it's a limitation of regexps on strings. The position results of
`regexp-match-positions' are always in terms of bytes, and strings are
implicitly encoded via UTF-8 to obtain bytes.

Here's one way to get the answer in terms of characters:

 (define (regexp-match-string-positions rs port)
   (let ([m (regexp-match-peek-positions rx port)])
     (and 
      m
      (let ([start (bytes-utf-8-length (read-bytes (caar m) port))])
        (list 
         (cons start
               (+ start 
                  (bytes-utf-8-length 
                   (read-bytes (- (cdar m) (caar m)) port)))))))))

Matthew



Posted on the users mailing list.