[racket] Regex for blank line?

From: Jon Zeppieri (zeppieri at gmail.com)
Date: Wed Jun 8 15:47:49 EDT 2011

I have functions below that pass your tests, but a lot depends on what
you count as whitespace (not to mention newlines, since one of your
comments suggests that you want to match all Unicode
line-terminators). The functions below assume that "newline" means
#\u000A and "whitespace" means space, tab, newline, formfeed, return.

Both regexps use [^\S\n] to match non-newline whitespace.

On Wed, Jun 8, 2011 at 2:13 PM, Richard Lawrence
<richard.lawrence at berkeley.edu> wrote:
> Hi everyone,
>
> I'm sure this is a really trivial question, but I've been trying on my
> own for some time now, and I can't quite figure it out.  I am trying to
> define a pair of functions, skip-whitespace and skip-blank-line, that do
> the following:
>
> - skip-whitespace should consume any whitespace characters from an input
>  port, possibly up to and including a single newline, but it should not
>  consume any more whitespace after a newline--i.e., it should not skip a
>  blank line in the input
>
> e.g.,
> (define ip (open-input-string "  ABC"))
> (define ip2 (open-input-string "  \n\t\nABC"))
> (define ip3 (open-input-string "ABC"))
> (skip-whitespace ip) (skip-whitespace ip2) (skip-whitespace ip3)
> (peek-char ip) ; should be #\A
> (peek-char ip2) ; should be #\tab
> (peek-char ip3) ; should be #\A

(: skip-whitespace (Input-Port -> Boolean))
(define (skip-whitespace in)
  (and (regexp-match #px"[^\\S\\\n]*\\\n?" in) #t))

>
> - skip-blank-line should consume whitespace characters from an input
>  port just in case that sequence of whitespace characters ends in a
>  newline, and not consume any input otherwise
>
> e.g.,
> (define ip (open-input-string "  ABC"))
> (define ip2 (open-input-string "  \n\t\nABC"))
> (define ip3 (open-input-string "ABC"))
> (skip-blank-line ip) (skip-blank-line ip2) (skip-blank-line ip3)
> (peek-char ip) ; should be #\space
> (peek-char ip2) ; should be #\tab
> (peek-char ip3) ; should be #\A

(: skip-blank-line (Input-Port -> Boolean))
(define (skip-blank-line in)
  (and (regexp-try-match #px"^[^\\S\\\n]*\\\n" in) #t))

> [snip]

> This works fine. But I can't figure out how to write the parallel regexp
> for skip-blank-line.  All the regexps I can come up with either read too
> much whitespace or too little.
>
> #lang typed/racket
> (: skip-blank-line (Input-Port -> Boolean))
> (define (skip-blank-line in)
>  (if (try-read #px"^[[:blank:]]*$" in) #t #f))
>
> This consumes too little in the second case: it doesn't consume the
> initial spaces and newline of ip2; the next char is #\space rather than
> #\tab.  (The same is true if I change the character class :blank: to
> :space:.)

$ matches the end of input. That only corresponds to a newline in
multi mode [see
http://docs.racket-lang.org/reference/regexp.html?q=regexp#(def._((quote._~23~25kernel)._regexp))]

> ... but what I could
> really use is a character class that just matches line-terminators,
> instead of :space:.  That seems to be the job of "\\p{Zl}", but I guess
> there's something I don't understand about that, because (regexp-match
> #px"\\p{Zl}" "\n") doesn't match anything.)

The newline's Unicode character category is Cc, not Zl. But Cc will
match far more that what you want.

-Jon



Posted on the users mailing list.