[racket-dev] regexp.c and lookahead

From: Tony Garnock-Jones (tonyg at ccs.neu.edu)
Date: Sat Jun 14 18:18:05 EDT 2014

Hi all,

At the moment, when regexp.c runs out of buffered lookahead during a
regexp-try-match, it peeks a few bytes. However, it looks like it will
never peek *fewer* than 16 bytes (unless eof occurs before then).

I have written the package "incremental-input" which lets a blocking
read (e.g. read-json) be fed input as it becomes available, event-style.

When testing using read-json from the "json" collect, I find that it
blocks unnecessarily even though a complete input is available. See
https://github.com/tonyg/racket-incremental-input/blob/master/incremental-input/main.rkt#L148-L157
to see how it manifests. Deleting any of the whitespace in the
byte-string on line 147 causes an unnecessary suspension.

The proximal problem is the greedy lookahead buffering in regexp-try-match.

Would it be possible for regexp.c to be satisfied with a partially full
lookahead buffer, so long as it is long enough to properly evaluate the
regexp under consideration? The specific regexp-try-match being used
here is just "^]", which needs just one byte of lookahead.

Alternatively, perhaps I'm overlooking something simpler. Perhaps
there's something I can do in (make-wrapper) with progress-events or
similar, to convince regexp.c to work with what it has before asking for
more?

After all, (read-json) at the REPL seems to detect the end of a JSON
term without gratuitous whitespace or eof!

Tony


Posted on the dev mailing list.