[racket-dev] Speeding up `in-directory`

From: Sam Tobin-Hochstadt (samth at cs.indiana.edu)
Date: Wed Sep 4 12:05:47 EDT 2013

That's very nice -- and faster, to boot. One thing I'm confused about
-- why is the `parameterize` around the second call to
`directory-list` needed?  Does it query `current-directory`
internally?

Oh, and a `define-sequence-syntax` version is even a little faster.
I'll push all of this soon.

Sam

On Wed, Sep 4, 2013 at 11:03 AM, Matthew Flatt <mflatt at cs.utah.edu> wrote:
> When continuations are too expensive, consider representing the
> continuation explicitly. In this case, I think a list of paths
> represents the continuation easily enough.
>
> (define (in-directory5 [orig-dir #f])
>   (define init-dir
>     (or orig-dir (current-directory)))
>   ;; current state of the sequence is a list of paths to produce; when
>   ;; incrementing past a directory, add the directory's immediate
>   ;; content to the front of the list:
>   (define (next l)
>     (if (directory-exists? (car l))
>         (append (if orig-dir
>                     ;; paths are always complete:
>                     (directory-list (car l) #:build? #t)
>                     ;; work relative to `orig-dir`:
>                     (parameterize ([current-directory init-dir])
>                       (directory-list (car l) #:build? #t)))
>                 (cdr l))
>         (cdr l)))
>   (make-do-sequence
>    (lambda ()
>      (values
>       car
>       next
>       (if orig-dir
>           (directory-list orig-dir #:build? #t)
>           (directory-list))
>       pair?
>       #f
>       #f))))
>
> At Wed, 4 Sep 2013 10:01:03 -0400, Sam Tobin-Hochstadt wrote:
>> Inspired by a post about a faster directory iteration in Haskell [1],
>> I decided to try doing the same for Racket.  The results are here:
>> https://gist.github.com/samth/6437192
>>
>> The current implementation uses continuations, which are pretty slow.
>> The fastest solution would fuse the traversal and processing code, but
>> Racket's optimizer isn't there yet for any implementation I can come
>> up with.  I tried a few different strategies:
>>
>> 1. Build a list of all of the entries, and then just return that.
>> This is quite fast (3x faster than `in-directory), even for 30k+
>> entries, but it does excessive IO when it's not needed.
>> 2. Use a thread and a channel to communicate.  This is about twice as
>> slow (on my SSD) as the list approach, but it doesn't do any excessive
>> IO.
>> 3. Use a thread, a channel, and a vector as a buffer.  This can be as
>> fast as or faster than the list approach with a large buffer size
>> (1000 elements), but it's quite competitive even with a buffer size of
>> 5.
>>
>> My channel implementation reads one entry ahead of where the current
>> `in-directory` is.
>>
>> If we think it's important never to do _any_ extra I/O, then I think
>> we should use a modified version of my 2nd implementation. If we're
>> willing to relax that, we should probably go with the 3rd approach
>> with a buffer size of 5 or 10.
>>
>> Note that even my fastest code is 2.5x slower than Unix `find`, even
>> when discounting the Racket startup overhead.
>>
>> Sam
>>
>> [1] https://github.com/JohnLato/posix-paths
>> _________________________
>>   Racket Developers list:
>>   http://lists.racket-lang.org/dev

Posted on the dev mailing list.