[racket-dev] Speeding up `in-directory`
That's very nice -- and faster, to boot. One thing I'm confused about
-- why is the `parameterize` around the second call to
`directory-list` needed? Does it query `current-directory`
internally?
Oh, and a `define-sequence-syntax` version is even a little faster.
I'll push all of this soon.
Sam
On Wed, Sep 4, 2013 at 11:03 AM, Matthew Flatt <mflatt at cs.utah.edu> wrote:
> When continuations are too expensive, consider representing the
> continuation explicitly. In this case, I think a list of paths
> represents the continuation easily enough.
>
> (define (in-directory5 [orig-dir #f])
> (define init-dir
> (or orig-dir (current-directory)))
> ;; current state of the sequence is a list of paths to produce; when
> ;; incrementing past a directory, add the directory's immediate
> ;; content to the front of the list:
> (define (next l)
> (if (directory-exists? (car l))
> (append (if orig-dir
> ;; paths are always complete:
> (directory-list (car l) #:build? #t)
> ;; work relative to `orig-dir`:
> (parameterize ([current-directory init-dir])
> (directory-list (car l) #:build? #t)))
> (cdr l))
> (cdr l)))
> (make-do-sequence
> (lambda ()
> (values
> car
> next
> (if orig-dir
> (directory-list orig-dir #:build? #t)
> (directory-list))
> pair?
> #f
> #f))))
>
> At Wed, 4 Sep 2013 10:01:03 -0400, Sam Tobin-Hochstadt wrote:
>> Inspired by a post about a faster directory iteration in Haskell [1],
>> I decided to try doing the same for Racket. The results are here:
>> https://gist.github.com/samth/6437192
>>
>> The current implementation uses continuations, which are pretty slow.
>> The fastest solution would fuse the traversal and processing code, but
>> Racket's optimizer isn't there yet for any implementation I can come
>> up with. I tried a few different strategies:
>>
>> 1. Build a list of all of the entries, and then just return that.
>> This is quite fast (3x faster than `in-directory), even for 30k+
>> entries, but it does excessive IO when it's not needed.
>> 2. Use a thread and a channel to communicate. This is about twice as
>> slow (on my SSD) as the list approach, but it doesn't do any excessive
>> IO.
>> 3. Use a thread, a channel, and a vector as a buffer. This can be as
>> fast as or faster than the list approach with a large buffer size
>> (1000 elements), but it's quite competitive even with a buffer size of
>> 5.
>>
>> My channel implementation reads one entry ahead of where the current
>> `in-directory` is.
>>
>> If we think it's important never to do _any_ extra I/O, then I think
>> we should use a modified version of my 2nd implementation. If we're
>> willing to relax that, we should probably go with the 3rd approach
>> with a buffer size of 5 or 10.
>>
>> Note that even my fastest code is 2.5x slower than Unix `find`, even
>> when discounting the Racket startup overhead.
>>
>> Sam
>>
>> [1] https://github.com/JohnLato/posix-paths
>> _________________________
>> Racket Developers list:
>> http://lists.racket-lang.org/dev