[racket-dev] Speeding up `in-directory`

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Wed Sep 4 11:03:26 EDT 2013

When continuations are too expensive, consider representing the
continuation explicitly. In this case, I think a list of paths
represents the continuation easily enough.

(define (in-directory5 [orig-dir #f])
  (define init-dir
    (or orig-dir (current-directory)))
  ;; current state of the sequence is a list of paths to produce; when
  ;; incrementing past a directory, add the directory's immediate
  ;; content to the front of the list:
  (define (next l)
    (if (directory-exists? (car l))
        (append (if orig-dir
                    ;; paths are always complete:
                    (directory-list (car l) #:build? #t)
                    ;; work relative to `orig-dir`:
                    (parameterize ([current-directory init-dir])
                      (directory-list (car l) #:build? #t)))
                (cdr l))
        (cdr l)))
  (make-do-sequence
   (lambda ()
     (values
      car
      next
      (if orig-dir
          (directory-list orig-dir #:build? #t)
          (directory-list))
      pair?
      #f
      #f))))

At Wed, 4 Sep 2013 10:01:03 -0400, Sam Tobin-Hochstadt wrote:
> Inspired by a post about a faster directory iteration in Haskell [1],
> I decided to try doing the same for Racket.  The results are here:
> https://gist.github.com/samth/6437192
> 
> The current implementation uses continuations, which are pretty slow.
> The fastest solution would fuse the traversal and processing code, but
> Racket's optimizer isn't there yet for any implementation I can come
> up with.  I tried a few different strategies:
> 
> 1. Build a list of all of the entries, and then just return that.
> This is quite fast (3x faster than `in-directory), even for 30k+
> entries, but it does excessive IO when it's not needed.
> 2. Use a thread and a channel to communicate.  This is about twice as
> slow (on my SSD) as the list approach, but it doesn't do any excessive
> IO.
> 3. Use a thread, a channel, and a vector as a buffer.  This can be as
> fast as or faster than the list approach with a large buffer size
> (1000 elements), but it's quite competitive even with a buffer size of
> 5.
> 
> My channel implementation reads one entry ahead of where the current
> `in-directory` is.
> 
> If we think it's important never to do _any_ extra I/O, then I think
> we should use a modified version of my 2nd implementation. If we're
> willing to relax that, we should probably go with the 3rd approach
> with a buffer size of 5 or 10.
> 
> Note that even my fastest code is 2.5x slower than Unix `find`, even
> when discounting the Racket startup overhead.
> 
> Sam
> 
> [1] https://github.com/JohnLato/posix-paths
> _________________________
>   Racket Developers list:
>   http://lists.racket-lang.org/dev

Posted on the dev mailing list.