[racket-dev] Speeding up `in-directory`

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Wed Sep 4 12:29:26 EDT 2013

At Wed, 4 Sep 2013 12:05:47 -0400, Sam Tobin-Hochstadt wrote:
> That's very nice -- and faster, to boot. One thing I'm confused about
> -- why is the `parameterize` around the second call to
> `directory-list` needed?  Does it query `current-directory`
> internally?

Yes: `(car l)` will be a path that's mean to be relative to the initial
directory, and the result paths should be similarly relative.

I now see that the "always a complete path" comment below is wrong, and
the code generally doesn't work right if `current-directory` changes
between the time that `in-directory` is called and the sequence is
started. Here's a corrected version (still not tested enough), followed
by a version that I like more (also not tested enough).

(define (in-directory5 [orig-dir #f])
  (define init-dir (current-directory))
  ;; current state of the sequence is a list of paths to produce; when
  ;; incrementing past a directory, add the directory's immediate
  ;; content to the front of the list:
  (define (next l)
    (if (directory-exists? (car l))
        (append (parameterize ([current-directory init-dir])
                  (directory-list (car l) #:build? #t))
                (cdr l))
        (cdr l)))
  (make-do-sequence
   (lambda ()
     (values
      car
      next
      (parameterize ([current-directory init-dir])
        (if orig-dir
            (directory-list orig-dir #:build? #t)
            (directory-list)))
      pair?
      #f
      #f))))

(define (in-directory6 [orig-dir #f])
  (define init-dir (current-directory))
  ;; current state of the sequence is a list of paths to produce; when
  ;; incrementing past a directory, add the directory's immediate
  ;; content to the front of the list:
  (define (next l)
    (if (directory-exists? (car l))
        (append (let ([d (car l)])
                  (for/list ([f (in-list
                                 (directory-list 
                                  (path->complete-path d init-dir)))])
                    (build-path d f)))
                (cdr l))
        (cdr l)))
  (make-do-sequence
   (lambda ()
     (values
      car
      next
      (if orig-dir
          (next (list orig-dir))
          (directory-list init-dir))
      pair?
      #f
      #f))))


> Oh, and a `define-sequence-syntax` version is even a little faster.
> I'll push all of this soon.
> 
> Sam
> 
> On Wed, Sep 4, 2013 at 11:03 AM, Matthew Flatt <mflatt at cs.utah.edu> wrote:
> > When continuations are too expensive, consider representing the
> > continuation explicitly. In this case, I think a list of paths
> > represents the continuation easily enough.
> >
> > (define (in-directory5 [orig-dir #f])
> >   (define init-dir
> >     (or orig-dir (current-directory)))
> >   ;; current state of the sequence is a list of paths to produce; when
> >   ;; incrementing past a directory, add the directory's immediate
> >   ;; content to the front of the list:
> >   (define (next l)
> >     (if (directory-exists? (car l))
> >         (append (if orig-dir
> >                     ;; paths are always complete:
> >                     (directory-list (car l) #:build? #t)
> >                     ;; work relative to `orig-dir`:
> >                     (parameterize ([current-directory init-dir])
> >                       (directory-list (car l) #:build? #t)))
> >                 (cdr l))
> >         (cdr l)))
> >   (make-do-sequence
> >    (lambda ()
> >      (values
> >       car
> >       next
> >       (if orig-dir
> >           (directory-list orig-dir #:build? #t)
> >           (directory-list))
> >       pair?
> >       #f
> >       #f))))
> >
> > At Wed, 4 Sep 2013 10:01:03 -0400, Sam Tobin-Hochstadt wrote:
> >> Inspired by a post about a faster directory iteration in Haskell [1],
> >> I decided to try doing the same for Racket.  The results are here:
> >> https://gist.github.com/samth/6437192
> >>
> >> The current implementation uses continuations, which are pretty slow.
> >> The fastest solution would fuse the traversal and processing code, but
> >> Racket's optimizer isn't there yet for any implementation I can come
> >> up with.  I tried a few different strategies:
> >>
> >> 1. Build a list of all of the entries, and then just return that.
> >> This is quite fast (3x faster than `in-directory), even for 30k+
> >> entries, but it does excessive IO when it's not needed.
> >> 2. Use a thread and a channel to communicate.  This is about twice as
> >> slow (on my SSD) as the list approach, but it doesn't do any excessive
> >> IO.
> >> 3. Use a thread, a channel, and a vector as a buffer.  This can be as
> >> fast as or faster than the list approach with a large buffer size
> >> (1000 elements), but it's quite competitive even with a buffer size of
> >> 5.
> >>
> >> My channel implementation reads one entry ahead of where the current
> >> `in-directory` is.
> >>
> >> If we think it's important never to do _any_ extra I/O, then I think
> >> we should use a modified version of my 2nd implementation. If we're
> >> willing to relax that, we should probably go with the 3rd approach
> >> with a buffer size of 5 or 10.
> >>
> >> Note that even my fastest code is 2.5x slower than Unix `find`, even
> >> when discounting the Racket startup overhead.
> >>
> >> Sam
> >>
> >> [1] https://github.com/JohnLato/posix-paths
> >> _________________________
> >>   Racket Developers list:
> >>   http://lists.racket-lang.org/dev

Posted on the dev mailing list.