[racket-dev] Replacing the splay tree token-tree% with an rb-tree?

From: Robby Findler (robby at eecs.northwestern.edu)
Date: Tue Nov 13 20:15:10 EST 2012

That sounds fantastic! Thanks!

In your stress test, try doing random insertions/deletions to try to
find things that break the red/black invariant.

Robby

On Tue, Nov 13, 2012 at 5:21 PM, Danny Yoo <dyoo at hashcollision.org> wrote:
> Hi Robby,
>
> I've been looking at a profile of DrRacket when opening/indenting large
> files.  Using the statistical profiler, I've measured a large chunk of time
> (around 40%) is dedicated to the search operation of the existing
> syntax-color token-tree.  That seems too high, so I've been trying to
> understand what's going on there.
>
> I suspect that the access pattern that's applied on the tree during
> indentation resembles the worst-case scenario of splay trees: if we search
> each node in sequential order, the structure of a splay tree will linearize,
> ruining the dynamic balance.  Indentation uses a repeated, sequential search
> across the tokens.
>
> Concretely, if I add the following diagnostic to the token-tree% class:
>
> --------------------------------------------------
> +  (define (height node)
> +    (cond
> +     ((not node) 0)
> +     (else
> +      (add1 (max (height (node-left node))
> +                 (height (node-right node)))))))
> +
>        ;; Moves the node at key-position to the root
>        (define/public (search! key-position)
>          (when root
> -          (set! root (internal-search root key-position))))
> +          (set! root (internal-search root key-position))
> +          (printf "debug: size: ~a, height: ~a\n" (size root 0) (height
> root)))
> +
> --------------------------------------------------
>
> and open up a large file like collects/drracket/private/unit.rkt, I can see
> that the tree is much deeper than it deserves to be.  During a tabify-all,
> the tree does appear to get huge; it swings between a height in the hundreds
> into the thousands thousands.  That's much too tall, considering that there
> are only about 40000 tokens in the tree.
>
>
> I believe that a balanced BST should give us better performance.  I have
> been writing an alternative implementation of the token tree based on
> red-black trees.
>
>     https://github.com/dyoo/new-token-tree
>
> It is unfortunately missing the necessary splitting operations.  But I plan
> to implement split and concat with the algorithm described in:
>
>     http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.109.4875
>
> so I don't anticipate fundamental problems.
>
>
> I wanted to surprise you by getting this all working by mid-week, but it's
> taking longer than I thought... :)  So I might as well run it by you to make
> sure the idea is sound before I go further on this track.

Posted on the dev mailing list.