[plt-scheme] Strategies for optimizing suffix tree code for PLT Scheme?

From: Danny Yoo (dyoo at hkn.eecs.berkeley.edu)
Date: Sun Nov 13 01:49:16 EST 2005

Hi everyone,

I'm trying to polish up my implementation of the Ukkonen algorithm for
building suffix trees.  I have two subprojects I'd like to apply this
toward: I want to see if they'd be useful in writing a copy-and-paste
detector in the style of PMD, and I want to visualize these trees using
MrLib graph snips.

Anyway, I have a prerelease here:

     http://hkn.eecs.berkeley.edu/~dyoo/tmp/suffixtree-v4309.tar.gz

At the moment, there's no real documentation --- my apologies!  The main
code there is 'ukkonen2.ss' and 'label.ss', with extensive unit tests in
'test-ukkonen2.ss'.

Suffix trees can be constructed with the following:

;;;;;;
> (require "ukkonen2.ss")
> (require "label.ss")
> (enable-debugging)
> (define tree (new-suffix-tree))
> (suffix-tree-add! tree (make-label "00010010$"))
;; [debugging output appears here that shows the trace of construction]
>
> (tree-contains? tree (make-label "001"))
#t
> (tree-contains? tree (make-label "0011"))
#f
>
> (define root (suffix-tree-root tree))
> (node-children root)
(#<struct:node> #<struct:node> #<struct:node>)
> (label->string (node-up-label (car (node-children root))))
"$"
> (label->string (node-up-label (cadr (node-children root))))
"10"
> (label->string (node-up-label (caddr (node-children root))))
"0"
;;;;;;


Although construction does appear to go in linear time, I want to train
myself to write tight and idiomatic PLT Scheme code.  I'm in the middle of
tracing down places that collect garbage because I see, with the help of
the timing test script 'time-tree-building.ss', that there are sporatic
discontinuities while running the algorithm across my test set.  I'm
assuming these to be the GC in action, and want to get rid of those bumps
as best as I can.

Toward that end, I'm lambda lifting all my utility functions and named
lets by hand to avoid closure construction, but that's a bit tedious.
Are there other optimizations I can do here, or unobvious PLT Scheme
gotchas that I should avoid?  Or are there automated tools to apply
optimizing source transformations?

Also, any other comments, advice, and criticism on the code would be
greatly appreciated.  Thanks for your help!



Posted on the users mailing list.