[plt-scheme] Interacting w/ MzScheme
Don-
On Dec 10, 2004, at 12:33 AM, Don Blaheta wrote:
> For list-related administrative tasks:
> http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>
> Quoth Matthias Felleisen:
>> Way back when I was young and at Rice, I actually started planning a
>> tree diff -- and promptly forgot about it that afternoon.
>
> I needed a tree diff for some of my natural language processing
> research, and I got a lot of mileage out of plain old gnudiff with a
> little bit of preprocessing. Is the problem algorithmic, or just a
> "getting around to it" issue?
>
I don't know what Matthias's response is going to be, but I'd say big
part of it is that its an algorithmic problem. The best algorithms for
calculating a minimal tree diff tend to be O(n^4) or so, where n is the
number of nodes in the tree.
Of course, this isn't a proof that you are stuck with O(n^4) to get
useful results. The standard diff implementation doesn't calculate a
minimal diff either, so you might be able to choose a different
algorithm and get usable results in a reasonable amount of time (see
the work of Chawathe and Garcia-Molina referenced below).
----
Another problem that I don't think many people have given much thought
to (and thus would be a big part of the research) is how to *display*
the feedback of the tree diff to the user; these algorithms all tend to
return edit script, which are just sequences of tree editing
operations. This is because that's the way the algorithms for string
diff were formulated as well; its just that there happens to be some
very nice ways to present string edit scripts graphically. I'm not
sure that the tree edit scripts will be so easily applied to the
visualized s-exp in your editors, especially if you choose to include
operations like "copy-subtree" or "move-subtree" in your set of edit
operations.
----
Here's my summary of a literature review I did during October while
preparing a project proposal.
- Zhang and Shasha [15] give an algorithm for solving the tree edit
problem in time
O(|T1||T2| min{height (T1), |leaves(T1)|} min{height (T2),
|leaves(T2)|})
- Bille [3] develops a unified framework for describing tree-diff
algorithms and summarizes the worst-case behavior of various tree diff
algorithms (including Zhang and Shasha’s work) with various optional
constraints on the problem domain (ordered versus unordered children,
subtree matchings, unit-cost, etc).
- Barnard et al. [2] extend previous work in Tree-to-Tree correction
(not specifically about program parse trees), adding Subtree Insertion,
Deletion, and Swapping to the set of allowed operations.
- Chawathe and Garcia-Molina [6, 5] explored Tree transformation
supporting subtree move and copy. where the rules are applied in
parallel rather than in sequence. Their technique produces non-optimal
solutions (in order to allow a more efficient algorithm), reducing the
problem of change detection to one of choosing a min-cost edge cover of
a bipartite graph.
- Yang [13] developed a tool for calculating differences between parse
trees of C programs, focusing mainly on how to generate matchings
between the nodes of the parse tree and on how to display the diff to
the user. His tool did not support a Copy-Subtree operation in the tree
edit scripts, and did not encode any knowledge of the operational
semantics of C or of any common refactorings for C programs. Yang did
consider some tree comparison algorithms much such like those those in
Section 7.1. However, he discounted their relevance for his problem,
since he felt that their reliance on a distance metric between nodes
(specifically with respect to the relabeling tree operation) was at
odds with the constraints imposed upon him by the C language and its
mixing of types and constants in a single namespace that cannot be
distinguished by its context-free grammar.
- Horwitz [7] and Yang et al. [14] did incorporate knowledge of program
behavior into a diff tool. Their approaches handled variable renaming
and other semantics preserving transformations. However, their tool was
only applicable to a particularly weak language (no procedural
abstraction; just variables, constants, assignments, conditionals, and
whileloops).
- Binkley [4] applies semantic differencing on dependence graphs to
determine which test cases in a regression test suite should be re-run,
and also to generate a new program that captures only the behavior of
the affected components. While the technology employed by Binkley may
be applicable to the work here, the goals are decidedly different: his
tool was meant to produce output intended for consumption by a testing
tool, while my tool is intended to provide direct feedback for human
developers.
----
Bibliography
[2] David T. Barnard, Gwen Clarke, and Nicolas Duncan. Tree-to-tree
correction for document trees. Technical Report 95-372, Queen’s
University, January 1995.
[3] Philip Bille. Tree edit distance, alignment distance and inclusion.
Technical Report TR-2003-23, IT University of Copenhagen, March 2003.
[4] D. Binkley. Using semantic differencing to reduce the cost of
regression testing. In Proceedings of the International Conference on
Software Maintenance 1992, pages 41– 50, 1992.
[5] S. Chawathe and H. Garcia-Molina. An expressive model for comparing
tree-structured data, 1997.
[6] Sudarshan S. Chawathe, Anand Rajaraman, Garcia-Molina
Garcia-Molina, and Jennifer Widom. Change detection in hierarchically
structured information. In Proceedings of theACM SIGMOD International
Conference on Management of Data, volume 25, 2 of ACM SIGMOD Record,
pages 493–504, New York, June 4–6 1996. ACM Press.
[7] Susan Horwitz. Identifying the semantic and textual differences
between two versions of a program. ACM SIGPLAN Notices, 25(6):234–245,
June 1990.
[12] Joel Winstead and David Evans. Towards differential program
analysis. In Proceedings WODA 2003 the Workshop on Dynamic Analysis,
pages 37–40, Portland, Oregon, May 2003.
[13] Wuu Yang. Identifying syntactic differences between two programs.
Software - Practice and Experience, 21(7):739–755, 1991.
[14] Wuu Yang, Susan Horwitz, and Thomas Reps. Detecting program
components with equivalent behaviors. Technical Report CS-TR-1989-840,
University of Wisconsin, 1989.
[15] K. Zhang and D. Shasha. Simple fast algorithms for the editing
distance between trees and related problems. SIAM J. Comput.,
18(6):1245–1262, 1989.
-Felix
----
"I probably would, Nick... but I'd appreciate it more
if you wouldn't park it on my kid."
-www.redmeat.com