[plt-scheme] Interacting w/ MzScheme

Fri Dec 10 01:00:54 EST 2004

Don-

On Dec 10, 2004, at 12:33 AM, Don Blaheta wrote:

>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>
> Quoth Matthias Felleisen:
>> Way back when I was young and at Rice, I actually started planning a
>> tree diff -- and promptly forgot about it that afternoon.
>
> I needed a tree diff for some of my natural language processing
> research, and I got a lot of mileage out of plain old gnudiff with a
> little bit of preprocessing.  Is the problem algorithmic, or just a
> "getting around to it" issue?
>

I don't know what Matthias's response is going to be, but I'd say big 
part of it is that its an algorithmic problem.  The best algorithms for 
calculating a minimal tree diff tend to be O(n^4) or so, where n is the 
number of nodes in the tree.

Of course, this isn't a proof that you are stuck with O(n^4) to get 
useful results.  The standard diff implementation doesn't calculate a 
minimal diff either, so you might be able to choose a different 
algorithm and get usable results in a reasonable amount of time (see 
the work of Chawathe and Garcia-Molina referenced below).

----

Another problem that I don't think many people have given much thought 
to (and thus would be a big part of the research) is how to *display* 
the feedback of the tree diff to the user; these algorithms all tend to 
return edit script, which are just sequences of tree editing 
operations.  This is because that's the way the algorithms for string 
diff were formulated as well; its just that there happens to be some 
very nice ways to present string edit scripts graphically.  I'm not 
sure that the tree edit scripts will be so easily applied to the 
visualized s-exp in your editors, especially if you choose to include 
operations like "copy-subtree" or "move-subtree" in your set of edit 
operations.

----

Here's my summary of a literature review I did during October while 
preparing a project proposal.

- Zhang and Shasha [15] give an algorithm for solving the tree edit 
problem in time
O(|T1||T2| min{height (T1), |leaves(T1)|} min{height (T2), 
|leaves(T2)|})

- Bille [3] develops a unified framework for describing tree-diff 
algorithms and summarizes the worst-case behavior of various tree diff 
algorithms (including Zhang and Shasha’s work) with various optional 
constraints on the problem domain (ordered versus unordered children, 
subtree matchings, unit-cost, etc).

- Barnard et al. [2] extend previous work in Tree-to-Tree correction 
(not specifically about program parse trees), adding Subtree Insertion, 
Deletion, and Swapping to the set of allowed operations.

- Chawathe and Garcia-Molina [6, 5] explored Tree transformation 
supporting subtree move and copy. where the rules are applied in 
parallel rather than in sequence. Their technique produces non-optimal 
solutions (in order to allow a more efficient algorithm), reducing the 
problem of change detection to one of choosing a min-cost edge cover of 
a bipartite graph.

- Yang [13] developed a tool for calculating differences between parse 
trees of C programs, focusing mainly on how to generate matchings 
between the nodes of the parse tree and on how to display the diff to 
the user. His tool did not support a Copy-Subtree operation in the tree 
edit scripts, and did not encode any knowledge of the operational 
semantics of C or of any common refactorings for C programs. Yang did 
consider some tree comparison algorithms much such like those those in 
Section 7.1. However, he discounted their relevance for his problem, 
since he felt that their reliance on a distance metric between nodes 
(specifically with respect to the relabeling tree operation) was at 
odds with the constraints imposed upon him by the C language and its 
mixing of types and constants in a single namespace that cannot be 
distinguished by its context-free grammar.

- Horwitz [7] and Yang et al. [14] did incorporate knowledge of program 
behavior into a diff tool. Their approaches handled variable renaming 
and other semantics preserving transformations. However, their tool was 
only applicable to a particularly weak language (no procedural 
abstraction; just variables, constants, assignments, conditionals, and 
whileloops).

- Binkley [4] applies semantic differencing on dependence graphs to 
determine which test cases in a regression test suite should be re-run, 
and also to generate a new program that captures only the behavior of 
the affected components. While the technology employed by Binkley may 
be applicable to the work here, the goals are decidedly different: his 
tool was meant to produce output intended for consumption by a testing 
tool, while my tool is intended to provide direct feedback for human 
developers.

----

Bibliography

[2] David T. Barnard, Gwen Clarke, and Nicolas Duncan. Tree-to-tree 
correction for document trees. Technical Report 95-372, Queen’s 
University, January 1995.

[3] Philip Bille. Tree edit distance, alignment distance and inclusion. 
Technical Report TR-2003-23, IT University of Copenhagen, March 2003.

[4] D. Binkley. Using semantic differencing to reduce the cost of 
regression testing. In Proceedings of the International Conference on 
Software Maintenance 1992, pages 41– 50, 1992.

[5] S. Chawathe and H. Garcia-Molina. An expressive model for comparing 
tree-structured data, 1997.

[6] Sudarshan S. Chawathe, Anand Rajaraman, Garcia-Molina 
Garcia-Molina, and Jennifer Widom. Change detection in hierarchically 
structured information. In Proceedings of theACM SIGMOD International 
Conference on Management of Data, volume 25, 2 of ACM SIGMOD Record, 
pages 493–504, New York, June 4–6 1996. ACM Press.

[7] Susan Horwitz. Identifying the semantic and textual differences 
between two versions of a program. ACM SIGPLAN Notices, 25(6):234–245, 
June 1990.

[12] Joel Winstead and David Evans. Towards differential program 
analysis. In Proceedings WODA 2003 the Workshop on Dynamic Analysis, 
pages 37–40, Portland, Oregon, May 2003.

[13] Wuu Yang. Identifying syntactic differences between two programs. 
Software - Practice and Experience, 21(7):739–755, 1991.

[14] Wuu Yang, Susan Horwitz, and Thomas Reps. Detecting program 
components with equivalent behaviors. Technical Report CS-TR-1989-840, 
University of Wisconsin, 1989.

[15] K. Zhang and D. Shasha. Simple fast algorithms for the editing 
distance between trees and related problems. SIAM J. Comput., 
18(6):1245–1262, 1989.

-Felix

----
"I probably would, Nick... but I'd appreciate it more
  if you wouldn't park it on my kid."
  -www.redmeat.com