[racket-dev] proposal for moving to packages: repository

From: Eli Barzilay (eli at barzilay.org)
Date: Wed May 29 08:00:31 EDT 2013

Now for the problems that are likely worth paying attention to, and
suggestions for improving things...

The quick summary of what I'm going to say is that I think that
there's a significant improvement that can be done with some more
work, one that requires some minimal manual intervention.  Because of
this, I think that it's best to work with a whole repository database
of file movements, which will be made automatically, but revise-able
manually to fix things.  Your scripts will change to parse this file
instead of running git directly, but since the format will be uniform,
this should be easy to adjust.

And a point of clarification: as you noted, these problems are not
things that you'll see in blames now.  For example, cases of
misidentification are in many places obviously nonsense, and real
cases are rare.  Another example is if there's a commit that removed a
bunch of code that you want to go over: currently, you'll see the
commit that removed a file in your history and the removed file is
visible in that commit but it won't be if it's "truncated away".

I'll repeat here that I'm personally fine with not doing any of this,
but I think that most people do care about losing these bits.  Also,
note that some of these problems are likely to go away in some future
git (for example, search for "fractions" in the below problems to see
a feature that git doesn't have now but might improve in the future),
so an improved future "blame" will actually produce better output when
things are fixed manually even though currently the result won't
differ as much with these fixes.


A good starting point for the whole-repo database of file movements
is:

  git log --date-order --format='----%n%h %ai %s' \
      --name-status -M -C --find-copies-harder -l20000 -B

For reference, I've put this output here:

  http://tmp.barzilay.org/git-log.txt

I'm thinking of starting with this text, and manually fixing things
like removing bogus copies/moves, and adding ones that git missed.  In
addition, there should be some "enrichment" to the format, to specify
where deleted files go -- so it's possibl to go over removed files
(everything that starts with "D") and assign them to package repos.
(Many of them are easy to do since their destination package is
obvious.)

The following is a list of problem examples, which can be addressed as
above.


Here is a problem where some potentially useful history is lost:

2a94ca9 Eric Dobson (3 weeks ago) Cleanup tc-lambda-unit.
  M	collects/typed-racket/typecheck/tc-lambda-unit.rkt
  D	collects/typed-racket/typecheck/parse-cl.rkt

c25ed74 Stephen Bloch (7 weeks ago) Moved error-message tests into a module+ in main source file.
  M	collects/picturing-programs/private/tiles.rkt
  D	collects/picturing-programs/tests/tiles-error-tests.rkt

1838953 Vincent St-Amour (5 months ago) Move define-inline to racket/performance-hint.
  M	collects/scribblings/reference/syntax.scrbl
  D	collects/unstable/scribblings/inline.scrbl

=> In these cases, the second file got "cleaned up" into the first, but
   git considers them unrelated by default, so the history of the first
   is lost if it is not kept explicitly.

9f337c6 Jay McCarthy (10 weeks ago) Removing the planet2 name from the code
  A	collects/tests/pkg/tests-checksums.rkt
  A	collects/tests/pkg/tests-conflicts.rkt
  A	collects/tests/pkg/tests-deps.rkt
  D	collects/tests/planet2/tests-checksums.rkt
  D	collects/tests/planet2/tests-conflicts.rkt
  D	collects/tests/planet2/tests-deps.rkt
  ... lots of these ...

=> In these cases files got renamed with enough changes to a point where
   git misses the fact that they were renamed.  (BTW, for this reason I
   recommended that renames are done without other modifications, and
   instead do them in a separate commit.)

It might help the above to lower the similarity threshold, but the first
problem is that git measures changes in relation to the overall file
size, so if the second file is big enough, it will not help.  Also,
there are these problems:

198a65a Matthew Flatt (13 days ago) raco pkg create: support "source" and "binary" bundling
  C100	collects/launcher/sh	collects/tests/pkg/test-pkgs/pkg-x/nobin-top.txt
  ...

6c1e163 Matthew Flatt (1 year, 2 months ago) add missing "jfp.css"
  C100	collects/launcher/sh	collects/scribble/jfp/jfp.css

=> Empty files are an obvious problem here, since they are 100% similar,
   and therefore considered a copy of some random empty file.  Cannot
   just ignore empty files, since it happens in other files too:

fae660b Jay McCarthy (7 months ago) Release Planet 2 (beta)
  C056	collects/meta/drdr2/analyzer/analyzer.rkt	collects/tests/planet2/test-pkgs/planet2-test1-conflict/planet2-test1/conflict.rkt
  ... many more ...

b2b5875 Blake Johnson (2 years, 7 months ago) replacing self modidx refs and tests
  C092	collects/meta/drdr2/analyzer/analyzer.rkt	collects/tests/compiler/demodularizer/tests/racket-5.rkt

=> In this case, the drdr2 file is just "#lang racket", and git considers
   it an original copy of the second since it just adds a "(exit 43)"
   line (first example) or "5" (second).  This happens elsewhere too, a
   lot with readers and info files:

145efa6 Jay McCarthy (1 year, 2 months ago) Adding #lang web-server/base
  C061	collects/frtime/lang/reader.rkt	collects/web-server/base/lang/reader.rkt
32d2a9c Matthew Flatt (3 years, 1 month ago) fix scheme/load and racket/load
  C087	collects/slideshow/lang/reader.rkt	collects/racket/load/lang/reader.rkt
c7e723e Matthew Flatt (3 years, 1 month ago) somewhat rackety core docs
  C060	collects/frtime/opt/lang/reader.ss	collects/racket/signature/lang/reader.ss
  C066	collects/frtime/reactive/lang/reader.ss	collects/racket/unit/lang/reader.ss

09bed0d Kevin Tew (1 year, 3 months ago) Initial Distributed Places commit
  C100	collects/combinator-parser/info.rkt	collects/racket/place/distributed/info.rkt
e788903 Eli Barzilay (1 year, 9 months ago) Remove a bunch of no-longer-needed `compile-omit-paths', and move the few ones into the subcollections.
  C100	collects/embedded-gui/private/tests/info.rkt	collects/tests/gracket/info.rkt
  C060	collects/embedded-gui/private/tests/info.rkt	collects/tests/plai/info.rkt
  C056	collects/embedded-gui/private/tests/info.rkt	collects/tests/planet/info.rkt
  ...

There are also counter examples:

2d12431 Matthew Flatt (5 months ago) move and fixup docs for the "help" collection
  R052	collects/scribblings/tools/documentation-utils.scrbl	collects/help/help.scrbl

fd7d8a4 Ryan Culpepper (6 months ago) move lazy-require to racket/lazy-require
  R064	collects/unstable/lazy-require.rkt	collects/racket/lazy-require.rkt

b53e458 Matthew Flatt (9 months ago) add `racket/format'
  C068	collects/unstable/cat.rkt	collects/racket/format.rkt

=> here the similarity is very low (the cutoff is 50%), but it is
   actually a correct identification of a move that should be kept.
   (These would also not be a problem had people done the
   move-in-a-separate-commit thing.)

This can probably be improved a lot with a script that will count the
lines of change for low percentages, and weigh in the line length of the
file.  (The log command line allows specifying threshold only in
fractions, so it has to be a separate script).


-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

Posted on the dev mailing list.