[racket-dev] proposal for moving to packages: repository

From: Eli Barzilay (eli at barzilay.org)
Date: Fri May 24 03:26:45 EDT 2013

8 hours ago, Matthew Flatt wrote:
> At Thu, 23 May 2013 07:09:17 -0400, Eli Barzilay wrote:
> > "Relevant history" is vague.
> 
> The history I want corresponds to `git log --follow' on each of the
> files that end up in a repository.

(In this context this is clear; the problem in Carl's post is that it
seemed like he was suggesting keeping the whole repository and doing
the split by removing material from clones -- which is and even fuller
history, but one that has large parts that are irrelevant.)


> That's true if you use `git filter-branch' in a particular way. I'll
> suggest an alternative way, which involves filtering the set of
> files in a commit-specific way. That is, the right set of files to
> keep for each commit are not the ones in the final place, but the
> ones whose history we need at each commit.

If that can be done reliabely, then of course it makes it possible to
do the split reliabley after the first restructure.  It does come with
a set of issues though...

> [... scripts description ...]

Here are a bunch of things that I thought about as I went over this.
In no particular order, probably not exhaustive, and possibly
repetitive:

* Minor: better to use `find-executable-path' since it's common to
  find systems (like mine) with an antique git in /usr/bin and a
  modern one elsewhere.  (In my case, both scripts failed since
  /usr/bin has an antique version.)

* There is an important point of fragility here: you're relying on git
  to be able to find all of the relevant file movements (renames and
  copies), which might not always be correct.  On one hand, you don't
  want to miss these operations, and on the other you don't want to
  have a low-enough threshold to identify bogus copies and renames.

* Because of this, I think that it's really best to inspect the
  results manually.  The danger of bogus copies, for example, is real,
  especially with small and very boilerplate-ish files like "info.rkt"
  files.  If there's a mistaken identification of such a copy you can
  end up with a bogus directory kept in the trimmed repo.  In
  addition, consider this information that the script detects via git
  for a specific commit:

    A/f1.ss renamed to B/f1.rkt
    A/f2.ss renamed to B/f2.rkt
    ...
    A/f47.ss renamed to B/f47.rkt
    A/f48.ss renamed to B/f48.rkt
    A/f49.ss deleted
    A/f50.ss deleted
    B/f49.rkt created
    B/f49.rkt created

  For a human reviewer, it's pretty clear that this is just a
  misidentification of two more moves (likely to happen with the kind
  of restructures that we did in the past, where a single commit both
  moves a file, and changes its contents).  This is why on one hand I
  *really* like to use such scripts (to make sure that I don't miss
  such things), but OTOH I want to review the analysis results to see
  potential problems and either fix them manually or figure out a way
  to improve the analysis and run it again.

* Also, I'd worry about file movements on top of paths that existed
  under a different final path at some point, and exactly situations
  like you described, where a file was left behind, but that file is
  completely new and should be considered separate (as in the case of
  a file move and a stub created in its place).

* The script should also take care to deal with files that got removed
  in the past.  For example, the drscheme collection has some file
  which gets removed, and later (completely unrelated) most of the
  contents migrated to drracket.  If the result of the analysis is
  that most of the material moved this way, and because of that you
  decide to keep the old drscheme collection -- you'd also want to
  keep that file that disappeared before the move, since it's part of
  the relevant history.

  So I'd modify this script to run on the *complete* repository -- the
  whole tree and all commits -- and generate information about
  movements.  Possibly do what your script is does for the whole tree,
  then add a second step that runs and looks for such files that are
  unaccounted for in the results, and decide what to do with them.

  I think that this also means that it makes sense to create a global
  database of all file movements in a single scan, instead of running
  it for each package.

* Technical: I thought that it might make sense to use a racket server
  (with netcat for the actual command), or have it "compile" a /bin/sh
  script to do the actual work instead of using `racket/kernel' for
  speed.  However, when I tried it on the plt tree, it started with
  spitting out new commits rapidly, but eventually slowed down to more
  than a second between commits, so probably even the kernel trick is
  not helping much...

* Actually, given the huge amount of time it's running (see next
  bullet), it's probably best to make it do the movements from all
  paths at the same time.  In this specific context, this means that
  it scans the package-restructured repo (from the first step) into a
  package-restructured repo (possibly with the same toplevel names)
  with all the files moved to their correct places, and the resulting
  repo can now be conveniently split into the sub-repos with a simple
  subdirectory filter.

* And speaking about the time: what I saw is about 19k commits (I
  think, I killed it and speaking from memory now), where it started
  to work very fast, then slowed down considerabely.  After about 5
  hours it was about half-way through the 19k commits, and the rate
  reached about 1.5 seconds per commit.  Assuming a linear growth in
  time per commit, this means that the whole operation is something
  that would take about 20 hours.  (I didn't leave it running, since I
  don't want it to disturb the nightly build that will start soon.)

  This is not making it impossible -- just very hard to do reliabely,
  so I really wouldn't want to see this going without a human
  supervising eye as I described above.

* Much better would be to run this to generate human readable and
  editable output: then not only go over this output manually and make
  sure that it all makes sense, but also identify points of
  optimization.  For example, knowing that all of drscheme/* moved to
  drracket/* is going to work out much better than dealing with each
  file separately on each commit re-doing.

* Re the commit messages being edited with "--msg-filter": one thing
  to note is that there are already such lines in the history portion
  that was ported from subversion.  Even for those commits, you
  probably want to add the sha1s still, since there might be
  references to the sha1s of an svn-imported commit, ending with the
  svn revision, then the original sha1 that was the first translation
  of the svn commit.

* It's not clear to me what you want to do at this point, but you
  originally described two filter-branch steps: one to restructure the
  repository soon, and another to split into packages.  If this is
  still the plan, then each of these steps would need to add a
  historical sha1 behind.

  Alternatively, do the first restructure with in-repo moves instead,
  but then it would be a good idea to run the slice script and make
  sure that it succeeds in finding all of *these* renames correctly,
  as a good first-level sanity test.

* Still, I consider all of this a huge amount of work and I still
  don't see any benefit for doing it.  Just the time spent on these
  explanations is more than what I'd spend on what I suggested
  yesterday wrt starting a separate setup step for the core and for
  the rest.

  BTW, the script is still useful, of course -- I'd probably do
  something similar, except that I'd use some shell scripts to inspect
  the history of all files, and refine it as I described above.  The
  thing is that this can be done without any effect on current
  progress, since the split during the build is made on current
  directories.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

Posted on the dev mailing list.