[racket-dev] proposal for moving to packages: repository
8 hours ago, Matthew Flatt wrote:
> At Thu, 23 May 2013 07:09:17 -0400, Eli Barzilay wrote:
> > "Relevant history" is vague.
>
> The history I want corresponds to `git log --follow' on each of the
> files that end up in a repository.
(In this context this is clear; the problem in Carl's post is that it
seemed like he was suggesting keeping the whole repository and doing
the split by removing material from clones -- which is and even fuller
history, but one that has large parts that are irrelevant.)
> That's true if you use `git filter-branch' in a particular way. I'll
> suggest an alternative way, which involves filtering the set of
> files in a commit-specific way. That is, the right set of files to
> keep for each commit are not the ones in the final place, but the
> ones whose history we need at each commit.
If that can be done reliabely, then of course it makes it possible to
do the split reliabley after the first restructure. It does come with
a set of issues though...
> [... scripts description ...]
Here are a bunch of things that I thought about as I went over this.
In no particular order, probably not exhaustive, and possibly
repetitive:
* Minor: better to use `find-executable-path' since it's common to
find systems (like mine) with an antique git in /usr/bin and a
modern one elsewhere. (In my case, both scripts failed since
/usr/bin has an antique version.)
* There is an important point of fragility here: you're relying on git
to be able to find all of the relevant file movements (renames and
copies), which might not always be correct. On one hand, you don't
want to miss these operations, and on the other you don't want to
have a low-enough threshold to identify bogus copies and renames.
* Because of this, I think that it's really best to inspect the
results manually. The danger of bogus copies, for example, is real,
especially with small and very boilerplate-ish files like "info.rkt"
files. If there's a mistaken identification of such a copy you can
end up with a bogus directory kept in the trimmed repo. In
addition, consider this information that the script detects via git
for a specific commit:
A/f1.ss renamed to B/f1.rkt
A/f2.ss renamed to B/f2.rkt
...
A/f47.ss renamed to B/f47.rkt
A/f48.ss renamed to B/f48.rkt
A/f49.ss deleted
A/f50.ss deleted
B/f49.rkt created
B/f49.rkt created
For a human reviewer, it's pretty clear that this is just a
misidentification of two more moves (likely to happen with the kind
of restructures that we did in the past, where a single commit both
moves a file, and changes its contents). This is why on one hand I
*really* like to use such scripts (to make sure that I don't miss
such things), but OTOH I want to review the analysis results to see
potential problems and either fix them manually or figure out a way
to improve the analysis and run it again.
* Also, I'd worry about file movements on top of paths that existed
under a different final path at some point, and exactly situations
like you described, where a file was left behind, but that file is
completely new and should be considered separate (as in the case of
a file move and a stub created in its place).
* The script should also take care to deal with files that got removed
in the past. For example, the drscheme collection has some file
which gets removed, and later (completely unrelated) most of the
contents migrated to drracket. If the result of the analysis is
that most of the material moved this way, and because of that you
decide to keep the old drscheme collection -- you'd also want to
keep that file that disappeared before the move, since it's part of
the relevant history.
So I'd modify this script to run on the *complete* repository -- the
whole tree and all commits -- and generate information about
movements. Possibly do what your script is does for the whole tree,
then add a second step that runs and looks for such files that are
unaccounted for in the results, and decide what to do with them.
I think that this also means that it makes sense to create a global
database of all file movements in a single scan, instead of running
it for each package.
* Technical: I thought that it might make sense to use a racket server
(with netcat for the actual command), or have it "compile" a /bin/sh
script to do the actual work instead of using `racket/kernel' for
speed. However, when I tried it on the plt tree, it started with
spitting out new commits rapidly, but eventually slowed down to more
than a second between commits, so probably even the kernel trick is
not helping much...
* Actually, given the huge amount of time it's running (see next
bullet), it's probably best to make it do the movements from all
paths at the same time. In this specific context, this means that
it scans the package-restructured repo (from the first step) into a
package-restructured repo (possibly with the same toplevel names)
with all the files moved to their correct places, and the resulting
repo can now be conveniently split into the sub-repos with a simple
subdirectory filter.
* And speaking about the time: what I saw is about 19k commits (I
think, I killed it and speaking from memory now), where it started
to work very fast, then slowed down considerabely. After about 5
hours it was about half-way through the 19k commits, and the rate
reached about 1.5 seconds per commit. Assuming a linear growth in
time per commit, this means that the whole operation is something
that would take about 20 hours. (I didn't leave it running, since I
don't want it to disturb the nightly build that will start soon.)
This is not making it impossible -- just very hard to do reliabely,
so I really wouldn't want to see this going without a human
supervising eye as I described above.
* Much better would be to run this to generate human readable and
editable output: then not only go over this output manually and make
sure that it all makes sense, but also identify points of
optimization. For example, knowing that all of drscheme/* moved to
drracket/* is going to work out much better than dealing with each
file separately on each commit re-doing.
* Re the commit messages being edited with "--msg-filter": one thing
to note is that there are already such lines in the history portion
that was ported from subversion. Even for those commits, you
probably want to add the sha1s still, since there might be
references to the sha1s of an svn-imported commit, ending with the
svn revision, then the original sha1 that was the first translation
of the svn commit.
* It's not clear to me what you want to do at this point, but you
originally described two filter-branch steps: one to restructure the
repository soon, and another to split into packages. If this is
still the plan, then each of these steps would need to add a
historical sha1 behind.
Alternatively, do the first restructure with in-repo moves instead,
but then it would be a good idea to run the slice script and make
sure that it succeeds in finding all of *these* renames correctly,
as a good first-level sanity test.
* Still, I consider all of this a huge amount of work and I still
don't see any benefit for doing it. Just the time spent on these
explanations is more than what I'd spend on what I suggested
yesterday wrt starting a separate setup step for the core and for
the rest.
BTW, the script is still useful, of course -- I'd probably do
something similar, except that I'd use some shell scripts to inspect
the history of all files, and refine it as I described above. The
thing is that this can be done without any effect on current
progress, since the split during the build is made on current
directories.
--
((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay:
http://barzilay.org/ Maze is Life!