[racket-dev] proposal for moving to packages: repository

From: Eli Barzilay (eli at barzilay.org)
Date: Thu May 23 05:41:35 EDT 2013

9 hours ago, Matthew Flatt wrote:
> At Wed, 22 May 2013 14:50:41 -0400, Eli Barzilay wrote:
> > That's true, but the downside of changing the structure and having
> > files and directories move post structure change will completely
> > destroy the relevant edit history of the files, since it will not
> > be carried over to the repos once it's split.
> 
> It's possible that we're talking past each other due to me not getting
> this point.

(Obligatory re-disclaimer: I consider the problem with forcing people
to change their working environment much more severe.)


> Why is it not possible to carry over history?
> 
> The history I want corresponds to `git log --follow' on each of the
> files that end up in a repository. I'm pretty sure that such a
> history of commits can be generated for any given set of files, even
> if no ready-made tool exists already (i.e., 'git' is plenty flexible
> that I can script it myself).
> 
> Or maybe I'm missing some larger reason?

The thing to remember is just how simple git is...  There's no magical
way to carry over a history artificially -- it's whatever is in the
commits.

To make this more concrete (and more verbose), in this context the
point is that git filter-branch is a simple tool that basically
replays the complete history, allowing you to plant various hooks to
change the directory structure, commit messages or whatever.  The new
history is whatever new commits are in the revised repository, with no
way to make up a history with anything else.

Now, to make my first point about the potential loss of history that
is inherent in the process -- say that you want to split out a
"drracket" repo in a naive way: taking just that one directory.  Since
it's done naively, the resulting repository will not have the
"drscheme" directory and its contents, which means that you lose all
history of files that happened there.  To try that (in a fresh clone,
of course) -- first, look at the history of a random file in it:

  F=collects/drracket/private/app.rkt
  git log --format='----%n%h %s' --name-only --follow -- "$F"

Now do the revision:

  S=collects/drracket
  git filter-branch --prune-empty --subdirectory-filter "$S" -- --all

And look at the same log line again, the history is gone:

  git log --format='----%n%h %s' --name-only --follow -- "$F"

If you look at the *new* file, you do see the history, but the
revisions made in "drscheme" are gone:

  git log --format='----%n%h %s' --name-only --follow -- private/app.rkt

In any case, this danger is there no matter what, especially in our
case since code has been moving around in the "racket" switch.  I
*hope* that most of it will be simple: like carrying along the
"drscheme" directory with "drracket", the "scheme" and "mzlib" with
"racket", etc.  Later on, if these things move to "compat" packages,
the irrelevant directories get removed from the repo without
surgeries, so the history will still be there.  This shows some of the
tricks that might be involved in the current switch: if you'd want to
have some "compat" package *now*, the right thing to do would be:

  * do a simple filter-branch to extract "drscheme" (and other such
    collections) in a new repository for "compat"

  * for "drracket": do a filter-branch that keeps *both* directories
    in, then commit a removal of "drscheme".  (Optionally, use rebase
    to move the deletion backward...)

Going back to the repo structure change that you want and the reason
that I said that doing moves between the package directories
post-restructure is destructive should be clear now: say that you move
collects/A/x into foo/A/x as part of the restructure.  Later you
realize that A/x should go into the bar package instead so you just
move it to bar/A/x.  The history is now in, including the rename, but
later on when bar is split into a separate repo, the history of the
file is gone.  Instead, it appears in the foo repository, ending up
being deleted.

One way to get around this is to avoid moving the file -- instead, do
another filter-branch surgery.  This will be a mess since each such
change will mean rebuilding the repository with all the pain that this
implies.  Another way to get around it is to keep track of these
moving commits, and when the time comes to split into package repos,
you first do another surgery on the whole repo which moves foo/A/x to
bar/A/x for all of the commits before the move (not after, since that
could lead to other problems), and then do the split.

This might work, but besides being very error-prone, it means doing
the same kind of file-movement tracking that I'm talking about anyway.
So take this all as saying that the movement of files between packages
needs to be tracked anyway -- but with my suggestion the movement is
delayed until it's known to be final before the repo split, which
makes it more robust overall.

----

But really, the much more tempting aspect for me is that this can be
done now -- if you give me a list of packages and files, I can already
do the movement script.

Actually, in an attempt to tempt you more, here's what I can do now
(as in the very near future):

Start from the list of directories/files in your min repo as a
specification of the contents of the core package, and decide that
everything else is in another "everything-else" package.  (Since
there's no actual file movements, it is cheap to use temporary names
and partial specifications.)

Then, change how the build works on the main machine (leave the other
machines as is for now): after the initial few steps of updating
version files etc the script doesn't use a repo -- it uses just the
exported directory.  So after it exports the directory for building,
the main machine will:

  - run the script to get the package directories, so you get
    something like (in $PLTHOME, whereever the build works):

      collects \
      doc       \  all of these
      man       /  are empty
      src      /
      core/collects
      core/man
      core/src
      everything-else/collects

  - it now moves core/* up a level (and removes the empty "core"
    directory)

  - do the regular build: executables + raco setup

  - next, move everything-else/* up a level too

  - run another setup

This means that now the build makes sure that the dependencies are
fine: that the core doesn't depend on everything-else.  Later on, we
can split another package out from everything-else, and insert it into
the above sequence: build the core, add P, run setup, add everything
else, run a final setup.  It can even get more sophisticated:

  - build core,
  - add P1, setup, move the built P1 out,
  - add P2, setup, move the built P2 out,
  - add everything-else and the built P1 & P2, run a final setup

Yes, this is duplicating the dependency info between the packages, but
this is all done temporarily (and for a small number of packages)
until the proper package-based build is working and replaces it.

In other words -- not only is my suggestion implementable now, it
allows the project to proceed faster: you can go on with doing the
package build, while everyone need to deal with respecting
dependencies (deciding on which package a file goes with, avoiding
breaking these dependencies).

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

Posted on the dev mailing list.