[racket-dev] proposal for moving to packages: repository

From: Eli Barzilay (eli at barzilay.org)
Date: Fri May 24 02:41:22 EDT 2013

Yesterday, Robby Findler wrote:
> Hi Eli: I'm trying to understand your point. Do I have this right?
> 
> Background: The git history consists of a series checkpoints in time
> of the entire repository, not a collection of individual files.

Yes, although the difference between "entire repository" and
"individual files" is mostly theoretical.  The main point is that the
log history is made from changes to *content* -- you can't have some
made up history planter artificially for a file.  (And it is the same
for most CMSs if not all; the main difference here is that git doesn't
keep meta information about copying and renaming.)


> So, when I do "git log x.rkt" then what I get is essentially a
> filtered list (except where people didn't properly rebase, but lets
> ignore that) of those checkpoints: all the ones where "x.rkt"
> changed.

Exactly.  (I don't get the rebase comment though -- even without
rebasing what you get is this filtered history.)


> Big Question: The issue is, then, when we split up the current repo
> into smaller repos, what are the series of checkpoints that we're
> going to "make up" for the individual repos? Right? 

Yes, but it can get a bit subtle.  Like I said, the `filter-branch'
tool is basically replaying the entire history, giving you points to
inject hooks that can modify the tree or the commits, etc.  Note that
in all uses that were mentioned, there was a "--prune-empty" flag,
which means that commits that didn't have any change are dropped.  I'm
mentioning this because some people might have an illusion that it's
better to *not* do that and keep these commits.  Here's an example why
this is not useful: say that you have this edit sequence:

  foo at somewhere creates A/x, with log message "created x"
  bar at somewhere edits it, with log message "edited x"
  baz at somewhere moved it into B/x, with log "renamed A to B"

If at this point you use any git tools, they can see the real history.
For exmaple, you can use `blame' to see which lines were written by
which user.  Also, assuming that these are all the changes, a git log
will show the three commits as they appear above.

Now, if you you use filter-branch to modify the repository and keep
only the "B" directory, but you *don't* use the "--prune-empty" flag:
the fact that you want to keep these other commits won't help -- the
full history would have the three commits with the same three
messages, but doing a log for just the file would show only the
commits for the file, so the first two commits won't be shown.
Similarly, blame can't show anything useful -- you'll only see
baz at somewhere as the author of the entire file.  And the reason this
makes sense is that the full commit history has the first two commits,
but they had no change -- so there's nothing that ties them to the
file in the trimmed repository, let alone something that relates them
to specific lines in the file...

(Two notes: (a) This is just a demonstration -- obviously, this is a
trimming that is done in a bad way since it dropped A even though it's
part of the history of B.  (B) Actually, it looks like the
"--subdirectory-filter" drops empty commits anyway, but the above
explains why it makes sense to do that.)


> Your Advice: And, IIUC, you're suggesting that the best way to deal
> with this question is to defer it until we are more sure of the
> actual split we want to make. So we don't mess with the history at
> all

The point is that every such messing-with-history should be done very
carefuly and checked thoroughly, since the chance to mess things up is
very real.  In the above, it's obvious that I should have not droped A
in the filter -- but if it's some random single file which you had in
the framework collection, out of tons of other files in the drracket
package, then it's unlikely that I will catch it -- which is why I
prefer using tools for these things and resolve all such issues with
the people who know about the code.


> and instead just work at the level of some script that we can run to
> just use "mv" and company to move things around.  When we know
> exactly what ends up going where, then we can figure out how to make
> up a new, useful history for the separate repositories.
> 
> Is that the point?

The thing is that having two such filters (one to restructure the big
repository and one to split it) is both increasing chances for making
mistakes, and making the job of the second restructure much harder to
do.  To the point where doing it manually is infeasible, which is why
I said that it will guarantee losing history.

(And I'll reply to Matthew's suggested tool next.)

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!


Posted on the dev mailing list.