[racket-dev] Packages

From: Eli Barzilay (eli at barzilay.org)
Date: Mon Apr 8 16:18:18 EDT 2013

This is a (long) criticism of the current state of the package system.
(It is a by-product of PR13669, where I raised that point.)

Executive summary: I very strongly think that "pkg create" should
change.  See the bottom for my suggestion.


* Most code development happens in a single collection

It all starts at how someone is expected to approach developing a
"package", regardless of how this is defined in the current system.  A
quick way to get to the core of the problem is to consider what you'd
expect to do when you start working on some "foo" package.  Under
probably most package systems and most distribution systems, you'd
begin with a "foo" directory and work there.  But with the new package
system, you get an extra level of directory structure: you need to
make up a "foo" directory with a "foo" subdirectory (usually).

I know that the way I develop packages (again, not using the formal
definition of the package system) does *not* do that.  I also know
that many other people don't do that.  In fact, I can think of very
few *existing* examples that would benefit from such a thing.  This is
in contrast to Jay's view, who said that he used multiple-collection
per package because he wanted it to "match what users have".

So just to make sure that I'm not talking out of my ass, I looked at
all of the existing packages.  Here are some numbers (manual count, so
it's all estimated):

  42 Single-directory packages, holding at most some meta stuff like a
     README file at the top (IIRC, there was only one case with an
     info.rkt file too).  Out of these, about 8 use an existing
     collection name like "data", "file", or "net".

  19 Multi-directory packages.

This makes it look like there's a good case for multi-collection
packages, but:

  14 of these multi-collection things have a single collection and a
     "tests" one.  As discussed in the past several times, the general
     agreement was that it's better to have tests inside the relevant
     collection (and the future trend is likely to shift to have tests
     *in* the source code).

So the real numbers are 56 single-collection packages, and 5 multi
ones.  Of these multi-collection packages, one was exactly the case I
thought that would benefit from this: Carl's "mischief", which is
declared as "a bunch of stuff" and arguably would benefit from
splitting into proper packages should there be sufficient demand.  In
the same category, there is soegaard's even more explicitlty named
"this-and-that".

This leaves exactly 3 cases where the multi-collection is really used
-- and two of them are Jay's packages (the other is from dvanhorn).

I take this as reaffirming my guess that pretty much all developement
happens in a single collection.  I'll note that there is, however, a
point for using existing collection names -- not a strong one, but
the ~20% (8 of the first 42) number of packages that use an existing
collection name was roughly the same in the multi-packages too.


* The package = multiple-collections feature is bad

Given the above, one way in which the multiple-collections per package
is bad is obvious: it's yet another complication on the way of a
random hacker's to contribute code.  It means that developers need to
unnaturally move code into a subdirectory, including existing code in
repositories.  That's a *real* problem in some cases.  Two quick
examples:

  * Like many other people, I have my "directory of stuff", with
    random code and random collections.  If I make that into a
    package, then any later publishing of some part of it as its own
    package means that I need to shuffle files around.  With an
    existing repository, and especially if I want to maintain my
    revision history, this leads to yet more acrobatics than a quick
    move to a subdirectory.

  * The handin-server and -client should clearly be developed
    together -- but they should not be distributed together, since the
    only the latter is what students need.  The best way that I can
    think of to address this is still bad: make them into a single
    package, and add instructions on packaging just the client to
    students.  It's true that such instructions already exist -- but
    there is no reason to complicate these instructions.

Another way to see why multiple collections per package are bad is to
consider the "raco link" command.  This command takes the collection
*names*, and originally this was the only thing it did.  Only after
Matthew implemented it, he added the `--root' option -- and he did
that after a request that I did, with the explicit scenario in mind of
accommodating such a "directory of stuff".  The package system, as
currently implemented, takes this non-default `--root' flag, and
adopts its behavior as the default.

But the problems are not only at a techincal level, thery're also
higher up.  Making collection roots into the unit of distribution
means that people need to be aware of them.  In fact, this is actually
making a "collection root" into a new concept -- before the package
system it was just a place to look for toplevel collections, but now
it has turned into sometimes a place for collections, and sometimes a
container of multi-collection packages (as well as such a place).


* What can be done

Just to be clear, I completely agree that it would be insane at this
point to do some kind of an incompatible change.  But looking at the
list of "raco pkg" subcommands, there's one command ("install") that
deals with a package URL, several commands that deal with the name of
an installed package, and one command -- "create" -- that deals with
these "package directories".  So if just this commad is revisited, the
issue can be resolved.

I originally thought that it makes sense to either have a new command
that packages specified collection directories (or names) instead of
collection roots.  It's a small change: you just name the
collection(s) instead of naming a root that has the collections as
subdirectories.

Jay suggested "packagify", which was actualkly a good hint for me to
do this writeup: I thought about what exactly bothers me about having
such a weird name -- and the thing is that I think that it's this
command that should be the more popular one, so a weird name for it
would not be a good choice.  The next obvious thing to consider is a
better name -- something like "pack" -- and the problem with that is
that it will be very confusing for users to have both "pack" and
"create" with these subtle differences.

The next name that I considered was something like "pack-collection",
or even possibly something like "pack-collection" and rename "create"
as "pack-collection-root".  But this is bad also for exactly the
reason that Matthew said in the PR, which I think is a very good
guideline for this suggetion and for the overall design of the package
system:

| I thought it would be a simpler path for people who already
| understand collections, but it turned out to be more complex and
| more confusing to have more ways of doing things.

So the problem of having two "pack-" or "create-" variants is that
people should still be aware of the two things, and more specifically,
the concept of a "collection root directory" (or whatever it gets
called) doesn't go away.

Together with "raco link", I now think that the package system (or
specifically, the "create" command) should do exactly what it does:
the default would accept a collection directory and make it into a
package, and with a "--root" flag, it would package up the whole
specified collection root.

There are a few technical details to deal with.  The few that I see
are:

* What happens when there is more than one collection specified in a
  single "create" command.  Following the above analysis of existing
  packages, I think that it makes sense to have the "main" collection
  be the first, and optionally further "support" collection specified
  later -- which means that the package meta-data is taken from the
  first collection.  The reason this follows what I see now is that
  most cases of two directories had a "tests" directory, and it makes
  sense to do something like

    raco pkg create path/to/foo path/to/tests/foo

  and given that "path/to" is one of my roots, the "create" command
  will package the two collections of "foo" and "tests/foo".

* Another question is what happens when I specify a collection that is
  not a toplevel collection.  The way that this can be done is what I
  wrote above: track it to its root, and use that as the path to the
  collection in the package.

* Finally, there's the question of manually packaged directories and
  single-collection repositories.  I think that both of these cases
  should be dealt with in a similar way -- when you create a package
  URL, you also specify whether it is a "--root" more or not, with the
  default being off.  Existing URLs will be treated as being in
  "--root" mode so they all continue to work fine.

  Alternatively, this could be specified in the package's toplevel
  info.rkt file, which "pkg create" would check, but with a default of
  non-"--root" this means changing existing repositories.

This is a relatively minor change, but I think that conceptually it
greatly simplifies things.  One of the main problems I had with planet
is that it was too heavy for random users.  The new system is
certainly lighter, but I think that such a change will make it
significantly more usable in that it's much closer to "just dump your
bunch of files on the web".

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

Posted on the dev mailing list.