[racket-dev] Symlink trouble

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Wed Apr 17 10:39:22 EDT 2013

Yes, I think Racket should use PWD --- if the expansion of soft links
produces the same path as getcwd(), which seems to be what "/bin/pwd"

Should Racket also set PWD (optionally, but by default) when it creates
a subprocess? I think probably so.

To make sure we're all on the same page:

The general problem is that there can be more than one filesystem path
that reaches a file. It would be great if we could normalize every path
to a canonical form, but path normalization in general seems to
intractable due to the possibilities of soft links, hard links,
multiple mount points, case-sensitivity choices, and probably other
twists that I'm forgetting. We have therefore settled on different
definitions of "same file", depending on the context.

For module paths, "same file" involves only syntactic normalizations of
the pathname (e.g., no checking for soft links). Various pieces of the
system are carefully implemented to be consistent with syntactic
normalization. For example, suppose that PLTCOLLECTS is set to
"/home/mflatt/plt", but "/home/mflatt" is a symlink to "/Users/mflatt";
pathnames associated to modules that are accessed via collection will
consistently use "/home/mflatt", and not somehow hop over to
"/Users/mflatt". As long as a user is similarly consistent when
supplying paths, it all works out.

Unfortunately, `current-directory' is a place where you don't get to
choose the path. You might say "/home/mflatt/plt" to get to a Racket
installation, but to initialize `current-directory', the path gets
turned into an inode and back to a path via getcwd() --- exactly the
sort of thing that breaks a syntactic view of "same".

The PWD environment variable addresses the problem with getcwd(): nice
shells set PWD based on a syntactic derivation of the current
directory, instead of an inode-based derivation.

So, Racket should take advantage of the information that nice shells
provide. Probably it should also act as a nice shell by default.

(As it happens, I use "csh" on Mac OS X, and it's not nice in the above
sense. That helps explain why I never got PWD vs. cwd() before.)

At Wed, 17 Apr 2013 12:06:29 +0200, Tobias Hammer wrote:
> Hi,
> i am currently implementing an application that heavily relies on rackets  
> great serialize functionality to exchange data between racket processes on  
> different computers. That works well until i stumbled over a very  
> confusion behavior of rackets filesystem and module path resolution.
> I will explain first, what i observed and then why this causes some  
> trouble:
> * relative (module) paths are resolved with something like (or  
> (current-load-directory) (current-directory))
> * collection paths are resolved with
>   (find-executable-path (find-system-path 'exec-file) (find-system-path  
> 'collects-dir)) for the system collection and with the given path for the  
> others
> * you can require a module relative and via collection, if they resolve to  
> the same name, there is no error
> serialize stores the module path and symbol where the deserialize function  
> can be found. It's interesting how this module path is determined
> * If the file containing the deserialize identifier (if implemented by  
> hand or the file where e.g serializable-stuct is used) is loaded via  
> collection, then the serialized stream contains a collection path  
> (determined via identifier binding and mpi magic)
> * If this file is loaded relative, the fallback method with  
> current-(load)-directory is used
> Nothing special so far, but the fun starts with how current-directory is  
> initialized. It uses (on *nix systems) getcwd() but this function returns  
> the path with all symbolic links resolved (getcwd is only a thin  
> OS-wrapper, and the OS provides nothing else).
> This little detail can easily break the serialization framework (and maybe  
> other things too).
> The scenario is a file that is in a path containing a symlink and that is  
> in the current collections, e.g
> /abc/symlink/more/def/file.rkt
> and PLTCOLLECTS="/abc/symlink/more:"
> and file.rkt contains a serializable-struct definition.
> Now one racket process loads "file.rkt" relative, serializes a struct  
> instance and sends it to another racket process. The other process loads  
> def/file via collection and deserialies the struct. The receiver now has a  
> struct that is of a different type and that he can't access.
> This fails because the serialized data contains the absolute symlink-free  
> path that differs from the path the receiver used to load file.rkt  
> (because for collection dirs symlinks are not resolved).
> The same happens of course when the data is send to another computer that  
> has a symlink in the path to file.rkt, even if they both load the same way.
> The confusing thing is that from the users point of view everything is  
> consistent. His working directory and collections all point to the same  
> location.
> It is clear that this behavior is by far not limited to racket as nearly  
> all programming languages use getcwd internally. A quick google search for  
> getcwd and symlinks gives a lot of results...
> I came up with a few solutions but i would like to get some feedback on  
> them. They all more or less use that the shell keeps track of the 'real'  
> (better: visible) working directory. Most *nix shells set 'PWD' in the  
> environment but it is not guaranteed and can of cause be altered by the  
> user.
> - The quick and very dirty hack is to set the current-directoy before any  
> use code is executed
> racket -e '(current-directory (or (getenv "PWD") (current-directory)))'  
> program.rkt
> Too ugly to really use it...
> - A better fix would be to change how the current-directory parameter is  
> initialized during the startup. It could be some heuristic that tries to  
> use the env-variable if it is a complete and existing path and falls back  
> to getcwd otherwise. As far as i can tell this won't break anything  
> because after this one time at startup the C-sides cwd and rackets  
> parameter are completely decoupled.
> - A more conservative solution would be a command line argument to racket  
> to set the initial value for current-directory. One could then populate it  
> with env's PWD or from `pwd` or whatever suits.
> I would appreciate any feedback on how i can work around this behavior  
> (except don't use symlinks ...) or if i missed something obvious. If not,  
> would any of the two real solutions be viable? They shouldn't be too hard  
> to implement i could create a patch if one of them seems ok.
> Tobias
> -- 
> ---------------------------------------------------------
> Tobias Hammer
> DLR / Robotics and Mechatronics Center (RMC)
> Muenchner Str. 20, D-82234 Wessling
> Tel.: 08153/28-1487
> Mail: tobias.hammer at dlr.de
> _________________________
>   Racket Developers list:
>   http://lists.racket-lang.org/dev

Posted on the dev mailing list.