[plt-dev] symbol->string and mutability

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Thu Jun 18 16:03:56 EDT 2009

At Thu, 18 Jun 2009 11:30:53 -0400, Carl Eastlund wrote:
> On Thu, Jun 18, 2009 at 3:35 AM, Matthew Flatt<mflatt at cs.utah.edu> wrote:
> > At Wed, 17 Jun 2009 20:28:10 -0400, Carl Eastlund wrote:
> >> Why do symbol->string and keyword->string produce mutable strings?  In
> >> so doing, they have to allocate a new string every time.  Is there any
> >> way to get at an immutable string that is not allocated more than
> >> once?  I would prefer that this be the default behavior; R6RS already
> >> specifies that symbol->string produces an immutable string, for
> >> instance.
> >
> > Symbols and keywords are represented internally in UTF-8, while strings
> > are represented internally as UTF-32. So, there's not an obvious way to
> > have `symbol->string' avoid allocation, except by either caching a
> > string reference in the symbol (probably not worth the extra space,
> > since most symbols are never converted) or keeping a symbol-to-string
> > mapping in a hash table (which any programmer can do externally).
> >
> > I think it would be a good idea to switch to an immutable-string
> > result, but considering potential incompatibility, it has never seemed
> > worthwhile in the short run.
> 
> I see.  I have contracts set up to accept only symbols and keywords
> whose names are ASCII strings; I was planning to use a weak, eq?-based
> hash of their names to shortcut the test.  Apparently, though, I
> cannot get eq?-unique names for symbols and strings.  If I hash the
> symbols and keywords themselves, I believe the weak table can never
> reclaim the space (since interned symbols and keywords are forgeable);

No --- symbols and keywords are GCed, so a weak hash table would work.

(And weakness in hash tables isn't about whether you could synthesize
the key. We have `equal?'-based hash tables with weak keys, after all.)

> However, while I'm musing out loud... would it be possible to have
> symbol->bytes and keyword->bytes that produce the UTF-8 representation
> (presumably with guarantees of uniqueness, immutability, and proper
> UTF-8 encoding)?

Do you mean that `symbol->bytes' would avoid allocation, which is
possible because the symbol is UTF-8 encoded?

If so, there's another part of the representation story that I left out
last time. A symbol's content is "inlined" into the allocated symbol
record, while a string or a byte string is a record containing a
pointer to the string's characters. This difference has to do with C
interoperability and a GC-based prohibition on pointers into the
interior of an allocated object. So, there are many ways in which the
current representations don't yield a cheap `symbol->bytes' operation.



Posted on the dev mailing list.