[plt-dev] symbol->string and mutability
On Thu, Jun 18, 2009 at 3:35 AM, Matthew Flatt<mflatt at cs.utah.edu> wrote:
> At Wed, 17 Jun 2009 20:28:10 -0400, Carl Eastlund wrote:
>> Why do symbol->string and keyword->string produce mutable strings? In
>> so doing, they have to allocate a new string every time. Is there any
>> way to get at an immutable string that is not allocated more than
>> once? I would prefer that this be the default behavior; R6RS already
>> specifies that symbol->string produces an immutable string, for
>> instance.
>
> Symbols and keywords are represented internally in UTF-8, while strings
> are represented internally as UTF-32. So, there's not an obvious way to
> have `symbol->string' avoid allocation, except by either caching a
> string reference in the symbol (probably not worth the extra space,
> since most symbols are never converted) or keeping a symbol-to-string
> mapping in a hash table (which any programmer can do externally).
>
> I think it would be a good idea to switch to an immutable-string
> result, but considering potential incompatibility, it has never seemed
> worthwhile in the short run.
I see. I have contracts set up to accept only symbols and keywords
whose names are ASCII strings; I was planning to use a weak, eq?-based
hash of their names to shortcut the test. Apparently, though, I
cannot get eq?-unique names for symbols and strings. If I hash the
symbols and keywords themselves, I believe the weak table can never
reclaim the space (since interned symbols and keywords are forgeable);
if I use an equal? hash, it defeats the purpose. In the end, this is
probably premature optimization; symbol and keyword names are usually
short, so I can just use an unhashed check.
However, while I'm musing out loud... would it be possible to have
symbol->bytes and keyword->bytes that produce the UTF-8 representation
(presumably with guarantees of uniqueness, immutability, and proper
UTF-8 encoding)?
--Carl