[plt-scheme] bytes vs u8vector

From: Lauri Alanko (la at iki.fi)
Date: Sat Jan 28 16:57:31 EST 2006

On Sat, Jan 28, 2006 at 01:32:06PM -0500, Eli Barzilay wrote:
> Except for the extra zero that must be there.

> It just happens that byte strings stand for a C char* which can be
> viewed as a byte vector, but it is more common to use them as strings.

> That's not part of the foreign interface, it is an assumption that is
> built into mzscheme.  (Unfortunate or not, the reason it is there is
> obvious.)

There must be something fundamental that I'm just not grokking here. To
me it is not at all obvious why a byte string should have a zero at the
end. Since native byte string operations know the string's exact length,
the only conceivable reason I can think of for null-termination is for
interfacing with legacy C code. Yet you say that this has nothing to do
with the FFI.

I don't see why byte strings and char* would be equated anyway. A byte
string is exactly like an ordinary string except that it stores octets
instead of characters. But in C code, a zero-terminated char array is
predominantly used for storing textual information, that is, strings of
_characters_, not octets: when you have a C-string, you're typically
more interested in what letters are printed out rather than which octet
values your string contains. Indeed, a char in C may be larger than a
single octet. (IIRC, some Cray-based platforms used to have 64-bit
chars.)

Hence the most meaning-preserving representation for a C-string at the
Scheme side would be an ordinary character string. If lower-level direct
manipulation of buffers is desired at Scheme side, then the Scheme code
can just as well manually ensure that the buffers it send to C are
null-terminated. Conceivably there could be a special data type in the
FFI for C strings such that it was zero-terminated and would only allow
non-zero octets within it, but such a type should certainly be separate
from the general-purpose byte strings that are used in plain Scheme
programming.

> > correctly that it's possible to create byte strings whose data
> > actually resides at some pointer that's been returned from the C
> > world?

> The GC just ignores all pointers to memory that it does not manage.

The problem again is that a byte string is a general-purpose data type
and those are supposed to be safe to the casual programmer. This means
that the buffer that the byte string uses should be guaranteed to stay
usable until the byte string is no longer used. But if the buffer is
foreign, it might get freed by some C code while the byte string remains
alive.

Sure, this can be remedied by getting ownership of or a reference to the
buffer and registering a finalizer that releases it, but I just feel
queasy that what I thought to be a primitive tightly managed scheme-only
object might in fact be a window to a dangerous foreign world where
dragons lurk just beyond sight...


Lauri


Posted on the users mailing list.