[plt-scheme] bytes vs u8vector

From: Lauri Alanko (la at iki.fi)
Date: Sat Jan 28 18:08:00 EST 2006

On Sat, Jan 28, 2006 at 05:19:12PM -0500, Eli Barzilay wrote:
> > To me it is not at all obvious why a byte string should have a zero
> > at the end.
> 
> Historical reason: strings in v20x turned to byte-strings in v300.

Yes, but v20x strings were octet sequences, too, and you could have a
zero inside them. Hence you couldn't reliably use C-style string
operations on arbitrary Scheme strings even then.

> You're putting things upside down -- IIUC, you're saying that mzscheme
> byte-strings should not be nul-terminated, and the foreign interface
> should provide such a type in addition.

Not exactly. I don't really care about the internal implementation of
byte strings as such. But I want them to be identified with u8vectors,
so if the FFI says that u8vectors don't need to be null-terminated, then
byte strings shouldn't need to be, either.

> Currently, byte-strings are nul-terminated and the foreign interface
> adds a type for generic byte vectors.

And _that_ is upside-down. Generic byte vectors are useful to the casual
scheme programmer. NUL-terminated char arrays are relevant only when
interfacing with C.

> See the paper that describes the foreign system -- when you write code
> that uses this library, you must write `(unsafe!)' to get the full
> power of the library.  This is equivalent to a statement that you know
> that the Scheme code you're writing is equivalent to C code, and as
> such it is exposed to the usual low-level/C dangers.

There are many kinds of "usual low-level dangers". Of course unsafe code
can in principle break anything at all. However, a reasonably talented C
programmer rarely makes buffer overflow or uninitialized pointer
dereferencing errors, and instead makes "ordinary programming mistakes"
about program logic. Memory allocation issues are a prime example.

Using module boundaries to delimit blame for errors is a good idea (as
contracts demonstrate), but the byte strings seriously undermine this
technique.

When we get an error, we must be able to identify the module to blame
for it. In the simplest case, we simply need to see who was calling whom
when the error was detected. Alternatively, we can sometimes at least
identify who was responsible for a broken object by looking at its type
tag.

But with the current foreign byte strings, it is possible that a
completely ordinary module that uses no unsafe features whatsoever
operates on an ordinary-looking byte string and this will wreak havoc
since the buffer had been freed and reallocated in the meanwhile.
Tracking this kind of a problem becomes just as hard as it is in C.

In addition to module boundaries, datatype boundaries are also important
blame delimiters. Foreign byte strings break those.


Lauri


Posted on the users mailing list.