[plt-scheme] bytes vs u8vector

From: Eli Barzilay (eli at barzilay.org)
Date: Sat Jan 28 18:29:15 EST 2006

On Jan 29, Lauri Alanko wrote:
> On Sat, Jan 28, 2006 at 05:19:12PM -0500, Eli Barzilay wrote:
> > > To me it is not at all obvious why a byte string should have a zero
> > > at the end.
> > 
> > Historical reason: strings in v20x turned to byte-strings in v300.
> 
> Yes, but v20x strings were octet sequences, too, and you could have
> a zero inside them. Hence you couldn't reliably use C-style string
> operations on arbitrary Scheme strings even then.

Sure you could -- if there's a zero in them, then most C functions
would not use the whole string, but the main danger that the
terminating nul protects agains is referencing memory you should not
reference.


> > You're putting things upside down -- IIUC, you're saying that
> > mzscheme byte-strings should not be nul-terminated, and the
> > foreign interface should provide such a type in addition.
> 
> Not exactly. I don't really care about the internal implementation
> of byte strings as such. But I want them to be identified with
> u8vectors, so if the FFI says that u8vectors don't need to be
> null-terminated, then byte strings shouldn't need to be, either.

Sure you care about them -- you care about not being able to use byte
strings as u8vectors.


> > Currently, byte-strings are nul-terminated and the foreign
> > interface adds a type for generic byte vectors.
> 
> And _that_ is upside-down. Generic byte vectors are useful to the
> casual scheme programmer. NUL-terminated char arrays are relevant
> only when interfacing with C.

I don't think that there is anything I can add to this discussion.  I
will shut up about this issue now.


> > See the paper that describes the foreign system -- when you write
> > code that uses this library, you must write `(unsafe!)' to get the
> > full power of the library.  This is equivalent to a statement that
> > you know that the Scheme code you're writing is equivalent to C
> > code, and as such it is exposed to the usual low-level/C dangers.
> 
> There are many kinds of "usual low-level dangers". Of course unsafe
> code can in principle break anything at all.

And Scheme code is safe from certain things like segfaults, unless it
uses foreign.


> But with the current foreign byte strings,

The term "byte strings" describes a data type that is part of
mzscheme, not the foreign interface.  "Foreign byte strings" is
therefore bogus in nature.


> it is possible that a completely ordinary module that uses no unsafe
> features whatsoever operates on an ordinary-looking byte string and
> this will wreak havoc since the buffer had been freed and
> reallocated in the meanwhile.  Tracking this kind of a problem
> becomes just as hard as it is in C.

It is not.  When you write an interface to a foreign library, you
should protect your users against such problems -- if they get
writeable strings that can be freed, or any other kind of pointers
that can be invalidated, then it's your bug -- you should copy such
objects, or wrap them in new types with operations that are safe.

The only exception is if you write a library for glue code libraries,
and in that case you should define your own `unsafe!'-like form.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!


Posted on the users mailing list.