[plt-scheme] bytes vs u8vector

From: Eli Barzilay (eli at barzilay.org)
Date: Sat Jan 28 17:19:12 EST 2006

On Jan 28, Lauri Alanko wrote:
> On Sat, Jan 28, 2006 at 01:32:06PM -0500, Eli Barzilay wrote:
> > Except for the extra zero that must be there.
> 
> > It just happens that byte strings stand for a C char* which can be
> > viewed as a byte vector, but it is more common to use them as strings.
> 
> > That's not part of the foreign interface, it is an assumption that is
> > built into mzscheme.  (Unfortunate or not, the reason it is there is
> > obvious.)
> 
> There must be something fundamental that I'm just not grokking here.
> To me it is not at all obvious why a byte string should have a zero
> at the end.

Historical reason: strings in v20x turned to byte-strings in v300.

Practical reason: usages of `string' operations in v20x are now easy
to port -- leave them as strings if you care about the value as a
string, turn them to byte strings if you care about an arbitrary block
of data (as in port operations).

> Since native byte string operations know the string's exact length,
> the only conceivable reason I can think of for null-termination is
> for interfacing with legacy C code.

Practical reason #2: the `legacy' in the previous sentence is a little
misleading -- there is plenty of current unicode-unaware code that
simply uses char*.  (I think it would be decades before this is truly
a legacy issue.)

> Yet you say that this has nothing to do with the FFI.

Not-my-department reason: the implementation of byte strings is in the
mzscheme core, not in the foreign interface -- it's purpose
corresponds to the above three reasons.


> I don't see why byte strings and char* would be equated anyway. A
> byte string is exactly like an ordinary string except that it stores
> octets instead of characters. But in C code, a zero-terminated char
> array is predominantly used for storing textual information, that
> is, strings of _characters_, not octets: when you have a C-string,
> you're typically more interested in what letters are printed out
> rather than which octet values your string contains. Indeed, a char
> in C may be larger than a single octet. (IIRC, some Cray-based
> platforms used to have 64-bit chars.)

[And mzscheme did not support such characters.]  In any case, you have
a problem with the mzscheme core, not with the foreign interface.  I'm
sure that Matthew can come up with many additional reasons why the
nul-terminator is useful.


> Hence the most meaning-preserving representation for a C-string at
> the Scheme side would be an ordinary character string. If
> lower-level direct manipulation of buffers is desired at Scheme
> side, then the Scheme code can just as well manually ensure that the
> buffers it send to C are null-terminated. Conceivably there could be
> a special data type in the FFI for C strings such that it was
> zero-terminated and would only allow non-zero octets within it, but
> such a type should certainly be separate from the general-purpose
> byte strings that are used in plain Scheme programming.

You're putting things upside down -- IIUC, you're saying that mzscheme
byte-strings should not be nul-terminated, and the foreign interface
should provide such a type in addition.  Currently, byte-strings are
nul-terminated and the foreign interface adds a type for generic byte
vectors.


> > > correctly that it's possible to create byte strings whose data
> > > actually resides at some pointer that's been returned from the C
> > > world?
> 
> > The GC just ignores all pointers to memory that it does not
> > manage.
> 
> The problem again is that a byte string is a general-purpose data
> type and those are supposed to be safe to the casual programmer.
> This means that the buffer that the byte string uses should be
> guaranteed to stay usable until the byte string is no longer used.
> But if the buffer is foreign, it might get freed by some C code
> while the byte string remains alive.

And that's a problem that the foreign interface is very intentionally
not dealing with.  The philosophy is that if you're dealing with some
external library, then you should know when pointers become unusable.
You get the facility to have automatic management, but it's your
responsibility to write that code since you know the bahavior of the
foreign code in question.


> Sure, this can be remedied by getting ownership of or a reference to
> the buffer and registering a finalizer that releases it, but I just
> feel queasy that what I thought to be a primitive tightly managed
> scheme-only object might in fact be a window to a dangerous foreign
> world where dragons lurk just beyond sight...

See the paper that describes the foreign system -- when you write code
that uses this library, you must write `(unsafe!)' to get the full
power of the library.  This is equivalent to a statement that you know
that the Scheme code you're writing is equivalent to C code, and as
such it is exposed to the usual low-level/C dangers.  If I'm running
any mzscheme code and I get a segfault, then the responsibility was
strictly on the C code that implements the Scheme core and extensions
-- now the responsibility is either in such code *or* in Scheme code
that uses the foreign library with `(unsafe!)'.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!


Posted on the users mailing list.