Manipulating byte strings -- SOLVED

Started by Thorstein, December 28, 2020, 05:44:44 PM

Previous topic - Next topic

Thorstein

[See solution in thread below.]



I'm trying to implement several versions of the Lempel-Ziv-x and Snappy compression algorithms.  Ordinarily, I like to get my logic straight in Lisp, and then, if I need the speed, I'll port the tight loops to a C library.  In this case, however, NEWLisp has been atypically difficult to debug.  I wonder if there are some simple code patterns I'm overlooking.



It would, of course, be simpler to use a non-UTF-8 enabled build of NEWLisp, but I want to compress UTF-8 strings that I'm processing within NEWLisp.



So given a UTF-8 string us, I understand that (slice us i 1) will give me an 8-bit "char". I also found that defining


(define (byte s
   (i 0)  )
  (char s i true)
  )

helped in some situations. But then I ran into problems trying to unpack a code like 32765 into two bytes.  In the following examples  I thought I could use the following for the low byte of 253.

> (mod 32765 256)
253

;; but
> (byte (mod 32765 256)) 
ý

;; and
>(byte (byte (mod 32765 256)))
195


And while, as mentioned above, the following use of (char) looks ok

>(char (char (mod 32765 256)))
253

>(char (mod 32765 256))
"ý"

>(length "ý")
2

the UTF-8 char length messes with the byte discipline of the compression algorithms.



At last, I found that (pack) can work:

>(pack "b" (& 32765 0xff))
"�"

;; and
> (byte (pack "b" (& 32765 0xff)))
253

;; (and for the high byte):
>(byte (pack "b" (/ 32765 256)))
127


But, a little confusingly, there were still some gotchas.  For example, (pack) doesn't work with (mod):

> (byte (pack "b"  (mod 32765 256)))
16

So, long story short, I've got these manipulations more-or-less working, but I wonder if there's a more direct way to manipulate such bytes and 8-bit chars??

fdb

#1
Not sure what you are trying to do here but according to the documentation around utf-8 you can use explode  : "Use explode to obtain an array of UTF-8 characters and to manipulate characters rather than bytes when a UTF-8–enabled function is unavailable".



Hope this helps!

Thorstein

#2
Thanks, fdb, but I don't want utf-8 chars; I need a byte stream.



But I think I've found a major source of my confusion, so I'll mark this thread "SOLVED":



(char "x") => a Unicode code-point.  This should perhaps have been obvious to me, but "code-point" is not mentioned in the Manual. Consequently my helper function


(define (byte s
   (i 0)  )
  (char s i true)
  )
;; returns a utf-8 char
(byte 218)
"Ú"
;; and even though the code-point for "Ú" is 218
(char "Ú")
218
;; the encoding of the code-point is of length 2 !!
(length (byte 218))
2  
;; so, confusingly,
(char (byte 218))
218                      ;; the code-point is one byte long
;; but
(byte (byte 218))
195                    


where 195 is the first byte of the 2-byte code-point encoding.



It appears my helper function should have been this:


(define (byte x)
  (if (number? x)
      (pack "b" x)
      (char x 0 true)
      )
  )
;; and now
(byte (byte 218))
218


Said differently, while (char) reciprocally translates one-byte code-points in the lower ascii range to one-byte chars, (char) does not do so for code-points in the range 0x80 -0xff and beyond.



My code now uses (slice stringx n 1) to fetch a byte and the revised (byte) to reciprocally transform 1-byte chars.  It never directly calls (char), and this appears to be working. Yay!