Print Page - Manipulating byte strings -- SOLVED

Title: Manipulating byte strings -- SOLVED
Post by: Thorstein on December 28, 2020, 05:44:44 PM

[See solution in thread below.]

I'm trying to implement several versions of the Lempel-Ziv-x and Snappy compression algorithms. Ordinarily, I like to get my logic straight in Lisp, and then, if I need the speed, I'll port the tight loops to a C library. In this case, however, NEWLisp has been atypically difficult to debug. I wonder if there are some simple code patterns I'm overlooking.

It would, of course, be simpler to use a non-UTF-8 enabled build of NEWLisp, but I want to compress UTF-8 strings that I'm processing within NEWLisp.

So given a UTF-8 string us, I understand that (slice us i 1) will give me an 8-bit "char". I also found that defining

Code Select Expand
(define (byte s
   (i 0)  )
  (char s i true)
  )

helped in some situations. But then I ran into problems trying to unpack a code like 32765 into two bytes. In the following examples I thought I could use the following for the low byte of 253.

Code Select Expand

> (mod 32765 256)
253

;; but
> (byte (mod 32765 256)) 
ý

;; and
>(byte (byte (mod 32765 256)))
195

And while, as mentioned above, the following use of (char) looks ok

Code Select Expand

>(char (char (mod 32765 256)))
253

>(char (mod 32765 256))
"ý"

>(length "ý")
2

the UTF-8 char length messes with the byte discipline of the compression algorithms.

At last, I found that (pack) can work:

Code Select Expand

>(pack "b" (& 32765 0xff))
"�"

;; and
> (byte (pack "b" (& 32765 0xff)))
253

;; (and for the high byte):
>(byte (pack "b" (/ 32765 256)))
127

But, a little confusingly, there were still some gotchas. For example, (pack) doesn't work with (mod):

Code Select Expand

> (byte (pack "b"  (mod 32765 256)))
16

So, long story short, I've got these manipulations more-or-less working, but I wonder if there's a more direct way to manipulate such bytes and 8-bit chars??

Title: Re: Manipulating byte strings
Post by: fdb on December 29, 2020, 05:37:47 AM

Not sure what you are trying to do here but according to the documentation around utf-8 you can use explode : "Use explode to obtain an array of UTF-8 characters and to manipulate characters rather than bytes when a UTF-8–enabled function is unavailable".

Hope this helps!

Title: SOLVED: Re: Manipulating byte strings
Post by: Thorstein on January 05, 2021, 07:47:54 AM

Thanks, fdb, but I don't want utf-8 chars; I need a byte stream.

But I think I've found a major source of my confusion, so I'll mark this thread "SOLVED":

(char "x") => a Unicode code-point. This should perhaps have been obvious to me, but "code-point" is not mentioned in the Manual. Consequently my helper function

Code Select Expand
(define (byte s
   (i 0)  )
  (char s i true)
  )
;; returns a utf-8 char
(byte 218)
"Ú"
;; and even though the code-point for "Ú" is 218
(char "Ú")
218
;; the encoding of the code-point is of length 2 !!
(length (byte 218))
2  
;; so, confusingly,
(char (byte 218))
218                      ;; the code-point is one byte long
;; but
(byte (byte 218))
195

where 195 is the first byte of the 2-byte code-point encoding.

It appears my helper function should have been this:

Code Select Expand
(define (byte x)
  (if (number? x)
      (pack "b" x)
      (char x 0 true)
      )
  )
;; and now
(byte (byte 218))
218

Said differently, while (char) reciprocally translates one-byte code-points in the lower ascii range to one-byte chars, (char) does not do so for code-points in the range 0x80 -0xff and beyond.

My code now uses (slice stringx n 1) to fetch a byte and the revised (byte) to reciprocally transform 1-byte chars. It never directly calls (char), and this appears to be working. Yay!

newLISP Fan Club

Forum => Anything else we might add? => Topic started by: Thorstein on December 28, 2020, 05:44:44 PM