[See solution in thread below.]
I'm trying to implement several versions of the Lempel-Ziv-x and Snappy compression algorithms. Ordinarily, I like to get my logic straight in Lisp, and then, if I need the speed, I'll port the tight loops to a C library. In this case, however, NEWLisp has been atypically difficult to debug. I wonder if there are some simple code patterns I'm overlooking.
It would, of course, be simpler to use a non-UTF-8 enabled build of NEWLisp, but I want to compress UTF-8 strings that I'm processing within NEWLisp.
So given a UTF-8 string us, I understand that (slice us i 1) will give me an 8-bit "char". I also found that defining
(define (byte s
(i 0) )
(char s i true)
)
helped in some situations. But then I ran into problems trying to unpack a code like 32765 into two bytes. In the following examples I thought I could use the following for the low byte of 253.
> (mod 32765 256)
253
;; but
> (byte (mod 32765 256))
ý
;; and
>(byte (byte (mod 32765 256)))
195
And while, as mentioned above, the following use of (char) looks ok
>(char (char (mod 32765 256)))
253
>(char (mod 32765 256))
"ý"
>(length "ý")
2
the UTF-8 char length messes with the byte discipline of the compression algorithms.
At last, I found that (pack) can work:
>(pack "b" (& 32765 0xff))
"�"
;; and
> (byte (pack "b" (& 32765 0xff)))
253
;; (and for the high byte):
>(byte (pack "b" (/ 32765 256)))
127
But, a little confusingly, there were still some gotchas. For example, (pack) doesn't work with (mod):
> (byte (pack "b" (mod 32765 256)))
16
So, long story short, I've got these manipulations more-or-less working, but I wonder if there's a more direct way to manipulate such bytes and 8-bit chars??
Not sure what you are trying to do here but according to the documentation around utf-8 you can use explode : "Use explode to obtain an array of UTF-8 characters and to manipulate characters rather than bytes when a UTF-8–enabled function is unavailable".
Hope this helps!
Thanks, fdb, but I don't want utf-8 chars; I need a byte stream.
But I think I've found a major source of my confusion, so I'll mark this thread "SOLVED":
(char "x") => a Unicode code-point. This should perhaps have been obvious to me, but "code-point" is not mentioned in the Manual. Consequently my helper function
(define (byte s
(i 0) )
(char s i true)
)
;; returns a utf-8 char
(byte 218)
"Ú"
;; and even though the code-point for "Ú" is 218
(char "Ú")
218
;; the encoding of the code-point is of length 2 !!
(length (byte 218))
2
;; so, confusingly,
(char (byte 218))
218 ;; the code-point is one byte long
;; but
(byte (byte 218))
195
where 195 is the first byte of the 2-byte code-point encoding.
It appears my helper function should have been this:
(define (byte x)
(if (number? x)
(pack "b" x)
(char x 0 true)
)
)
;; and now
(byte (byte 218))
218
Said differently, while (char) reciprocally translates one-byte code-points in the lower ascii range to one-byte chars, (char) does not do so for code-points in the range 0x80 -0xff and beyond.
My code now uses (slice stringx n 1) to fetch a byte and the revised (byte) to reciprocally transform 1-byte chars. It never directly calls (char), and this appears to be working. Yay!