get-wide-string builtin function?

Started by ryuo, July 07, 2014, 02:35:04 AM

Previous topic - Next topic

ryuo

I was able to work out a user-defined function for converting a wchar_t array to a layout that appears compatible with what the 'unicode' function does. However, I was wondering if it could be converted to a builtin primitive function. It seems like a good idea to me because there is already a builtin function for converting UTF-32 to UTF-8 and UTF-8 to UTF-32. Basically, the idea is to have a function like 'get-string', but that works on UTF-32 character arrays that are acquired from a C library somehow. I tried 'get-string', and it simply does not work. Here is my approach using existing builtin functions:



(define (get-wide-string ptr)
(let (p ptr)
(while (!= ((unpack "lu" p) 0) 0)
(set 'p (+ p 4))
)
((unpack (format "s%d" (- p ptr -4)) ptr) 0)
)
)

(println (utf8 (get-wide-string result)))


'result' is a pointer to a wchar_t from a shared library on Linux.



I know wchar_t has no guaranteed size. However, for the systems where wchar_t is used to hold UTF-32, I thought it would be nice to have a builtin function that can copy the wchar_t string and then wrap it into a newLISP cell. Basically, converting it to a form identical to what 'unicode' does. This would mean it could then be fed to the 'utf8' function to convert it back to the encoding that newLISP uses.



I feel this would be a good addition to newLISP because it would allow newLISP to give and receive UTF-32 strings to and from C libraries. Right now, you can only give UTF-32 strings to C libraries. And while I seldom see UTF-32 used by C libraries, it is used by a Text User Interface library I am trying to write an interface module for. I would prefer this be implemented within the interpreter, as it would be slower when run from within the interpreted language.



However, you may have other reasons why it shouldn't be. But, I was thinking it's a good idea to do this because if you need to convert very large strings, it would probably be faster if the code was implemented in C. Thoughts Lutz?

Lutz

#1
The current get-string stops at the first zero byte it encounters and is only usable to get text strings in ASCII form.



Version 10.6.1 now adds and optional bytes, size parameter to copy any kind of content from a memory address. This works also for UTF32 (4 bytes for each unicode character). Of course you would need the size of the buffer, which hopefully is supplied by the library you are using or can be obtained otherwise and also prevents memory overruns.



The following demo converts a UTF8 string to UTF32 than retrieves that unicode string from the raw memory address:



newLISP v.10.6.1 64-bit on OSX IPv4/6 UTF-8 libffi, options: newlisp -h

> (set 'utf32 (unicode "我能吞下玻璃而不伤身体。"))
"17b0000??000030T000011N0000?s0000?t0000f?0000rN0000$O0000??0000SO0000020000000000000"
> (set 'addr (address utf32))
140592856255712
> (utf8 (get-string addr 52))
"我能吞下玻璃而不伤身体。"
>


see also: http://www.newlisp.org/downloads/development/inprogress/">http://www.newlisp.org/downloads/develo ... nprogress/">http://www.newlisp.org/downloads/development/inprogress/

ryuo

#2
I've been thinking that an additional optional parameter specifying the "character size" to use when looking for a null terminator would be a nice addition to the get-string function. This would be the number of bytes used for a single element in the string. Obviously, the number of bytes cannot be guaranteed to equal the number of characters because of how UTF-8 and UTF-16 work. Only UTF-32 can guarantee this to my knowledge.



What use does this have? It means you would be able to tell newLISP what size to use for each "character" when looking for the null character. It's nice if you don't know the size in advance but do know that the string is null-terminated.



1, 2, and 4 would be good settings for this parameter. It can be implemented by checking for how many repeat null bytes have been encountered in a row. As far as I know, when an integer is set to 0, each of the bytes used to store it are also set to 0. Therefore, the endianness of the CPU should have no effect on this. You could count the number of repeat null bytes encountered. When it matches the "character size", then the null terminator has been found. When the character size is "1", the old behavior should remain the same as before.



I thought this would be a nice addition on top of what you wrote about, but it's up to you ultimately. I am just trying to help improve newLISP where it seems to lack. Thanks for reading.

TedWalther

#3
What if get-string had an option at the end; default would be standard C string byte stream as it is now.  And then you could use the symbols 'utf8 'utf16 'utf32 to read in strings in the appropriate format.  And that style gives flexibility for reading in other types of character strings in the future, for instance, I've heard there are some Asian character encodings comparable to unicode.



And instead of specifying the size of the string, let the default be 0 terminated, as in C strings.  But allow the final option to be a comparitor function that checks the character and tells if the string terminates or not.  Or just a character that the function compares itself.



But then things start to get complex, sigh.  Probably best to stick with 0 termination and leave it at that, or specify a number of "characters" and leave it at that.  The way Lutz currently has it.



But it would be nice to have this function call pattern:



(get-string my-address)
(get-string my-address char)
(get-string my-address char 18)
(get-string my-address utf8)
(get-string my-address utf16)
(get-string my-address utf32)


I don't have details, but there is a 2 byte character encoding that is used in some Asian languages; having this funcall pattern would allow that to be incorporated too.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#4
This syntax covers all cases, even if the string buffers are not written in a specific font:



; old syntax
(get-string <address>) ; gets an zero terminated ASCII or UTF-8 string

; new optional parameters
(get-string <address> <bytes>) ; gets a buffer of <bytes> of binary data
(get-string <address> <bytes> <limit>) ; reads max <bytes> data up to <limit>


The second pattern would work for UTF-32/UCS-4 this way

(get-string <address> <bytes> "00000000")

Any other string in <str-limit> could be specified to stop reading. Any length or content.



Setting a maximum of bytes to read via <bytes> also makes get-string safer, preventing running into invalid memory.



See: http://www.newlisp.org/downloads/development/inprogress/">http://www.newlisp.org/downloads/develo ... nprogress/">http://www.newlisp.org/downloads/development/inprogress/



Ps: "0000" and "00000000" are searched only on 2/4 byte borders for UCS-2/4

TedWalther

#5
QuotePs: "0000" and "00000000" are searched only on 2/4 byte borders for UCS-2/4


Lutz, my head is exploding.  Please explain?



I'm aware that the 2-byte and 4-byte 0 should only be searched on 2/4 byte borders for UCS-2/4.  How are you communicating to get-string that that is what it should do?  Does it assume that the "char-width" is the same as the terminating string?



And, does it terminate before the size-bytes, if it finds the terminating string?



And, can I put "nil" in for size-bytes, when I don't know how big the string might be, because I went the 0 null character to specify the end of the string?



And with UTF-8, this would fail entirely.



The way it is now, perhaps we need two functions:



get-bytes, which just grabs a stream of 8-bit bytes.



and



get-string, which grabs a stream of "characters", where you specify the character encoding, UTF-8, UTF-16, UTF-32, ASCII, etc.



Because bytes and characters are two separate entities.  Smushing it together into one function is making me get dizzy.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#6
Ok, I looked at the Changelog for 10.6.1 "inprogress", and I see that if you use the special values '0000' or '00000000' that the reads are aligned on 2 and 4 byte boundaries.  This... doesn't feel right.  Feels like magic values. I mean, it works.  But wouldn't it be better take take that special logic out of get-string, and have get-bytes and get-string be separate functions?



The Asian encodings I was thinking of are Guobiao and Big5 and maybe HZ.  Japanese also has 4 encodings.  Unicode is the standard, and I don't suggest asking you to support other encodings Lutz, but please leave things in such a fashion that the others are easily accomodated.



For instance, utf16, utf32, utf8, and ascii would be symbols referencing internal string reader functions.  But if someone wants to implement Big5 or ShiftJIS, let them implement it as a newlisp function, and use that as the symbol in the (get-string) funcall.  Wouldn't that be lispier?





From Wikipedia:


Quote
    Guobiao is mainly used in Mainland China and Singapore. All Guobiao standards are prefixed by GB, the latest version is GB18030 which is a one, two or four byte encoding.

    Big5, used in Taiwan, Hong Kong and Macau, is a one or two byte encoding.

    Unicode, with the set of CJK Unified Ideographs.



Other encoding scheme, such as HZ were also used in early days.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.