newLISP Fan Club

Forum => newLISP in the real world => Topic started by: hivecluster on April 22, 2009, 05:23:53 PM

Title: russian text and find-all problem
Post by: hivecluster on April 22, 2009, 05:23:53 PM
hello, everybody

look here:

newLISP v.10.0.2 on Linux IPv4 UTF-8, execute 'newlisp -h' for more info.
> (set-locale "ru_RU.utf8")
("ru_RU.utf8" ",")
> (find-all "eye" "EYE eYe eye" $0 1)
("EYE" "eYe" "eye")
> (find-all "глаз" "ГЛАЗ гЛаз глаз" $0 1)
("глаз")

can you explain why don't find-all matches all words?



p.s. sorry for my pijin english :(
Title:
Post by: Lutz on April 23, 2009, 04:39:32 AM
Regular expression are part of the PCRE http://pcre.org library code newLISP is using. When PCRE gets compiled it gets compiled for upper/lower-casing, case flipping and character classifying of (letters, numbers, hex-digit etc.) for a specific locale.



In the standard newLISP distribution a file: pcre-chartables.c is contained, which gets automatically generated for a specific locale. In newLISP this locale is the so called 'C'-locale. It does casing etc. only for the first page of one-byte characters in the UTF-8 character set, but guarantees internationally consistent behavior of newLISP at least in the English language. When newLISP starts up, it pus itself into this locale.



As a workaround you could do something like this:


(find-all (lower-case search-str) (lower-case text-str))


Of course this depends on the newLISP 'upper/lower-case' routines working correctly in your locale's UTF-8 implementation and for the character set used, which should have tables working for the C-libraries towupper() and towlower() functions to pick the right character and case.



Last not least, when using UTF-8 code all regex flags should be or'ed wirh 2048 (see docs for regex). It makes the following difference:


; wrong because (char 937) should count as only one UTF-8 character
(find (append "." (char 937) ".") (append (char 937) (char 937) (char 937)) 0) => 1

; correct because the first to bytes in (char 937) form one UTF-8 character
(find (append "." (char 937) ".") (append (char 937) (char 937) (char 937)) 2048) => 0


The character used here is the Greek Omega character. I have coded it as (char 937), so you can copy/paste the code without problems. This is what I raelly did:


(find ".Ω." "ΩΩΩ" 2048) => 0 ; correct offset 0
Title:
Post by: Fritz on September 27, 2009, 12:54:29 PM
Sorry for necroposting. (upper-case) and (lower-case) won`t work with russian letters.



May be, somewhere deep in newLISP is already implemented International Support? (Workin in "newLISP v.10.1.2 on Win32 IPv4" now).



If it is not, but you plan to, here is russian alphabet (33 letters inside):



Lower:

"224225226227228229184230231232233234235236237238239240241242243244245246247248249250251252253254255"



Upper:

"192193194195196197168198199200201202203204205206207208209210211212213214215216217218219220221222223"



Btw, lower-case and upper-case — only a part of a problem. When I have an error in the function with russian name, I have to resolve messages like "ERR: missing parenthesis : "...238239224 14)n ".
Title:
Post by: cormullion on September 27, 2009, 02:39:27 PM
It appears to work ok on my MacOS X UTF-8 newLISP 10.1.5:


(set-locale "ru_RU.UTF-8")

(println "ангел снега")
ангел снега
(println (upper-case "ангел снега"))
АНГЕЛ СНЕГА

(println (upper-case "Bглядывaяcь в глyбинy вpeмeни"))
BГЛЯДЫВAЯCЬ В ГЛYБИНY ВPEМEНИ

(println (lower-case "Bглядывaяcь в глyбинy вpeмeни"))
bглядывaяcь в глyбинy вpeмeни

(println (title-case "Bглядывaяcь в глyбинy вpeмeни"))
Bглядывaяcь в глyбинy вpeмeни


although it's hard to tell - the letters look similar regardless of case... (sorry, don't speak Russian... :)
Title:
Post by: Fritz on September 27, 2009, 02:51:48 PM
Yep, just tried it — it works with UTF. (I have just installed UTF-version to check).



Funny, but I got another problem: now GUI is unable load my config file from my home directory, becouse it has russian name. I think, I`ll be able to resolve it by renaming my windows user.



(//%3C/s%3E%3CURL%20url=%22http://img7.imageshost.ru/imgs/090928/a9bf2789ba/04c9b.jpg%22%3E%3CLINK_TEXT%20text=%22http://img7.imageshost.ru/imgs/090928/a%20...%20/04c9b.jpg%22%3Ehttp://img7.imageshost.ru/imgs/090928/a9bf2789ba/04c9b.jpg%3C/LINK_TEXT%3E%3C/URL%3E%3Ce%3E)