russian text and find-all problem

Started by hivecluster, April 22, 2009, 05:23:53 PM

Previous topic - Next topic

hivecluster

hello, everybody

look here:

newLISP v.10.0.2 on Linux IPv4 UTF-8, execute 'newlisp -h' for more info.
> (set-locale "ru_RU.utf8")
("ru_RU.utf8" ",")
> (find-all "eye" "EYE eYe eye" $0 1)
("EYE" "eYe" "eye")
> (find-all "глаз" "ГЛАЗ гЛаз глаз" $0 1)
("глаз")

can you explain why don't find-all matches all words?



p.s. sorry for my pijin english :(

Lutz

#1
Regular expression are part of the PCRE http://pcre.org">http://pcre.org library code newLISP is using. When PCRE gets compiled it gets compiled for upper/lower-casing, case flipping and character classifying of (letters, numbers, hex-digit etc.) for a specific locale.



In the standard newLISP distribution a file: pcre-chartables.c is contained, which gets automatically generated for a specific locale. In newLISP this locale is the so called 'C'-locale. It does casing etc. only for the first page of one-byte characters in the UTF-8 character set, but guarantees internationally consistent behavior of newLISP at least in the English language. When newLISP starts up, it pus itself into this locale.



As a workaround you could do something like this:


(find-all (lower-case search-str) (lower-case text-str))


Of course this depends on the newLISP 'upper/lower-case' routines working correctly in your locale's UTF-8 implementation and for the character set used, which should have tables working for the C-libraries towupper() and towlower() functions to pick the right character and case.



Last not least, when using UTF-8 code all regex flags should be or'ed wirh 2048 (see docs for regex). It makes the following difference:


; wrong because (char 937) should count as only one UTF-8 character
(find (append "." (char 937) ".") (append (char 937) (char 937) (char 937)) 0) => 1

; correct because the first to bytes in (char 937) form one UTF-8 character
(find (append "." (char 937) ".") (append (char 937) (char 937) (char 937)) 2048) => 0


The character used here is the Greek Omega character. I have coded it as (char 937), so you can copy/paste the code without problems. This is what I raelly did:


(find ".Ω." "ΩΩΩ" 2048) => 0 ; correct offset 0

Fritz

#2
Sorry for necroposting. (upper-case) and (lower-case) won`t work with russian letters.



May be, somewhere deep in newLISP is already implemented International Support? (Workin in "newLISP v.10.1.2 on Win32 IPv4" now).



If it is not, but you plan to, here is russian alphabet (33 letters inside):



Lower:

"224225226227228229184230231232233234235236237238239240241242243244245246247248249250251252253254255"



Upper:

"192193194195196197168198199200201202203204205206207208209210211212213214215216217218219220221222223"



Btw, lower-case and upper-case — only a part of a problem. When I have an error in the function with russian name, I have to resolve messages like "ERR: missing parenthesis : "...238239224 14)n ".

cormullion

#3
It appears to work ok on my MacOS X UTF-8 newLISP 10.1.5:


(set-locale "ru_RU.UTF-8")

(println "ангел снега")
ангел снега
(println (upper-case "ангел снега"))
АНГЕЛ СНЕГА

(println (upper-case "Bглядывaяcь в глyбинy вpeмeни"))
BГЛЯДЫВAЯCЬ В ГЛYБИНY ВPEМEНИ

(println (lower-case "Bглядывaяcь в глyбинy вpeмeни"))
bглядывaяcь в глyбинy вpeмeни

(println (title-case "Bглядывaяcь в глyбинy вpeмeни"))
Bглядывaяcь в глyбинy вpeмeни


although it's hard to tell - the letters look similar regardless of case... (sorry, don't speak Russian... :)

Fritz

#4
Yep, just tried it — it works with UTF. (I have just installed UTF-version to check).



Funny, but I got another problem: now GUI is unable load my config file from my home directory, becouse it has russian name. I think, I`ll be able to resolve it by renaming my windows user.



http://img7.imageshost.ru/imgs/090928/a9bf2789ba/04c9b.jpg">