Localization in newLISP

Started by Lutz, November 20, 2003, 07:19:36 AM

Previous topic - Next topic

Lutz

This is what I found out so far:



(1) The categies used in set-locale are different on each OS even between version of GCC! Also different LC_options are available,. some have more some have less.



(2) Perl and Python are distributed ignoring the locale but have simple statements to switch it on



(3) PCRE regular expression  module: the file "chartables.c" has to be generated for the locale in question before newLISP gets compiled. In newLISP this identical file is called "prce-chartables.c" and is shipped as the ISO C locale.



This is what I propose:



newLISP will ship ignoring the locale, like before, but has a simple means to switch to the local locale and power users can do detailed on/off switching of LC_options.



This is how it could work:



newLISP out of the box comes with the default locale ISO C, now you can do the following:



(set-locale) => "C"  ;; tells you what you have currently



now you can do:



(set-locale "")  ;; switches to your locations locale



this switches to your local locale with all options turned on (LC_ALL). It  works like "use locale" on Perl.



(set-locale "es_US") ; switches to something else



you can use "locale -a" at the shell level on some Unix OS's to find out what you have. On a German Unix machine I got about 50 different strings, you could use. The above "es_US",  (Spanish as used in US), worked well with upper-case etc. on that German Linux machine.



If your are a power user you can specify a second parameter in the set-locale function specifying specific options with an integer number for the different options LC_MONETARY, LC_NMUMERIC etc. these numbers are different on each platform/compiler/version and you can find them in a file calles 'locale.h' on your system.



To switch back to ISO C mode do a:



(set-locale "C")



This will put newLISP to where it was on startup.



Regarding regular expressions:



PCRE has to be configured before compiling newLISP to generate "chartables.c" called "pcre-chartables.c" in newLISP. We could collect different  "chartables.c" collected made in different countries and publish those in the newlisp download directory, where people can grab it for their own compiles.



Lutz

nigelbrown

#1
Sounds a good plan.

Questions - with newlisp being explicitly set to the C locale will this change any previous alignment between toupper and pcre regarding what is uppercase (I don't know how well aligned they were in the past without explicit locale setting but explicitly setting locale may change how toupper() handles 251 for example)

or more simply - does the explicit C locale differ from the current stable release behaviours



I could fiddle with examples to find this out but thought you might have the answers at your finger-tips.



Nigel

Lutz

#2
the "C" locale and the PCRE handling of charcaters align completely, which is nothing to wonder about because neither of them handles upper/lower of characters in the 8 bit portion of the character set.



All the 'specific' locales like for example "en_US" do a perfect job on upper/lowering all accented and umlauted characters in newLISP, which seem to include all French, German, Spanish and Scandinavian characters. If "en_US", which is identical to ISO-8859-1,  would be available on all platforms it would be the way to go, because you could at least cover all Western European languages with one setup.



In that case I would just fabricate an ISO-8859-1 for PCRE and 85% of the current users in the world are happy. Unfortunately the "C" locale is the only one guaranteed to be there on all platforms, it also offers the same character set as ISO-8859-1 but neither newLISP not PCRE can upper/lower it.



I am still researching (have access to Solaris and Mac OSX), but the only thing I can get so far on the Win32 and BSD compiles is the 7 bit "C" locale,  have not checked Solaris and OSX yet.



Lutz