Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Started by IVShilov, April 04, 2018, 02:13:37 AM

Previous topic - Next topic

IVShilov

I spent 8 hours figuring out HOW it works in windows cmd.exe and found a paradox.

Two paradoxes.

Try this by yourself, all code in this post is copy and paste from cmd.exe window.



Starts CMD.EXE, and newlisp.exe without any init.lsp, and put him a valid cyrillic filepath as first parameter:


D:tmp>r:binnewlispnewlisp.exe -n "D:tmpЁ.doc"
newLISP v.10.7.1 64-bit on Windows IPv4/6 UTF-8 libffi, options: newlisp -h

> (last (main-args))
"D:\tmp\╨╕.doc"             # two symbols - not one, it's UNICODE
> (load {R:binnewlispmodulesiconv.lsp})
MAIN
> (file? (last (main-args)))  # may be it understands as valid path?
nil
> (Iconv:convert (last (main-args)) {UTF-8} {CP866}) # OK, de-UNICODE it
"D:\tmp\и.doc"
> # one symbol, but there must be "Ё"!

After hours of en- decoding between UTF-8, CP866 and CP1251 I have lucky shot in the dark and have paradox one: UTF8-path, decoded in CP866, must be decoded as CP1251 to CP866 again:

> (Iconv:convert (Iconv:convert (last (main-args)) {UTF-8} {CP866}) {CP1251} {CP866})
"D:\tmp\Ё.doc"
> #  no logic, but now we have a readable file path!
> (file? (Iconv:convert (Iconv:convert (last (main-args)) {UTF-8} {CP866}) {CP1251} {CP866}) ) # but what about this thinks newlisp itself?
nil

Newlisp think that there is no such file, but I think it is, I see "D:\tmp\Ё.doc".

Paradox two:

> (write-file {D:tmp1.txt} {1}) # OK, newlisp, does the file you create by yourself...
1
> (file? {D:tmp1.txt}) # ... would be a truly file?
true
> (write-file {D:tmpЁ.txt} {Ё}) # OK, now special case
1
> (file? {D:tmpЁ.txt})
true
> (file? {D:tmpЁ.doc})
nil
>


Ok, explorer.exe, what do you think about that?

Ё.doc:
[attachment=1]Ё.doc.jpg[/attachment]
Ё.txt
[attachment=2]Ё.txt.jpg[/attachment]

PPL, I think only some kind of Data Flow Diagram may clearly shows whats going under the hood of GUI and where the silent charset translations take place.
[attachment=0]DFD CMD-newlisp-OS.jpg[/attachment]
As I know, CMD.EXE works in CP866, FileSystem store file paths in CP1251, and newlisp.exe internally works in UTF-8. Let's discuss

TedWalther

#1
Wow, good detective work.  Thank you for that nice diagram.  Do you have a program to auto-generate it from a script, or was it a hand drawn work of art?  I ask because often I'd like to make similar diagrams to illustrate things.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

IVShilov

#2
Diagram is pure handmade, not script fabricated.



In cmd.exe encoding INPUT and for OUTPUT for a started process can be change by a command "chcp" (CHange Code Page, https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chcp">//https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chcp):

> (exec "chcp")         # get current code page
("Active code page: 866")
> (exec "chcp 1251") # change code page to CP1251
("Active code page: 1251")
>
(prompt-event
 (fn (ctx)
   (string
    (last (parse (first (exec "chcp")) { }))
    { > })))

$prompt-event
1251 > # now for debugging we see chcp in prompt

For clearly understand whats going on, I use command- and reader-events:

1251 > (reader-event (lambda (ex) (println "reader-event IN: " ex)))
 => (reader-event (lambda (ex) (println "reader-event IN: " ex)))
$reader-event
1251 > (char "Ё")
reader-event IN: (char "Ё")
168
1251 > (command-event (fn (s)(println {command-event IN:} s) s))
reader-event IN: (command-event (lambda (s) (println "command-event IN:" s) s))
$command-event
1251 >

Now try "greek omega test" from newlisp manual on CP1251 and UTF-8 chcp settings:

65001 > (println (char 937))
command-event IN:(println (char 937))
reader-event IN: (println (char 937))
Ω
"Ω"

Output is good,

65001 > (println "Ω")
command-event IN:(println

ERR: missing parenthesis : "...(println"
65001 >

Looks like cmd.exe not ever passed  "Ω" to newlisp subprocess: see command-event IN:(println - string cutted.



Try CP1251:

1251 > (println (char 937))
command-event IN:(println (char 937))
reader-event IN: (println (char 937))
О©
"О©"

Output fails, and

1251 > (print "Ω")
command-event IN:(print "?")
reader-event IN: (print "?")
?"?"
1251 >

Unsuccessful too, because decoding UTF->CP1251 needed, and CP1251 have no "Ω" letter.



Two days out of luck.

Possible solutuions:

A) set cmd.exe in "chcp 1251":  

 - translate INPUT in newlisp CP1251->UTF by command-event;

 - translate OUTPUT from newlisp UTF->CP1251 by another event handler - I dont know such.

B) set cmd.exe in "chcp 65001" and figure out input translation by reading Microsoft docs.



This problem python have too: https://github.com/Drekin/win-unicode-console/tree/development#win-unicode-console">//https://github.com/Drekin/win-unicode-console/tree/development#win-unicode-console.

IVShilov

#3
Quote from: "IVShilov" Two days out of luck.

Much more days out of luck, but some light illuminates the darkness.



Many (all?) UTF8-apps, starts from (in?) CMD.EXE, have problems in Windows environment (python too)



Full view about whats under the hood from (depths of) MS: https://devblogs.microsoft.com/commandline/windows-command-line-backgrounder/">https://devblogs.microsoft.com/commandl ... kgrounder/">https://devblogs.microsoft.com/commandline/windows-command-line-backgrounder/



How to force CMD.EXE fully supports UTF8 knows this guy:

1. Best answer for thread "How to use unicode characters in Windows command line?" here, briefly explain a problem: https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line">https://stackoverflow.com/questions/388 ... mmand-line">https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line

2. His site with solutions: https://math.berkeley.edu/~serganov/ilyaz.org/keyboard/">https://math.berkeley.edu/~serganov/ilyaz.org/keyboard/



In any case, we need functions for encoding/decoding (I still cannot import libiconv, forced use iconv.exe) and other batteries in newlisp distro like python have.

UPD: IMHO as minimum we need a prediacte (utf? str) like
(define (utf? str) (= (length str) (utf8len str)))
for figure out wait problems or not.