iconv/recode

Started by Dmi, January 26, 2006, 01:42:27 PM

Previous topic - Next topic

Dmi

Just wrote iconv wrapper. It seems to be working.

Potentially can convert to/from any encoding that libc's "iconv()" does.

Tested only on Linux... and I'm looking for comments about coding.

 
; iconv interface. See man 3 iconv. (C) Dmi
(context 'ICONV)

(set 'funcs '("iconv_open" "iconv" "iconv_close"))
(dolist (l funcs) (import "libc.so.6" l))

; iconv library function wrapper
; inbuf - string to recode
; with no-force does one iconv call,
;   returns list with iconv result, in-buffer/restsize, out-buffer/restsize.
; without no-force, if iconv call return error, drops broken symbol and
;   repeats iconv for the rest. Returns recodeed string.
(define (recode str no-force)
  (letn (strlen (length str)
         buflen (+ (* 2 strlen) 4)
         buf (dup "00" buflen)
         plen (pack "lu" buflen)
         pbuf (pack "ld" buf)
         rsize 0 bufrest 0)
     (do-until (or no-force (>= rsize 0) (<= strlen 0))
       (set 'pstrlen (pack "lu" strlen)
            'pstr (pack "ld" str))
       (set 'rsize (iconv cds pstr pstrlen pbuf plen))
       (set 'strlen ((unpack "lu" pstrlen) 0))
       (set 'str (get-string ((unpack "ld" pstr) 0)))
       (unless no-force
         (set 'str (1 strlen str)
              'strlen (- strlen 1))))
     (set 'buf (slice buf 0 (find "00" buf))
          'bufrest ((unpack "lu" plen) 0))
       (if no-force
         (list rsize str strlen buf bufrest)
         buf)))

; recode at once. do iconv_init, iconv and iconv_close together.
; also shows proposed usage for "recode"
(define (recode-once cfrom cto str no-force)
    (let (cds (iconv_open cto cfrom) res nil)
        (if (= -1 cds) -1
            (begin
                (set 'res (recode str no-force))
                (iconv_close cds)
                res))))

(context MAIN)

; example
;(println (ICONV:recode-once "KOI8R" "UTF-8" "abc провdefерка юникодаgh"))
WBR, Dmi

Lutz

#1
Because the format "lu" and "ld" are 32 bit wide, instead of:



(set 'strlen ((unpack "lu" pstrlen) 0))



you can do faster and shorter:



(set 'strlen (get-int pstrlen))



there are several places in the code, where you can do this.



Instead of:



(set 'buf (slice buf 0 (find "00" buf))



you could do the shorter:



(set 'buf (string buf))



the 'string' function will stop copying at the first 00 it encounters, but your method is faster in execution on short strings.



Lutz

Dmi

#2
Very nice! Thanks, Lutz!

I'll correct that.



There is also one point of my attention:

man 3 iconv:
size_t iconv(iconv_t cd,
                     char **inbuf, size_t *inbytesleft,
                     char **outbuf, size_t *outbytesleft);

On conversion iconv() increments starting address of inbuf*.

Then, if iconv rises error, inbuf* will be a pointer to the rest of the unconverted buffer's contents.

So, in pure C, I can increment inbuf* to eat invalid char and pass it again to iconv().

In newlisp, I use string copying to emulate this:
(set 'str (get-string ((unpack "ld" pstr) 0)))
(set 'str (1 strlen str))
(set 'pstr (pack "ld" str))
(iconv ... pstr ...)

This isn't much principial, but may be there is some arithmetic trick in newlisp that can directly increment string's pointer ((unpack "ld" pstr) 0) before I do (get-string) ?
WBR, Dmi

Lutz

#3
((unpack "ld" pstr) 0)



or shorter



(get-int pstr)



gives you the pointer, which is really just a number/address of the buffer, which you could increment directly:



(set 'str (get-string (+ (get-int pstr) 1)))

(set 'pstr (pack "ld" str))



The arguments of 'pack', 'unpack','get-string' and 'get-int' are just numbers:

> (set 'str "hello")
"hello"
> (address str)
3168672
> (get-string 3168672)
"hello"
> (set 'pstr (pack "ld" 3168672))
"000Y160"
> (get-string 3168673)
"ello"
> (get-string 3168674)
"llo"
> (get-string 3168675)
"lo"
>

That means you can just add or subtract from them.



Lutz

Dmi

#4
Cool! Final version is pretty ready!



...but I just find function (address)

playing with it shows that sometimes it is equivalent to (pack "lu") but not always:
(define (recode str no-force)
  (letn (strlen (length str)
         buflen (+ (* 2 strlen) 4)
         buf (dup "00" buflen)
         plen (address buflen)           ; XXX equivalent point
         pbuf (pack "ld" buf)
         istr (address str)
         rsize 0 bufrest 0)
     (do-until (or no-force (>= rsize 0) (<= strlen 0))
       (set 'pstrlen (pack "lu" strlen)  ; XXX strange point
            'pstr (address istr))
       (set 'rsize (iconv cds pstr pstrlen pbuf plen))
       (set 'strlen (get-int pstrlen)
            'istr (get-int pstr))
       (println (get-int pstrlen) ":" (get-int plen)) ; debug
       (sleep 500)                                            ; debug
       (unless no-force (begin (inc 'istr) (dec 'strlen))))
     (set 'str (get-string istr)
          'buf (string buf)
          'bufrest (get-int plen))
       (if no-force
         (list rsize str strlen buf bufrest)
         buf)))

Debug output prints rest lengths of str and of buf, as calculated by iconv().



If at line marked "strange point" I change (pack "lu" strlen) to (address strlen), then instead of strlen, address of _str_ is printed: (get-string (get-int pstrlen)) shows the decrementing rest of the str.



Interesting, that at line marked "equivalent point"  plen can be safely defined both by (address) or (pack).



What happens?



P.S. When I test this branch, I pass str that consists of ascii and non-ascii characters (such as example in the first post) and specify two encodings, that has no national intersection (for ex: KOI8-R and ISO-8859-1) - so only ascii chars will rest after conversion.
WBR, Dmi

Dmi

#5
Addition:

If in debug println i change (get-int pstrlen) to simly strlen, then I see that strlen is right (and decrementing).



I feel, I made some error, but can't found myself.
WBR, Dmi

Dmi

#6
This is a final version (I hope ;-)

Origin at http://en.feautec.pp.ru/SiteNews/IconvAndRecode">//http://en.feautec.pp.ru/SiteNews/IconvAndRecode
; iconv interface. See man 3 iconv.
; (C) Dmitry Chernyak. Licence: free. Warranty: none.
; http://en.feautec.pp.ru
(context 'ICONV)

(set 'funcs '("iconv_open" "iconv" "iconv_close"))
(dolist (l funcs) (import "libc.so.6" l))

; iconv library function wrapper
; str - string to recode
; with no-force does one iconv call,
;   returns list with iconv result, in-buffer/restsize, out-buffer/restsize.
; without no-force, if iconv call return error, drops broken symbol and
;   repeats iconv for the rest. Returns the recodeed string.
(define (recode cds str no-force)
  (letn (strlen (length str)
         buflen (+ (* 2 strlen) 4)
         buf (dup "00" buflen)
         plen (address buflen)
         pbuf (pack "ld" buf)
         istr (address str)
         rsize 0 bufrest 0)
     (do-until (or no-force (>= rsize 0) (<= strlen 0))
       (let (pstrlen (address strlen) pstr (address istr))
         (set 'rsize (iconv cds pstr pstrlen pbuf plen))
         (set 'strlen (get-int pstrlen)
              'istr (get-int pstr))
         (unless no-force (begin (inc 'istr) (dec 'strlen)))))
     (set 'str (get-string istr)
          'buf (string buf)
          'bufrest (get-int plen))
       (if no-force
         (list rsize str strlen buf bufrest)
         buf)))

; recode at once. do iconv_init, iconv and iconv_close together.
; also shows possible usage for "recode"
(define (recode-once cfrom cto str no-force)
  (let (cds (iconv_open cto cfrom) res nil)
    (if (= -1 cds) -1
      (begin
        (set 'res (recode cds str no-force))
        (iconv_close cds)
        res))))

(context MAIN)

; example
;(println (set 'test "abc провdefерка юникодаgh") "n"
;         (set 'test1 (ICONV:recode-once "KOI8R" "UTF-8" test)) "n"
;         (ICONV:recode-once "UTF-8" "KOI8R" test1))
WBR, Dmi