in utf-8 mode regex return byte length

Started by ssqq, June 15, 2014, 07:43:25 AM

Previous topic - Next topic

ssqq

> (length (char 0xff))
2
> (utf8len (char 0xff))
1
> (regex (char 0xff) (char 0xff))
("ÿ" 0 2)
> (regex (char 0xff) (char 0xff) 2048)
("ÿ" 0 2)


I think in UTF_8 mode, regex should return character location with utf-8 length.

Lutz

#1
Thanks for reporting this. Fixed for v.10.6.1 here:



http://www.newlisp.org/downloads/development/inprogress/">http://www.newlisp.org/downloads/develo ... nprogress/">http://www.newlisp.org/downloads/development/inprogress/



> (regex "Ω" "Ω" 2048)
("Ω" 0 1)
> (regex "Ω" "Ω")
("Ω" 0 2)
>

ssqq

#2
I make a function to get it.


(define (get-utf8-index utf8-str byte-index)
  (letn ((str-lst (explode utf8-str))
          (str-len (length str-lst))
          (chop-lst (sequence (- str-len 1) 0))
          (char-len-lst (map length str-lst)))
        (find byte-index
              (map (curry apply +)
                   (map (curry chop char-len-lst) chop-lst)))))


(get-utf8-index (dup (char 0xff) 10) 12) --> 6