in utf-8 mode regex return byte length

ssqq · June 15, 2014, 07:43:25 AM

Code Select Expand
> (length (char 0xff))
2
> (utf8len (char 0xff))
1
> (regex (char 0xff) (char 0xff))
("ÿ" 0 2)
> (regex (char 0xff) (char 0xff) 2048)
("ÿ" 0 2)

I think in UTF_8 mode, regex should return character location with utf-8 length.

Lutz · June 15, 2014, 08:25:44 AM

Thanks for reporting this. Fixed for v.10.6.1 here:

http://www.newlisp.org/downloads/development/inprogress/">http://www.newlisp.org/downloads/develo ... nprogress/">http://www.newlisp.org/downloads/development/inprogress/

Code Select Expand

> (regex "Ω" "Ω" 2048)
("Ω" 0 1)
> (regex "Ω" "Ω")
("Ω" 0 2)
>

ssqq · June 15, 2014, 08:47:10 AM

I make a function to get it.

Code Select Expand
(define (get-utf8-index utf8-str byte-index)
  (letn ((str-lst (explode utf8-str))
          (str-len (length str-lst))
          (chop-lst (sequence (- str-len 1) 0))
          (char-len-lst (map length str-lst)))
        (find byte-index
              (map (curry apply +)
                   (map (curry chop char-len-lst) chop-lst)))))

(get-utf8-index (dup (char 0xff) 10) 12) --> 6

newLISP Fan Club

News:

in utf-8 mode regex return byte length

ssqq

Lutz

ssqq