leave-string: any idea for simplifying?

Started by hartrock, October 11, 2015, 10:19:10 AM

Previous topic - Next topic

hartrock

There is the following function leave-string for leaving left (positive pos) or right (negative pos) part of a string, which works for pos out of range, too:

>
(define (leave-string str pos)
  (if (>= pos 0)
      (0 pos str)
      ((- (min (- pos) (length str))) str)))
;;
(set 'str "foobar")
(leave-string str 3) ; first 3
(leave-string str -3) ; last 3
;;
(leave-string str 7) ; first 7 (all with pos one out of range)
(leave-string str -7) ; last 7 (all with pos one out of range)

(lambda (str pos)
 (if (>= pos 0)
  (0 pos str)
  ((- (min (- pos) (length str))) str)))
"foobar"
"foo"
"bar"
"foobar"
"foobar"
>
Any idea for simplifying the negative pos case?

Or is there any other possibility to get this truncating string functionality, I don't have on the radar?

rrq

#1
Not really simpler, but the negative index case would be slightly shorter with the following.
((length 0 pos str) str)

xytroxon

#2
A potential problem exists in handling UTF-8 string values in UTF-8 enabled versions of newLISP. In UTF-8 versions, the length function would return the number of bytes and not the number of UTF-8 characters in the string. Hence, your function would not return the correct number of characters from your string!



In such cases utf8len must be used. Since single-byte ASCII characters (0-127) are a subset of UTF-8, length function problems may not be noticed until a user has multi-byte (non-English) UTF-8 characters to process.



A further complication is that non-UTF-8 versions of newLISP do not include the utf8len function!



A possible solution is to use this "ambidextrous" strlen function in place of length:



(define (strlen str) (if (= (& (sys-info 9) 128) 128) (utf8len str) (length str)))


But of course it fails to return the correct length if your users are trying to process multi-byte UTF-8 character strings on non-UTF-8 versions of newLISP. Like when dealing with UTF-8 html pages that include "fancy" left and right quotes in "English only" text.



A truly "simplified" or "efficient" version of your function may not be entirely possible depending on how robust you want your code to be - like in module code designed to run on all versions of newLISP.



-- xytroxon
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

hartrock

#3
Quote from: "ralph.ronnquist"Not really simpler, but the negative index case would be slightly shorter with the following.
((length 0 pos str) str)
Thanks for the suggestion!

Corrected it is:

((length (0 pos str)) str)
; which is shorter and easier to understand than:

((- (min (- pos) (length str))) str)

The longer variant avoids copying str for computing valid negative index, though.



Asymmetry triggering this thread is:

> (set 'str "foobar")
"foobar"
> (11 str)
""
> (-11 str)

ERR: invalid string index
>
(same for (slice pos str)); but there may be reasons for this semantics: e.g. negative indices near to tail of a list are inefficient compared to positive ones near to its head; which could be a reason for making use of negative ones more uncomfortable.

hartrock

#4
Quote from: "xytroxon"A potential problem exists in handling UTF-8 string values in UTF-8 enabled versions of newLISP. In UTF-8 versions, the length function would return the number of bytes and not the number of UTF-8 characters in the string. Hence, your function would not return the correct number of characters from your string!

Thanks for the pointer!



This is indeed a problem for usecase triggering this thread: there is a transfer of newlisp interpreter stdin/stdout/stderr via websocket protocol, whose text will be transfered in chunks for having intermediate results in case of longer running computations. Separating these chunks is byte string orientated, but transfer goes by transferring these chunks as - encoded - JSON strings. But JSON encoding could fail, if an UTF-8 char will be cutted at the border between two chunks; with fragments becoming part of different chunks.

So there is something to do; trying to provoke problem triggers error (to be eval'ed code and results of evalutions transferred via websocket as UTF-8 text):

> (set 's "我能吞下玻璃而不伤身体。")(dup s 500) ; dup leading to chunked transfer
"我能吞下玻璃而不伤身体。"
[text]我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。...[/text]
> (set 's "我能吞下玻璃而不伤身体。")(dup s 500) ; dup leading to chunked transfer
-> this run has failed for transfer of second (chunk borders are varying) dup result (first succeeded one shown truncated).