leave-string: any idea for simplifying?

hartrock · October 11, 2015, 10:19:10 AM

There is the following function leave-string for leaving left (positive pos) or right (negative pos) part of a string, which works for pos out of range, too:

Code Select Expand

> 
(define (leave-string str pos)
  (if (>= pos 0)
      (0 pos str)
      ((- (min (- pos) (length str))) str)))
;;
(set 'str "foobar")
(leave-string str 3) ; first 3
(leave-string str -3) ; last 3
;;
(leave-string str 7) ; first 7 (all with pos one out of range)
(leave-string str -7) ; last 7 (all with pos one out of range)

(lambda (str pos) 
 (if (>= pos 0) 
  (0 pos str) 
  ((- (min (- pos) (length str))) str)))
"foobar"
"foo"
"bar"
"foobar"
"foobar"
>

Any idea for simplifying the negative pos case?

Or is there any other possibility to get this truncating string functionality, I don't have on the radar?

rrq · October 15, 2015, 05:28:05 AM

Not really simpler, but the negative index case would be slightly shorter with the following.

Code Select Expand
((length 0 pos str) str)

xytroxon · October 15, 2015, 10:34:32 AM

A potential problem exists in handling UTF-8 string values in UTF-8 enabled versions of newLISP. In UTF-8 versions, the length function would return the number of bytes and not the number of UTF-8 characters in the string. Hence, your function would not return the correct number of characters from your string!

In such cases utf8len must be used. Since single-byte ASCII characters (0-127) are a subset of UTF-8, length function problems may not be noticed until a user has multi-byte (non-English) UTF-8 characters to process.

A further complication is that non-UTF-8 versions of newLISP do not include the utf8len function!

A possible solution is to use this "ambidextrous" strlen function in place of length:

Code Select Expand

(define (strlen str) (if (= (& (sys-info 9) 128) 128) (utf8len str) (length str)))

But of course it fails to return the correct length if your users are trying to process multi-byte UTF-8 character strings on non-UTF-8 versions of newLISP. Like when dealing with UTF-8 html pages that include "fancy" left and right quotes in "English only" text.

A truly "simplified" or "efficient" version of your function may not be entirely possible depending on how robust you want your code to be - like in module code designed to run on all versions of newLISP.

-- xytroxon

hartrock · October 16, 2015, 04:39:58 PM

~~Quote from: ralph.ronnquist~~Not really simpler, but the negative index case would be slightly shorter with the following.
Code Select Expand ((length 0 pos str) str)

Thanks for the suggestion!

Corrected it is:

Code Select Expand

((length (0 pos str)) str)

; which is shorter and easier to understand than:

Code Select Expand

((- (min (- pos) (length str))) str)

The longer variant avoids copying str for computing valid negative index, though.

Asymmetry triggering this thread is:

Code Select Expand

> (set 'str "foobar")
"foobar"
> (11 str)
""
> (-11 str)

ERR: invalid string index
>

(same for

Code Select Expand
(slice pos str)

); but there may be reasons for this semantics: e.g. negative indices near to tail of a list are inefficient compared to positive ones near to its head; which could be a reason for making use of negative ones more uncomfortable.

hartrock · October 16, 2015, 06:52:23 PM

~~Quote from: xytroxon~~A potential problem exists in handling UTF-8 string values in UTF-8 enabled versions of newLISP. In UTF-8 versions, the length function would return the number of bytes and not the number of UTF-8 characters in the string. Hence, your function would not return the correct number of characters from your string!

Thanks for the pointer!

This is indeed a problem for usecase triggering this thread: there is a transfer of newlisp interpreter stdin/stdout/stderr via websocket protocol, whose text will be transfered in chunks for having intermediate results in case of longer running computations. Separating these chunks is byte string orientated, but transfer goes by transferring these chunks as - encoded - JSON strings. But JSON encoding could fail, if an UTF-8 char will be cutted at the border between two chunks; with fragments becoming part of different chunks.

So there is something to do; trying to provoke problem triggers error (to be eval'ed code and results of evalutions transferred via websocket as UTF-8 text):

Code Select Expand

> (set 's "我能吞下玻璃而不伤身体。")(dup s 500) ; dup leading to chunked transfer
"我能吞下玻璃而不伤身体。"
[text]我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。...[/text]
> (set 's "我能吞下玻璃而不伤身体。")(dup s 500) ; dup leading to chunked transfer

-> this run has failed for transfer of second (chunk borders are varying) dup result (first succeeded one shown truncated).

newLISP Fan Club

News:

leave-string: any idea for simplifying?

hartrock

rrq

xytroxon

hartrock

hartrock