"save" makes improper [text] strings; impossible to reload

Started by TedWalther, August 19, 2015, 05:29:26 PM

Previous topic - Next topic

TedWalther

I stored some text in a context.  That text contained a lot of cut and pasted newlisp code.



I did (save "myfile.lsp" MySnippets)



It saved it all.



Then I do (load "myfile.lsp")  Uh-oh.  Error.



ERR: symbol expected : [text] blahblahblah


There is no line number, and it is a big file.  grep doesn't work.  Pop open vi, and see the problem.  There were embedded text tags in the data.



Simple way to reproduce problem:



Make a file with more than 2048 bytes in it.  I just did this:



$ vi foo.txt
type this: i[/text]<ESC>2050Im<ESC>:wq
$ newlisp
(set 'foo (read-file "foo.txt"))
 => works correctly
(save "foo.lsp" 'foo)
=> it doesn't complain...
(load "foo.lsp")
ERR: symbol expected in function set : [/text]


Lutz, would it make sense to have "" be the default representation for strings, and let the string know internally if it is 2048 bytes or not?



Update



What I mean by that, does the programmer really need to know which internal string implementation is being used? String representation is for programmers to use.  Doesn't the interpreter have enough information to switch between a 2048 buffer, and an arbitrary sized string?
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

As a temporary workaround, before I ever "save" a string, I do (replace {]} my_string {]}).  And then after saving and loading it, I reverse it with (replace {]} my_string {]})
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

The distinction of shorter or bigger-equal than 2048 is done for speed. The internal string representation is the same. What differs is the method used for reading strings from source. For strings shorter than 2048 limited by "" or {}, allocated buffers or the stack and one read() is used. On output newLISP looks at the length than chooses the "" versus [text][/tags]. For longer strings stream reading is used. The short kind, the most used in program source, can also be handled fast on the C stack, while the long cannot.



But the three sets of delimiters also offer the programmer different ways of encoding strings. E.g. for web work [text][/text] are very convenient because they take text as is including line feeds. {} are useful for regular expression for avoiding  double backslashing.

TedWalther

Lutz, can you change things so that "save" function works?  Is save much good if you can't "load" the output?
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

see here: http://www.newlisp.org/downloads/development/inprogress/">http://www.newlisp.org/downloads/develo ... nprogress/">http://www.newlisp.org/downloads/development/inprogress/

hartrock

But this does not work for unprintable chars.



This is OK:

> (set 's "this is 00 a test")
"this is 00 a test"
; but here 00 has silently gone away:

> (set 'sb (dup s 150))
[text]this is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a testthis is  a test[/text]
>

Any ideas?

It would be nice to have a robust way to store/load strings whatever they contain.

Or alternatively: if strings are the wrong data type for storing unprintable chars, this could be an error case (but for shorter ones this is allowed, so the switch between "" and [text][/text] leads to different behavior here).

TedWalther

I was thinking, from point of view of correctness, easiest just to have "" for strings.  Go language has `` for unescaped strings. {} strings are convenient, but can sometimes bite that they can only contain balanced parenthesis.  I think overall, if "" worked for any size, I'd rarely use the other kinds, especially if "" strings allow literal characters inside, so I can use them the same was as [text] and {} for multiline strings.



It is a huge "gotcha" to have a string type (two string types) that work only on some content, but don't have any escape mechanism to include all content.  And then the one type of string that does allow all content, is limited to 2048 bytes.  Aaarrghghh!



Lutz, how about this; when doing "save", if the string is longer than 2048 bytes, instead of converting it to [text], convert it to (list num num num) where each num is a byte, value in the range 0..255  Or else (string str1 str2...) where str1 is a 2048 byte string in "" representation, as is str2, up until all the content is represented.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

In Hartrock's example the 00 characters are still in sb but text in [text],[/text] tags does not escape characters, so 00 is not shown, although part of sb.



Right now in 10.6.4 (in progress) base64 encoding is used for strings > 2047 characters and containing the [/text] tag when using save. I could drop the condition for [/text] and always use base64 for strings longer 2047 in the save function. That would make Hartrock's sb string usable for the save function.



But I don't want to give up the [text],[/tags] to use completely unescaped text (not binary content). This is frequently used in web programming. I also don't want to eliminate the 2047 limit for " quoted string for speed in processing.



The need to display code > 2047 and containing non-displayable binary info is very rare. The save now will work with binary contents too if it always uses base64 transformation on strings > 2047 characters.

TedWalther

The base64 solution bothers me, because it breaks the human readability of the (save) output.  Can you share a bit more info about the speedups you achieved with the three different string types?  How about a 4th type of string, L" for "" strings longer than 2048 bytes?



update



Or instead of L"my string", do it Python style: """my string"""
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

hartrock

After doing some other newLISP related stuff back to this interesting discussion.
Quote from: "Lutz"In Hartrock's example the 00 characters are still in sb but text in [text],[/text] tags does not escape characters, so 00 is not shown, although part of sb.

This is OK for using these strings inside newLISP; but it makes problems, if transferring their [text]...[/text] representation, because this omits part of the string (information loss).



There is such a use case: I'm in the process of writing a newLISP 'Inspector' server app for inspecting all newLISP symbols by a browser (planned to publish it in a while). Therefore all strings will be transferred from server to browser (via JSON) by using their :source representation.

Currently there is (using newlisp-10.6.4.tgz  2015-08-31):

[unprintable chars] "bar00buz"
- OK, no information loss -, and

[unprintable chars] [text]this s a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is a testthis is ...
- not optimal, because some parts of the string are missing -

as part of info in browser window.

 
Quote from: "Lutz"Right now in 10.6.4 (in progress) base64 encoding is used for strings > 2047 characters and containing the [/text] tag when using save. I could drop the condition for [/text] and always use base64 for strings longer 2047 in the save function. That would make Hartrock's sb string usable for the save function.

Quote from: "TedWalther"The base64 solution bothers me, because it breaks the human readability of the (save) output.

Inspecting a newLISP system for development/debugging purposes needs two things:

[*] accuracy, and
  • [*] readability.
  • [/list]

    For serving 1. the internal representation has to be shown, which even for short strings is not optimal regarding 2.; e.g.

    >
    [text]
    One line...
    second line...
    third line...
    [/text]

    "nOne line...nsecond line...nthird line...n"
    ][/text] representation is more readable than its internal "..." representation (for the 'Inspector' app I'm preferring accuracy, if in doubt).


    Quote from: "Lutz"
    But I don't want to give up the [text],[/tags] to use completely unescaped text (not binary content). This is frequently used in web programming.

    They are also very nice for showing loaded text files as symbol evaluations (*.html, *.txt, *.js, etc.).


    Quote from: "Lutz"
     I also don't want to eliminate the 2047 limit for " quoted string for speed in processing.


    Quote from: "Lutz"
    The need to display code > 2047 and containing non-displayable binary info is very rare.

    For debugging purposes there is such a need.


    Quote from: "Lutz" The save now will work with binary contents too if it always uses base64 transformation on strings > 2047 characters.

    This is bad for readability of - longer - text files mentioned above; just tried to encode https://www.base64encode.org/">//https://www.base64encode.org/:

    Hello World!
    ; resulting in Base64 format (UTF-8):

    SGVsbG8gV29ybGQh

    Note: result longer as source.

    Idea:

    What about switching to base64 transformation (or another accurate variant) only then, if there are unprintable chars?

    This would give readability in most cases, but would also provide accuracy in the rarer ones, too.


    Quote from: "TedWalther"
    Lutz, how about this; when doing "save", if the string is longer than 2048 bytes, instead of converting it to [text], convert it to (list num num num) where each num is a byte, value in the range 0..255  Or else (string str1 str2...) where str1 is a 2048 byte string in "" representation, as is str2, up until all the content is represented.

    I like the simplicity and readability (compared with base64 encoding) of this approach.



    Idea:

    To combine the best of all variants:
    [list=1]
  • [*] "..." as now for short strings.

  • [*] [text]...[/text] for longer strings not containing unprintable chars.

  • [*] (string str1 str2 ...) encoding for longer strings containing a small amount of unprintable chars.

  • [*] base64 encoding for longer strings containing a significant amount of unprintable chars (binary data in strings would not be very readable in other representations, too).
  • [/list]

    But 4. only then - due to its unreadability (checking values of single bytes not possible) -, if it gives real improvements regarding mem or speed or ? (possibly I'm missing an important point here) compared with 3. (when is this the case?).

    Lutz

    save strings longer than 2047 will now be broken up in portions of up to 72 characters delimited by normal quotes "..." and escaped for unprintable characters.



    For example{



    (set 'str (dup "Part of 00 a long string with more than 2047 chars. " 45))


    produces using save:



    (set 'str (append  
    "Part of 00 a long string with more than 2047 chars. Part of 00 a long stri"    
    "ng with more than 2047 chars. Part of 00 a long string with more than 204"  
    ...
    "tring with more than 2047 chars. Part of 00 a long string with more than "    
    "2047 chars. Part of 00 a long string with more than 2047 chars. ")


    The internal representation of a string is always the same: address and length. The internal I/O routines decide the format by looking at the length field of the string and the device used.



    http://www.newlisp.org/downloads/development/inprogress/">http://www.newlisp.org/downloads/develo ... nprogress/">http://www.newlisp.org/downloads/development/inprogress/

    TedWalther

    Thank you Lutz.  I woke up from a dream this morning with another solution, but then saw you had implemented this one.  I'm very glad and appreciate the one you implemented.  Here was the one I saw in the dream:



    (length str_one) => 5000
    str_one => [text 5000].....[/text]


    Also, funny enough, I came across this problem in the exact same way as hartrock; transmitting newlisp code across the internet embedded in JSON.
    Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

    hartrock

    Quote from: "Lutz"save strings longer than 2047 will now be broken up in portions of up to 72 characters delimited by normal quotes "..." and escaped for unprintable characters.

    I like this solution: accurate and even more readable as longer "..." chunks.


    Quote from: "Lutz"
    The internal representation of a string is always the same: address and length. The internal I/O routines decide the format by looking at the length field of the string and the device used.

    So there will be output of longer strings in [text]...[/text] format printed by the newLISP interpreter (please correct me, if I'm wrong).



    For having nice output of longer strings (e.g. from reading text files) in the browser, I like the [text]...[/text] variant for strings only containing 'normal' text chars; e.g. currently there is something like this in the browser window of 'Inspector' app:

    WS_T_Content: _jqtree.css  ->
    [text]ul.jqtree-tree {
      list-style: none outside;
      margin-left: 0;
      margin-bottom: 0;
      padding: 0; }
      ul.jqtree-tree ul.jqtree_common {
        list-style: none outside;
        margin-left: 12px;
        margin-right: 0;
        margin-bottom: 0;
        padding: 0;
        display: block; }
      ul.jqtree-tree li.jqtree-closed > ul.jqtree_common {
        display: none; }
      ul.jqtree-tree li.jqtree_common {
        clear: both;
        list-style-type: none; }
      ul.jqtree-tree .jqtree-toggler {
        border-bottom: none;
        color: #333;
        text-decoration: none;
        margin-right: 0.5em;
        vertical-align: middle;
        /* [sr] Solves the problem with toggler char occupying less than 1em, by
         * enforcing to use exactly this space. */
        float: left; width: 1em; }
        ul.jqtree-tree .jqtree-toggler:hover {
          color: #000;
          text-decoration: none; }
        ul.jqtree-tree .jqtree-toggler.jqtree-closed {
          background-position: 0 0; }
      ul.jqtree-tree .jqtree-element {
        cursor: pointer;
        position: relative; }
      ul.jqtree-tree...

    Is it safe to assume, that all chars not represented as nnn sequences in "..." formats are printable chars suited for [text]...[/text] input/output?

    Then I would have a criterium for writing a function for converting from chunked "..." into [text]...[/text] format for displaying a 'friendly' string nicely in browser: of course [text],[/text] tags inside string and probably b should be treated special (and possibly enforcing chunked "..." output to have no information loss).

    Important is, to keep a correct (no information loss) "..." chunk input/output representation for 'nasty' strings.



    I don't know how robust this approach would be - e.g. regarding invalid UTF-8 sequences (https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences">//https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) -, what it really should be to get no surprises for 'nasty' strings; so any hints could be helpful here (string representations are a huge field...).



    Perfect would be to have a function like e.g. valid-UTF-8? or text? (the latter checking for valid UTF-8 and not having embedded [text],[/text] tags and possibly not having b inside (you get the idea): the latter would be ideal for my usecase, but the former would be more flexible for being used in other ones, too...



    PS: Possibly best for an 'Inspector' app targeted to developers is to show internal "" rep as default (accuracy), with the possibility to switch to a user friendly [text] view (which may have information loss then).

    hartrock

    Two more ideas (after sleeping).

    [*]
    extend seems to be faster]
    > (time (append "foo" "bar" "buz") 1000)
    0.733
    > (time (extend "foo" "bar" "buz") 1000)
    0.427
    [/code]
  • [*]
    Quote from: "hartrock"
    Quote from: "Lutz"save strings longer than 2047 will now be broken up in portions of up to 72 characters delimited by normal quotes "..." and escaped for unprintable characters.

    I like this solution: accurate and even more readable as longer "..." chunks.

    This is not necessarily true.

    But there is another variant:

    (extend "First line...ntsecond line with tab...n  third line with indent...n")
    could also be:

    (extend
    "First line...n"
    "tsecond line with tab...n"
    "  third line with indent...n"
    )

    This would preserve WS formatting to some extent!

    Strings with very many n's inside (each converted to 5 instead of 2 chars in output) should be very rare, so I think this is no problem here.

    As upper line length limit - if there is no n in sight - 2047 could be used...



    For inspecting a newLISP system this would be a nice representation avoiding all [text]...[/text] issues: smaller strings up to 2047 chars could be shown optionally in this format, too.



    BTW:

    Recently I have checked exporting all syms and their values in JSON format with some JSON WS for user friendly formatting - indents, NLs - or without it for saving space: the saving was only about 7% for switching from e.g.

    {
    "Class:Class": {
      "type": "sym",
      "prefix": "Class", "term": "Class",
      "global?": false, "protected?": false,
      "val": {
        "type": "lambda",
        "val": "(lambda () (cons (context) (args)))"
      }
    },
    to (one line):

    {"Class:Class":{"type":"sym","prefix":"Class","term":"Class","global?":false,"protected?":false,"val":{"type":"lambda","val":"(lambda () (cons (context) (args)))"}},
  • [/list]

    Lutz

    After doing a few more benchmarks, I stay with append, which is faster when appending more than a few strings. In the save case you have at least about 30 strings of about 72 characters each. extend does a realloc() on each string, while append allocates memory in bigger chunks.