escaping characters problem

Started by cormullion, January 20, 2007, 04:56:18 AM

Previous topic - Next topic

cormullion

I'm processing some text so that various characters are replaced with alternative formulations. The problem is that the curly braces in the original text need to be replaced with letteropenbrace{} but the backslashes need to be replaced with letterbackslash{}, and I can't do both these operations independently.



One possible solution that occurred to me was to replace the '{' with some unique text that couldn't possibly occur in newLISP source code, then to replace it again once I'd finished the other characters. But is this the most efficient way of doing a sequence of changes? It looks inefficient as well, but perhaps it isn't.


(set 'c to some string)

(set 'uuid1 (uuid))
(set 'uuid2 (uuid))

(replace "{" c uuid1)
(replace "}" c uuid2)

(replace {} c {letterbackslash{}})
(replace {$} c {letterdollar{}} )
(replace {#} c {letterhash{}})
(replace {!} c {letterexclamationmark{}} )
(replace {|} c {letterbar{}})
(replace {@} c {letterat{}})
(replace {^} c {letterhat{}})
(replace "%" c {letterpercent{}} )
(replace "/" c {letterslash{}} )
(replace "<" c {letterless{}} )
(replace ">" c {lettermore{}} )
(replace "~" c {lettertilde{}} )
(replace "&" c {letterampersand{}})
(replace "?" c {letterquestionmark{}})
(replace "_" c {letterunderscore{}})
(replace "'" c {lettersinglequote{}})
     
; finally, replace the { and }
(replace uuid1 c {letteropenbrace{}})
(replace uuid2 c {letterclosebrace{}})


Is there a better solution?

Lutz

#1
I had the same problem when writing the Wiki and IDE software and then again when writing syntax.cgi. I came to the same solution, you are using:



(1) translate strings or characters into something else to protect them from change by 'replace'



(2) do the replacements



(3) translate the protected strings and characters back



'replace' is pretty fast, specially when using the raw string replacement without regular expressions. The only thing to consider is: perhaps using  'uuid' for each character is a bit expensive, because each uuid is 36 characters long, allthough it is a 100% safe solution, because a uuid is unique :), it all depends on the characteristics of your text.



What I do in the Wiki is using HTML coded characters:


       (replace "&" str "&")
        (replace "<" str "<")
        (replace ">" str ">")
        (replace " " str "&nbsp;")


in normal text these sequences are pretty much never occuring and at least in  the Wiki or IDE program they proved to be a safe choice.



Lutz

cormullion

#2
using HTML is good idea - although as soon as you write it, of course, you realise that it's an example of the sort of example that will break your script.



Or to put it another way, can the script process itself?



Everything's very quiet on the newLISP front recently? Is everybody switching to Lua? .... :-/

HPW

#3
QuoteEverything's very quiet on the newLISP front recently? Is everybody switching to Lua? .... :-/


I would not call it quiet, since 9.1 is on the horizon now.

No reason to switch anywhere, simply using this great language!

As Paul Graham states: 'Try to solve hard problems'.



(hpwNewLISP 2.19 gets more tightly integrated with neobook's 5.5.3 new privat variable feature.)



Improving things that are very good yet,  gets more and more difficult!

;-)
Hans-Peter

m i c h a e l

#4
Hi Cormullion!


Quote from: "Cormullion"Everything's very quiet on the newLISP front recently? Is everybody switching to Lua?


Perish the thought! Melissa and I have been hard at work on getting a version of neglOOk running on newLISP wiki. It is turning out quite awesome, if I may say so myself ;-) We hope to go live with it very soon.



Working with newLISP wiki has led me to an even greater appreciation of the inspiring work Lutz is doing on both the wiki and the language.



Lua is one of the languages I went some distance to learn. There are many things to commend in Lua, but one of them is not her speed of text processing using regular expressions (in my own experience, of course). This is the number one thing I end up using a language for.



Since I started programming in newLISP, I have not had a desire to try other languages. This has given me a chance to actually become more familiar with one language, while allowing me to see the logic (and the beauty) of the design that has gone into newLISP.



The wall is calling this flower back!



m i c h a e l



P.S. Who's Paul Graham? ;-)

cormullion

#5
Hi m i c h a e l - glad to see you around...! looking forward to neglooking at your site.



Don't get me wrong - I'm not saying anything negative about newLISP. It's partly that I was reading a bit about Lua recently (not really by choice, more because of luatex) and saw the page at http://www.lua.org/uses.html">//http://www.lua.org/uses.html - quite an impressive list of people who've built it into their products or applications. I'm no programmer, so I don't know what it takes to 'get a programming language into' a product, but I do know that sometimes I'd prefer use my own favourite language rather than have to learn another one, so wish that more people chose newLISP...



Also partly because it's been quiet anyway round here.

HPW

#6
QuoteP.S. Who's Paul Graham?


The man who is promissing ARC! ;-)


QuoteWhen will that be? We have no idea. We reserve the right to take a very long time. It has been over 40 years since McCarthy first described Lisp. Another 2 or 3 aren't going to kill anyone. So please don't send us mail asking what Arc's status is or when it will be done. (When it's done, we'll tell you.)


I am glad that Lutz does not promise things, he make them alive!
Hans-Peter

William James

#7
QuoteI'm processing some text so that various characters are replaced with alternative formulations. The problem is that the curly braces in the original text need to be replaced with letteropenbrace{} but the backslashes need to be replaced with letterbackslash{}, and I can't do both these operations independently.



One possible solution that occurred to me was to replace the '{' with some unique text that couldn't possibly occur in newLISP source code, then to replace it again once I'd finished the other characters. But is this the most efficient way of doing a sequence of changes? It looks inefficient as well, but perhaps it isn't.


I think that the best solution is to pass through the string only once.  That way, the replacements won't be replaced.



(define (translate ch)
  (case ch
    ({} {letterbackslash{}})
    ({$} {letterdollar{}} )
    ({#} {letterhash{}})
    ({!} {letterexclamationmark{}} )
    ({|} {letterbar{}})
    ({@} {letterat{}})
    ({^} {letterhat{}})
    ("%" {letterpercent{}} )
    ("/" {letterslash{}} )
    ("<" {letterless{}} )
    (">" {lettermore{}} )
    ("~" {lettertilde{}} )
    ("&" {letterampersand{}})
    ("?" {letterquestionmark{}})
    ("_" {letterunderscore{}})
    ("'" {lettersinglequote{}})
    ("{" {letteropenbrace{}})
    ("}" {letterclosebrace{}})
    (true ch)))

(setq text {A set {2 3 5 e f} of numbers & letters.})

(replace "." text (translate $0) 0)

TedWalther

#8
Yes, that is the solution I came to as well.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

schilling.klaus

#9
Does this approach also work for replacing unicode characters outside the ascii range with numerical html character references or the other way round?

Lutz

#10
Yes, you can use unicode characters in regular expressions when using an UTF8 enabled version of newLISP:


(replace "死" "abc死def死ghi" "生" 0) ;-> "abc生def生"

; utf8 same as using unicode

(replace "u6b7b" "abcu6b7bdefu6b7b" "u751F" 0) ;-> "abc生def生"


ps: you need a modern UTF8 enabled web browser and Chinese fonts to see this post correctly