cleaning strings

Started by joejoe, May 13, 2012, 09:37:41 PM

Previous topic - Next topic

joejoe

Hi -



I would like to know how to do one thing, please.



I would like to know how to remove string elements which are less than 4 characters long from my string. (By characters, i mean letters, numbers, !/#/$//,/'/;/:/etc).



For example:



If my string is this:


("a bb ccc dddd eeeee ffffff")

I would like to know which function is best to make this list only a list of strings four or more characters long.


("dddd eeeee ffffff")



I tried replace-ing


(replace "[.]" title "" 0)

and


(replace "[+]" title "" 0)

but that did not seem to clean out one character strings.



I tried various maneuvers w/ define and length but wound up lost and without solution.



Am I simply missing the "magic" regex that means "characters less than 4 characters long"?



And is replace the correct function to "remove" these things?



Thank you for any guidance!

xytroxon

#1
One method, is to parse the line into words, define a small? predicate, then use the clean function. The join function can be used to make a string again.



(setq input (parse "a bb ccc dddd eeeee ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)


-- xytroxon
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

bairui

#2
Cool. I was thinking of something along the same lines:



(setf input "a bb ccc dddd eeeee ffffff")

(filter (fn (s) (>= (length s) 4)) (parse input))

joejoe

#3
Thanks xytroxon and bairui !



I appreciate the examples and now understand how to use them.



Two things -



I tried both examples with this string (containing random non letter/number characters):


(setq input (parse "! @ # $$$ *- a bb ccc dddd eeeee ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)


On xytroxon's example, I get a result of this:


("!" "@")
()


On bairui's example, I get this:


()

I am still after only the string components with 4 or more characters, meaning somehow strip out those exclamations, symbols, etc. Is this possible?



The second thing,



The examples you both provided, am I close the correct way to translate to a list example?


(setq 'input '("a" "bb" "ccc" "dddd" "eeeee" "ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)


I am getting this:


nil

ERR: list expected in function clean : input


and for bairui's example:


(setq 'input '("a" "bb" "ccc" "dddd" "eeeee" "ffffff"))
(filter (fn (s) (>= (length s) 4)) (parse input))
(exit)


I get this, too:


ERR: string expected in function parse : input

Okay and thanks!

xytroxon

#4
You have a symbol quoting error...



Use: (set 'input ...



or: (setq input ...  or (setf input ...



but not: (setq 'input ...  nor  (setf 'input ...



(setq and (setf are the same as using (set '



-----------------



We can then also add some regex to parse to force it to break on one or more whitespace chars {s+}



(setq input (parse "! @ # $$$ *- a bb ccc dddd eeeee ffffff" {s+} 0))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)

(println (join output " "))

(exit)


 -- xytroxon
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

cormullion

#5
Joejoe, you should usually use parse with a string-break argument and optionally a regex option:


(parse string string-break regex-option)

otherwise you will see unexpected results, as newLISP tries to treat your input as source code.


> (parse "this is #1 in a list of 3")
("this" "is")
> (parse "well ; there's a thing!")
("well")
> (parse "[This sentence isn't going to be broken into words, whatever you do.")
("[This sentence isn't going to be broken into words, whatever you do.]")
> (parse "0800-074-085")
("0" "800" "-074" "-0" "85")
>
 

joejoe

#6
xytroxon, major thanks on using set, setq and setf properly! got it!



cormullion, thanks for the parse guidance because i will be using that a lot! ;0)

Lutz

#7
You could also use 'find-all'. In that case the regular expression describes a class of tokens instead of break strings:


(set 'input  "! @ # $$$$$ *- a bb ccc dddd eeeee ffffff")

(find-all {w{4,}} input)   => ("dddd" "eeeee" "ffffff")

(find-all "\w{4,}" input)   => ("dddd" "eeeee" "ffffff")

(find-all "[^ ]{4,}" input)   => ("$$$$$" "dddd" "eeeee" "ffffff")

joejoe

#8
Most excellent!



That is the magic regex of 4+ characters! :0)



Thanks very much Lutz and I will study the slight differences in your regexes.



Very much appreciated and thanks again!