cleaning strings

joejoe · May 13, 2012, 09:37:41 PM

Hi -

I would like to know how to do one thing, please.

I would like to know how to remove string elements which are less than 4 characters long from my string. (By characters, i mean letters, numbers, !/#/$//,/'/;/:/etc).

For example:

If my string is this:

Code Select Expand
("a bb ccc dddd eeeee ffffff")

I would like to know which function is best to make this list only a list of strings four or more characters long.

Code Select Expand
("dddd eeeee ffffff")

I tried replace-ing

Code Select Expand
(replace "[.]" title "" 0)

and

Code Select Expand
(replace "[+]" title "" 0)

but that did not seem to clean out one character strings.

I tried various maneuvers w/ define and length but wound up lost and without solution.

Am I simply missing the "magic" regex that means "characters less than 4 characters long"?

And is replace the correct function to "remove" these things?

Thank you for any guidance!

xytroxon · May 14, 2012, 01:48:25 AM

One method, is to parse the line into words, define a small? predicate, then use the clean function. The join function can be used to make a string again.

Code Select Expand

(setq input (parse "a bb ccc dddd eeeee ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)

-- xytroxon

bairui · May 14, 2012, 02:00:04 AM

Cool. I was thinking of something along the same lines:

(setf input "a bb ccc dddd eeeee ffffff")

(filter (fn (s) (>= (length s) 4)) (parse input))

joejoe · May 14, 2012, 01:46:33 PM

Thanks xytroxon and bairui !

I appreciate the examples and now understand how to use them.

Two things -

I tried both examples with this string (containing random non letter/number characters):

Code Select Expand
(setq input (parse "! @ # $$$ *- a bb ccc dddd eeeee ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)

On xytroxon's example, I get a result of this:

Code Select Expand
("!" "@")
()

On bairui's example, I get this:

Code Select Expand
()

I am still after only the string components with 4 or more characters, meaning somehow strip out those exclamations, symbols, etc. Is this possible?

The second thing,

The examples you both provided, am I close the correct way to translate to a list example?

Code Select Expand
(setq 'input '("a" "bb" "ccc" "dddd" "eeeee" "ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)

I am getting this:

Code Select Expand
nil

ERR: list expected in function clean : input

and for bairui's example:

Code Select Expand
(setq 'input '("a" "bb" "ccc" "dddd" "eeeee" "ffffff"))
(filter (fn (s) (>= (length s) 4)) (parse input))
(exit)

I get this, too:

Code Select Expand
ERR: string expected in function parse : input

Okay and thanks!

xytroxon · May 15, 2012, 01:02:00 AM

You have a symbol quoting error...

Use: (set 'input ...

or: (setq input ... or (setf input ...

but not: (setq 'input ... nor (setf 'input ...

(setq and (setf are the same as using (set '

-----------------

We can then also add some regex to parse to force it to break on one or more whitespace chars {s+}

Code Select Expand

(setq input (parse "! @ # $$$ *- a bb ccc dddd eeeee ffffff" {s+} 0))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)

(println (join output " "))

(exit)

-- xytroxon

cormullion · May 15, 2012, 04:48:23 AM

Joejoe, you should usually use parse with a string-break argument and optionally a regex option:

Code Select Expand
(parse string string-break regex-option)

otherwise you will see unexpected results, as newLISP tries to treat your input as source code.

Code Select Expand
> (parse "this is #1 in a list of 3")
("this" "is")
> (parse "well ; there's a thing!")
("well")
> (parse "[This sentence isn't going to be broken into words, whatever you do.")
("[This sentence isn't going to be broken into words, whatever you do.]")
> (parse "0800-074-085")
("0" "800" "-074" "-0" "85")
>

joejoe · May 15, 2012, 06:43:41 PM

xytroxon, major thanks on using set, setq and setf properly! got it!

cormullion, thanks for the parse guidance because i will be using that a lot! ;0)

Lutz · May 16, 2012, 12:10:51 AM

You could also use 'find-all'. In that case the regular expression describes a class of tokens instead of break strings:

Code Select Expand
(set 'input  "! @ # $$$$$ *- a bb ccc dddd eeeee ffffff")

(find-all {w{4,}} input)   => ("dddd" "eeeee" "ffffff")

(find-all "\w{4,}" input)   => ("dddd" "eeeee" "ffffff")

(find-all "[^ ]{4,}" input)   => ("$$$$$" "dddd" "eeeee" "ffffff")

joejoe · May 16, 2012, 07:31:18 PM

Most excellent!

That is the magic regex of 4+ characters! :0)

Thanks very much Lutz and I will study the slight differences in your regexes.

Very much appreciated and thanks again!

newLISP Fan Club

News:

cleaning strings

joejoe

xytroxon

bairui

joejoe

xytroxon

cormullion

joejoe

Lutz

joejoe