replace regular expressions

Started by newdep, December 07, 2008, 01:22:26 AM

Previous topic - Next topic

newdep

..Finaly A good statement from larry Perl Wall  regarding P6  ->



"It will break backward compatibility [but] in order to simplify it we have to get rid of old cruft, particularly the regular expression cruft," Wall said. "A lot of the unreadability of Perl is related to the regular expression syntax – and we didn't do that, we got it from Unix. It needs to be end-of-lifed. Regular expressions are not strings, they are a sub-language. We took it and made it worse. There is this two-pass nature that is evil."

http://www.computerworld.com.au/article/269758/perl_6_break_compatibility_support_other_interpreters?eid=-6787">//http://www.computerworld.com.au/article/269758/perl_6_break_compatibility_support_other_interpreters?eid=-6787



...Now... Rebol already did build its own very readable parsing dialect, the most Elegant dialect I know of...



Now the question rises? can newlisp come back (because we had some functions in the past) with a readable string language instead of regex?

I often though about this but i never came to to point on setting something up.... One thing is sure... regex (those ugly ants) is not enhancing a language its making it flexible but not readable..



Regex is flexible, no doubt, but when wanting to do a simple multiple

string reach replacement (something that is 99% the case) I would

like to see no regex at all ;-) Yes we get used in using it but its time for

readablility..secondly regex's dont fit into Lisp..



I vote for an alternative ;-)
-- (define? (Cornflakes))

Lutz

#1
We had 'match' for strings in older versions, but it simply wasn't that powerful. The current regex form all the world uses is ugly, but at least its a standard and highly developed and efficient.



If a different regex pattern syntax gets developed, one of the regex experts, i.e. Phillip Hazel, should do it, because the back-end would always be the same highly optimized engine.

xytroxon

#2
The only concern I have is that the current version used is outdated or has errors.



But PCRE code went to multiple files in later versions :(

http://www.pcre.org/">//http://www.pcre.org/



And I wouldn't even use newLISP(R) if it weren't for the regex support ;)



An area for exploration is PEG.



LPeg - Parsing Expression Grammars For Lua

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html">//http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html

"LPeg is a new pattern-matching library for Lua, based on Parsing Expression Grammars (PEGs)."



Lua Wiki: Lpeg Recipes

http://lua-users.org/wiki/LpegRecipes">//http://lua-users.org/wiki/LpegRecipes



-- xytroxon
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

newdep

#3
Lutz, whatif 'a' match would return in a more powerful jacket for strings?



I would realy love a generic function on strings in newlisp, I always

struggle with the incompatibilty's in find / find-all and regex on expressions..



The time I spent on finding the right regex its no fun...I rather

spend time on constructing good nested functions and lispy code then doctoring on regex..;-)



btw.. I did build some newlisp functions as an effect of regex for my personal

use..but these arnt that quick as regex...so a C version would I my case be very welcome...(like the old match..)
-- (define? (Cornflakes))

kib2

#4
Maybe you can try to implement pegs in newLisp, there's already some Scheme/CL implementations given here :



http://pdos.csail.mit.edu/~baford/packrat/">//http://pdos.csail.mit.edu/~baford/packrat/

HPW

#5
For regex problems there are helper tools:



http://www.weitz.de/regex-coach/">http://www.weitz.de/regex-coach/



http://www.regexbuddy.com/">http://www.regexbuddy.com/



http://www.ultrapico.com/Expresso.htm">http://www.ultrapico.com/Expresso.htm



http://www.radsoftware.com.au/regexdesigner/">http://www.radsoftware.com.au/regexdesigner/



Or as a service:



http://www.nettz.de/Service/regexp/">http://www.nettz.de/Service/regexp/



http://gskinner.com/RegExr/">http://gskinner.com/RegExr/
Hans-Peter

DrDave

#6
Quote from: "Lutz"We had 'match' for strings in older versions, but it simply wasn't that powerful. The current regex form all the world uses is ugly, but at least its a standard and highly developed and efficient.



If a different regex pattern syntax gets developed, one of the regex experts, i.e. Phillip Hazel, should do it, because the back-end would always be the same highly optimized engine.

I wonder how many programs implementing regex really need the efficiency and could make much more readable code with some less efficient but still acceptable coding. Granted you can condense into a regular expression what would amount to a lot of alternate code, but if you strive for minimizing code over clarity, you aren't going to be coding in any mainstream language anyway.



My dislike of regular expressions is because they are not easy to write, mostly non-intuitive, not easy to debug, and not easy to maintain. These all go against  "clarity and correctness".


Quote...it is better to first strive for clarity and correctness and to make programs efficient  only if really needed.
...it is better to first strive for clarity and correctness and to make programs efficient only if really needed.

\"Getting Started with Erlang\"  version 5.6.2

newdep

#7
Examples of creating regexes on the net are there enough indeed..



Even more funny is the fact that people build helpers for them, that on itself says enough about regex ;-)



The difficult part in regex, is to build a save "repetitive" which is something I never conquere in the first 10 try's...



On itself regex is easy to learn, but the exceptions make it difficult and cryptic...finding problems in repetition regex's is a nighmare on itself..brrr..



for me its in that case clear, i rather use readable code and find the problem in no time, have a slower parse code, instead of debugging regex's with 3 cups of coffee...
-- (define? (Cornflakes))

m35

#8
http://www.alh.net/newlisp/phpbb/profile.php?mode=viewprofile&u=79">Jeremy Dunn started an http://www.alh.net/newlisp/phpbb/viewtopic.php?t=1243">interesting module to help make regex easier to read in newLISP.

cormullion

#9
How about porting PEG http://emacswiki.org/emacs/peg.el">//http://emacswiki.org/emacs/peg.el to newLISP... It looks like a natural fit.


;; This file implements a macro `peg-parse' which parses the current
;; buffer according to a PEG.  E.g. we can match integers with a PEG
;; like this:
;;
;;  (peg-parse (number   sign digit (* digit))
;;             (sign     (or "+" "-" ""))
;;             (digit    [0-9]))
;;
;; In contrast to regexps, PEGs allow us to define recursive rules.  A
;; PEG is a list of rules.  A rule is written as (NAME PE ...).
;; E.g. (sign (or "+" "-" "")) is a rule with the name "sign".  The
;; syntax for Parsing Expression (PE) is a follows:
;;
;; Description   Lisp Haskell, as in Ford's paper
;; Sequence (and e1 e2)     e1 e2
;; Prioritized Choice   (or e1 e2) e1 / e2
;; Not-predicate (not e) !e
;; And-predicate (if e) &e
;; Any character (any) .
;; Literal string "abc" "abc"
;; Character C (char c) 'c'
;; Zero-or-more (* e) e*
;; One-or-more (+ e) e+
;; Optional (opt e) e?
;; Character range (range a b) [a-b]
;; Character set [a-b "+*" ?x] [a-b+*x]  ; note: [] is a elisp vector
;; Character classes    [ascii cntrl]
;; Beginning-of-Buffer  (bob)
;; End-of-Buffer        (eob)
;; Beginning-of-Line    (bol)
;; End-of-Line         (eol)
;; Beginning-of-Word    (bow)
;; End-of-Word         (eow)
;; Beginning-of-Symbol  (bos)
;; End-of-Symbol       (eos)
;; Syntax-Class       (syntax-class NAME)
;;
;; `peg-parse' also supports parsing actions, i.e. Lisp snippets which
;; are executed when a PE matches.  This can be used to construct
;; syntax trees or for similar tasks.  Actions are written as
;;
;;  (action FORM)          ; evaluate FORM
;;  `(VAR... -- FORM...)   ; stack action
;;
;; Actions don't consume input, but are executed at the point of
;; match.  A "stack action" takes VARs from the "value stack" and
;; pushes the result of evaluating FORMs to that stack.  See
;; `peg-ex-parse-int' for an example.
;;
;; Derived Operators:
;;
;; The following operators are implemented as combinations of
;; primitive expressions:
;;
;; (substring E)  ; match E and push the substring for the matched region
;; (region E)     ; match E and push the corresponding start and end positions
;; (replace E RPL); match E and replace the matched region with RPL.
;; (list E)       ; match E and push a list of all items that E produces.
;;
;; Regexp equivalents:
;;
;; Here a some examples for regexps and how those could be written as PE.
;; [Most are taken from rx.el]
;;
;; "^[a-z]*"
;; (and (bol) (* [a-z]))
;;
;; "n[^ t]"
;; (and "n" (not [" t"]) (any))
;;
;; "\*\*\* EOOH \*\*\*n"
;; "*** EOOH ***n"
;;
;; "\<catch>[^_]"
;; (and (bow) (or "catch" "finally") (eow) (not "_") (any))
;;
;; "[ tn]*:\([^:]+\|$\)"
;; (and (* [" tn"]) ":" (or (+ (not ":") (any)) (eol)))
;;
;; "^content-transfer-encoding:\(n?[t ]\)*quoted-printable\(n?[t ]\)*"
;; (and (bol)
;;      "content-transfer-encoding:"
;;      (* (opt "n") ["t "])
;;      "quoted-printable"
;;      (* (opt "n") ["t "]))
;;
;; "\$[I]d: [^ ]+ \([^ ]+\) "
;; (and "$Id: " (+ (not " ") (any)) " " (+ (not " ") (any)) " ")
;;
;; "^;;\s-*n\|^n"
;; (or (and (bol) ";;" (* (syntax-class whitespace)) "n")
;;     (and (bol) "n"))
;;
;; "\\\\\[\w+"
;; (and "\\[" (+ (syntax-class word)))


I haven't tried converting emacs-lisp to newLISP yet. Is it hard? :)

Jeremy Dunn

#10
Corm,



I like what you posted because it is along the lines of what I was trying to do only developed further. There was also this approach that was posted on the Arc forum that deserves looking at too.



http://www.lisperati.com/arc/regex.html">http://www.lisperati.com/arc/regex.html



One of the criticisms that I have of what you posted is that I don't like the idea of using similar function names for operations that are not the same. When looking at code casually you should always know that something is going on. If I glance at AND and OR in the above code I cannot immediately see that we are actually doing something different. Far better to have something like RGX-AND and RGX-OR or perhaps $AND and $OR. Similarly * and + will be seen as being multiplication and addition. I understand the desire to stay with standard symbols but a function name should be a name that is distinct from other names. If we try to carry over too much that is old then that belies our belief that it needs to be changed.