regex help

Started by joejoe, April 19, 2012, 09:27:36 PM

Previous topic - Next topic

joejoe

Ive tried for last three hours to get this one. :0)



I *am* able to get a regex to work with this:


find-all {<title>([^.]+)</title>}

on the xml code below, but I tried this (and about 60 other iterations):


find-all {<item>([^.]+)</item>}

to get the entire value of the text between the <item> and </item> tags.



Is there not a simple way to say, look for the <item> tag and then get everything until you see the </item> tag?



Any suggestion would really be appreciated. Thanks very much! :0)


<item><title>Very Cute Solar Powered Plant Decoration Flower Random Color </title><description><![CDATA[<table border='0' cellpadding='8'> <tr><td> <a href= 'http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10039&campid=5337070286&item=120898233743&vectorid=229466&lgeo=1' target='_blank'><img src='http://thumbs4.ebaystatic.com/pict/1208982337434040_1.jpg' border='0'/></a></td><td><strong>$1.99</strong><br>End Date: Friday May-18-2012 18:43:55 PDT<br>Buy It Now for only: $1.99<br><a href='http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10039&campid=5337070286&item=120898233743&vectorid=229466&lgeo=1' target='_blank'>Buy It Now</a> | <a href='http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=4&toolid=10039&campid=5337070286&vectorid=229466&lgeo=1&mpre=http%3A%2F%2Fcgi1.ebay.com%2Fws%2FeBayISAPI.dll%3FMfcISAPICommand%3DMakeTrack%26item%3D120898233743%26ssPageName%3DRSS%3AB%3ASRCH%3AUS%3A104' target='_blank'>Add to watch list</a></td></tr> </table>]]></description><pubDate>18 Apr 2012 18:53:18 GMT-07:00</pubDate><guid>120898233743</guid><link>http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10039&campid=5337070286&item=120898233743&vectorid=229466&lgeo=1</link><e:BidCount></e:BidCount><e:CurrentPrice>1.99</e:CurrentPrice><e:ListingType>FixedPrice</e:ListingType><e:BuyItNowPrice></e:BuyItNowPrice><e:ListingEndTime>2012-05-19T01:43:55.000Z</e:ListingEndTime><e:ListOrder>120898233743</e:ListOrder><e:PaymentMethod>PayPal</e:PaymentMethod></item>

m i c h a e l

#1
Hi joejoe!



Will this work for you?


find-all {<item>(.+?)</item>}


Hope this helps.



m i c h a e l

joejoe

#2
m i c h a e l -



it sure does! i gotta remember KISS when i tackle regexes. :0)



thanks big friend!



joejoe

joejoe

#3
On further request on this, if I may:



I have a list which contains a lot of texts like the large one above, all similar.



What is my best approach on extracting the same item information out of each item in the list, but keeping it associated with the same item?



I want to create an xpath file, so that in the end I have something similar to this:


;; <eb>
;;   <ebitem>
;;     <ebtitle>ebtitle</ebtitle>
;;     <ebprice>ebprice</ebprice>
;;     <eblink>eblink</eblink>
;;     <ebimage>ebimage</ebimage>
;;   </ebitem>
;;   <ebitem>
;;     <ebtitle>ebtitle</ebtitle>
;;     <ebprice>ebprice</ebprice>
;;     <eblink>eblink</eblink>
;;     <ebimage>ebimage</ebimage>
;;   </ebitem>
;; </eb>


Do I do a dolist and within that dolist, find-all the desired info and push the found regexes into my list template?



Or is it easier to replace the unnecessary pieces of the <item>'s and insert the xpath tags around them?



I guess what I am not sure on how to do is operate on the items in a list, performing multiple things on each item in the list, and then going to the next item in the list and doing the same thing on that item.



What would be the suggested approach to get the data out of <item> and into the xpath template?



I am new to nL and programming and just would like direction so I can carry out the work.



Thanks kindly again.

cormullion

#4
Perhaps something like this:


(dolist (item (find-all {<item>(.+?)</item>} html-text))  
    (find-all {<title>(.+?)</title>} item (set 'title $1))
    (find-all {<e:CurrentPrice>(.+?)</e:CurrentPrice>} item (set 'price $1))
    (find-all {<link>(.+?)</link>} item (set 'link $1))
    (find-all {<img src='(.+?.jpg)'} item    (set 'image $1))    
    (println
      (format {
        <ebitem>
          <ebtitle>%s</ebtitle>
          <ebprice>%s</ebprice>
          <eblink>%s</eblink>
          <ebimage>%s</ebimage>
        </ebitem>}
        title price link image)))


although this will fail if the information is not found by the regexen...

joejoe

#5
cormullion,



To the rescue, in character, thank you!



I see now how to get data out of a dolist loop and similarly now know how to wield the incredibly useful format function.



Many big thanks for that!