newLISP Fan Club

Forum => newLISP in the real world => Topic started by: joejoe on November 07, 2011, 08:27:06 PM

Title: scrape url and replace to make new list
Post by: joejoe on November 07, 2011, 08:27:06 PM
Hi - I am using Cormullion's beautiful Intro to nL example:



http://en.wikibooks.org/wiki/Introduction_to_newLISP/The_Internet#Accessing_web_pages



He shows:


(set 'the-source (get-url "http://www.apple.com"))
(replace {src="(httpS*?jpg)"} the-source (push $1 images-list -1) 0)
(println images-list)


and so I am trying to modify it to pull different info from a different web page:


(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>S*?</h2>} the-source (push $1 images-list -1) 0)
(println images-list)


(Essentially I am after the headline article titles between the <h2>...</h2>



I am getting


nil

as an answer, and Im suspecting I have an incorrect regex.



Ive tried various versions of the above using the  in front of various character without success.



Would one be so kind as to point me my shortcoming? Much appreciated and thank you a lot! :0)
Title: Re: scrape url and replace to make new list
Post by: joejoe on November 07, 2011, 08:31:49 PM
Tried this too:


(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {(<h2>S*?</h2>)} the-source (push $1 images-list -1) 0)
(println images-list)
Title: Re: scrape url and replace to make new list
Post by: cormullion on November 08, 2011, 02:28:40 AM
Ah yes, the Joy of Regex...



I think your problem is that there are no headlines without whitespace on that page. The original code you copied is looking for URLs, which can't have whitespace. So S* will find URLs. However, the text between the h2 tags always contains spaces (unless there was a one word headline, such as "Boom"). Hence no match, because S is looking for non-white-space only.



You've also noticed that the parentheses inside regex patterns correspond to the $1.. tags you use in the action expressions. Without those, $1 doesn't refer to the results of the regex search.



And you need to watch out for backslashes in regex patterns, too. They have to be doubled if you  use double quotes...



If you want to experiment with regex in newLISP, take a look at grepper.lsp, an interactive regex tester tuned for newLISP... It's somewhere on //http://github.com/cormullion/newlisp-projects.
Title: Re: scrape url and replace to make new list
Post by: joejoe on November 08, 2011, 12:53:59 PM
cormullion, thanks!



With your notes, I managed to get at what I was after with this:


(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>(.+)</h2>} the-source (push $1 images-list -1) 0)
(println images-list)


Im getting strange characters, that I guess are complex text characters that dont render on the shell:


"11/8/2011 Tokyo Starts Burning Radioactive Waste from Other Areas … Tokyo Go                                                               vernor Tells Residents to “Shut Up” and Stop Complaining About It"
 "Japan, France consider nuclear power costs" "Japan Times: People “fed up wit                                                               h the shroud of secrecy” in Fukushima — Starting to smuggle in journalists â                                                               €” Must rely on media for help"


Also I see a log of double box characters next to the â characters, as well as ' a lot.



Am I correct that I would need another regex to process these correctly, or at least transform them into normal quotation marks? I am going to be using these titles to post to a web form.



Thanks again for the regex pointers, 'mullion! ;0)
Title: Re: scrape url and replace to make new list
Post by: cormullion on November 08, 2011, 02:11:57 PM
Unicode characters processed by newLISP vary in appearance depending on how you output them - the newLISP manual has more, under Unicode. To over-simplify, console output shows them escaped, but printed output shows them converted. For example, compare:


Geiger-Muller 206178+206179 G-M SI-3 BG TUBE COUNTER 10pcs new

Geiger-Muller β+γ G-M SI-3 BG TUBE COUNTER 10pcs new


Of course, your system should be UTF-8 and you should use fonts that have Unicode characters. The web page you're looking at is utf-8 encoded.



As for the ampersands, these are HTML encodings for characters outside the restricted ASCII set, and would be converted using any standard HTML-encode/decode function. I think there's a few knocking about...
Title: Re: scrape url and replace to make new list
Post by: joejoe on November 08, 2011, 03:37:35 PM
Thanks again Cormullion!



I appreciate the guidance, always.