scrape url and replace to make new list

Started by joejoe, November 07, 2011, 08:27:06 PM

Previous topic - Next topic

joejoe

Hi - I am using Cormullion's beautiful Intro to nL example:



http://en.wikibooks.org/wiki/Introduction_to_newLISP/The_Internet#Accessing_web_pages">http://en.wikibooks.org/wiki/Introducti ... _web_pages">http://en.wikibooks.org/wiki/Introduction_to_newLISP/The_Internet#Accessing_web_pages



He shows:


(set 'the-source (get-url "http://www.apple.com"))
(replace {src="(httpS*?jpg)"} the-source (push $1 images-list -1) 0)
(println images-list)


and so I am trying to modify it to pull different info from a different web page:


(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>S*?</h2>} the-source (push $1 images-list -1) 0)
(println images-list)


(Essentially I am after the headline article titles between the <h2>...</h2>



I am getting


nil

as an answer, and Im suspecting I have an incorrect regex.



Ive tried various versions of the above using the  in front of various character without success.



Would one be so kind as to point me my shortcoming? Much appreciated and thank you a lot! :0)

joejoe

#1
Tried this too:


(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {(<h2>S*?</h2>)} the-source (push $1 images-list -1) 0)
(println images-list)

cormullion

#2
Ah yes, the Joy of Regex...



I think your problem is that there are no headlines without whitespace on that page. The original code you copied is looking for URLs, which can't have whitespace. So S* will find URLs. However, the text between the h2 tags always contains spaces (unless there was a one word headline, such as "Boom"). Hence no match, because S is looking for non-white-space only.



You've also noticed that the parentheses inside regex patterns correspond to the $1.. tags you use in the action expressions. Without those, $1 doesn't refer to the results of the regex search.



And you need to watch out for backslashes in regex patterns, too. They have to be doubled if you  use double quotes...



If you want to experiment with regex in newLISP, take a look at grepper.lsp, an interactive regex tester tuned for newLISP... It's somewhere on http://github.com/cormullion/newlisp-projects">//http://github.com/cormullion/newlisp-projects.

joejoe

#3
cormullion, thanks!



With your notes, I managed to get at what I was after with this:


(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>(.+)</h2>} the-source (push $1 images-list -1) 0)
(println images-list)


Im getting strange characters, that I guess are complex text characters that dont render on the shell:


"11/8/2011 Tokyo Starts Burning Radioactive Waste from Other Areas … Tokyo Go                                                               vernor Tells Residents to “Shut Up” and Stop Complaining About It"
 "Japan, France consider nuclear power costs" "Japan Times: People “fed up wit                                                               h the shroud of secrecy” in Fukushima — Starting to smuggle in journalists â                                                               €” Must rely on media for help"


Also I see a log of double box characters next to the â characters, as well as ' a lot.



Am I correct that I would need another regex to process these correctly, or at least transform them into normal quotation marks? I am going to be using these titles to post to a web form.



Thanks again for the regex pointers, 'mullion! ;0)

cormullion

#4
Unicode characters processed by newLISP vary in appearance depending on how you output them - the newLISP manual has more, under Unicode. To over-simplify, console output shows them escaped, but printed output shows them converted. For example, compare:


Geiger-Muller 206178+206179 G-M SI-3 BG TUBE COUNTER 10pcs new

Geiger-Muller β+γ G-M SI-3 BG TUBE COUNTER 10pcs new


Of course, your system should be UTF-8 and you should use fonts that have Unicode characters. The web page you're looking at is utf-8 encoded.



As for the ampersands, these are HTML encodings for characters outside the restricted ASCII set, and would be converted using any standard HTML-encode/decode function. I think there's a few knocking about...

joejoe

#5
Thanks again Cormullion!



I appreciate the guidance, always.