get-url -> ERR: HTTP document empty

Darth.Severus · March 26, 2013, 03:52:39 PM

I'm using some code in a program, to get the links of a website incl. subpages. But it's not working, I often get "ERR: HTTP document empty". I puted a until loop into the code, so that it tries it several times, always after some minutes. My fist thougth was, I'm blocked by the server, but this seems not to be the case. If I open a newlist shell and write (get-url url) I get the site, while I still have the same IP.

Code Select Expand

(define (pick-links url)
	(setq page (get-url url))
	(println (1 20 page)) ; testing
	(write-file "page" page); also testing
	(until (not (starts-with page "ERR: HTTP document empty")) 
		(and (sleep 600000) (setq page (get-url url))))
	(setq linklist (join (find-all "<a href=([^>]+)>([^>]*)</a>" page) "<br>n")) 
        (setq linklist 
		(replace {"} linklist "`")) ;"
	(setq parsedlist
		(parse linklist "n"))  
	(setq page nil) )

cormullion · March 28, 2013, 07:41:18 AM

It worked on the first 6 sites I tried it on. (I removed the 'write-file' line.) Perhaps it's site-specific...?

Darth.Severus · March 28, 2013, 07:31:01 PM

Ahhh, my usual error, thinking complex instead of doing the nearest thing. I always tried it with the same website... -> facepalm.

However, it works now. I'm using the dump option of w3m browser to save the sites to my disc before I process them.

Code Select Expand
(eval-string (string {(exec "w3m } url { -dump_source -T text/html > temppage.html")}))

I think the problem was a security measure of the website, maybe it blocks non-browsers when they try to get more than two sites.

Thanks, anyway.

newLISP Fan Club

News:

get-url -> ERR: HTTP document empty

Darth.Severus

cormullion

Darth.Severus