Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - ino-news

#1
Anything else we might add? /
April 02, 2007, 01:33:44 PM
Quote from: "ino-news"
there are a number of leads i looked up in the 'net.  once i'm finished

with the parsing, i need to cook up the "multipart/form-data"

content-type for the PUT method.


(i meant "POST" there.)



there are two promising lisp html-parsers out there:



http://www.neilvandyke.org/htmlprag/">//http://www.neilvandyke.org/htmlprag/

http://www.stklos.org/Doc/extensions/htmlprag/htmlprag.html">//http://www.stklos.org/Doc/extensions/htmlprag/htmlprag.html

(which is a stklos port of the former). htmlprag is a bunch of

functions: a tokenizer generates a function which delivers the next tag

on successive invocations, and there is a structure parser and a HTML

emitter. the structure comes out as:



(html->shtml
 "<html><head><title></title><title>whatever</title></head><body>
<a href="url">link</a><p align=center><ul compact style="aa">
<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>
still < bold </b></body><P> But not done yet...")
=>
(*TOP* (html (head (title) (title "whatever"))
             (body "n"
                   (a (@ (href "url")) "link")
                   (p (@ (align "center"))
                      (ul (@ (compact) (style "aa")) "n"))
                   (p "BLah"
                      (*COMMENT* " comment <comment> ")
                      " "
                      (i " italic " (b " bold " (tt " ened")))
                      "n"
                      "still < bold "))
             (p " But not done yet...")))


then there's a file named "html4each.scm" in the famous slib:

http://www.google.com/search?q=html4each">//http://www.google.com/search?q=html4each.  if your system

allows installing the slib, look at this file.  it scans HTML calling

a user-supplied procedure with a tag-string as its sole argument.



since the guile scheme interpreter is quite user-friendly and the slib

was present as well, i did my program in guile instead of newlisp.  it

took about three days until it was alpha ready, just like the newlisp

program before.  seems to be standard ...  --clemens
#2
Anything else we might add? /
March 28, 2007, 01:59:17 PM
Quote from: "ino-news"
Quote from: "Dmi"
Btw, about awk: U may redefine FS to something unusual and regexps will

works through "n".  Also U may remember the rest of "$0" of previous

iteration and concatenate it in the current one...


this one i don't understand, you'd have to explain from "remeber the

rest of $0 ..." onwards.


sorry, i had overlooked that you were talking about awk.  now it makes

sense to me.



--clemens
#3
Anything else we might add? /
March 28, 2007, 12:10:24 PM
Quote from: "Dmi"
In fact the current algo isn't good to differ where the current closing

angle brace ">" closes "<input>", "<textarea>" or another tag.


i thought so. the problem with your current approach lies in the puny

little ">" which closes tags like "<input ...>", because a single ">"

closes any tag anywhere. i was thinking about making your code more

selective.  for every tag recognized there is a list of tags it closes,

and i could add specific tags after "tag-open" to tell it which tags to

look for that will definitely going to close it.  anyway, your code

defines a tag stack, which can be useful to attack the structure.


Quote from: "Dmi"
U have several ways here:



1. Preparse an html contents to transform all "<input attrs> into

<input> attrs </input>" and so on. (replace will help U).



and use rules like


(input "<input> " tag-open ())
(input/ "</input> " tag-close (input))


etc.


that's a good idea!  i had already thought of preprocessing the input to

make it more legible.


Quote from: "Dmi"
2. Preparse an html contents to transform all "<[^ ]* [^ ]*>" into "<1>

<attrs> 2 </attrs>"



and use one universal rule:


(attrs "<attrs> " tag-open ())
(attrs/ "</attrs> " tag-close (input))


And U'll have _all_ tag attributes wrapped as (attrs attribute-string)


another good one, which i could combine with the previous suggestion.


Quote from: "Dmi"
Btw, about awk: U may redefine FS to something unusual and regexps will

works through "n".  Also U may remember the rest of "$0" of previous

iteration and concatenate it in the current one...


this one i don't understand, you'd have to explain from "remeber the

rest of $0 ..." onwards.



currently, i'm playing with the following idea:



since there are only a few tags needed for the task, i can preprocess

the input to delete all the tags that don't have to be interpreted. in

that same step i'd make the input uniform to "(parse...)" by inserting

spaces or other markers. then, only "<input...>" and "<textarea>" tags

will be left. they vary in the number and type of tags used. then

i'd run a table like the one from your code against them, but use

"(match...)" to find and extract the information i need.



since i don't control the markup on the page and it can change anytime,

i need to be flexible, but only on a small number of items.



there are a number of leads i looked up in the 'net.  once i'm finished

with the parsing, i need to cook up the "multipart/form-data"

content-type for the PUT method.



thanks for your suggestions,  --clemens
#4
Anything else we might add? /
March 27, 2007, 08:18:35 AM
Quote from: "ino-news"
does the pcre-lib built into newlisp allow pattern matching across

lines?


also, in addition to just recognizing the tags, i see no immediate

possibility to extract their attributes, which would eliminate another

step for my program.  Dmi, do you see the possibility to use

parenthesized groups in the regular expressions of the tags table to

have this info ready in the output?  --clemens
#5
Anything else we might add? /
March 27, 2007, 07:53:12 AM
Quote from: "Dmi"
Try http://en.feautec.pp.ru/store/libs/tags.lsp">//http://en.feautec.pp.ru/store/libs/tags.lsp



This isn't an xml instrument, but was mainly written for parsing a

complex html.  There is no separate documentation, but an example

at the and of module.  Pay separate attention to "parse-tags" and

"structure-tags" functions.


that's what i had in mind.  lutz:  using "tidy" is not a bad idea, but

i was thinking more along the lines of Dmi's solution.  he uses regular

expressions in his recognizers, all wellt kept together in a table at

the beginning of his code.



Dmi:  your code returns a flat "tag-soup".  it recognizes the (few) tags

it has recognizers for and leaves the rest as strings.  not bad, but not

what i need.  actually, i want code to fill in a form.  for this,

i could ignore (leave as strings to be discarded after a cursory check)

everything _except_



"<input type=(checkbox|hidden|submit) (name|value) ..." and

"<textarea name=...>"



your central piece of code is:

; tags: ((tag open? (close-tag close-tag ...) (tag ...))
(define (parse-tags tags str)
  (let (res (list str))
    (dolist (t tags)
      (set 'res (map (fn (x)
       (if (string? x)
         (letn (lst (parse x (t 1) 0)
          len (length lst))
           (if (< len 1) (set 'len 1))
           (dotimes (l (lesser len))
             (push (t 0) lst (greater (* 2 l))))
           (filter (fn (x) (or (not (string? x))
             (not (empty? x))))  lst))
         x))
         (flat res))))
    (flat res)))


what happens if i leave out the "(flat res)" in "(parse-tags)"? do you

say "(list str)" only to have a list for "map" to work on?



also, my current version is an awk-script.  is does it's job well, but

frequently name, value pairs in <input...> tags span multiple lines, so

i have to set flags stating "collect-in-progress", because awk doesn't

handle regular expressions spanning multiple lines.  now i could do away

with all that if i can get the proper regular expression into the

recognizer item.  example:



<input type="checkbox" name="send3" checked>

<input type="hidden" name="type3" value="www">

<input type="hidden" name="master3" value="abuse-gd@china-netcom.com">

<input type="hidden" name="info3"

 value="http%3A%2F%2Fdestine.descabc.com%2F%3Fcharitable%2F">To: <a href="mailto:abuse-gd@china-+netcom.com">abuse-gd@china-netcom.com</a>



does the pcre-lib built into newlisp allow pattern matching across

lines?  --clemens
#6
i would like to xml-parse the type of HTML pages one finds "in the

wild".  some tags are used in HTML which lead to non well-formed XML,

most notably <p> or
.  i could replace them with something like

a XML NOOP or a simple space before running xml-parse.



is there a recipe you could recommend to turn "normal" HTML pages into

something xml-parseble?  --clemens
#7
the manual states that i can use post-url with any valid content-type,

but the example just shows unencoded get-type query strings, where

tokens are separated by `&' (ampersands) and tokens are name, value

pairs separated by `=' (equal sign).  now i need to automatically fill

a form demanding content-type "form-data", which is different, more like

multi part MIME emails.  example:



Content-Type: multipart/form-data; boundary=12345

Content-Length: 1525



--12345

Content-Disposition: form-data; name="action"



flexsend

--12345

Content-Disposition: form-data; name="spamid"



1261694132

--12345

Content-Disposition: form-data; name="crc"



2cf33919657cb46919b903201d8999cb

--12345

Content-Disposition: form-data; name="date"



25 Mar 2007 09:53:52 -0000

--12345



and so forth.  how do i tell post-url to encode this format?  --clemens
#8
newLISP newS /
March 25, 2007, 06:49:12 AM
Quote from: "Lutz"
http://newlisp.org/downloads/development/">http://newlisp.org/downloads/development/


there's one thing i'd like to have for a future release: local-domain

aka unix sockets. this would allow me to do "net-select" on both INET

and local filesytem sockets. then i'd do away with net-select'ing on

the INET sockets with a timeout, check for available descriptors and

handle fifos somewhere in between. net-select would give me a nice,

clean interface.



-- clemens
#9
Quote from: "ino-news"
all sorts of other IPs of varying string length, as if there was a

limit built-in to newlisp stating that IPs like "221.208.208.101"

cannot happen. could someone please check newlisp's C-source or verify

this?


i think i found the bug:  look at "newlisp-9.1.0/nl-sock.c".  there are

three occurences of the idiom:



snprintf(IPaddress, 15, "%d.%d.%d.%d", ipNo[0], ipNo[1], ipNo[2], ipNo[3]);


the man page states:



"The snprintf() and vsnprintf() functions will write at most size-1 of

the characters printed into the output string (the size'th character

then gets the terminating `'); if the return value is greater than

or equal to the size argument, the string was too short and some

of the printed characters were discarded.  The output is always

null-terminated."



so the buffer receiving "IPaddress" must be at least 16 bytes of size,

the line should say:



snprintf(IPaddress, 16, "%d.%d.%d.%d", ipNo[0], ipNo[1], ipNo[2], ipNo[3]);


and the last byte of "IPaddress" will have the customary "00" null

byte.  glancing at the code, the calling functions have their buffer for

the string representation of the IP address correctly sized to be 16

bytes, so maybe changing those snprintf-lines would be enough to fix the

problem.  i don't dare to do this beeing unfamiliar with the code base,

but i strongly recommend a bug-fix release.  i'd be happy with a patch

for the time beeing, as i really need the functionality.



-- clemens
#10
Anything else we might add? /
March 24, 2007, 06:02:18 PM
Quote from: "Lutz"Both newlisp.vim 1.0.40 shipped in newlisp-9.0.0.tgz

and 1.0.50 shipped in newlisp-9.1.1.tgz have 'letn', both in the util

directory of the source distribution.



http://newlisp.org/code/newlisp.vim.txt">http://newlisp.org/code/newlisp.vim.txt


ah!  i installed from the freebsd ports collection, which doesn't

install these files.  thanks to your link, i have them in place now.



-- clemens
#11
newLISP v.9.1.0 on freebsd-6-stable:



i have an application doing:

(define (do-udp-connect fd)
  (letn ((peer (net-receive-from fd 1))
         (last-error (net-error)))
        (if (nil? last-error)
          (let ((ip (nth 1 peer)))
            (if (!= last-udp-ip-seen ip)
              (begin
                (log-n-block ip (target-ip fd) (target-port fd) "udp")
                (set 'last-udp-ip-seen ip)))
            true)
          (begin
            (set 'prob-fd fd)
            (set 'last-net-error last-error)
            nil))))


this is to temporarily block UDP connections from hosts attacking,

eg. the micro$oft pop-up "service" peddling registry-fix software.

funny thing is this: i have all sorts of entries in the log showing

11-digit IPs like (example) "221.208.208.10". when checking the firewall

log i notice that this particular IP _never_ showed up, but eg.

"221.208.208.101" or "221.208.208.100". pulling my hair, time passes. it

occurs to me to count the digits of the IPs and voila: never once does

a (valid!) _12_ digit IP (like xxx.xxx.xxx.xxx, with x=anydigit) get

returned!



all sorts of other IPs of varying string length, as if there was a

limit built-in to newlisp stating that IPs like "221.208.208.101"

cannot happen. could someone please check newlisp's C-source or verify

this? oh, by the way, a similiar routine "do-tcp-connect" exists using

"net-peer" to aquire the peers IP, and it, too, shows this behaviour.



i can corelate events logged by "log-n-block", which keeps timestamps,

to firewall entries made by tcpdump, and it really looks like 12-digit

IPs triggered the events, but got logged by newlisps routines with the

last digit chopped off!



-- clemens
#12
the keyword "letn" is missing, too.  --clemens
#13
Anything else we might add? /
March 24, 2007, 01:18:15 PM
Quote from: "Lutz"Not sure what you mean, it works for me:



> (multiplier:make 'double 2)

double:double

> double:multi

2

> (save "double.lsp" 'double)

true

> !cat double.lsp

(context 'double)

(define (double:double x)
  (* x multi))

(set 'multi 2)


(context 'MAIN)


what exactly did you do? Can you paste the session?


i don't have it anymore, but i remember that something went wrong.

i pasted the original code.  newlisp throws an error when it meets

unbalanced paranthesis.  so i "repaired" the multiplier definition, but

i was back in context MAIN then, due to my error.



what i didn't know was the ability to use shell code on newlisps command

line!  and what i still don't understand it how the symbol 'multi gets

set in the cloned context 'double without beeing explicitly set in the

original 'multiplier.  can you explain this to me?  i only see it used

there.  --clemens
#14
Anything else we might add? /
March 24, 2007, 10:10:09 AM
Quote from: "Lutz"here is a closure-like multiplier using contexts:


(context 'multiplier)

(define (multiplier:multiplier x) (* x multi))

(define (multiplier:make ctx multi)
        (def-new 'multiplier:multiplier (sym ctx ctx)))

(context 'MAIN)



there's something wrong here: multiplier:make doesn't use argument

"multi", which should be saved in the multiplier context. otherwise, the

evaluation of (double some-number) breaks: "value expected in function *

: multi".



clemens
#15
Anything else we might add? /
March 23, 2007, 01:50:34 PM
Quote from: "cormullion"
Why do you think my version was slower, though...? Is parse

working harder than find-all - I suppose it is returning much more

info...


i guess find-all lets the pcre-lib do most of the work.  --clemens