Searching in lists

Started by didi, January 08, 2013, 10:16:07 PM

Previous topic - Next topic

didi

I want to import data directly from Wordpress to my ZZBlogX.  I can export data as XML-Files.  I've parsed it with xml-parse and filtered out unnecessary things.  Now I have the problem to seek through the file  for eg.   to  look for "title" and then get the next element which is the title to make a title-list.  


( do-while (find "title" outlist)
  ( set 'i  ( find "title" outlist ))
  ( set 'outlist ( i outlist))
  ( push (pop outlist) title-list )
)


That doesn't work .  Would it be better to work with  slice  eg.  get the index of the element and make an new list with slice ?

bairui

#1
Without seeing your data, I can only guess at the structure. As such, this code will almost certainly fail:


(map rest (ref-all '("title" *) x match true))

Hopefully the intent survives though and you're able to find a built-in search function to suit your needs. There are about three hundred of them by my last count, once you factor in the myriad switches and leavers and moon phases that affect their operation.



Ok... it is perhaps a bit persnickety of me to blame my tools for my own inability to learn/memorise the many different search functions in newLISP. I need to find a way to better absorb and retain this knowledge. :-/

cormullion

#2
Without seeing your file, it's not easy to suggest anything precisely, but for general tips on importing XML, see if any of http://en.wikibooks.org/wiki/Introduction_to_newLISP/Working_with_XML">this Wikibooks chapter is useful.

cormullion

#3
Moon phases ... :)

didi

#4
The wikibook shows exactly what I'm looking for.

I try to adapt that for wordpress.  I'll be back after everything works



this works now, too - but not elegant :( do-while (find "title" outlist)
  ( set 'i  ( find "title" outlist ))
  ( set 'outlist ( i outlist))
  ( pop outlist )
  ( push (pop outlist) title-list ))


( outlist is the cleaned and flatned parsed xml  -  title-list it the list of titles I looked for.  )

rickyboy

#5
didi, I don't know exactly your context, but I have exported my data from Wordpress before, so I am going to guess your context.  I'm supposing now that you have completed the export into an XML file and that you have "slurped" that into newLISP and it is now an SXML (list) data structure.



But just to back up a moment, say I exported my Wordpress to a file called wp-export.xml.  Then, in newLISP, I would "slurp" that file like this.


(define (sxml<-file FILEPATH)
  (xml-type-tags nil nil nil nil)
  (xml-parse (read-file FILEPATH) (+ 1 2 4 16)))

(define *wp-export-db* (sxml<-file "wp-export.xml"))

I had problems using the 8 flag to xml-parse, so I left it out.  The 8 flag makes xml-parse turn the tag names into newLISP symbols, which normally are convenient, but there are some tag names I wanted to use as search criteria which had a colon in the name, like "wp:status".  newLISP uses the colon too for symbol context qualification.  If my recollection is correct, during my testing/debugging, some function like match was failing to match on these qualified tag names, so without any further investigation (because I'm lazy like that), I quickly switched to using strings for tag names.  To do this, leave the 8 flag off in the call to xml-parse.



Using the ref-all function with match as a helper is a pretty good way to munge the SXML (as bairui and cormullion have mentioned); so I'm going to use ref-all also.  I haven't thought about what is a better or the best way to process XML, so you'll hear no unqualified comments from me on that. :)  At any rate, here is a helper function I wrote, because I don't like the look of the calling code having match and true at the end of the call to ref-all.  That's just me though.


(define (ref-all-match KEY LST) (ref-all KEY LST match true))

So you could write a function to get all the titles from the export, like this.


(define (wpx-get-titles WPX)
  (map last
       (ref-all-match '("title" *)
         (ref-all-match '("item" ("title" *) *) WPX))))

This might yield more "titles" than you want.  When I run it on my export, I get this.


> (wpx-get-titles *wp-export-db*)
("Sample Page" "About" "Things" "title" "Home" "Home" "title" "Notes on Clojure Records"
 "Unit-Slope" "basis" "Horizontal" "Skating" "tumble" "separate" "simplescene" "simplescene"
 "spirograph-demo" "StarFish" "Unit-Slope" "half-slope (1)" "Permutations of a Multiset"
 "Ripping Access Databases in Clojure" "Reactive Swing via Observables Pt 1"
 "Reactive Swing and Observables Pt 2" "Reactive Swing and Observables Pt 3"
 "Reactive Swing and Observables pt 4" "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6"
 "Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")

I don't know what you want out of the title extraction, but I was looking for blog post titles.  Wordpress uses the "title" tag in a more general sense.  For instance, in my output, "Sample Page", "About", "Things" are titles of Wordpress Pages.  "Unit-Slope", "basis", ..., "half-slope (1)" are titles of attachments (like imported JPEGs).  The others are blog post titles.  Notice also that there is a "title" title (the 7th element).  This is a remnant of an empty title element, which could be a problem, but in the spirit of laziness, let's move on, shall we?  :)



I wanted just the published post titles, so I needed some more munging.


(define (wpx-get-published-post-titles WPX)
  (map last
       (ref-all-match '("title" *)
         (ref-all-match '("item" ("title" *) *
                                 ("wp:status" "publish") *
                                 ("wp:post_type" "post") *)
                        WPX))))

That should pretty much do it.  Here's my trial run.


> (wpx-get-published-post-titles *wp-export-db*)
("Permutations of a Multiset" "Ripping Access Databases in Clojure"
 "Reactive Swing via Observables Pt 1" "Reactive Swing and Observables Pt 2"
 "Reactive Swing and Observables Pt 3"  "Reactive Swing and Observables pt 4"
 "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6"
 "Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")

The only potential problem I can see with this function definition is with the reliance on the order of the "title", "wp:status" and "wp:post_type" elements.  In general, we should be concerned with it, but my export was small enough for me to notice that these elements are consistently output in the order indicated by the function definition above.



As extra credit, the following is some code that also worked for me to get the published post titles.  This set of functions don't rely on the order of the sub-elements  (like "wp:status" and "wp:post_type") which are used for filtering.  And I like the "small building blocks" design -- I like Legos too. :)


(define (wpx-extract-items WPX)
  (ref-all-match '("item" *) WPX))

(define (mfilter M LST)
  (filter (curry member M) LST))

(define (wpx-get-post-items WPX)
  (mfilter '("wp:post_type" "post") (wpx-extract-items WPX)))

(define (wpx-get-published-post-items WPX)
  (mfilter '("wp:status" "publish") (wpx-get-post-items WPX)))

(define (wpx-get-post-titles WPX)
  (map (curry lookup "title") (wpx-get-post-items WPX)))

(define (wpx-get-published-post-titles WPX)
  (map (curry lookup "title") (wpx-get-published-post-items WPX)))

Perhaps other people will share the way they like to process (S)XML. I'm very curious.  Thanks!



P.S. -- Nice find on the wikibooks, cormullion!
(λx. x x) (λx. x x)

cormullion

#6
You could make this into a blog post... :) Always good to read something by you!

rickyboy

#7
That's your job, my friend.  :)  I miss Unbalanced Parentheses ...
(λx. x x) (λx. x x)

cormullion

#8
Thanks! Someone else's turn now, though :)

Lutz

#9
Quote like "wp:status". newLISP uses the colon too for symbol context qualification.


in the next version newLISP translates XML : colons in tag names to dots in symbols names:



http://www.newlisp.org/downloads/development/inprogress/CHANGES-10.4.6.txt">http://www.newlisp.org/downloads/develo ... 10.4.6.txt">http://www.newlisp.org/downloads/development/inprogress/CHANGES-10.4.6.txt

rickyboy

#10
Thanks, Lutz!
(λx. x x) (λx. x x)

didi

#11
Thanks Ricky - that's really awesome -  nearly can't wait till the evening to test it !

cormullion

#12
Once I tried to make it so that an XML file would execute itself. So after converting an xml file to sxml, you'd then evaluate the sxml, having previously defined functions called title, item or status which would evaluate their attributes in turn. I think it all went wrong with the @ signs - can't remember now. But at the time it seemed more logical to let the title and item functions decide what to do and when to do it, rather than try to scan through a big list of stuff and extract the information in procedural style.

didi

#13
Ricky's code works fine.  I've got all titles out of a  600kb textfile in a blink.



Now I want  to get the  post-content  marked as  <content:encoded> ,  I changed "title" to "content:encoded" and tested evey kind and combination of "*"  nothing worked. Maybe I couldn't find out the right pattern or it doesn't work .  Here one sample-post in xml :


<item>
<title>TEST</title>
<link>http://www.obeabe.de/?p=1090</link>
<pubDate>Thu, 10 Jan 2013 18:38:16 +0000</pubDate>
<dc:creator><![CDATA[Didi]]></dc:creator>

<category><![CDATA[Allgemein]]></category>

<category domain="category" nicename="allgemein"><![CDATA[Allgemein]]></category>

<guid isPermaLink="false">http://www.obeabe.de/?p=1090</guid>
<description></description>
<content:encoded><![CDATA[Testpost Testpost]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>1090</wp:post_id>
<wp:post_date>2013-01-10 18:38:16</wp:post_date>
<wp:post_date_gmt>2013-01-10 18:38:16</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>test</wp:post_name>
<wp:status>publish</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>post</wp:post_type>
<wp:post_password></wp:post_password>
<wp:is_sticky>0</wp:is_sticky>
<wp:postmeta>
<wp:meta_key>_edit_lock</wp:meta_key>
<wp:meta_value>1357843097</wp:meta_value>
</wp:postmeta>
<wp:postmeta>
<wp:meta_key>_edit_last</wp:meta_key>
<wp:meta_value>1</wp:meta_value>
</wp:postmeta>
</item>


The result should be "Testpost Testpost" .  Maybe someone has an idea.

rickyboy

#14
This is what I would define if I wanted to be in the (aforementioned) "Lego" scheme.


(define *didi-example-item* (sxml<-file "didi-example-item.xml"))

(define (wpx-get-post-contents WPX)
  (map (curry lookup "content:encoded") (wpx-get-post-items WPX)))

Then, say this in the REPL.


> (wpx-get-post-contents *didi-example-item*)
("Testpost Testpost")
>
(λx. x x) (λx. x x)