Searching in lists

didi · January 08, 2013, 10:16:07 PM

I want to import data directly from Wordpress to my ZZBlogX. I can export data as XML-Files. I've parsed it with xml-parse and filtered out unnecessary things. Now I have the problem to seek through the file for eg. to look for "title" and then get the next element which is the title to make a title-list.

Code Select Expand
( do-while (find "title" outlist)
  ( set 'i  ( find "title" outlist ))
  ( set 'outlist ( i outlist))
  ( push (pop outlist) title-list )
)

That doesn't work . Would it be better to work with slice eg. get the index of the element and make an new list with slice ?

bairui · January 08, 2013, 11:45:48 PM

Without seeing your data, I can only guess at the structure. As such, this code will almost certainly fail:

Code Select Expand
(map rest (ref-all '("title" *) x match true))

Hopefully the intent survives though and you're able to find a built-in search function to suit your needs. There are about three hundred of them by my last count, once you factor in the myriad switches and leavers and moon phases that affect their operation.

Ok... it is perhaps a bit persnickety of me to blame my tools for my own inability to learn/memorise the many different search functions in newLISP. I need to find a way to better absorb and retain this knowledge. :-/

cormullion · January 09, 2013, 12:37:05 AM

Without seeing your file, it's not easy to suggest anything precisely, but for general tips on importing XML, see if any of http://en.wikibooks.org/wiki/Introduction_to_newLISP/Working_with_XML">this Wikibooks chapter is useful.

cormullion · January 09, 2013, 12:39:50 AM

Moon phases ... :)

didi · January 09, 2013, 01:19:28 PM

The wikibook shows exactly what I'm looking for.

I try to adapt that for wordpress. I'll be back after everything works

this works now, too - but not elegant :

Code Select Expand
( do-while (find "title" outlist)
  ( set 'i  ( find "title" outlist ))
  ( set 'outlist ( i outlist))
  ( pop outlist )
  ( push (pop outlist) title-list ))

( outlist is the cleaned and flatned parsed xml - title-list it the list of titles I looked for. )

rickyboy · January 09, 2013, 02:27:06 PM

didi, I don't know exactly your context, but I have exported my data from Wordpress before, so I am going to guess your context. I'm supposing now that you have completed the export into an XML file and that you have "slurped" that into newLISP and it is now an SXML (list) data structure.

But just to back up a moment, say I exported my Wordpress to a file called wp-export.xml. Then, in newLISP, I would "slurp" that file like this.

Code Select Expand
(define (sxml<-file FILEPATH)
  (xml-type-tags nil nil nil nil)
  (xml-parse (read-file FILEPATH) (+ 1 2 4 16)))

(define *wp-export-db* (sxml<-file "wp-export.xml"))

I had problems using the 8 flag to xml-parse, so I left it out. The 8 flag makes xml-parse turn the tag names into newLISP symbols, which normally are convenient, but there are some tag names I wanted to use as search criteria which had a colon in the name, like "wp:status". newLISP uses the colon too for symbol context qualification. If my recollection is correct, during my testing/debugging, some function like match was failing to match on these qualified tag names, so without any further investigation (because I'm lazy like that), I quickly switched to using strings for tag names. To do this, leave the 8 flag off in the call to xml-parse.

Using the ref-all function with match as a helper is a pretty good way to munge the SXML (as bairui and cormullion have mentioned); so I'm going to use ref-all also. I haven't thought about what is a better or the best way to process XML, so you'll hear no unqualified comments from me on that. :) At any rate, here is a helper function I wrote, because I don't like the look of the calling code having match and true at the end of the call to ref-all. That's just me though.

Code Select Expand
(define (ref-all-match KEY LST) (ref-all KEY LST match true))

So you could write a function to get all the titles from the export, like this.

Code Select Expand
(define (wpx-get-titles WPX)
  (map last
       (ref-all-match '("title" *)
         (ref-all-match '("item" ("title" *) *) WPX))))

This might yield more "titles" than you want. When I run it on my export, I get this.

Code Select Expand
> (wpx-get-titles *wp-export-db*)
("Sample Page" "About" "Things" "title" "Home" "Home" "title" "Notes on Clojure Records" 
 "Unit-Slope" "basis" "Horizontal" "Skating" "tumble" "separate" "simplescene" "simplescene" 
 "spirograph-demo" "StarFish" "Unit-Slope" "half-slope (1)" "Permutations of a Multiset" 
 "Ripping Access Databases in Clojure" "Reactive Swing via Observables Pt 1"
 "Reactive Swing and Observables Pt 2" "Reactive Swing and Observables Pt 3" 
 "Reactive Swing and Observables pt 4" "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6" 
 "Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")

I don't know what you want out of the title extraction, but I was looking for blog post titles. Wordpress uses the "title" tag in a more general sense. For instance, in my output, "Sample Page", "About", "Things" are titles of Wordpress Pages. "Unit-Slope", "basis", ..., "half-slope (1)" are titles of attachments (like imported JPEGs). The others are blog post titles. Notice also that there is a "title" title (the 7th element). This is a remnant of an empty title element, which could be a problem, but in the spirit of laziness, let's move on, shall we? :)

I wanted just the published post titles, so I needed some more munging.

Code Select Expand
(define (wpx-get-published-post-titles WPX)
  (map last
       (ref-all-match '("title" *)
         (ref-all-match '("item" ("title" *) *
                                 ("wp:status" "publish") *
                                 ("wp:post_type" "post") *)
                        WPX))))

That should pretty much do it. Here's my trial run.

Code Select Expand
> (wpx-get-published-post-titles *wp-export-db*)
("Permutations of a Multiset" "Ripping Access Databases in Clojure" 
 "Reactive Swing via Observables Pt 1" "Reactive Swing and Observables Pt 2"
 "Reactive Swing and Observables Pt 3"  "Reactive Swing and Observables pt 4"
 "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6" 
 "Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")

The only potential problem I can see with this function definition is with the reliance on the order of the "title", "wp:status" and "wp:post_type" elements. In general, we should be concerned with it, but my export was small enough for me to notice that these elements are consistently output in the order indicated by the function definition above.

As extra credit, the following is some code that also worked for me to get the published post titles. This set of functions don't rely on the order of the sub-elements (like "wp:status" and "wp:post_type") which are used for filtering. And I like the "small building blocks" design -- I like Legos too. :)

Code Select Expand
(define (wpx-extract-items WPX)
  (ref-all-match '("item" *) WPX))

(define (mfilter M LST)
  (filter (curry member M) LST))

(define (wpx-get-post-items WPX)
  (mfilter '("wp:post_type" "post") (wpx-extract-items WPX)))

(define (wpx-get-published-post-items WPX)
  (mfilter '("wp:status" "publish") (wpx-get-post-items WPX)))

(define (wpx-get-post-titles WPX)
  (map (curry lookup "title") (wpx-get-post-items WPX)))

(define (wpx-get-published-post-titles WPX)
  (map (curry lookup "title") (wpx-get-published-post-items WPX)))

Perhaps other people will share the way they like to process (S)XML. I'm very curious. Thanks!

P.S. -- Nice find on the wikibooks, cormullion!

cormullion · January 09, 2013, 02:47:17 PM

You could make this into a blog post... :) Always good to read something by you!

rickyboy · January 09, 2013, 02:59:09 PM

That's your job, my friend. :) I miss Unbalanced Parentheses ...

cormullion · January 09, 2013, 03:01:56 PM

Thanks! Someone else's turn now, though :)

Lutz · January 09, 2013, 03:57:30 PM

~~Quote~~ like "wp:status". newLISP uses the colon too for symbol context qualification.

in the next version newLISP translates XML : colons in tag names to dots in symbols names:

http://www.newlisp.org/downloads/development/inprogress/CHANGES-10.4.6.txt">http://www.newlisp.org/downloads/develo ... 10.4.6.txt">http://www.newlisp.org/downloads/development/inprogress/CHANGES-10.4.6.txt

rickyboy · January 09, 2013, 08:26:41 PM

Thanks, Lutz!

didi · January 09, 2013, 10:19:09 PM

Thanks Ricky - that's really awesome - nearly can't wait till the evening to test it !

cormullion · January 10, 2013, 08:27:21 AM

Once I tried to make it so that an XML file would execute itself. So after converting an xml file to sxml, you'd then evaluate the sxml, having previously defined functions called title, item or status which would evaluate their attributes in turn. I think it all went wrong with the @ signs - can't remember now. But at the time it seemed more logical to let the title and item functions decide what to do and when to do it, rather than try to scan through a big list of stuff and extract the information in procedural style.

didi · January 10, 2013, 10:52:35 AM

Ricky's code works fine. I've got all titles out of a 600kb textfile in a blink.

Now I want to get the post-content marked as <content:encoded> , I changed "title" to "content:encoded" and tested evey kind and combination of "*" nothing worked. Maybe I couldn't find out the right pattern or it doesn't work . Here one sample-post in xml :

Code Select Expand
<item>
<title>TEST</title>
<link>http://www.obeabe.de/?p=1090</link>
<pubDate>Thu, 10 Jan 2013 18:38:16 +0000</pubDate>
<dc:creator><![CDATA[Didi]]></dc:creator>

		<category><![CDATA[Allgemein]]></category>

		<category domain="category" nicename="allgemein"><![CDATA[Allgemein]]></category>

<guid isPermaLink="false">http://www.obeabe.de/?p=1090</guid>
<description></description>
<content:encoded><![CDATA[Testpost Testpost]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>1090</wp:post_id>
<wp:post_date>2013-01-10 18:38:16</wp:post_date>
<wp:post_date_gmt>2013-01-10 18:38:16</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>test</wp:post_name>
<wp:status>publish</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>post</wp:post_type>
<wp:post_password></wp:post_password>
<wp:is_sticky>0</wp:is_sticky>
<wp:postmeta>
<wp:meta_key>_edit_lock</wp:meta_key>
<wp:meta_value>1357843097</wp:meta_value>
</wp:postmeta>
<wp:postmeta>
<wp:meta_key>_edit_last</wp:meta_key>
<wp:meta_value>1</wp:meta_value>
</wp:postmeta>
	</item>

The result should be "Testpost Testpost" . Maybe someone has an idea.

rickyboy · January 10, 2013, 11:19:38 AM

This is what I would define if I wanted to be in the (aforementioned) "Lego" scheme.

Code Select Expand
(define *didi-example-item* (sxml<-file "didi-example-item.xml"))

(define (wpx-get-post-contents WPX)
  (map (curry lookup "content:encoded") (wpx-get-post-items WPX)))

Then, say this in the REPL.

Code Select Expand
> (wpx-get-post-contents *didi-example-item*)
("Testpost Testpost")
>

newLISP Fan Club

News:

Searching in lists

didi

bairui

cormullion

cormullion

didi

rickyboy

cormullion

rickyboy

cormullion

Lutz

rickyboy

didi

cormullion

didi

rickyboy