Searching in lists

Started by didi, January 08, 2013, 10:16:07 PM

Previous topic - Next topic

rickyboy

#15
If you want to transform the WP export file into another format (like your blog data format), I'd recommend that you process whole posts ("items") instead of extracting just the constituents.  That way, you don't lose the coupling structure of the data.



Here's how I would define a function to do that -- again, in "Lego land".


(define (wpx-process-all-posts-from-wp-export WP-EXPORT-FILENAME)
  (let (posts (wpx-get-post-items
               (wpx-extract-items
                (sxml<-file WP-EXPORT-FILENAME)))
        post-filter
          (fn (post)
            (list 'post
                  (list 'title (lookup "title" post))
                  (list 'link (lookup "link" post))
                  (list 'author (lookup "dc:creator" post))
                  (list 'date (lookup "wp:post_date" post))
                  (list 'text (lookup "content:encoded" post)))))
    (map post-filter posts)))

I tested this on an export file I generated from my (old, v3.2.1) Wordpress blog and it worked great.  But since that example is too big to display here, here's how it worked on your singleton example.


> (wpx-process-all-posts-from-wp-export "didi-example-item.xml")
((post (title "TEST")
       (link "http://www.obeabe.de/?p=1090")
       (author "Didi")
       (date "2013-01-10 18:38:16")
       (text "Testpost Testpost")))

Notice that in the post-filter function that I'm extracting only the constituents I want (namely title, URL, author, date and text) and I associate them to newLISP symbols.  You will just change this part of post-filter to get whatever you want that corresponds to your blog data structures (that you commit to your nldb backend).



(nldb FTW!)
(λx. x x) (λx. x x)

didi

#16
Thanks Ricky !! I like the "Lego" style .

Thats exactly what I wanted,  in the end all wordpress-posts  should be in Cormullions  nldb.lsp database.

So that I can translate a wordpress-blog in the static blog with ZZBlogX .



Many thanks to all.  Hope I can show you the final result soon.

didi

#17
With  Wordpress 3.x everything works fine.   With 2.x  I had to delete all "youtube videos" , because the output of xml-parse  was "nil" due to "not well formed XML", to analyze it I opened the xml-file with my firefox-browser, there you get detailed informations about the errors ,  I had to throw out links like this:
<wp:postmeta>
<wp:meta_key>_oembed_f109b1315b82821c5f7d1d98a3530231</wp:meta_key>
<wp:meta_value><iframe width="500" height="375" src="http://www.youtube.com/embed/kKH77hfWYPU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe></wp:meta_value>
</wp:postmeta>


btw:  newLISPs is great ,  Ricky's few lines transformed my 600kb big xml-file, in no-time into a post-list of 359kB !

rickyboy

#18
There are two things that xml-parse (and other xml parsers) will not like about this input: (1) any ampersand character in an attribute value needs to be escaped, and (2) the allowfullscreen attribute is not expressed as a key-value pair, e.g. allowfullscreen="Yes".  This is all according to XML standard, although don't quote me as I'm not an expert.  The problems arise because the parser looks at the <iframe ...></iframe> part as XML, when it's not.  That's WordPress's fault, not ours.



It seems to me that the WordPress export output is producing "dirty XML".  IMO, they should have enveloped the iframe (X)HTML entity with something like a <![CDATA[ ...]]>, as they did with other DB values.



By the way, as soon as I used the "CDATA envelope", xml-parse happily parsed it.


> (sxml<-file "didi-example-2-fix.xml")
(("wp:postmeta" ("wp:meta_key" "_oembed_f109b1315b82821c5f7d1d98a3530231") ("wp:meta_value"
   "<iframe width="500" height="375" src="http://www.youtube.com/embed/kKH77hfWYPU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>")))

Of course, WordPress doesn't care that an intermediate process they don't know about (like our XML processing here) would be using their output -- only that their export and import processes work as functional inverses in WordPress.  :(
(λx. x x) (λx. x x)

didi

#19
Thanks Ricky !



It seems that in  Wordpress 3.xx it is correct and the problem is only with the  older versions 2.9x



Currently I'm struggeling with some regex functions to replace the image and video-links.  More after solving that ..