Content migration using import_from_xml.php

G'day,


I'm trying to copy data from an old content management system, into a new web site on our new Matrix system. In the absence of an export/import instruction manual, my plan has been to look at the output of the export_to_xml.php script - run on an existing Matrix site, and to use this as the 'instruction manual' for producing my own XML import file from the data in our old CMS.



It seems that I have to generate XML actions for the following for each page:



create_asset (Page_Standard)

add_web_path

create_asset (Bodycopy)

create_asset (Bodycopy_Div)

create_asset (Content_Type_WYSIWYG)

set_attribute_value (Page_Standard, short_name)

set_attribute_value (Page_Standard, name)

set_attribute_value (Bodycopy, name) = "Page Contents"

set_attribute_value (Bodycopy Div, name) = "Content DIV ###" … where "###" is the asset ID of the standard page created above

set_attribute_value (Content_Type_WYSIWYG, name) = "DIV Content"

set_attribute_value (Content_Type_WYSIWYG, html) … this is the actual page HTML content



There's also the opportunity to parse the HTML of each of my old pages, update the in-page link URLs to point to the newly created Matrix pages, and then add a 'notice' link via:



create_link (Bodycopy_Div)



I'm wondering if anyone else has had a go at doing this sort of thing, and whether you have any suggestions for things I may have missed out of the above list. For example, I haven't attempted to migrate the page permissions because we're happy to reset those later. I'm also probably not going to worry about migrating files (PDF, images, etc). I'm concerned though that I don't fail to migrate something essential that'll cause Matrix to get 'confused' if it's missing.



For example, the XML file I'm using as an example has a set_Bodycopy_Div_###shadow_links action that just sets shadow_links to an empty array … I'd like to leave this out because I don't understand what shadow links are. Does anyone know if this is ok? There's also a set_Bodycopy_Div###_attributes action that sets "attributes" to an empty array as well - and I'd also like to leave this out, but don't know if Matrix requires it to be set. Other things I'd like to leave out of the XML import file are htmltidy_status and htmltidy_errors.



We need to migrate about 4,500 HTML pages into Matrix 3.26.3.



Thanks in advance for any comments or suggestions you might have.



Warwick

[quote]
I'm wondering if anyone else has had a go at doing this sort of thing, and whether you have any suggestions for things I may have missed out of the above list.







We need to migrate about 4,500 HTML pages into Matrix 3.26.3.

[/quote]



Hi Warwick,



I can't comment on the import as we have our own custom script for importing content, and it is does not set permissions. It can create standard pages though, based on standard XML, in the location of your choice.



I am working on a migration project that may be of interest. I am having to screen scrape some existing content, process it, resolve urls, cleanup content, import the old images into the new system,etc. We are scraping rather than querying the DB because our pages are stitched together from many pieces, so it is better to let the CMS do it.



This is all written in Ruby, and I am not done yet - at the moment it grabs a dom node from a page, cleans and parses it, generates a hash of the element and imports them into a CMS (not Matrix). The other problems (migrating links images) I have a plan for. Happy to share any of this if it sounds useful.



Richard

[quote]
Hi Warwick,



I can't comment on the import as we have our own custom script for importing content, and it is does not set permissions. It can create standard pages though, based on standard XML, in the location of your choice.



I am working on a migration project that may be of interest. I am having to screen scrape some existing content, process it, resolve urls, cleanup content, import the old images into the new system,etc. We are scraping rather than querying the DB because our pages are stitched together from many pieces, so it is better to let the CMS do it.



This is all written in Ruby, and I am not done yet - at the moment it grabs a dom node from a page, cleans and parses it, generates a hash of the element and imports them into a CMS (not Matrix). The other problems (migrating links images) I have a plan for. Happy to share any of this if it sounds useful.



Richard

[/quote]





Hi Richard,



I have started working on something similar in Ruby and Rails, I would be keen to discuss your/my experiences if you have some time.





I look forward to your response.



Kind regards,



Luke

[quote]
Hi Richard,



I have started working on something similar in Ruby and Rails, I would be keen to discuss your/my experiences if you have some time.





I look forward to your response.



Kind regards,



Luke

[/quote]

Sure. DM me your email address.

I missed this thread before, sorry.


I’ve done exactly what you’re attempting… moved around 30,000 assets from one (older) Matrix instance into a new one via Matrix’s xml export / import. I did bodycopy ( e.g. page content ) as well as images, PDF’s, docs etc. and remapped all the linked asset id’s in HTML content. Also did some content cleaning through TIDY.



Sing out if you’re still looking for info. I have around 8,000 lines of rather unattractive code that does it. :D’ /> <img src=‘http://forums.squizsuite.net/public/style_emoticons/<#EMO_DIR#>/blink.gif’ class=‘bbc_emoticon’ alt=’:blink: