Taking a static archive of a squiz matrix hosted site - any advice?


(Douglas (@finnatic at @waikato)) #1

Hi, I'm trying to take a static archive or mirror of a squiz matrix hosted site using wget and just wondering if anyone else has had to likewise and has any advice.


I'm encountering a problem where children assets exist so that you have both a page appearing at:



http://webserver/parent-asset

&

http://webserver/parent-asset/child-asset



wget is clobbering the parent-asset and making it into a folder. Using -nc to not clobber the file doesn't seem to work very well, with the first run seeing child-asset successfully clobbering it, requiring a second run which does create it, but with further work to move it from parent-asset.1 to parent-asset/index.html or similar.



Does anyone have any experience doing likewise and have any tips to share?


(Douglas (@finnatic at @waikato)) #2

[quote]
I'm encountering a problem where children assets exist so that you have both a page appearing at:



http://webserver/parent-asset

&

http://webserver/parent-asset/child-asset

[/quote]



I've found a solution with the -E option which forces a .html extension onto the retrieved pages so that I have:



parent-asset.html & parent-asset/child-asset.html.



I still work to do (in terms of sorting out /__data/ files and so on), - if anyone does have tips on making a mirror / archive of a Squiz Matrix hosted site I'd be interested.


(Pw) #3

Maybe WinHTTrack will help? http://www.httrack.com/


(Rwahyudi) #4

Hi Douglas,


Im currently trialing httrack to archive our site.



So far it does the job, but there are few issues that I would like to address before putting it to production.

    httrack http://www.example.com--mirror --path /web/apache2/htdocs/example/ -%S /web/apache2/htdocs/example-filter.httrack --robots=0 -q -o0 -X --connection-per-second=50 --sockets=10 --cache=0 -%U apache --htt10 --clean


example-filter.httrack contains filter to include / exclude URL ie :
    -*
    +http://www.example.com/*
    +http://www.example.com/_designs/*
    -http://www.example.com/__data/*
    etc .. 


Items under __data is created using rsync off $MATRIX_HOME/data/public

(Douglas (@finnatic at @waikato)) #5

[quote]
Im currently trialing httrack to archive our site.

[/quote]



Hi - thanks for the reply. I've trialled HTTrack before as well, albeit on windows and we opted for wget in archive mode in the end on that occasion. From the command you provide it looks like you're using the linux version?


[quote]

Items under __data is created using rsync off $MATRIX_HOME/data/public

[/quote]



We're doing similar. For static archives (or mirrors) we don't need __lib do we?


(Darren Johnston) #6

[quote]
Hi - thanks for the reply. I've trialled HTTrack before as well, albeit on windows and we opted for wget in archive mode in the end on that occasion. From the command you provide it looks like you're using the linux version?







We're doing similar. For static archives (or mirrors) we don't need __lib do we?

[/quote]



Hi Douglas,



The new Export Assets to XML Tool will give you a backup of the whole site (html code is inside the xml file though) if thats useful?

you can then import back to matrix when you need to.



regards Darren


(Douglas (@finnatic at @waikato)) #7

[quote]
The new Export Assets to XML Tool will give you a backup of the whole site (html code is inside the xml file though) if thats useful?

you can then import back to matrix when you need to.

[/quote]



Hi Darren,



I'll have to have a play, but that sounds like additional work to extract all of the html for each of the files. The idea here is to be able to mirror a Squiz Matrix site via Apache (etc) without Matrix being up / available.