Hi, I'm trying to take a static archive or mirror of a squiz matrix hosted site using wget and just wondering if anyone else has had to likewise and has any advice.
I'm encountering a problem where children assets exist so that you have both a page appearing at:
http://webserver/parent-asset
&
http://webserver/parent-asset/child-asset
wget is clobbering the parent-asset and making it into a folder. Using -nc to not clobber the file doesn't seem to work very well, with the first run seeing child-asset successfully clobbering it, requiring a second run which does create it, but with further work to move it from parent-asset.1 to parent-asset/index.html or similar.
Does anyone have any experience doing likewise and have any tips to share?
Taking a static archive of a squiz matrix hosted site - any advice?
[quote]
I'm encountering a problem where children assets exist so that you have both a page appearing at:
http://webserver/parent-asset
&
http://webserver/parent-asset/child-asset
[/quote]
I've found a solution with the -E option which forces a .html extension onto the retrieved pages so that I have:
parent-asset.html & parent-asset/child-asset.html.
I still work to do (in terms of sorting out /__data/ files and so on), - if anyone does have tips on making a mirror / archive of a Squiz Matrix hosted site I'd be interested.
Hi Douglas,
Im currently trialing httrack to archive our site.
So far it does the job, but there are few issues that I would like to address before putting it to production.
- It crashes/freeze randomly
- creates lots of .tmp files
-
www.example.com/about created as www.example.com/about.html . index.html is not present on www.example.com/about/.
Here is the command that I use to mirror the site :
httrack http://www.example.com--mirror --path /web/apache2/htdocs/example/ -%S /web/apache2/htdocs/example-filter.httrack --robots=0 -q -o0 -X --connection-per-second=50 --sockets=10 --cache=0 -%U apache --htt10 --clean
example-filter.httrack contains filter to include / exclude URL ie :
-* +http://www.example.com/* +http://www.example.com/_designs/* -http://www.example.com/__data/* etc ..
Items under __data is created using rsync off $MATRIX_HOME/data/public
[quote]
Im currently trialing httrack to archive our site.
[/quote]
Hi - thanks for the reply. I've trialled HTTrack before as well, albeit on windows and we opted for wget in archive mode in the end on that occasion. From the command you provide it looks like you're using the linux version?
[quote]
Items under __data is created using rsync off $MATRIX_HOME/data/public
[/quote]
We're doing similar. For static archives (or mirrors) we don't need __lib do we?
[quote]
Hi - thanks for the reply. I've trialled HTTrack before as well, albeit on windows and we opted for wget in archive mode in the end on that occasion. From the command you provide it looks like you're using the linux version?
We're doing similar. For static archives (or mirrors) we don't need __lib do we?
[/quote]
Hi Douglas,
The new Export Assets to XML Tool will give you a backup of the whole site (html code is inside the xml file though) if thats useful?
you can then import back to matrix when you need to.
regards Darren
[quote]
The new Export Assets to XML Tool will give you a backup of the whole site (html code is inside the xml file though) if thats useful?
you can then import back to matrix when you need to.
[/quote]
Hi Darren,
I'll have to have a play, but that sounds like additional work to extract all of the html for each of the files. The idea here is to be able to mirror a Squiz Matrix site via Apache (etc) without Matrix being up / available.