Cloning a whole website


(Mariner Peter) #1

I am struggling to figure out how to do this, or if in fact it's possible without some heavy duty script hacking.
Here's the problem:



We have a large corporate intranet that is live and has numerous users looking after their own areas of it. At this moment two things are happening as part of a redevelopment, one a facelift, and two, a content restructure.



Given that we do not want to break the production site, I would love to clone the entire site and do all the work on that. At some future time when all the work is complete and approved by all stakeholders, we'd simply change the site URL to point to the redeveloped instance.



I have already pretty much established that it is not possible to clone an entire site through the backend admin interface, and it appears the only option is via the export / import backend scripts.



Having just recently completed a major migration from a legacy CMS I am pretty familliar with these - and I know they have their limitations. For starters, the XML output file for a whole site of some 1200 pages would be huge, and I am sure that given my experience with the import script I'd be lucky to get 1% through it before the script timed out. I'm not about to increase the execution time to 12 hours in a prod environment either…



It's possible I could export "branches" of the site, and import these piece by piece, but again, this is a pretty risky proposition. One missed asset and the import XML files will bomb. I'd have to say that for all but the simplest website, or tiniest website node, the import / export scripts are not viable options.



Of course, there may be some completely left field way of doing this I haven't even come close to thinking about.



Has anyone else faced a similar situation? What did you do?



I'm open to suggestions!


(Nic Hubbard) #2

IMHO.


Cloning entire site = Very bad idea



No matter how you try it. :frowning:


(Mark Brydon) #3

[quote]Cloning entire site = Very bad idea
No matter how you try it. :([/quote]

I second and third that…

The XML export process does not handle every possible exporting scenario.



I would recommend taking a backup of the system using scripts/backup.sh and setting it up with a separate database.


(Mariner Peter) #4

[quote]IMHO.


Cloning entire site = Very bad idea



No matter how you try it. :([/quote]



Re-organising large slabs of content on a production website with multiple publishers and hundreds of users is probably no less of a bad idea. :wink:


(Edison Wang) #5

My suggestion is using import_from_xml.php / export_to_xml scripts. They can be found in scripts/import. Make sure you use the newest version, because the old version of them will not work well for an entire site.


Exporting portions of a site is not recommended. Because these portions will fail to link to each other on new environment(e.g user permissions of assets).

It's highly recommended to export designs,users and sites at one go. Please refer the comments of these scripts for instructions.



The possible steps and outcome:

export_to_xml => this script will successfully generate an export xml file using 1 min

import_from_xml.php => this script will import all assets in 20 mins up to couple of hours depending on server speed and number of assets.

manual checking and fixing newly created assets. because some hard coded links and attribute values is not possible to convert.((e.g www.olddomain.com/xx)





You can feel free to give it a go, since the export process is quick and has no impact on system.


(Mariner Peter) #6

[quote]My suggestion is using import_from_xml.php / export_to_xml scripts. They can be found in scripts/import. Make sure you use the newest version, because the old version of them will not work well for an entire site.


Exporting portions of a site is not recommended. Because these portions will fail to link to each other on new environment(e.g user permissions of assets).

It's highly recommended to export designs,users and sites at one go. Please refer the comments of these scripts for instructions.



The possible steps and outcome:

export_to_xml => this script will successfully generate an export xml file using 1 min

import_from_xml.php => this script will import all assets in 20 mins up to couple of hours depending on server speed and number of assets.

manual checking and fixing newly created assets. because some hard coded links and attribute values is not possible to convert.((e.g www.olddomain.com/xx)





You can feel free to give it a go, since the export process is quick and has no impact on system.[/quote]



Hi Edison.



Those scripts would want to be an order of magnitude improvement over the old ones. The size of an XML file for just 10 assets is surprisingly big… let alone for over 1000. Last year I migrated 8000 assets from a legacy CMS into Mysource using a script I wrote to create the page + content and media assets… and I had to break up the XML into 350 odd files so I wouldn't get PHP execution timeouts. And I had the execution time increased from the typical 3 minutes to 30 minutes! :o



I also don't have a high level of confidence in this approach when even a small test of a direct export_to / import_from results in the spectacularly unhelpful message:



"XML Error: mismatched tag"



Pinpointing the problem(s) in a 5Mb XML file may also be a tad tricky… I am guessing there's not just "one" mismatched tag in my test export file. :wink:



Where you have multiple sites on a MySource instance as we do, creating a new MySource instance from a database export / import is also not a viable option. You would have to freeze publishing on sites that had nothing to do with the one in question, just to ensure parity once the work on the redev site had been completed. The even less desirable alternative would be to ask publishers of sites that have nothing to do with the one under redevelopment to publish to two MySource instances!



Where you have just one site on a MySource instance this would work, provided you had a way of tracking all the content changes on the live site and repeating them on the development site. Putting a publishing freeze in place would alleviate this, but on a big site with tens of thousands of pages, this has its own issues and could drag on for months and months.



A cloned individual site gives you something to work on without effecting users or publishers, or other sites running on the same Mysource instance… and something for groups and stakeholders to sign off on.



Perhaps outside of government this process is far simpler. :wink:


(Dw Andrew) #7

You're right this is really quite a complicated problem, with matrix you need to do a lot of things like this live or you are pretty much forced to do work twice (once in backup/dev and then in production) as there is no simple migration of a site/assets from one system to another. To do what you want I think you would really need to have your own scripts developed.


I wonder if it is worth attempting on a dev system, to clone the page inside matrix, delete the old site and then change the paths for the cloned site… you would probably need to increase you HIPO memory settings etc. Or, possibly have a backup of the site you are changing done and then serve up pages with the backup as you make changes to the new version (they will have seperate file systems and dbs)?


(Nic Hubbard) #8

If I were you I would just rebuild the site. :slight_smile: The second time around you are going to build it much cleaner, and have learned a lot from the first build.


(Mariner Peter) #9

We've been kicking this 'round most of the day.


We are now considering the following plan:


  1. snapshot current production instance of MySource
  2. build a second instance of MySource using this snapshot
  3. change apache & MySource config to run the two unnaffected sites, and the redevelopment site, from the new MySource instance. The production version of the site being restructured would remain on the old MySource instance
  4. track changes on the site running on the old instance, while carrying out the content restructure on the new instance ( not publicly visible )
  5. Once restructure is complete, freeze the old site and update any content that has changed on the new site
  6. When it's all signed off, change Apache & MySource config to point to instance of new site on newer MySource instance ( so all sites will be running from the newer MySource instance )
  7. Decom the old instance of MySource



    Has a whole load of advantages, and avoids the problems of stalling HIPO tasks and import_from_xml script timeouts / badly formed XML.



    In spite of this, it's somewhat suprising to learn that MySource doesn't have any decent recursive asset copy / clone functionality. Owners of large corporate websites running in shared hosting environments or on black box / appliance versions of MySource would be stuck if they had to do this.

(Peter Sheppard) #10

What you've highlighted above taking a snapshot of the system into another location is how we approached re-building our intranet last year. Except we upgraded a major version and moved to a new php5 64-bit platform at the same time.


We have various replication and redundancy stuff going on at system level; and it's much easier to maintain database consistency that way.



If you were to clone the site within the system (even if you could), how would you deal with things such as links embedded within pages and whether they should be updated to point to the cloned assets etc? It's a massive minefield!


(Mariner Peter) #11

[quote]What you've highlighted above taking a snapshot of the system into another location is how we approached re-building our intranet last year. Except we upgraded a major version and moved to a new php5 64-bit platform at the same time.


We have various replication and redundancy stuff going on at system level; and it's much easier to maintain database consistency that way.



If you were to clone the site within the system (even if you could), how would you deal with things such as links embedded within pages and whether they should be updated to point to the cloned assets etc? It's a massive minefield![/quote]





crazy thought ( setting myself up to be shot down in flames, I'm sure :wink: ) but aren't the asset id offsets fixed? If you locked off access for the duration, couldn't some amazing script find the first and last asset ID under the root you want to clone, then re-create copies of all those assets starting from one plus the last created asset in the system? All embedded links would just have to have the offset added… and they'd point to the same cloned assets in the new structure. All child / parent asset relationships would be preserved.



Piece of cake. :stuck_out_tongue:


(Ryan Archer) #12

 

 crazy thought ( setting myself up to be shot down in flames, I'm sure wink.gif ) but aren't the asset id offsets fixed? If you locked off access for the duration, couldn't some amazing script find the first and last asset ID under the root you want to clone, then re-create copies of all those assets starting from one plus the last created asset in the system? All embedded links would just have to have the offset added... and they'd point to the same cloned assets in the new structure. All child / parent asset relationships would be preserved.

 

"I'd buy that for a dollar!"


(Ryan Archer) #13

My suggestion is using import_from_xml.php / export_to_xml scripts. They can be found in scripts/import. Make sure you use the newest version, because the old version of them will not work well for an entire site.
Exporting portions of a site is not recommended. Because these portions will fail to link to each other on new environment(e.g user permissions of assets).
It's highly recommended to export designs,users and sites at one go. Please refer the comments of these scripts for instructions.

The possible steps and outcome:
export_to_xml => this script will successfully generate an export xml file using 1 min
import_from_xml.php => this script will import all assets in 20 mins up to couple of hours depending on server speed and number of assets.
manual checking and fixing newly created assets. because some hard coded links and attribute values is not possible to convert.((e.g www.olddomain.com/xx)


You can feel free to give it a go, since the export process is quick and has no impact on system.

 

Instructions for actually running the script?

Do I edit the PHP file and then enter it's path in the browser and hit "enter" to start the script?

I am just finding the Squiz guide very short and I am not familiar with running PHP server scripts on Squiz let alone really any other system.


(Ryan Archer) #14

Well going by the import from XML tool... it falls over if XML file is about 30MB. It worked with a 10MB file.

That's pretty lame. Can the PHP scripts put up a better fight? If so, is the "scripts" folder located through FTP? I just am unable to locate it within dashboard.