1 big metadata schema or several small schemas

evanwills · 2015-06-05 02:09:15 UTC

We have been using a couple of quite large metadata schema across our sites, one with 30 fields spread over 6 section and the other with 22 fields also spread over 6 section. The problem we are now having is that when I want to make changes, regenerating the schemas is causing weird random errors. (Each of these schemas is applied to roughly 100k of assets. Most of the 100k have both)

Is it better to use multiple smaller schemas (e.g. split the large schemas into one schema per section) or continue as we have been with fewer larger schemas?

quimby · 2015-06-06 00:19:15 UTC

I've run into similar problems and tried multiple different ways - this is what I came up with as the best solution for us:

- Each site has it's own schema applied at the site asset and cascaded down.

- Asset restrictions used heavily to control which asset types have what.

- Use specialised schemas only where it is called for (e.g. for some custom functonality) and apply in that area only.

- Strictly follow the one schema per site rule (I'll explain why in a bit).

Now for the re-use bit:

- Once you have defined a schema for a site, if there are fields and/or sections that are shared, use Matrix linking to link those fields and/or sections under the site's schema.

- To achieve this I find it useful to have a 'placeholder' schema that never gets applied to anything, just holds commonly shared schema sections/fields.

- I recommend linking sections only, and giving them a suitable name to indicate they are linked. Linking fields is possible, but the name is often very important, so you would want some way to identify that it is shared - otherwise changes can inadvertently affect other sites.

To summarise, you are re-using the sections/fields, but the schema, the parent, is standalone and specific only to the site.

Does this really work?

Very much YES. Metadata (as far as I can see) is stored as a tertiary table, so each field has a unique identifier which is simply referenced by id (primary key??), rather than by name or it's location under one schema or another. I have even re-linked sections under totally new schemas and renamed all the fields, and the assets retained their existing values.

This also fixes the HIPO problem. (following is educated guess on my part - Squiz staff may want to clarify - or not). When a regen starts the HIPO job tries to figure our a list of all the assets it needs to act on. If you have applied the same schema to multiple large sites, this ends up as a very large list. The HIPO job either times out and cannot be restarted or you get an out-of-memory error.

This fixes your metadata issue because regen's are for the schema only - it doesn't branch out to all assets that may have linked sections as well. This means the list the HIPO generates that needs to act on is much smaller, but you have to run a regen a couple more times - once for each site. But it's worth it to get the HIPO to actually run. The *sections* underneath are linked to multiple schemas and therefore referenced by many, many more assets, but only the assets with the *schema* applied will be include in the HIPO job.

Also note you don't ** have to ** run a regen every time you change metadata, only if you want the changes immediately visible. Otherwise you can just leave it and the metadata will regenerate itself per asset the next time the asset is updated.

Another advantage is you can use global triggers that act on metadata fields. We have a number of 'System-wide' triggers for expiries, automated backups, record-keeping, etc. Triggers reference metadata fields via their id's, not their names, so to get a trigger to work on a new site, simply add the site to the root nodes in the trigger conditions and link the relevant section under that site's schema. You're done. And still no need to run a regen, it will figure out when the trigger next runs.

The one downside is to watch out for changes in one site affecting other sites, as it's linked. Though this could also be a feature, depending upon what you are looking for (e.g. changing every 'DC.' field to 'DCTERMS.' for us was really easy).

In your case I'm not sure how big your sites are - but if they are still too big with this approach then you could do one parent schema per top-level-section, as you suggest, but link the schema sections between them for re-use. Exactly the same principle then.

Cheers