Clean up duplicate assets

What is the recommended and or easiest way to clean up duplicate files which have been added as assets? We have several users putting content in the site and I am noticing that there are instances where an asset (generally a pdf attachment or similar) that is linked to from different locations in the site is being added as an asset in multiple places. I want to delete the extra assets and fix reroute the link to another asset in as easy a way as possible.


Sorry if this is a bit mundane, I know it's in there I just can't remember where.

There is actually no completely automated way to do this – you could write a script (or get it written) that would extract the details of all file-based assets to build a large array and then compare each file against each other file to find matches (probably by filename/size). Then, you would probably need/want some manual validation to ensure that the two files are indeed identical before you removed one of them.


Though, you could do a backend Quick Search on filenames as a quicker, simpler test – but that requires knowing the filenames in the first place. Also, you could create search folder assets to find all the PDFs in your site so that you can scan that for duplicate filenames perhaps.

You could use fdupes in the data directory - http://netdial.caribe.net/~adrian2/fdupes.html


The last number in the data path before the file will be the asset id. That’ll just identify the dupes for you, you’ll still need someone to go through and actually link up the assets nicely, and trash the dupes.



data/private/assets/pdf_file/0003/103278/ - 103278 is the assetid.



Edit: you might want to be careful that fdupes actually identifies the current version of the file, not some file that’s just sitting around because the file versioning thing often doesn’t clean up how you might expect.


File versioning doesn't clean up at all, which is the point. :) If you have rollback enabled, the file_versioning location will store every version of the file ever uploaded.

[quote]File versioning doesn't clean up at all, which is the point. :slight_smile: If you have rollback enabled, the file_versioning location will store every version of the file ever uploaded.[/quote]I realise the file in the actual file versioning folder should remain but it doesn't make sense to me to leave two files in the asset data path when a file name is changed…


I thought this was fixed -- i.e. that Matrix removes the old filename. If not, please submit this to the Bug Tracker.

I've only noticed it on our prod install, which is still 3.14.2, it's highly possible you guys fixed it.


I'm pretty sure we did. I don't have time to go scanning the Bug Tracker now, though.

The bugs were fixed even before 3.10 came out. There are no reported bugs in 3.14 for this issue. If you can reproduce it on newly uploaded files, please submit a new bug for it.

Can't reproduce this on our 3.14.2 test box, can't get the go ahead from the boss to play with the prod box. If I manage to reproduce this at some point in the future I will post a bug. I suspect seeing as you guys have fixed it that the few assets we have like this got that way while we were having fairly severe DB outages.

Hi, sorry for not replying, I was at Squiz doing some training…


Thanks for your replies, but you know I wasn't even talking about automating this. Perhaps making an asset list that displayed file size would be a way to show up which files have an identical size which is a pretty good clue. I've used a dupe finding program for my mp3 collection but I don't know about using it on a dataset…



But I was only looking for a way to do it manually as I go. It's there on the HIPO that pops up when I delete an item!


The Move to Trash HIPO doesn't show duplicates -- it merely shows whether or not the asset(s) being deleted are currently in use anywhere else in the system.

It does however give the option to relink to a different asset.


URL Remapping

When these assets are moved to the trash, any URLs that they currently have will be broken. To remap their URLs to another asset (Eg. a "Page not found" asset), select that asset in the field below.



That's all I was after. Though obviously there should be some sort of system tool that could discover duplicate assets.



Give me a bell when it is ready, ta.


This only adds an entry into the Remap Manager so that if a user has the direct URL of the asset bookmarked, Matrix can redirect them to a new asset. It does not change any internal links to that asset, so those links will be broken.

[quote]That's all I was after. Though obviously there should be some sort of system tool that could discover duplicate assets.

Give me a bell when it is ready, ta.[/quote]

If you want this to be developed, you should speak to your Squiz account manager about funding it. :)

So if my system admin is fond of 'cleaning up the remap manager' then it will break the link? Thanks for pointing it out, as if that is the case I am misled by the HIPO dialogue, as I assumed it changed the link itself.

Correct -- if your sysadmin removes Remap Manager links then this will break.