Decode HTML entities via keyword modifier


(*) #1

Does any method to decode HTML entities within keywords exist?

I’ve looked through the documentation at https://matrix.squiz.net/manuals/keyword-replacements/chapters/keyword-modifiers and it doesn’t look like it – you’ve got htmlentities to encode but no way of decoding. As a workaround, I’ve considered using a Regular Expression but it’s going to be cumbersome to manage so many entities, even if I don’t try and regex them all.

In PHP, you’d just be calling html_entity_decode() and I’d imagine adding the implementation as a new modifier would be straightforward.

Thoughts?

Matrix Version: 5.3.3.0


(Peter McLeod) #2

Hi
Using the ‘^unescapehtml’ keyword modifer might work to some degree but not sure in all cases.
Thanks
Peter


(*) #3

Thanks for the suggestion. Unfortunately, since unescapehtml uses htmlspecialchars_decode() in its implementation, it only handles decoding &, ", ’ and <> (docs).


(*) #4

In case anyone’s curious, you can fake it with a Regular Expression asset. You just need to know that the regex format needs some cajoling in the case of ; (semicolon) characters in the original content. What appeared to be a string of &ndash; needs to be matched against the string &ndash\;, which means the regex /&ndash\\;/ (noting the escape of the backslash) is needed.

In my case, I want my metadata description field to not feature characters, not their HTML entities, so I used the following:

%metadata_field_page.byline^striphtml^trim^preg_replace:12345%

where 12345 is the asset ID of the Regular Expression asset mentioned above.

Whilst this is a temporary work around, a htmlentities_decode modifier wants implementing.


(Peter McLeod) #5

Hi

Could also just remove all entities with something like:

%metadata_field_page.byline^striphtml^trim^replace:(&[a-zA-Z0-9#]{2,};): %

This would capture entities using either name or number format (eg for an ndash: – OR – OR –), but would also remove language character such as diacritical marks, witch may not be ideal depending on your specific requirements.

Thanks
Peter


(*) #6

Thanks again, but the aim isn’t to get rid of any part of the content. In my current case, that’d get rid of quotation marks, dashes and foreign characters, affecting the readability and actual content. Also, for what it’s worth, asking editors to not use special characters is untenable – the content could contain words or names with foreign characters.

I’ve added this into the Squiz Map at https://squizmap.squiz.net/matrix/10314 and in the meantime, I’ll build the Regex by hand.


(Kieran) #7

I see this didn’t go anywhere. Strange to have encode but not decode.
Was hoping to not have to use client-side javascript to decode a long string into actual html tags.