I'm created an asset listing that generates an RSS feed based on new content for a section of my site. I've got everything working but I'm having trouble with HTML character entities and special characters. It seems that the special chars and HTML entities break the XML. So entries with bad chars or html entities just get omitted when being parsed (or worse still, when using PHP SimpleXMLElement break the parser altogether).The problem is that I can't remove/replace the character entities.
I've tried a number of things, including using keyword modifiers:
- ^striphtml which is helpful but doesn't remove the entities or fix the special chars
- ^urlencode which does make the HTML entities and special chars safe for use in XML but makes the feed useless because even the white space is URL encoded.
- ^as_xml which is meant for arrays (not string) and doesn't do anything in my use case.
- ^escapehtml which only replaces one set of HTML character entities with another.
Part of the problem is that were serving pages the headers charset as iso-8859-1 but the HTML charset as UTF8. (I know... WTF!!!)
The simple solution is to get editors to be more careful about what the enter. However... That's easier said than done.