Content not showing after 'illegal' characters in EES

uoebusiness · October 16, 2010, 7:31am

Edit: this should have been in the EES Phase 2 thread. Maybe a mod can move it?

I've come across this a few times in FF. Add content via Matrix as normal using the standard WYSIWYG div and all displays fine. View in EES and it displays the same way but if you edit it, save and then preview, on some occasions large chunks of the text will be missing. If you view in Matrix (or on the site) it's still all there.

What seems to be happening is that Matrix is accepting illegal characters in WYSIWYG but after editing in EES it then only displays up to the illegal character. e.g.

Full text:

New industries, new types of firms and organizations, new types of people, and new nations are increasingly practicing international business, and are doing it in new ways - within new structures. The rapid change we witness in the types of international businesses, and in how they practice international businesses continues unabated.

After editing in EES:

New industries, new types of firms and organizations, new types of people, and new nations are increasingly practicing international business, and are doing it in new ways

Not an actual example but just to illustrate. All that seems to be happening is that in EES it's stopping the display at the first character it doesn't recognise, but only after editing it.

Obviously not a show stopper, as the text is still all there, but as we have less technical people using EES it's a concern that they would think it has been deleted and might copy and paste the missing stuff back in thereby duplicating what is already there.

Anthony_Barnes · October 17, 2010, 11:36pm

Is your Matrix system set to UTF-8 or ISO? There is a known issue with systems still using ISO character set concerning some of the data that is POST'ed back using jQuery. The content header can be overridden, but a more robust solution would be to look at conversion of the site to support UTF-8.

uoebusiness · October 18, 2010, 7:10am

[quote]
Is your Matrix system set to UTF-8 or ISO? There is a known issue with systems still using ISO character set concerning some of the data that is POST'ed back using jQuery. The content header can be overridden, but a more robust solution would be to look at conversion of the site to support UTF-8.

[/quote]

It is indeed set to: Default Character Set - Western European (ISO). What impact would changing this setting to UTF-8 have, apart from fixing the above? Is the conversion of the site you talk about just making the switch?

nnhubbard · October 18, 2010, 5:17pm

[quote]
It is indeed set to: Default Character Set - Western European (ISO). What impact would changing this setting to UTF-8 have, apart from fixing the above? Is the conversion of the site you talk about just making the switch?

[/quote]

All special characters on your website will turn into a strange ? character, so you will have to go through every page, every metadata field, and fix those. We did this change, but it took about a week to clean up all the pages.

uoebusiness · October 18, 2010, 6:34pm

[quote]
All special characters on your website will turn into a strange ? character, so you will have to go through every page, every metadata field, and fix those. We did this change, but it took about a week to clean up all the pages.

[/quote]

Ouch. Oh well if needs must, my team are going to be impressed when I tell them this.

nnhubbard · October 18, 2010, 8:43pm

[quote]
Ouch. Oh well if needs must, my team are going to be impressed when I tell them this.

[/quote]

Before the switch I used the system find and replace tool to try and replace as many special characters as I could, such as MS Word "smart quotes".

uoebusiness · October 28, 2010, 1:04pm

We have made the switch in Matrix and cleaned up the odd characters that appeared on the sites, fortunately not too many. However we have a problem with remote content. If there is a non-breaking space ( ) in the code then it also gets assigned the diamond with a ? in it. Why would it not recognise a standard html string like that?

Also our Postgres server is not running UTF-8 so will this have an impact? Our sys admin is away on a beach somewhere so it will be a while before we can get that converted.

Anthony_Barnes · October 28, 2010, 10:46pm

[quote]
We have made the switch in Matrix and cleaned up the odd characters that appeared on the sites, fortunately not too many. However we have a problem with remote content. If there is a non-breaking space ( ) in the code then it also gets assigned the diamond with a ? in it. Why would it not recognise a standard html string like that?

Also our Postgres server is not running UTF-8 so will this have an impact? Our sys admin is away on a beach somewhere so it will be a while before we can get that converted.

[/quote]

Did change both the matrix system configuration and your design to support the UTF-8 charset? It needs to be updated in both spots. The database, in all likelyhood, will be SQL_ASCII which will support both character sets. You won't need to change that, it's only the data itself that needs to be fixed (which it sounds like you have already done).

I've done a fair bit of investigation with this character set issue and it appears to be a function of the HTTP Request object for browsers. Everything is posted using UTF-8, no matter if you try and inject your own headers into the request. This means that we can't really fix it for EES, but we can do some sneaky things to work around it.

There is a plugin (which requires more testing on systems with actual character set issues), that will convert some invalid characters in the serialised POST requests on the fly. I've tested this locally for some of the characters reported as issues by clients and it seems to work fine as a temporary measure between the transition of western ISO to UTF-8.

    
    /**
     * Plugin to replace characters known to cause issues between character
     * set changes in Matrix (particularly the old default of ISO-8859-1).
     * Functions that intercept data posted back via ajax do some replacements before
     * passing them on.
     *
     * 
     *
     * @copyright 2010 Squiz Pty Ltd (ABN 77 084 670 600)
     * @license   GNU General Public License v.2
     * @version   SVN: $Id: EasyEditCharSetPlugin.js 722 2010-10-22 03:45:00Z abarnes $
     * @link
     */
    
    EasyEdit.plugins.charSet = {
    init: function() {
        var self = this;
        
        // Process for serialised form data
        EasyEditEventManager.bind('EasyEditBeforeLoad',self.rewriteSerialised);
        
        // Process for any data posted via js api
        // (This will run for every JS API call - may have performance impact)
        // Uncomment this line only if required.
        //EasyEditEventManager.bind('EasyEditBeforeLoad',self.rewriteMakeRequest);
    },
    
    /**
     * Converts entities posted either as
        - HTML entity equivalents
        - Invalid characters
        - Serialised invalid characters
     * Into format that will avoid issues with systems that previously supported
     * iso-8859-1. This function does a best guess replacement for known characters
     * and a ? replacement for unknowns.
     */
    convertEntities: function(data)
    {
        // HTML Entity replacements
        // Eg, "–" will be replaced with "-"
        var replacements = {
            "8211":     "-",
            "8217":     "'",
            "8216":     "'"
        };
        
        var regExp = new RegExp('&#([0-9]+);','g');
        var matches = data.match(regExp);
        if (matches !== null) {
            for (var i = 0, l = matches.length; i");
        data = data.replace(/[\u02DC|\u00A0]/g, " ");
    
        // Replace serialised equivalents
        data = data.replace("%E2%80%99","'"); // Serialised content
        data = data.replace("%E2%80%98","'"); // Serialised content
        data = data.replace("%E2%80%93","-"); // Serialised content
        
        return data;
    },
    
    /**
     * Rewrites the jQuery serialize function so all data is
     * filtered through the plugin function after it has been serialized.
     */
    rewriteSerialised: function()
    {      
        (function(oldSerialize){
            jQuery.fn.serialize = function(data){
                var newData = EasyEdit.plugins.charSet.convertEntities(oldSerialize.call(this,data));
                return newData;
            };
        })(jQuery.fn.serialize);
    
    },
    
    /**
     * Rewrites the JS API makeRequest function so it can pre-process the 
     * URL parameter to strip out any invalid characters.
     *
     * This function is generally not required, but the option exists to use it
     */
    rewriteMakeRequest: function()
    {
        (function(oldMakeRequest){
            window.makeRequest = function(url, receive, dataCallback) {
                var newUrl = EasyEdit.plugins.charSet.convertEntities(url);
                oldMakeRequest.apply(this,[newUrl,receive,dataCallback]);
            };
        })(makeRequest);
    }
    };

Caveat: This was only really made to deal with the 3 problem characters highlighted to us, the serialise re-write would probably need to have extra code amended for other invalid characters.