Character Encoding


(Mike Hatcher) #1

Hi


I appear to have got myself into a bit of a state with character encoding and I need some advice on what to do…



This is what my situation used to be:

Matrix 3.10.3 sending Western ISO (I presume this is ISO 8859-1?)

My parse file created as ISO 8859-1

My meta tag saying it was UTF 8.

Don't ask how I got myself so mixed up because it's just plain embarrassing :unsure:



My situation now:

Matrix is still Western ISO (I tried changing to UTF 8 and it screwed up all my characters - Pound & Euro currency signs mostly)

I converted my parse file in UTF 8 encoding

My meta tag is still UTF 8



The problem/issue:

  1. How do I convert all my assets in Matrix into UTF 8 without my entities going nuts?
  2. If I change Matrix's encoding setup will all assets from then on be UTF 8?
  3. Are there any other issues that I would need to be aware of?
  4. Please, please can someone else post and let me know I'm not alone in my stupidity and they have done this too :stuck_out_tongue:



    Thanks

    Mike

(Ben Caldwell) #2

Hi saffamike,


I take it from your post that you want to move everything over to UTF-8. Unfortunatley, if you have any extended-ASCII characters (i.e. pound sign, cent sign, euro sign, accented letters) in your ISO 8859-1 encoded content you are going to run into trouble.



Your content is encoded based off the systems ‘Default Character Set’ setting. By changing this to UTF-8 and viewing your content you are technically just viewing ISO 8859-1 (assuming the content was created or last submitted using ISO 8859-1) data in the UTF-8 character set.



The reason your currency symbols etc. are not displaying correctly is that extended-ASCII charaters are not valid when displayed in the UTF-8 character set. The solution to this is to go to your content and re-enter the symbol. Even better, if you use a HTML Entity (such as £ or €) you will be able to switch between encodings without concern for corruption. You can see a complete list of HTML Entities here.



It may be possible to use the Global Search and Replace tool to do this. You could possibly use the invalid representation of the currency symbol as the search and the HTML Entity as the replacement (you would need to make sure that your system ‘Default Character Set’ is set to the encoding in which the characters are represented incorrectly). I have not tested this.



Also, you need to make sure that your design does not attempt to set the character set in the content-type header (if it has one). Matrix will send the correct character set as part of the HTTP Headers.



Hope this helps.


(Mike Hatcher) #3

Hi Ben


Thanks for the advice here and for the link to the entity reference sheet. Another quick question…



I have some other entities on my home page that link to other languages. I had a look on the link you sent me and spent quite a while searching for the entities for the characters below. I have also included the html below. Please can you point me in the right direction.



Français

Český

Português

Español

Русско

한국어

普通□

日本語


    


Thanks
Mike

(Rayn Ong) #4

You might find this conversion tool useful: http://minutillo.com/steve/convert/.


And I think the second last should be one of the following, instead of 普通□. Simplified Chinese: 普通话; Traditional Chinese: 普通話


(Ben Caldwell) #5

You should not enter these charaters as HTML Entities.


Note: If you do enter these as HTML Entities using the HTML Source plugin of the WYSIWYG Content Type, they will be automatically converted to their character equivalents.



Currently, only symbols are converted to their entity equivalents. This is done so that the Search Manager indexes them as searchable words.