Import_asset_csv_to_matrix.php importing content with special characters throwing errors


(Nic Hubbard) #1

When importing content using the import_asset_csv_to_matrix.php script it works great until it reaches data in the .csv that contains a special character, such as £. It doesn't seem to be an easy task to replace all special characters since there could be other languages within the imported data.

 

Here is the error I am getting:

Fatal error: Uncaught exception 'Exception' with message: 'Cannot set metadata field #9224 to "Third US OK for Regeneron?s Eylea" for asset ID #9326 in context #0, DB error while inserting new values: "SQLSTATE[22021]: Character not in repertoire: 7 ERROR:  invalid byte sequence for encoding "UTF8": 0xcd73" [SYS0339]' in [SYSTEM_ROOT]/core/include/metadata_manager.inc(1128): setMetadata('9326...', Array)
#1 [SYSTEM_ROOT]/scripts/import/import_asset_csv_to_matrix.php(539): editMetadata(Object(News_Item), Array, Array, '9210...')
#2 [SYSTEM_ROOT]/scripts/import/import_asset_csv_to_matrix.php(312): createAsset(Array, 'news_item', Object(Folder), '9210...', Array, Array)
#3 [SYSTEM_ROOT]/scripts/import/import_asset_csv_to_matrix.php(757): importAssets('data-import.csv', 'news_item', '9209...', '9210...', Array, Array, false, '', '', Array)

Since the DB encoding is UTF8, is there any way around this issue? Or are we going to have to replace all special characters in the data with their HTML entities? Shouldn't the script be doing this?


(Marcus Fong) #2

The DB seems to be complaining that you have a non-UTF8 character in your input. What character is the question mark in "Third US OK for Regeneron?s Eylea"? I'm not having much luck searching for 0xcd73 - all I can find is references to a Korean character, which doesn't seem to quite match the surrounding text...

 

I believe we generally use "iconv" to perform character set conversion (for example, in the system_integrity_fix_char_encoding script) but it requires both the input and output encodings to be specified, so the import scripts don't attempt any kind of autoconversion.

 

If you do know what the source encoding is (since in this case your target would be UTF-8) then you could try running the "iconv" command-line utility directly on a copy of your CSV file and retrying the import. Although if you have a mix of different character encodings in your input, then that could be tricky - I'm not sure there's an easy solution in that case.


(Nic Hubbard) #3

The DB seems to be complaining that you have a non-UTF8 character in your input. What character is the question mark in "Third US OK for Regeneron?s Eylea"? I'm not having much luck searching for 0xcd73 - all I can find is references to a Korean character, which doesn't seem to quite match the surrounding text...

 

The text is: Third US OK for RegeneronÍs Eylea

 

The strange thing is, like I mentioned, that the pound sign, £ threw the same error. But I did a find and replace to replace it with its HTML entity. Strange that UTF8 isn't taking these characters.


(Nic Hubbard) #4

The DB seems to be complaining that you have a non-UTF8 character in your input. What character is the question mark in "Third US OK for Regeneron?s Eylea"? I'm not having much luck searching for 0xcd73 - all I can find is references to a Korean character, which doesn't seem to quite match the surrounding text...

 

I figured out the issue. It was Mac Excel again. It wasn't saving the .csv as UTF8. Had to use TextEdit to change the encoding. 


(Marcus Fong) #5

It is odd - you can see in the script that the underlying byte sequence is apparently "0xcd73", which isn't a UTF8 character...

 

...hmm, that's interesting, though. The UTF8 sequence is totally different (0xc38d), but that character is "0x00cd" in UTF16: http://www.fileformat.info/info/unicode/char/00cd/index.htm

 

I wonder if you've got UTF16 in there rather than UTF8, and the leading 0x00 is being interpreted as a UTF8 null character? 0x73 is the ASCII/UTF8 lowercase "s", so if the relevant bytes were:

 

00 cd 73

 

then that would make sense. What do you get if you run:

 

iconv -f UTF-16 -t UTF-8 filename.csv > filename_utf8.csv

 

and try importing that?


(Marcus Fong) #6

Oops - I got pulled away to look at something else in the middle of writing that post, and didn't see your reply until afterwards. Glad you sorted it out!