Has anyone had any experience with the Structured File Import Tool?
I'm trying to find a way to use this from converting existing MS Word documents to HTML, but when importing, a fatal error is frequently displayed - caused by unrecognised characters. Creating an HTML page from scratch and ensuring it is well formed, imports perfectly, but isn't very useful. I'm thinking that the extra junk that MS Word adds to the code is what is breaking the tool but the manuals (http://manuals.matrix.squizsuite.net/tools/chapters/structured-file-import-tool) say converting is a suggestion if you lack the required HTML skills. Should I be saving as a filtered HTML file or are there any MS Word settings that need to be set in order to get a cleaner code output? Does it make a difference if the Word file has background images, header & footer content, a table of contents?
Ultimately I'd like to have a guide I can pass out to the system users who can create pages from correctly marked up Word files. If anyone has had any real world success I'd like to hear what you did to achieve this.
Using structured file import tool
What's the error you get when the import is breaking? Check data/private/error.log for details.
The tool already has an option to remove Microsoft Word tags, so I think the extra stuff MS word inserted in the html file should be fine.
Plus you will also have an option to run tidy on the html file, so i assume it should be fine.
The tool will create images assets for you, and link them properly in the content. There is an example in the manual.
This is the error that is shown. Sometimes if the asset is created, I get a fatal error thrown, but the text refers to a different file/line.
Fatal error: Uncaught exception 'Exception' with message 'Unable to update HIPO job vars due to database error: SQLSTATE[22021]: Character not in repertoire: 7 ERROR: invalid byte sequence for encoding "UTF8": 0xa9 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".' in /var/www/matrix/core/hipo/hipo_job.inc:820 Stack trace: #0 /var/www/matrix/core/hipo/hipo_job.inc(1336): HIPO_Job->save() #1 /var/www/matrix/core/hipo/hipo_job.inc(1174): HIPO_Job->processWeb(Array, 'HIPO_Job_Struct...') #2 /var/www/matrix/core/hipo/hipo_herder.inc(236): HIPO_Job->process() #3 /var/www/matrix/core/include/mysource.inc(1691): HIPO_Herder->processWeb() #4 /var/www/matrix/core/include/mysource.inc(463): MySource->_processGlobalActions() #5 /var/www/matrix/core/include/init.inc(281): MySource->init() #6 /var/www/matrix/core/web/index.php(28): require_once('/var/www/matrix...') #7 {main} thrown in /var/www/matrix/core/hipo/hipo_job.inc on line 820
I've tried uploading with all the options checked to remove the Microsoft tags and tidy the HTML.
If there are images referenced in the uploaded file but no 'files' folder will this be the reason for the error? I would assume that I would just get broken images.
that error is because of non-utf8 characters in the html file. Some characters are non-utf8, which is like byte-sequence to Matrix.
The tool doesn't have an option to convert characters for you. So you will have to either make sure the document is clear of those characters or you have to clean it up manually.
Hmm I thought it might be due to encoding and odd characters. Here is a link to a test document that I am trying to get to work, but I'm still getting the same error and I think it may be due to the bulleted list in the file. http://www.4shared.com/file/13gHG6Iu/docx.html
My process is:
- Markup MS Word document with the correct headings, list types, etc
[*]Save file as Web Page Filtered
[*]Upload file(s) to /var/www/html
[*]In Squiz click on Structured File Import Tool
[*]Select the file .htm
[*]Choose root node
[*]Split on heading one, and heading two
[*]Check all other options - HTML tidy, create CSS, etc
[*]Commit
But I am still getting the error displayed as in my previous post. Maybe someone can try out the sample file I provided and see if they can get it to work or let me know if there is something really simple to fix that I just haven't seen as yet.