Pdf failing to copy to public folder correctly

joel · 2015-05-07 06:25:31 UTC

I've got a pdf file that is failing to create correctly.

The asset initial creation goes as expected, and I can find the file in /data/private/assets/pdf_file/XXXX/XXXXX

I set the file live, and try to download it... nada. Looking in its public folder /data/public/assets/pdf_file/XXXX/XXXXX the file is present, however it's an empty file.

No errors in the matrix logs. Pretty sure it's some sort of issue with the file in question as other pdf documents work just fine.

Anyone got any suggestions for the next step in troubleshooting this? I've 'fixed' it for now by just copying the file directly to the public folder, but I'd like to figure out what's going on.

nnhubbard · 2015-05-07 06:33:07 UTC

It is password protected? If so, do you have the External Tool install for indexing password protected PDF files? We had some issue with this in the past.

Bart · 2015-05-07 23:16:42 UTC

How was the PDF file created? Via admin interface, Edit+, or asset builder on the front end?

joel · 2015-05-08 01:25:19 UTC

Nic - Yeah, the file has some restrictions placed on it, which seems to be the instigator of the problem.

Bart - Edit+ initially, but same result using the admin interface.

I think I've found why nothing is being logged though:

> pdftohtml 445-Section1100-Processing-150401.pdf
Error: Copying of text from this document is not allowed.
> echo $?
0

pdf_file.inc line 129:

if ($status != '0') {

pdftohtml is returning a status code of 0 so matrix can't throw an error based on the status code. Looks like newer versions of pdftohtml should return proper status codes, the version installed on the matrix system (rh6) is 0.12.4, but even the version that comes with centos 7 (0.22.5) seems to return '0' regardless of success.

>pdftohtml heldgfdg
I/O Error: Couldn't open file 'heldgfdg': No such file or directory.
> echo $?
0

That said - I'm curious as to why pdftohtml was used to start with - given that there's a strip_tags being run on the result anyway, why not just use pdftotext?

Bart · 2015-05-11 01:06:04 UTC

Yea not sure why pdftohtml was chosen over pdftotext originally, but in the next upcoming 5.2.2.0 release we've added support for Apache Tika (https://tika.apache.org/download.html) to handle indexing of both word and pdfs.

joel · 2015-05-11 01:12:57 UTC

in the next upcoming 5.2.2.0 release we've added support for Apache Tika (https://tika.apache.org/download.html) to handle indexing of both word and pdfs.

ah, that looks like it ought to simplify the indexing of files