Search to index OCR scanned PDF

I've just come to notice that the search functionality doesn't scan the contents of PDF files that have been OCR(optical character recognition) scanned.


Only the words within in the PDF that are text and not originally from a scanned object appear in results. Any idea if/when this is going to be supported? Or should I be doing something already to enable the scanning of the OCR parts.



The pdftohtml and anitword options are enabled on the system

If pdftohtml can't read the contents, we can't index them. It seems it can't recognise the text in your case, so you're not going to ever be able to get Matrix to search it. Although you could try running pdftohtml on the command line just to be sure.


If you want more advanced searching over documents, you might want to consider a dedicated search provider, like Funnelback or Google.

That answered the question


Thanks very much