I have a problem with PDF.
I'm using solr 8.11.1. I create an index from PDF files using DIH. Everything works well. But PDF contains Discretionary Hyphen (soft-hyphen). The PDF was created in Indesign and Discretionary Hyphen was inserted into some of the long words. For example, the word *uncopyrightable* is divided like this: *un-co-py-righ-tab-le* (the hyphen shows where Discretionary Hyphen is). The word will not necessarily be wrapped to another line.
Because of this, I get several words in the index - *un*, *co*, *py*, *righ*, *tab*, *le*, instead of a single word *uncopyrightable*. And so with many words. Because of this, I can't find these words in the index now.
I tried in tika-data-config to replace the character (using unicode u00AD) with "":
<entity name="pdf" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text"
transformer="TemplateTransformer,RegexTransformer">
<field column="text" regex="\u00AD" replaceWith="" sourceColName="text"/>
</entity>
But didn't get any result.
Then I tried to do this:
<field column="text" regex="un co py righ tab le" replaceWith="777" sourceColName="text"/>
And I got 777 in the index.
It turns out that Discretionary Hyphen turns into a space even before being processed in tika-data-config.
How can this problem be solved now?
For information. If I open the PDF file with Adobe Reader and then copy and paste the text in Word, the spaces don't appear. If I open with PDF-XChange Viewer and paste it into Word, then spaces appear. If I open it with Microsoft Edge, then there are icons in the form of a question in a diamond.
I have no way to fix PDF. Besides, there are a lot of them.