PDF Text Direction

580 views Asked by At

How is text direction for right-to-left languages, like Arabic, encoded in PDF? My understanding is that since PDF is fundamentally a graphical format, the concept of text-direction doesn't need to really be encoded. Rather, the glyphs simply need to be painted on-screen from right to left. However, the PDF reference manual mentions an attribute called WritingMode, where you can specify combinations left-to-right, right-to-left and top-to-bottom, bottom-to-top.

So my questions is:

(1) If my understanding is correct, and RTL or LTR is merely expressed by the way the glyphs are painted on-screen, what is the point of the WritingMode attribute?

(2) If there is no actual directionality information encoded in the PDF file, other than the order the glyphs are painted, how does a PDF-to-Text program know if a given line is supposed to be read right-to-left or left-to-right? (I suppose the PDF program could just check if the Unicode codepoints extracted from a ToUnicode map fall into a range that corresponds to an RTL language.)

2

There are 2 answers

1
mark stephens On

Text direction will be set in the Trm

0
plinth On

WritingMode is only for Tagged PDF, if I'm reading the spec correctly. If a PDF doesn't contain the appropriate logical structure, you don't get WritingMode.

The general answer, as I understand it, is "it depends". In R-L writing, you probably have the text advance info encoded in the font and a single text placement will advance the text to the right place. I say 'probably' because it might be that the actual generation software ignores this and places each glyph on its own, regardless of the text advance in the font. Then you get fun languages like Arabic and Hebrew which aren't strictly R-L, as numbers are still L-R within a R-L line.