I've hit a limit when indexing PDF files in Windows Search, specifically the array size of System.Keywords property. Everything works fine up to 20 tags, but then any further tags aren't included in the index.
My first instinct was to see what the IFilter was capturing and using filtdump.exe I got the following output.
CHUNK: ---------------------------------------------------------------
Attribute = {F29F85E0-4FF9-1068-AB91-08002B27B3D9}\5 (System.Keywords)
idChunk = 3
BreakType = 0 (No Break)
Flags (chunkstate) = (Value)
Locale = 0 (0x0)
IdChunkSource = 0
cwcStartSource = 0
cwcLenSource = 0
VALUE: ---------------------------------------------------------------
Type = 31 (0x1f), VT_LPWSTR
Value = "TAG1; TAG2; TAG3; TAG4; TAG5; TAG6; TAG7; TAG8; TAG9; TAG10; TAG11; TAG12; TAG13; TAG14; TAG15; TAG16; TAG17; TAG18; TAG19; TAG20; TAG21"
So I could see that all the tags were being retrieved, but the final tag was being truncated.
Doing a dump of the property schema for System.Keywords I got the following:
Property Key: {F29F85E0-4FF9-1068-AB91-08002B27B3D9} 5
Canonical Name: System.Keywords
Property Type: VT_VECTOR | VT_LPWSTR
Display Name: Tags
Edit Invitation: Add a tag
Type Flags: PDTF_MULTIPLEVALUES | PDTF_CANGROUPBY | PDTF_CANSTACKBY | PDTF_ISTREEPROPERTY | PDTF_ISVIEWABLE | PDTF_ISSYSTEMPROPERTY
View Flags:
Default Column Width: 11
Display Type: PDDT_STRING
Column State: SHCOLSTATE_TYPE_STR
Grouping Range: PDGR_DISCRETE
Relative Desc. Type: PDRDT_GENERAL
Sort Description: PDSD_A_Z
Sort Desc. Labels: A on top/Z on top
Aggregation Type: PDAT_UNION
Condition Type: PDCOT_STRING
Condition Operation: COP_WORD_EQUAL
Enumerated Types: 0
Search Info Flags: PDSIF_ININVERTEDINDEX | PDSIF_ISCOLUMN | PDSIF_ISCOLUMNSPARSE
Column Index Type: <not specified>
Projection String System.Keywords
Max Size: 512
Also looking at the documentation for System.Keywords there is no mention of a maximum size or limit of items.
Again looking at documentation there is mention of maxSize attribute:
Optional. Indicates the maximum size allowed for the property value stored in the Windows search database. This limit applies to the indvidual elements of a vector, not the vector as a whole. Values beyond this size are truncated. The default is "128" (bytes). Currently, Windows Search does not use the maxSize when calculating the amount of data it accepts from a file. Instead, the limit Windows Search uses is the product of the size of the file and the MaxGrowFactor (file size N * MaxGrowFactor) read from the registry at HKEY_LOCAL_MACHINE->Software->Microsoft->Windows Search->Gathering Manager->MaxGrowFactor. The default MaxGrowFactor is four (4). Consequently, if your file type tends to be small in total size but have larger properties, Windows Search may not accept all the property data you want to emit. However, you can increase the MaxGrowFactor to suit your needs.
However it isn't clear to me if this affects the size of the array. I'm guessing that this truncation is occurring in the Gatherer component of Windows Search so I'm wondering if there are any registry settings involved.
FWIW I did look at the Windows Search database (Windows.edb) using the ESE Database View utility and I could see from the schema that the column type is large binary type so there shouldn't be a limitation there. Looking at the raw value I could see the bytes for the tag values (separated by NUL characters) and terminated with an @ character. But there were only 20 values not 21 confirming the limit.
I've reached the end of my research, but I'm still no further along. Is it possible to extend the array size for System.Keywords or is it a hard-coded limit in the Gatherer component? Any help would be most appreciated, thanks in advance!