We are Indexing a resume document by using elastic search java API. It works fine. When we are searching a keyword it's return the accurate response(Document) which has that keyword.
But we want to index document in deep. For example A resume has 'Skills' and their 'Skills Month'. Skills month may be 13 months in document. SO i search for that skill and set skill months between 10 to 15 months in elastic search query, then we want that record(Document).
How can we do this?
Here is the code for Indexing:-
IndexResponse response = client
.prepareIndex(userName, document.getType(),
document.getId())
.setSource(extractDocument(document)).execute()
.actionGet();
public XContentBuilder extractDocument(Document document) throws IOException, NoSuchAlgorithmException {
// Extracting content with Tika
int indexedChars = 100000;
Metadata metadata = new Metadata();
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new BytesStreamInput(
Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
} catch (Throwable e) {
logger.debug("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]", e);
System.out.println("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]" +e);
parsedContent = "";
}
XContentBuilder source = jsonBuilder().startObject();
if (logger.isTraceEnabled()) {
source.prettyPrint();
}
// File
source
.startObject(FsRiverUtil.Doc.FILE)
.field(FsRiverUtil.Doc.File.FILENAME, document.getName())
.field(FsRiverUtil.Doc.File.LAST_MODIFIED, new Date())
.field(FsRiverUtil.Doc.File.INDEXING_DATE, new Date())
.field(FsRiverUtil.Doc.File.CONTENT_TYPE, document.getContentType() != null ? document.getContentType() : metadata.get(Metadata.CONTENT_TYPE))
.field(FsRiverUtil.Doc.File.URL, "file://" + (new File(".", document.getName())).toString());
if (metadata.get(Metadata.CONTENT_LENGTH) != null) {
// We try to get CONTENT_LENGTH from Tika first
source.field(FsRiverUtil.Doc.File.FILESIZE, metadata.get(Metadata.CONTENT_LENGTH));
} else {
// Otherwise, we use our byte[] length
source.field(FsRiverUtil.Doc.File.FILESIZE, Base64.decode(document.getContent().getBytes()).length);
}
source.endObject(); // File
// Path
source
.startObject(FsRiverUtil.Doc.PATH)
.field(FsRiverUtil.Doc.Path.ENCODED, SignTool.sign("."))
.field(FsRiverUtil.Doc.Path.ROOT, ".")
.field(FsRiverUtil.Doc.Path.VIRTUAL, ".")
.field(FsRiverUtil.Doc.Path.REAL, (new File(".", document.getName())).toString())
.endObject(); // Path
// Meta
source
.startObject(FsRiverUtil.Doc.META)
.field(FsRiverUtil.Doc.Meta.AUTHOR, metadata.get(Metadata.AUTHOR))
.field(FsRiverUtil.Doc.Meta.TITLE, metadata.get(Metadata.TITLE) != null ? metadata.get(Metadata.TITLE) : document.getName())
.field(FsRiverUtil.Doc.Meta.DATE, metadata.get(Metadata.DATE))
.array(FsRiverUtil.Doc.Meta.KEYWORDS, Strings.commaDelimitedListToStringArray(metadata.get(Metadata.KEYWORDS)))
.endObject(); // Meta
// Doc content
source.field(FsRiverUtil.Doc.CONTENT, parsedContent);
// Doc as binary attachment
source.field(FsRiverUtil.Doc.ATTACHMENT, document.getContent());
// End of our document
source.endObject();
return source;
}
Below code is used for getting response:
QueryBuilder qb;
if (query == null || query.trim().length() <= 0) {
qb = QueryBuilders.matchAllQuery();
} else {
qb = QueryBuilders.queryString(query);//query is a name or string
}
org.elasticsearch.action.search.SearchResponse searchHits = node.client()
.prepareSearch()
.setIndices("ankur")
.setQuery(qb)
.setFrom(0).setSize(1000)
.addHighlightedField("file.filename")
.addHighlightedField("content")
.addHighlightedField("meta.title")
.setHighlighterPreTags("<span class='badge badge-info'>")
.setHighlighterPostTags("</span>")
.addFields("*", "_source")
.execute().actionGet();
Elastic search indices all the column by default for providing better search capabilities. Before you put your JSON documents under some type, it would be great to define your mappings (refer: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-analysis.html)
When you want to search data by exact keyword, you may need to skip a particular column by not analyzing. While indexing a document, the column values will be analyzed and then it will be indexed. You can enforce Elastic saying that "not_analyzed". Then your column value will be indexed as it is. This way you can get a better search results.
For another part for defining your JSON document, it would be good if you use some library to define JSON. I prefer Jackson library for parsing JSON document. This will reduce the lines of code in your project.