Indexing of document in elastic search, JAVA API

1.8k views Asked by At

We are Indexing a resume document by using elastic search java API. It works fine. When we are searching a keyword it's return the accurate response(Document) which has that keyword.

But we want to index document in deep. For example A resume has 'Skills' and their 'Skills Month'. Skills month may be 13 months in document. SO i search for that skill and set skill months between 10 to 15 months in elastic search query, then we want that record(Document).

How can we do this?

Here is the code for Indexing:-

IndexResponse response = client
                        .prepareIndex(userName, document.getType(),
                                document.getId())
                        .setSource(extractDocument(document)).execute()
                        .actionGet(); 

    public XContentBuilder extractDocument(Document document) throws IOException, NoSuchAlgorithmException {
        // Extracting content with Tika
        int indexedChars = 100000;
        Metadata metadata = new Metadata();

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new BytesStreamInput(
                Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
        } catch (Throwable e) {
            logger.debug("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]", e);
            System.out.println("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]" +e);
            parsedContent = "";
        }

        XContentBuilder source = jsonBuilder().startObject();

        if (logger.isTraceEnabled()) {
            source.prettyPrint();
        }

        // File
        source
            .startObject(FsRiverUtil.Doc.FILE)
            .field(FsRiverUtil.Doc.File.FILENAME, document.getName())
            .field(FsRiverUtil.Doc.File.LAST_MODIFIED, new Date())
            .field(FsRiverUtil.Doc.File.INDEXING_DATE, new Date())
            .field(FsRiverUtil.Doc.File.CONTENT_TYPE, document.getContentType() != null ? document.getContentType() : metadata.get(Metadata.CONTENT_TYPE))
            .field(FsRiverUtil.Doc.File.URL, "file://" + (new File(".", document.getName())).toString());

        if (metadata.get(Metadata.CONTENT_LENGTH) != null) {
            // We try to get CONTENT_LENGTH from Tika first
            source.field(FsRiverUtil.Doc.File.FILESIZE, metadata.get(Metadata.CONTENT_LENGTH));
        } else {
            // Otherwise, we use our byte[] length
            source.field(FsRiverUtil.Doc.File.FILESIZE, Base64.decode(document.getContent().getBytes()).length);
        }
        source.endObject(); // File

        // Path
        source
            .startObject(FsRiverUtil.Doc.PATH)
            .field(FsRiverUtil.Doc.Path.ENCODED, SignTool.sign("."))
            .field(FsRiverUtil.Doc.Path.ROOT, ".")
            .field(FsRiverUtil.Doc.Path.VIRTUAL, ".")
            .field(FsRiverUtil.Doc.Path.REAL, (new File(".", document.getName())).toString())
            .endObject(); // Path

        // Meta
        source
            .startObject(FsRiverUtil.Doc.META)
            .field(FsRiverUtil.Doc.Meta.AUTHOR, metadata.get(Metadata.AUTHOR))
            .field(FsRiverUtil.Doc.Meta.TITLE, metadata.get(Metadata.TITLE) != null ? metadata.get(Metadata.TITLE) : document.getName())
            .field(FsRiverUtil.Doc.Meta.DATE, metadata.get(Metadata.DATE))
            .array(FsRiverUtil.Doc.Meta.KEYWORDS, Strings.commaDelimitedListToStringArray(metadata.get(Metadata.KEYWORDS)))
            .endObject(); // Meta


        // Doc content
        source.field(FsRiverUtil.Doc.CONTENT, parsedContent);

        // Doc as binary attachment
        source.field(FsRiverUtil.Doc.ATTACHMENT, document.getContent());

        // End of our document
        source.endObject();

        return source;
        }

Below code is used for getting response:

QueryBuilder qb;
        if (query == null || query.trim().length() <= 0) {
            qb = QueryBuilders.matchAllQuery();
        } else {
            qb = QueryBuilders.queryString(query);//query is a name or string
        }
org.elasticsearch.action.search.SearchResponse searchHits =  node.client()
                .prepareSearch()
                .setIndices("ankur")
                .setQuery(qb)
                .setFrom(0).setSize(1000)
                .addHighlightedField("file.filename")
                .addHighlightedField("content")
                .addHighlightedField("meta.title")
                .setHighlighterPreTags("<span class='badge badge-info'>")
                .setHighlighterPostTags("</span>")
                .addFields("*", "_source")
                .execute().actionGet();
1

There are 1 answers

0
Thamizharasu On

Elastic search indices all the column by default for providing better search capabilities. Before you put your JSON documents under some type, it would be great to define your mappings (refer: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-analysis.html)

When you want to search data by exact keyword, you may need to skip a particular column by not analyzing. While indexing a document, the column values will be analyzed and then it will be indexed. You can enforce Elastic saying that "not_analyzed". Then your column value will be indexed as it is. This way you can get a better search results.

For another part for defining your JSON document, it would be good if you use some library to define JSON. I prefer Jackson library for parsing JSON document. This will reduce the lines of code in your project.