Normalizing SOLR records for sharding: _version_ issues

1k views Asked by At

As a part of my DSpace instance, I have a SOLR repository containing 12 million usage statistics records. Some records have migrated through multiple SOLR upgrades and do not conform to the current schema. 5 million of these records are missing a unique id field specified in my schema.

The DSpace system provides a mechanism to shard older usage statistics records into a separate solr shard using the following code.

DSPACE SHARD LOGIC:

        for (File tempCsv : filesToUpload) {
            //Upload the data in the csv files to our new solr core
            ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
            contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
            contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
            contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");

            statisticsYearServer.request(contentStreamUpdateRequest);
        }
        statisticsYearServer.commit(true, true);

When I attempted to run this process, I received an error message for each of my records missing the unique id field and the 5 million records were dropped by the process.

I have attempted to replace these 5 million records in order to force the creation of a unique id field on each record. Here is the code that I am running to trigger that update. The query myQuery iterates over batches of several thousand records.

MY RECORD REPAIR PROCESS:

    ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>();
    SolrQuery sq = new SolrQuery();
    sq.setQuery(myQuery);
    sq.setRows(MAX);
    sq.setSort("time", ORDER.asc);

    QueryResponse resp  = server.query(sq);
    SolrDocumentList list = resp.getResults();

    if (list.size() > 0) {
        for(int i=0; i<list.size(); i++) {
            SolrDocument doc = list.get(i);
            SolrInputDocument idoc = ClientUtils.toSolrInputDocument(doc);
            idocs.add(idoc);
        }           
    }

    server.add(idocs);
    server.commit(true, true);
    server.deleteByQuery(myQuery);
    server.commit(true, true);

After running this process, all of the records in the repository have a unique id assigned. The records that I have touched also have a _version_ field present.

When I attempt to re-run the sharding process that I included above, I receive an error related to the _version_ field value and the process terminates. If I attempt to set the version field explicitly, I receive the same error.

Here is the error message that I am encountering when I invoke the shard process:

Exception: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)

My goal is to repair my records so that I can run the shard process provided by DSpace. Can you recommend any additional action that I should take to repair these records?

3

There are 3 answers

0
schweerelos On BEST ANSWER

The sharding code in SolrLogger copies records into a new, empty core. The problem is that DSpace usage statistics documents from about DSpace 3 onwards contain a _version_ field, and this field is included in the copy during sharding.

When documents containing a _version_ field are added to a Solr index, this triggers Solr's optimistic concurrency functionality, which checks for an existing document with the same unique ID in the index. The logic goes roughly like this (see http://yonik.com/solr/optimistic-concurrency/):

  • _version_ > 1: Document version must exactly match
  • _version_ = 1: Document must exist
  • _version_ < 0: Document must not exist
  • _version_ = 0: Don't care (normal overwrite if exists)

The usage statistics documents containing a _version_ value > 1 thus make Solr look for an existing document with the same unique ID in the newly created year shard; however, clearly there is no such document at that point, hence the version conflict.

The copy process during the sharding creates temporary CSV files that are then imported into the new core. Luckily, Solr's CSV update handler can be told to exclude specific fields from the import, using the skip parameter: https://wiki.apache.org/solr/UpdateCSV#skip

Changing the sharding code like so

//Upload the data in the csv files to our new solr core
ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
+ contentStreamUpdateRequest.setParam("skip", "_version_");
contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");

skips the _version_ field, which in turn disables the optimistic concurrency check.

This is discussed in https://jira.duraspace.org/browse/DS-2212 with a pull request at https://github.com/DSpace/DSpace/pull/893; hopefully this will be included in DSpace 5.2.

1
Adán On

It should be easier to modify the generated csv.

Try to add the id to the csv directly adding a method to do that before the firs method.

FileUtils.copyInputStreamToFile(csvInputstream, csvFile);

//<-a method call to a function that reopen the csv file and add the mandatory id to each line

filesToUpload.add(csvFile); //Add 10000 & start over again yearQueryParams.put(CommonParams.START, String.valueOf((i + 10000))); }

for (File tempCsv : filesToUpload) {

(...)

0
Havenless On

I was trying to upgrade 1.8.3 to 4.2 with 4 million records, all missing uid and version. I wrote a script to read from Solr (in batches of 10,000), write copies back in, and finally delete the originals. The results looked good until I tried sharding, when I saw the same issue reported here.

The CSV files contained correct version numbers. The exception report was

Exception: version conflict for 38dbd4db-240e-4c9b-a927-271fee5db750 expected=1490271991641407488 actual=-1 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for 38dbd4db-240e-4c9b-a927-271fee5db750 expected=1490271991641407488 actual=-1

The first record in temp/temp.2012.0.csv, begins

38dbd4db-240e-4c9b-a927-271fee5db750,1490271991641407488, ...