How can I iteratively query a large MySQL dataset from within Solr's data importer?

790 views Asked by At

I have 256 tables in my production database. Each table has around 10,000,000 rows -- I can't get all of the rows from any one table in one SELECT query.

Actually I put the query-phases in solr/conf/data-config.xml just like this as a test; it has one table and I'm selecting all rows.

I use Solr (which is based on Java). And I trigger the indexing of the data like so:

curl --max-time 36000 -i localhost:portNum/solr/dataimport?command=full-import

This is what's in the XML:

<entity name="file" query="SELECT *, LOWER(REPLACE(sw_name, ' ', '')) packed_sw_name, CAST(group_id as char) group_id_s FROM tblTmp1"/>

But now, I want to get the parameter for table number(tblNum) for data-import and select rows between id ranges iteratively -- with dollar-sign variable substitution, the query would look like this:

<entity name="file" query="SELECT *, LOWER(REPLACE(sw_name, ' ', '')) packed_sw_name, CAST(group_id as char) group_id_s FROM tblTmp1_$tblNumwhere id >=$startSnand id<$endSn" />

How can I query the database iteratively, and in which file should I set this up?

1

There are 1 answers

0
fish2000 On

I don't know if this is the best place to write the SQL query, if you want to add that kind of logic.

If you do use the importer as configured, you'll want to use a cursor to do your selects in a loop -- putting the results into a temporary table for Solr. I don't know how that works w/r/t MySQL syntax, but as I recall it's similar (but not identical) to other SQL dialects -- here is an example of a cursor query in PostgreSQL's PL/PgSQL language:

CREATE FUNCTION urge_to_merge( ) RETURNS INTEGER AS '

    DECLARE 
        pacman              CURSOR FOR SELECT * FROM forsale_fsuserprofileimage;
        pellet              forsale_fsuserprofileimage%ROWTYPE;



    BEGIN 

        OPEN pacman;
        LOOP
            FETCH pacman INTO pellet;
            EXIT WHEN NOT FOUND;

            UPDATE forsale_fsuserprofile 
            SET 
                image = pellet.image,
                w = pellet.w,
                h = pellet.h
            WHERE 
                id = pellet.userprofile_id;

        END LOOP;

        RETURN( 1 );

    END;


' LANGUAGE 'plpgsql';

I would suggest ditching the Solr stock data importer and writing a simple CLI program -- that way, you can control when your queries are executed. You can write it in Java, but you don't have to; with Solr's HTTP API you can use any other language, as long as it can make HTTP requests and MySQL API calls: Python, Ruby, bash scripts with curl, even PHP-CLI if that's your thing.

Many languages already have bindings to Solr's HTTP API you can use out-of-the-box (I can vouch for Python and Ruby in this regard) but here's an article with examples of how easy it is to use the API yourself: https://lucene.apache.org/solr/guide/7_7/searching.html#searching