I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs). I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?
* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.
I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)
def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr = []
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('\0', '') for line in file), delimiter='\t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record = {}
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr = []
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)
if __name__ == '__main__':
create_concurrent_futures()
I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?