what's the fastest way to scan a very large file in java?

11.4k views Asked by At

Imagine I have a very large text file. Performance really matters.

All I want to do is to scan it to look for a certain string. Maybe I want to count how many I have of those, but it really is not the point.

The point is: what's the fastest way ?

I don't care about maintainance it needs to be fast.

Fast is key.

8

There are 8 answers

2
Joel On BEST ANSWER

For a one off search use a Scanner, as suggested here

A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.

1
Qwerky On

Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt.

Edit:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. Note I've never used it so can't vouch for it.

3
Michael Borgwardt On

First of all, use nio (FileChannel) rather than the java.io classes. Second, use an efficient string search algorithm like Boyer-Moore.

If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene.

0
Aravind Yarram On

Use the right tool: full text-search library

My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. As @Michael Borgwardt suggested, Lucene is the best library out there.

0
Adriaan Koster On

I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? Maybe you can call it using Runtime.getRuntime().exec(..)

0
Richard H On

It depends on whether you need to do more than one search per file. If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. If you need to do more than one search, you should probably build an index of the file with a tool like Lucene: read the file in, tokenise it, stick tokens in index. If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). If not keep it on disk. And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive.

1
Please treat your mods well. On

Whatever may be the specifics, memory mapped IO is usually the answer.

Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC.

Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. I think it might be exactly what you are searching.

0
Kellindil On

I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation.

This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/