I tried to read lines from a file which maybe large.
To make a better performance, I tried to use mapped file. But when I compare the performance, I find that the mapped file way is even a a little slower than I read from BufferedReader
public long chunkMappedFile(String filePath, int trunkSize) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, mapped file [{}], trunk size = {} ", filePath, trunkSize);
//Create file object
File file = new File(filePath);
//Get file channel in readonly mode
FileChannel fileChannel = new RandomAccessFile(file, "r").getChannel();
long positionStart = 0;
StringBuilder line = new StringBuilder();
long lineCnt = 0;
while(positionStart < fileChannel.size()) {
long mapSize = positionStart + trunkSize < fileChannel.size() ? trunkSize : fileChannel.size() - positionStart ;
MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, positionStart, mapSize);//mapped read
for (int i = 0; i < buffer.limit(); i++) {
char c = (char) buffer.get();
//System.out.print(c); //Print the content of file
if ('\n' != c) {
line.append(c);
} else {// line ends
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("mappedfile processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
}
line = new StringBuilder();
}
}
closeDirectBuffer(buffer);
positionStart = positionStart + buffer.limit();
}
long end = System.currentTimeMillis();
logger.info("chunkMappedFile {} , trunkSize: {}, cost : {} " ,filePath, trunkSize, end - begin);
return lineCnt;
}
public long normalFileRead(String filePath) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, Normal read file [{}] ", filePath);
long lineCnt = 0;
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("file processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
} }
}
long end = System.currentTimeMillis();
logger.info("normalFileRead {} , cost : {} " ,filePath, end - begin);
return lineCnt;
}
Test result in Linux with reading a file which size is 537MB:
MappedBuffer way:
2017-09-28 14:33:19.277 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :14804 , lines per seconds: 861852.0670089165
BufferedReader way:
2017-09-28 14:27:03.374 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :13001 , lines per seconds: 981375.1249903854
That is the thing: file IO isn't straight forward and easy.
You have to keep in mind that your operating system has a huge impact on what exactly is going to happen. In that sense: there are no solid rules that would work for all JVM implementations on all platforms.
When you really have to worry about the last bit of performance, doing in-depth profiling on your target platform is the primary solution.
Beyond that, you are getting that "performance" aspect wrong. Meaning: memory mapped IO doesn't magically increase the performance of reading a single file within an application once. Its major advantages go along this path:
( quoted from this answer on using the C
mmap()system call )In other words: you example is about reading a file contents. In the end, the OS still has to turn to the drive to read all bytes from there. Meaning: it reads disc content and puts it in memory. When you do that the first time ... it really doesn't matter that you do some "special" things on top of that. To the contrary - as you do "special" things the memory-mapped approach might even be slower - because of the overhead compared to an "ordinary" read.
And coming back to my first record: even when you would have 5 process reading the same file, the memory-mapped approach isn't necessarily faster. As the Linux might figure: I already read that file into memory, and it didn't change - so even without explicit "memory mapping" the Linux kernel might cache information.