I'm doing a text file reading with Apache Commons I/O and get full 100% CPU loaded with large file (23GB) ~ 404 million lines. My code snippet is below:
try (LineIterator it = FileUtils.lineIterator(file1, "UTF-8")) {
while (it.hasNext()) {
String lineR = it.nextLine();
// do something with line
bytesRead += lineR.length();
int percent = (int) (bytesRead * 100 / totalBytes);
if (percent > prePercent && percent % 5 == 0) {
log.info(percent + "% " + prefix + " read.");
prePercent = percent;
}
//split \t or " ", get domainName
String domainName = Arrays.stream(lineR.split("[\t ]")).filter(line -> line.contains(prefix)).findFirst().orElse(" ");
uniqueNameDomainSet.add(domainName.substring(0, a.length() -1));
}
}
I don't think there's a problem with Apache Commons I/O, so which part can lead to full CPU?
One problem is that String.split() using regex to matches delimiter, each time you invoke the split(), the delimiter
[\t ]must be recompiled first.You can optimize it a bit by using Pattern.split() so that you can reuse the compiled pattern.
Try this:
Also, regex matching is a costing operation, I suggest that your document must use only one delimiter char Ex.
'\t'so that you can write custom split by using String.indexOf or using guava which is much more efficient