Read very large file make CPU loaded 100%

Question

Read very large file make CPU loaded 100%

241 views Asked by Alex At 03 June 2022 at 08:55

I'm doing a text file reading with Apache Commons I/O and get full 100% CPU loaded with large file (23GB) ~ 404 million lines. My code snippet is below:

try (LineIterator it = FileUtils.lineIterator(file1, "UTF-8")) {
    while (it.hasNext()) {
        String lineR = it.nextLine();
        // do something with line
        bytesRead += lineR.length();
        int percent = (int) (bytesRead * 100 / totalBytes);
        if (percent > prePercent && percent % 5 == 0) {
            log.info(percent + "% " + prefix + " read.");
            prePercent = percent;
        }

        //split \t or " ", get domainName
        String domainName = Arrays.stream(lineR.split("[\t ]")).filter(line -> line.contains(prefix)).findFirst().orElse(" ");
        uniqueNameDomainSet.add(domainName.substring(0, a.length() -1));
    }
}

I don't think there's a problem with Apache Commons I/O, so which part can lead to full CPU?

Original Q&A

There are 1 answers

**Chicky** · Answer 1 · 2022-06-03T10:13:24+00:00

One problem is that String.split() using regex to matches delimiter, each time you invoke the split(), the delimiter [\t ] must be recompiled first.

You can optimize it a bit by using Pattern.split() so that you can reuse the compiled pattern.

Try this:

public static void main(String[] args) throws Exception {
    String text = "one\ttwo\tthree";
    String separatorPatternStr = "[\t ]";
    Pattern separatorsPattern = Pattern.compile(separatorPatternStr);
    int numIterations = 404000000;
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < numIterations; i++) {
        String[] tmp = separatorsPattern.split(text);
    }
    long preCompiledElapsedTime = System.currentTimeMillis() - startTime;
    System.out.printf("preCompiledElapsedTime: %d(ms)\n", preCompiledElapsedTime);


    long startTime2 = System.currentTimeMillis();
    for (int i = 0; i < numIterations; i++) {
        String[] tmp = text.split(separatorPatternStr);
    }
    long plainSplitElapseTime = System.currentTimeMillis() - startTime2;
    System.out.printf("plainSplitElapseTime: %d(ms)\n", plainSplitElapseTime);
}

Also, regex matching is a costing operation, I suggest that your document must use only one delimiter char Ex. '\t' so that you can write custom split by using String.indexOf or using guava which is much more efficient

TechQA.

Read very large file make CPU loaded 100%

There are 1 answers

Related Questions in JAVA

Related Questions in READFILE

Related Questions in APACHE-COMMONS-IO

Popular Questions

Trending Questions