Node.js "readline" + "fs. createReadStream" : Specify start & end line number

1k views Asked by At

https://nodejs.org/api/readline.html

provides this solution for reading large files like CSVs line by line:

const { createReadStream } = require('fs');
const { createInterface } = require('readline');

(async function processLineByLine() {
  try {
    const rl = createInterface({
      input: createReadStream('big-file.txt'),
      crlfDelay: Infinity
    });

    rl.on('line', (line) => {
      // Process the line.
    });

    await once(rl, 'close');

    console.log('File processed.');
  } catch (err) {
    console.error(err);
  }
})();

But I dont want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.

Basically I want to be able to set a 'start' & 'end' line for a given run of my function.

Is this doable with readline & fs.createReadStream? If not please suggest alternate approach.

PS: It's a large file (around 1 GB) & loading it in memory causes memory issues.

2

There are 2 answers

1
jfriend00 On BEST ANSWER

But I don't want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.

Unless your lines are of fixed, identical length, there is NO way to know where line 10,000 starts without reading from the beginning of the file and counting lines until you get to line 10,000. That's how text files with variable length lines work. Lines in the file are not physical structures that the file system knows anything about. To the file system, the file is just a gigantic blob of data. The concept of lines is something we invent at a higher level and thus the file system or OS knows nothing about lines. The only way to know where lines are is to read the data and "parse" it into lines by searching for line delimiters. So, line 10,000 is only found by searching for the 10,000th line delimiter starting from the beginning of the file and counting.

There is no way around it, unless you preprocess the file into a more efficient format (like a database) or create an index of line positions.

Basically I want to be able to set a 'start' & 'end' line for a given run of my function.

The only way to do that is to "index" the data ahead of time so you already know where each line starts/ends. Some text editors made to handle very large files do this. They read through the file (perhaps lazily) reading every line and build an in-memory index of what file offset each line starts at. Then, they can retrieve specific blocks of lines by consulting the index and reading that set of data from the file.

Is this doable with readline & fs.createReadStream?

Without fixed length lines, there's no way to know where in the file line 10,000 starts without counting from the beginning.

It's a large file(around 1 GB) & loading it in memory causes MEMORY ISSUES.

Streaming the file a line at a time with the linereader module or others that do something similar will handle the memory issue just fine so that only a block of data from the file is in memory at any given time. You can handle arbitrarily large files even in a small memory system this way.

0
leitning On

A new line is just a character (or two characters if you're on windows), you have no way of knowing where those characters are without processing the file.

You are however able to read only a certain byte range in a file. If you know for a fact that every line contains 64 bytes, you can skip the first 100 lines by starting your read at byte 6400, and you can read only 100 lines by stopping your read at byte 12800.

Details on how to specify start and end points are available in the createReadStream docs.