Suppose I have data in txt format structed as this:
|-------------------------|
| ID 1 |
| Category A |
| Class ABC |
| Type 1 |
| Subcat A |
| positional entry 1 |
| positional entry 2 |
|-------------------------|
| ID 2 |
| Category B |
| Class ABC |
| Type 2 |
| Subcat B |
| positional entry 1 |
| positional entry 2 |
| positional entry 3 |
|-------------------------|
| ID 3 |
| Category A |
| Class E |
| Type 4 |
| Subcat A |
| positional entry 1 |
|-------------------------|
The data is stored in one large txt file (approx. 7 GB) with more than 100 million rows. I want to extract only those IDs that fall into Category A and Subcat A. However, I also want the positional entries. These are not fixed, it is unknown how many lines there are. So for each ID it can vary.
I tried to go through it with opening it as txt file and going through each line. My problem here is that each time the file pointer enters a new line the information so to say is lost, although I could try to set flags that are retained.
Second approach was to extract the beginning and end of each ID in a list first. Then check the position where an ID starts that has category A and Subcat A. However, I have many rows, so storing these information in ranges with lists that have so many elements is not possible. I wanted to check then for each ID to which range it falls.
Expected output:
|-------------------------|
| ID 1 |
| Category A |
| Class ABC |
| Type 1 |
| Subcat A |
| positional entry 1 |
| positional entry 2 |
|-------------------------|
| ID 3 |
| Category A |
| Class E |
| Type 4 |
| Subcat A |
| positional entry 1 |
|-------------------------|
How can I do this extraction?
Edit: positional entry 1,2 and so on just means this can be some varying entries. So these are lines with for example text entries which I need for later analysis.
Edit 2 according to Zach Young's answer:
When I adopt the code to the following:
import csv
DIVIDER = "-------------------------"
f_in = open(r"C:\myfile\testfile.txt")
block: list[str] = []
with open(r"C:\myfile\output.csv", "w", newline="") as f_out:
writer = csv.writer(f_out)
for line in f_in:
line = line.replace("|", "").strip()
if line == DIVIDER:
if len(block) > 4 and block[1] == "Category A" and block[4] == "Subcat A":
print("check")
print(block)
writer.writerow(block)
block = []
continue
block.append(line)
Then I get the ouptut as in the answer below, however I do not want transposing, I would like to have the expected output as I wrote it here.
I see this problem as:
...and, printing dividers at beginning and end of file, and between printed blocks.
I've created two files from the unmodified samples you shared. Both have no extraneous whitespace at the beginning; both have a final linebreak (an "empty line" at the bottom):
input.txt:
target.txt:
To start off, I want to see what capturing and printing every block looks like. Consider this simple program that just reads a block, and prints a block, but only after a complete block has been read:
I expect to get out exactly what I put in. I test that:
diff doesn't complain, so I've got at least that much correct.
Next, I'll add in the evaluation step:
I expect to get an output that matches target.txt. I test that:
Please, recreate the conditions for those tests exactly and try those two sample programs, without any modifications... not even the to the input path.
That will mean making sure the programs can access a file named input.txt. A straightforward way to do that would be to:
cdto that directoryZ:\your\path\to\python3 main1.py > output.txt, and the same formain2.pyDo not change anything/add variables to the test, yet.
Once you've verified the second script exactly produces target.txt, you can move on to trying your input text.
Because of the size of your file, to start, I recommend adding a simple counter and then halting after the program has read enough blocks that should qualify; after, say... 100 blocks?:
After that, maybe add in a file to write to, to avoid the prints and redirect. You don't need to do
with open(...)for these programs: once the program finishes (even from an unhandled exeception), Python will close all open files.