I have to identify paragraphs from multiple text files(.txt) and create a dataframe of [paragraph1, "text of the file1 in paragraphs "]

273 views Asked by At
with open("/home/xxxx/Downloads/DataEnginner9.txt", "r") as f:
    for line in f:
        print(line)

when i run this code i am able to get as sentences only,

The above code takes the file and splits into sentences and prints each line, but i want it to identify paragraphs from multiple files and also create a data-frame which contains the file name in the first column and respective entire content in the second column of the same row i.e.., example Data-frame :

[file1,content of the file splitted in paragraphs; file2,content of the file2 splitted in paragraphs . . . ]

Below is the sample output generated by the above script from one file.

Job description

Responsibilities

Work collaboratively with a global team to design, develop

scalable, maintainable and reliable services that process very large quantities

data using Big Data technologies (100 billion daily indicators, 6 TB/day before

compression).

Familiar with Object oriented development, with specific experience

in at least one major OO language(knowledge of Java is mandatory and if

possible java 8). Nice to have: Knowledge of functional programming.

Perform end-to-end software development life cycle functions

including Design, Development, Performance Analysis & Tuning, Optimization,

Testing and Product Maintenance.

1

There are 1 answers

0
Suresh Kumar On
def txt(filepath):
    df12 = pd.DataFrame(columns=["title","paragraphs"])
    af = []
    with open(filepath) as f:
        lines = f.readlines()

    paragraph = ''
    for line in lines:
        if line.isspace():  # is it an empty line?
            if paragraph:
                af.append(paragraph)
                paragraphs = re.split("\n\n(?=\u2028|[A-Z-0-9])", af)
                paragraph = ''
            else:
                continue
        else:
            paragraph += ' ' + line.strip()
    return paragraphs