extrract a paragraph between heading with specific set of words

80 views Asked by At

I have a text file containing data as follows:

History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms

Now I would like to extract paragraph or perticular section which contain specific set of words like {" Software", opensource" }

I have tried regexp and if loop but couldn't extract the output needed can anyone help me out.

2

There are 2 answers

0
francisco sollima On

Use a regular expression:

import re
my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:software|open\s?source).+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)

You end up having all paragraphs with the keywords you mentiones in the list paragraph_list

EDIT

If you want the keywords to be dynamic, or provided by a list/tuple:

import re
keywords = ('software', 'open source')

my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:' + '|'.join(keywords) + ').+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)
0
Dadep On

you can easily find if a substring is part of bigger one :

>>> str='In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms'
>>> "software" in str
True

you can extract the lines of your files that contain a specific word :

>>> f = open('yourfile.txt','r')
>>> result=[i for i in data if 'software' in i]