How to remove a custom word pattern from a text using NLTK with Python

Question

How to remove a custom word pattern from a text using NLTK with Python

1.5k views Asked by Punuth At 07 June 2015 at 12:39

I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK.
So first I want to take out each question separately from the text.The question paper format is given below.

 (Q1). What is web 3.0?
 (Q2). Explain about blogs.
 (Q3). What is mean by semantic web?
       and so on ........

So now I want to extract the questions one by one without having the question number(Question number format is always same as given above).So my result should be something like this.

 What is web 3.0?
 Explain about blogs.
 What is mean by semantic web?

So how can tackle this problem with python 3.4 with NLTK?
Thank you

Original Q&A

There are 3 answers

alvas On 07 June 2015 at 13:38

If the (QX) always separated by a space before the text, you can do this:

>>> text = """(Q1). What is web 3.0?
...  (Q2). Explain about blogs.
...  (Q3). What is mean by semantic web?"""
>>> for line in text.split('\n'):
...     print line.strip().partition(' ')[2]
... 
What is web 3.0?
Explain about blogs.
What is mean by semantic web?

AvidLearner On 07 June 2015 at 13:17

In case every sentence starts with this pattern, what you ask for is easy to parse, you can use split to remove this prefix:

sentences = [ "(Q1). What is web 3.0?",
              "(Q2). Explain about blogs.",
              "(Q3). What is mean by semantic web?"]
for sen in sentences:
    print sen.split('). ',1)[1]

This will print:

What is web 3.0?
Explain about blogs.
What is mean by semantic web?

**alexis** · Accepted Answer · 2015-06-07T13:25:56+00:00

You'll probably need to detect lines containing a question, then extract the question and drop the question number. The regexp for detecting a question label is

qnum_pattern = r"^\s*\(Q\d+\)\.\s+"

You can use it to pull out the questions like this:

questions = [ re.sub(qnum_pattern, "", line) for line in text if 
                                            re.search(qnum_pattern, line) ]

Obviously, text must be a list of lines or a file open for reading.

But if you had no idea how to approach this, you have your work cut out for you with the rest of the assignment. I recommend spending some time on the python tutorial or other introductory materials.

TechQA.

How to remove a custom word pattern from a text using NLTK with Python

There are 3 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in NLP

Related Questions in NLTK

Related Questions in TOKENIZE

Popular Questions

Popular Tags

Trending Questions