Apply POS tag to nested list

151 views Asked by At

I'm trying to go through multiple sentences in a text. Each sentence is stored in nested list structure (i.e. a list of each sentence in the form of a list). I then want to apply POS tag to each 'token' in the sentence and store in another nested list structure. Ultimately this is so that I can add this to a dataframe and export to excel in 1 column (where each row is a sentence).

The trouble I'm having is the POS tag list only seems to capture the last sentence in the text. Here is part of the code.

for sentences in doc1.sents: #iterates over sentences in doc
     for match_id, start, end in phrase_matcher(nlp(sentences.text)):  
          if nlp.vocab.strings[match_id] in ["key"]: 
          found_sentences = sentences.text
          duplicate_sentence_list.append(found_sentences)                        
      all_separated_words_list.append(text_preprocessing(found_sentences))
          tokens = nltk.word_tokenize(sentence)
          tags = nltk.pos_tag(tokens)
          pos_list.append(tags)

When I try adding the POS tag to a for loop like below:

for sentences in doc1.sents: #iterates over sentences in doc
     for match_id, start, end in phrase_matcher(nlp(sentences.text)):  
          if nlp.vocab.strings[match_id] in ["key"]: 
          found_sentences = sentences.text
          duplicate_sentence_list.append(found_sentences)                        
          all_separated_words_list.append(text_preprocessing(found_sentences))
          for i in found_sentences:
              pos_list.append(nltk.pos_tag(i))

i get this error:

TypeError: tokens: expected a list of strings, got a string

When i change the for loop to use the nested list (all_separated_words_list) I get this error:

`Output exceeds the size limit. Open the full output data in a text editor

AttributeError Traceback (most recent call last) /var/folders/6g/n1v5s0vj77xc2htytg4spx_r0000gn/T/ipykernel_17689/361983526.py in 14 all_separated_words_list.append(text_preprocessing(found_sentences)) 15 for i in found_sentences: 16 pos_list.append(nltk.pos_tag(all_separated_words_list)) 17 # tokens = nltk.word_tokenize(i) 18 # tags = nltk.pos_tag(tokens)

~/opt/anaconda3/lib/python3.9/site-packages/nltk/tag/init.py in pos_tag(tokens, tagset, lang) 164 """ 165 tagger = _get_tagger(lang) 166 return _pos_tag(tokens, tagset, tagger, lang) 167 168

~/opt/anaconda3/lib/python3.9/site-packages/nltk/tag/init.py in _pos_tag(tokens, tagset, tagger, lang) 121 122 else: 123 tagged_tokens = tagger.tag(tokens) 124 if tagset: # Maps to the specified tagset. 125 if lang == "eng":

~/opt/anaconda3/lib/python3.9/site-packages/nltk/tag/perceptron.py in tag(self, tokens, return_conf, use_tagdict) 178 output = [] ... 277 if word.isdigit() and len(word) == 4: 278 return "!YEAR" 279 if word and word[0].isdigit():

AttributeError: 'list' object has no attribute 'isdigit'`

So I'm not too sure how to proceed. Would appreciate any help

1

There are 1 answers

0
larapsodia On

From the error message, it's telling you that it expected a string, but instead it got a list.

for i in found_sentences:
    pos_list.append(nltk.pos_tag(i))

I suspect what's happening is that at this point you think you're giving it a single sentence, and then trying to iterate over the words in it, but found_sentences is actually list of sentences. So when it iterates over them it's finding a list (the tokenized sentence) instead of a string (the individual word).

Go back over your code again, looking at the output of each line and you'll be able to see where it is going wrong.