How to define pos_pattern for extracting nouns followed by zero or more sequence of nouns or adjectives for KeyphraseCountVectorizer?

Question

How to define pos_pattern for extracting nouns followed by zero or more sequence of nouns or adjectives for KeyphraseCountVectorizer?

261 views Asked by Mhmd Rokaimi At 14 January 2023 at 07:17

I'm trying to extract Arabic keywords from tweets. I'm using keyBERT with KeyphraseCountVectorizer

vectorizer = KeyphraseCountVectorizer(pos_pattern='< N.*>*')

I'm trying to write more custom pos patterns regExp to select nouns followed by zero or more sequence of nouns or adjectives but not verbs. can you please help me to write the right regExp? Thank you

Original Q&A

There are 1 answers

**Kyle F. Hartzenberg** · Accepted Answer · 2023-01-14T22:26:09+00:00

I interpret your requirement to match "nouns followed by zero or more sequence of nouns or adjectives" as matching at least one or more sequential nouns (i.e. <N.*>+), followed by zero or more adjectives (i.e. <J.*>*). So putting these together you get the full RegExp as follows:

vectorizer = KeyphraseCountVectorizer(pos_pattern="<N.*>+<J.*>*")

As a side point, you note that you are attempting to extract Arabic keywords. From my understanding the keyphrase_vectorizers package relies on the text being annotated with spaCy PoS tags, and so to change languages from the default (English) you have to load a corresponding pipeline/model in the desired language and set the stop words to those of the new language. For example, if using the Keyphrase Vectorizer for German:

vectorizer = KeyphraseCountVectorizer(spacy_pipeline='de_core_news_sm', stop_words='german')

However, at present spaCy does not have a pipeline trained for Arabic text, which means that using KeyphraseCountVectorizer in a straightforward manner with Arabic text is not possible without workarounds (something you may have already solved but I just thought I'd mention it).

TechQA.

How to define pos_pattern for extracting nouns followed by zero or more sequence of nouns or adjectives for KeyphraseCountVectorizer?

There are 1 answers

Related Questions in NLP

Related Questions in PART-OF-SPEECH

Related Questions in KEYWORD-EXTRACTION

Popular Questions

Trending Questions