Data Extraction using Regex

2.4k views Asked by At

I have data in a text file "file.txt"

Recipes & Menus
Expert Advice
Ingredients
Holidays & Events
Community
Video
SUMMER COOKING
Lentil and Brown Rice Soup
Gourmet January 1991
3.5/4
reviews (83)
90%
make it again
Some soups genuinely do inspire a devotion akin to love, and this is one of
them. In the cold of winter, when Gourmet editors ponder the matter of what soup
Cook
Reviews (83)
YIELD: Makes about 14 cups, serving 6 to 8
Ingredients
5 cups chicken broth
1 1/2 cups lentils, picked over and rinsed
1 cup brown rice
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces
1 onion, chopped
1 stalk of celery, chopped
3 garlic cloves, minced
1/2 teaspoon crumbled dried basil
1/2 teaspoon crumbled dried orégano
1/4 teaspoon crumbled dried thyme
1 bay leaf
1/2 cup minced fresh parsley leaves
2 tablespoons cider vinegar, or to taste
Preparation
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice,

I want to extract the data between Ingredients and Preparation.
I had written the following regex for it :-

(?s).*?Ingredients(.*?)Preparation.*

But it's extracting the data between the Ingredients in italics on 3rd line of
file.txt and Preparation but not between the data between Ingredients and Preparation
What changes in my regex code should I do to resolve this problem?
Thanks in advance!

4

There are 4 answers

0
Dmitry Egorov On BEST ANSWER

Try making your first .* greedy. It will eat all Ingredients up until the last one before Preparation:

(?s).*Ingredients(.*?)Preparation.*

Demo: https://regex101.com/r/mQ5eK5/1

0
Tensibai On
(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*

[*]{2}tell the regex you want one of the chars in the list (here a single *) excatly twice {2}.

I prefer using character classes than escaping, I found them more readable than this:

(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*

and depending on the language you're using you may have to escape the backslash too.

0
Casimir et Hippolyte On

You can use a lookahead that checks that each line is not Ingredients. In this way you limit the number of tests to only the start of lines (instead of testing each characters):

(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$ 

demo

pattern details:

(?m)             # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R   # "Ingredients" at the start of the line followed by a new line
(   # capture group 1
    (?:          # open a non-capturing group
        (?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
        .*\R             # the line
    )+? # repeat until "Preparation"
)
Preparation$

Note: since you didn't say what regex engine you use, it is possible that \R is not supported. In this case, replace it with \r?\n.

1
Wiktor Stribiżew On

You can make use of a lazy quantifier .*? with the second .*:

(?s).*\bIngredients\b(.*?)\bPreparation\b

See demo

Or you can make use of a tempered greedy token and then you do not need the first .*:

(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b

See demo