Python "SyntaxError: EOL while scanning string literal" error while using pandas DataFrame.query() method

77 views Asked by At

I'm sorry it's probably a very silly question but I can't resolve the problem on my own.

I need to write a very large condition in a df.query() method to filter my data. But for some reason I have an error.

My code

data_normal = data.query('(total_area < 400) & (rooms < 10) & (living_area > 10) & (living_area < 200) \
        & (kitchen_area > 5) & (kitchen_area < 40) & (celling_height >= 2.5) & (last_price < 200000000) \ 
        & (last_price >= 1000000) & (floors_total < 30) & (days_exposition < 800) & (days_exposition > 5) \ 
        & (cityCenters_nearest <= 52000) & (parks_nearest <= 1500) & (parks_nearest > 10)')

returning an error:

File "/tmp/ipykernel_306/3185002971.py", line 7 & (kitchen_area > 5) & (kitchen_area < 40) & (celling_height >= 2.5) & (last_price < 200000000) \

^ SyntaxError: EOL while scanning string literal

But my previous code

data_normal = data.query('(total_area < 400) & (rooms < 10) & (living_area > 10) & (living_area < 200) \
        & (cityCenters_nearest <= 52000) & (parks_nearest <= 1500) & (parks_nearest > 10)')

does not catch any errors. I don't know why.

I tried to find quotation that caused this error but I still don't see it as if I'm blind. Help me please.

I've tried multi-line expression with like

data_normal = data.query('''(total_area < 400) & (rooms < 10) & (living_area > 10) & (living_area < 200) \
        & (kitchen_area > 5) & (kitchen_area < 40) & (celling_height >= 2.5) & (last_price < 200000000) \ 
        & (last_price >= 1000000) & (floors_total < 30) & (days_exposition < 800) & (days_exposition > 5) \ 
        & (cityCenters_nearest <= 52000) & (parks_nearest <= 1500) & (parks_nearest > 10)''')
5

There are 5 answers

0
cornisto On

It might be an issue with multi-line query string and slashes. Maybe try to put it in one line first and then gradually split and make it more readable.

This worked for me:

data_normal = data.query('(total_area < 400) & (rooms < 10) & (living_area > 10) & (living_area < 200) \
    & (kitchen_area > 5) & (kitchen_area < 40) & (celling_height >= 2.5) & (last_price < 200000000) \
    & (last_price >= 1000000) & (floors_total < 30) & (days_exposition < 800) & (days_exposition > 5) \
    & (cityCenters_nearest <= 52000) & (parks_nearest <= 1500) & (parks_nearest > 10)')
3
AKX On

You'll probably want to write your query a little more verbosely, and with a little less backslash continuations:

data_normal = data.query("""
(
    (total_area < 400)
    & (rooms < 10)
    & (living_area > 10)
    & (living_area < 200)
    & (kitchen_area > 5)
    & (kitchen_area < 40)
    & (celling_height >= 2.5)
    & (last_price < 200000000)
    & (last_price >= 1000000)
    & (floors_total < 30)
    & (days_exposition < 800)
    & (days_exposition > 5)
    & (cityCenters_nearest <= 52000)
    & (parks_nearest <= 1500)
    & (parks_nearest > 10)
)
""".strip().replace("\n", ""))
0
chepner On

I wouldn't hard-code a single string like this, for readability purposes. Construct a list of individual conditions, then join them into a single string. For example,

# The documentation says & has the same precedence as `and`; you
# can probably drop the parentheses.
conditions = [
    '(total_area < 400)',
    '(rooms < 10)',
    ...
    '(parks_nearest > 10)'
]

data_normal = data.query('&'.join(conditions))
0
Milo Buwalda On

You have to set the backslash at the end of the line without any whitespaces after it.

This fixes it for me:

data_normal = data.query('(total_area < 400) & (rooms < 10) & (living_area > 10) & (living_area < 200)\
    & (kitchen_area > 5) & (kitchen_area < 40) & (celling_height >= 2.5) & (last_price < 200000000)\
    & (last_price >= 1000000) & (floors_total < 30) & (days_exposition < 800) & (days_exposition > 5)\
    & (cityCenters_nearest <= 52000) & (parks_nearest <= 1500) & (parks_nearest > 10)')

I would suggest you look at the answers above to reformat your query to make it more readable.

0
Vikas Sharma On

I would suggest you avoid using data.query() and instead look to mask the conditions explicitly to increase both the readability and maintainability of your code.

Try this:

mask = (
    data['total_area'] < 400) & 
    (data['rooms'] < 10) & 
    (data['living_area'] > 10) & 
    (data['living_area'] < 200) & 
    (data['kitchen_area'] > 5) & 
    (data['kitchen_area'] < 40) & 
    (data['celling_height'] >= 2.5) & 
    (data['last_price'] < 200000000) & 
    (data['last_price'] >= 1000000) & 
    (data['floors_total'] < 30) & 
    (data['days_exposition'] < 800) & 
    (data['days_exposition'] > 5) & 
    (data['cityCenters_nearest'] <= 52000) & 
    (data['parks_nearest'] <= 1500) & 
    (data['parks_nearest'] > 10)
) 
data_normal = data[mask]

You can go more hardcore by doing this:

mask_total_area = data['total_area'] < 400
mask_rooms = data['rooms'] < 10
mask_living_area = (data['living_area'] > 10) & (data['living_area'] < 200)
mask_kitchen_area = (data['kitchen_area'] > 5) & (data['kitchen_area'] < 40)
mask_ceiling_height = data['ceiling_height'] >= 2.5
mask_last_price = (data['last_price'] < 200000000) & (data['last_price'] >= 1000000)
mask_floors_total = data['floors_total'] < 30
mask_days_exposition = (data['days_exposition'] < 800) & (data['days_exposition'] > 5)
mask_cityCenters_nearest = data['cityCenters_nearest'] <= 52000
mask_parks_nearest = (data['parks_nearest'] <= 1500) & (data['parks_nearest'] > 10)

data_normal = data[
    mask_total_area & mask_rooms & mask_living_area & mask_kitchen_area & mask_ceiling_height & 
    mask_last_price & mask_floors_total & mask_days_exposition & mask_cityCenters_nearest & mask_parks_nearest
]