Multiple functions in StreamLit

43 views Asked by At

I'm trying to create a web-scraping app that fetches data from a certain real estate website and returns a dataset including prices and cities. Having returned the dataset, the script should include a slider that will allow me to set a price and filter by it. However, StreamLit reruns the script every time I even slightly move the slider, and the process of web-scraping, cleaning and structuring data begins anew. I've tried using session_state, but it feels like I'm missing something, because it still reloads the whole thing. What can I do about it?

starter = st.checkbox('Begin Data Gathering')
if starter:
    def dataframe_creation(linker):
      link = linker
      br = requests.get(link)
      page = br.content
      soup = bs(page, 'html.parser')
      #PRICES
      advert = soup.find_all("div", attrs={"class": "advert__content-header"})
      price_list = [] ##################FINAL COLUMN
      import re

      html_code = str(advert)
      matches = re.findall(r'<b>(€)</b>(.*?)</span>', html_code)

      for match in matches:
          euro = match[0]
          number = match[1]
          number = number.split(" ")[0]
          number22 = number.replace(".", "")
          price_list.append(number22)
      # CITY LOCATION
      city = soup.find_all("div", attrs={"class": "advert__content-place"})
      city2 = str(city)
      city3 = re.findall(r'>(.*?)</div>', city2)
      city_list = [] #### THIS DATA
      for z in city3:
        kols = z.split(",")[0]
        city_list.append(kols)
      finale1 = pd.DataFrame({
             "Prices": price_list,
             "Cities": city_list})
      finale1['Prices'] = finale1['Prices'].astype(int)
      finale1
slider = st.slider("Price in Euros", min_value=200, max_value=5000, step=10)
finale1[finale1['Prices'] <= int(slider)]
1

There are 1 answers

0
AlGM93 On

Essentially the piece that you are missing is called caching. As you said everytime the state of your app changes, i.e of any of the Streamlit components you have layed out Streamlit will rerender the page on the server and serve it back to the client (browser). That is expected beahviour. Essentially the piece that you are missing is called caching. As you said every time the state of your app changes, i.e. of any of the Streamlit components you have laid out streamlit will rerender the page on the server and serve it back to the client (browser). That is expected behaviour.

To be more efficient in these these rerenders Streamlit has two mechanisms which act as some kind of memory session_states and caching. As some kind of summary:

  • Session States hold the memory of what is known. It is a dictionary where you can hold variables which persist between rerenders, i.e. states of buttons, counters ...
  • Caching hold the memory of what it has been done. It uses a decorator and a common technique called memoization to know which parts of the code have been executed and have not changed to avoid repeating itself. Here are the docs https://docs.streamlit.io/library/advanced-features/caching.

Example: If you have a Streamlit page that contains code for lengthy process such as reading a very large csv, every time you change the state of the page by actioning an Streamlit element it will execute your page code from top to bottom reading again the csv and making your app slow.

import pandas as pd

<page layout>
df = pd.read_csv("some.csv") # Rexecuted on every action.
<page layout>

To correct this behaviour we encapsulate the lengthy process in a function and add a decorator (thats it). The new code will look like this.

import pandas as pd

<page layout>
@st.cache_data
def csv_reader(path):
    return pd.read_csv("some.csv")

df = csv_reader(path="some.csv")
<page layout>

A bit more verbose but much better. Now the first time we render the page the function csv_reader will be executed. By the end of the execution a hash will be created with the name and values passed to the function, and the (hash, function_result) will be stored in an internal dictionary as a key-value pair. The next times the page is rerendered and Streamlit calls the function it will first look up if the function had been called for those parameters in that internal dictionary, then since it will be the case it will recover the result form the dictionary without having to actually reexecute the function again. You have avoided reexecuting that lengthy part any more.

This approach is not useful if the function is always called with different parameters or has an element of randomness in it, but I believe this is not the case. I don't have a complete piece of your code to test but I believe your issue will get solved by wrapping your function with the caching decorator and explicitly returning the datafreame such as,

@st.cache_data
def dataframe_creation(linker):
    ...
    return finale1

finale1 = dataframe_creation(linker=some_data)