Data normalization and rescaling value in Python

839 views Asked by At

I have a dataset which contains URLs with publish date (YYYY-MM-DD), visits. I want to calculate benchmark (average) of visits for a complete year. Pages were published on different dates.....e. g. Weightage/contribution of 1st page published in Aug (with 10,000 visits) will be more as compare to 2nd page published in March (11,000).

Here is my dataset:

Click here to see my dataset

First step:

So first of all I want to add a column (i.e. time frame) in my data set which can calculate the time frame from the Publish date. For example: if the page was published on 2019-12-10, it can give the time frame/duration from my today's date, expected o/p: (Dec 2019, 9 Months). i.e. (Month Year on which the page was published, Total months from today)

Second step:

I want to normalize/rescale my data (visits) on the basis of calculated time frame column in step 1.

How can I calculate average/benchmark.

1

There are 1 answers

2
Maryam On

for the first step you can use following code: read dataframe

import pandas as pd
df = pd.read_csv("your_df.csv")

My example dataframe as below:

            Pub.Dates Type  Visits
0  2019-12-10 00:00:00    A    1000
1  2019-12-15 00:00:00    A    5000
2  2018-06-10 00:00:00    B    6000
3  2018-03-04 00:00:00    B   12000
4  2019-02-10 00:00:00    A    3000

for normalizing the date: at first define a method to normalize just a date:

from datetime import datetime

def normalize_date(date): # input: '2019-12-10 00:00:00'
    date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object
    date_to_str = date_obj.strftime("%B %Y") # 'December 2019'
    diff_date = datetime.now() - date_obj # find diff from today 
    diff_month = int(diff_date.days / 30) # convert days to month
    normalized_value = date_to_str + ", " + str(diff_month) + " months"
    return normalized_value # 'December 2019, 9 months'

now apply the above method to all values of the date column:

df['Pub.Dates'] =list(map(lambda x: normalize_date(x), df["Pub.Dates"].values))

The normalized dataframe will be:

                  Pub.Dates Type  Visits
0   December 2019, 9 months    A    1000
1   December 2019, 9 months    A    5000
2      June 2018, 27 months    B    6000
3     March 2018, 31 months    B   12000
4  February 2019, 19 months    A    3000
5       July 2020, 2 months    C    9000

but for the second step if there are multiple records per month you can do the following steps, groupby date and other columns you need then get mean of them:

average_in_visits = df.groupby(("Pub.Dates", "Type")).mean()

the result will be:

                               Visits
Pub.Dates                Type        
December 2019, 9 months  A       3000
February 2019, 19 months A       3000
July 2020, 2 months      C       9000
June 2018, 27 months     B       6000
March 2018, 31 months    B      12000