I am working on a huge denormalized table on a SQL server (10 columns x 130m rows). Take this as data example :
import pandas as pd
import numpy as np
data = pd.DataFrame({
'status' : ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
'clientId' : ['A', 'B', 'C', 'A', 'D', 'C', 'A', 'B','C'],
'partner' : ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_3', 'brand_3'],
'gmv' : [100,100,100,100,100,100,100,100,100]})
data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})
As you can see, many of it columns are categories/strings that could be factorize (replaced by a small int identification to another x.1 join).
My question is if there is a easy way to extract another "dataframe" from each category columns and factory factorize the main table, so the bytes transmitted over a single query could be faster! Is there any easy library for it?
I would expect to get this output:
data = pd.DataFrame({
'status' : ['1', '1','1', '2','2','2', '3', '3','3'],
'clientId' : ['1', '2', '3', '1', '4', '3', '1', '2','3'],
'partner' : ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_3', 'brand_3'],
'gmv' : [100,100,100,100,100,100,100,100,100]})
status_df = {1 : 'pending', 2:'canceled', 3:'confirmed'}
clientid = {1 : 'A', 2:'B', 3:'C', 4:'D'}
and so on!
Bonus question! My table is big, so I probably would need to apply something using DASK.
You can use
factorizeto do this. For example:Output (data):
Output (status_df):
For columns like
partner, where there areNaNvalues, you can choose to have them replaced with-1(the default behaviour), or to haveNaNincluded inpartner_df(along with its own index) by specifyinguse_na_sentinel=False.