I have a dataset that contains innovation data of companies, by using some regex, I want to retrieve the licence data

company licences/patents
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt
2       VV3346E, SS345
3       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz

Desired output:

company licences/patents                                    licence
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt  licence-pp-zz, licence-zz-pp, licence-xx-tt 
2       VV3346E, SS345
3       licence-dd-zz                                       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz                                       licence-xb-xz

5 Answers

3
Scott Boston On Best Solutions

You can try:

df['licence'] = df['licences/patents'].str.extractall('(licence-\w{2}-\w{2})')\
  .unstack().apply(lambda x: ', '.join(x.dropna()), axis=1)

Output:

   company                                   licences/patents                                      licence
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...  licence-pp-zz, licence-zz-pp, licence-xx-tt
1        2                                     VV3346E, SS345                                          NaN
2        3                                      licence-dd-zz                                licence-dd-zz
3        4                                 UT223, licence, ss                                          NaN
4        5                XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
5        6                                             xc, zz                                          NaN
6        7                                      licence-xb-xz                                licence-xb-xz
1
Emma On

This RegEx might help you to create one group, $1, where your desired output may be:

(licence-[a-z]{2}-[a-z]{2})

RegEx

1
r_hudson On

df['license'] = df['licences/patents'].apply(lambda x: ''.join(re.findall('lice.*',x)))

It will create new column license with licenses stripped from original column.

2
Chris A On

Another approach, using Series.str.findall and Series.str.join:

df['licence'] = df['licences/patents'].str.findall(r'(licence[^,]*)').str.join(', ')

[out]

   company                                   licences/patents  \
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...   
1        2                                     VV3346E, SS345   
2        3                                      licence-dd-zz   
3        4                                 UT223, licence, ss   
4        5                XBTYU, licence-tt-kk, licence-ss-tt   
5        6                                             xc, zz   
6        7                                      licence-xb-xz   

                                       licence  
0  licence-pp-zz, licence-zz-pp, licence-xx-tt  
1                                               
2                                licence-dd-zz  
3                                      licence  
4                 licence-tt-kk, licence-ss-tt  
5                                               
6                                licence-xb-xz  
0
hacker315 On

Try below code:

df['licences/patents'].str.findall('(licence-[\w\-]+)').apply(", ".join)