Create column in OpenRefine using a logic test on an existing column?

3.4k views Asked by At

I have an original data set with over 4,500 row entries representing land transactions. One of the existing columns is "Place," and in each cell value I've put the locality name usually as Some Neighborhood, but sometimes when there is a place I don't know, I'll add additional info from the notary records, e.g., Some Neighborhood, Some County or even as Some Neighborhood, Some County (nicknamed).

I want to create two new columns using the info from this one column. The first column would contain the upper-level administrative division, and the second column the administrative division above that -- kinda like zooming out.

I think I would need to create a new column based on existing column using a sort of logical test or if-if-if- test where I can enumerate the values of Some Neighborhood that belong to County 1 or County 2 or City 1, etc. Then, based on this new column, I'd create another that would have State 1, State 2, etc., presumably using the same sort of test.

How would I write this using regular expressions in Open Refine GREL? How can I string together multiple possible values that are found in the cells? I tried using cells and value.contains but I do not know how to string together multiples like this.

So again to recap, I want to script a way to create a new column that contains new cell values based on a column of existing values, something like

"if Street1,Street2, Street5, but not Street3, or Street4, then County1"

OR

"if Street1 or Street2 or Street5, then County1 AND THEN IF Street3 or Street4 then County2"


EDIT: Here's some of the data:

land sale   0.350   carreau 350 gourdes Bullet
land sale   1.000   carreau 700 gourdes Campèche
land sale   0.200   carreau 220 gourdes Bremont
land sale   0.500   carreau 150 gourdes Pierrette
land sale   5.000   carreau 225 gourdes Lagenivrée
land sale   0.125   carreau 200 gourdes Bullet
land sale   1.000   carreau 300 gourdes Tozin
land sale   0.125   carreau 100 gourdes Dufort
land sale   0.250   carreau 135 gourdes Charitte, Savann Brute
land sale   0.500   carreau 300 gourdes Ravines des Roches
land sale   0.500   carreau 80  gourdes Isidore (Nègre Libre)
land sale   0.500   carreau 215 gourdes Nordette (Boures)
land sale   0.250   carreau 200 gourdes Bullet (Morne Montègue)

And here's an example of the expected result with two new columns, let's say "Commune" and "Section":

land sale   0.350   carreau 350 gourdes Limonade    Bwadlans    Bullet
land sale   1.000   carreau 700 gourdes Limonade    Bwadlans    Campèche
land sale   0.200   carreau 220 gourdes Limonade    Bwadlans    Bremont
land sale   0.500   carreau 150 gourdes Limonade    Roucou      Pierrette
land sale   5.000   carreau 225 gourdes Limonade    Roucou      Lagenivrée
land sale   0.125   carreau 200 gourdes Limonade    Bwadlans    Bullet
land sale   1.000   carreau 300 gourdes Quart_Mor   Sablé       Tozin
land sale   0.125   carreau 100 gourdes Limonade    Bwadlans    Dufort
land sale   0.250   carreau 135 gourdes Limonade    Bwadlans    Charitte, Savann Brute
land sale   0.500   carreau 300 gourdes Limonade    Bwadlans    Ravines des Roches
land sale   0.500   carreau 80  gourdes Limonade    Bwadlans    Isidore (Nègre Libre)
land sale   0.500   carreau 215 gourdes Limonade    Bwadlans    Nordette (Boures)
land sale   0.250   carreau 200 gourdes Limonade    Bwadlans    Bullet (Morne Montègue)
2

There are 2 answers

0
Owen Stephens On BEST ANSWER

There are a number of approaches and the one you choose may depend on how many values you have in the initial Place column.

I recommend the first approach as a starting point

Approach 1 - use facets:

  • a) Add blank columns (e.g. Commune, Section) (do this by using 'add column based on this column' from any existing column, and using 'null' as GREL formula)
  • b) Facet on Place column, select a set of values in the same Commune and Section (e.g. 'Bullet','Campéche','Bremont', etc.) and then write formula to update Commune and Section columns with 'Bwadlans' and 'Limonade'
  • c) Repeat for each commune/section with a group of Places

Approach 2 - use a lookup: If you have a mapping of Places -> Commune & Sections already, you can create this as a separate OpenRefine project, then use 'cross' to lookup a Commune/Section for each Place

  • a) Create OpenRefine project with columns for Place, Commune and Section (one row per place)
  • b) In the initial project create a new column based on the Place column with GREL like: cell.cross('place mapping project','Place')[0].cells["Commune"].value
  • c) Repeat (b) but for the Section value e.g. cell.cross('place mapping project','Place')[0].cells["Section"].value

Approach 3 - use conditionals: This is I think the approach you asked for in the first place, but I don't think it is necessarily the best option

  • a) using 'add column based on this column' from any Place column, use GREL like if(or(value=="Bullet",value=="Campéche"),"Bwadlans","")
  • b) for other locations you can write similar GREL on the new column by referring to the value in the Place column - e.g. if(or(cells["Place"].value=="Pierrette",cells["Place"].value=="Lagenivrée"),"Roucou")

Note that to use more than two conditions you have to nest 'or' statements e.g.: or(or(value=="Bullet",value=="Campéche"),value=="Bremont")

Approaches 2 and 3 can be made easier by installing OpenRefine extensions which help with this process (recommend the VIB-BITS extension for Approach 2, and looking at the gokbutils extension with the 'inArray' function for Approach 3 - which would avoid complex 'or' statements)

0
Ettore Rizza On

As you can see in Owen's answer (Approach 3), conditionnals are tricky with GREL. If this is really what you want to do, use Python/Jython instead. Here is an example based on your data:

value = value.strip().lower()

if "pierrette" in value or "lagenivrée" in value:
    return "Limonade||Roucou"
elif "tozin" in value:
    return "Quart_Mor||Sablé"
else:
    return "Limonade||Bwadlans"

Then just split the new column using || as a separator (see the screencast).

enter image description here

You can of course create more complex conditions using brackets and not in.

Fake example :

    value = value.strip().lower()

    if "pierrette" in value or "lagenivrée" in value:
        return "Limonade||Roucou"
    elif "tozin" in value:
        return "Quart_Mor||Sablé"
    elif ( ("ravinne" in value or "lagenivrée" in value) 
           and ("des roches" not in value or "savan" not in value)):
        return "Somewhat||Somewath else"
    else:
        return "Limonade||Bwadlans"