Drawing an analogy from SQL/Relational Algebra to *nix command-line filters/tools

230 views Asked by At

Tools like cat/tac, grep/sed, head/tail, cut,sort/uniq, comm and join make it easy - when chained together - to extract data from one or multiple files. The dominating paradigm in this context is that of creating a "pipeline" or applying "filters" to extract information.

I was thinking that a different perspective, that of Relational Operators, may be useful when applied to text files, at least as a conceptual framework which may help me better utilize and combine these tools to extract and transform data.

My question is: Could someone suggest me some examples showing an analogy between relational operators/SQL queries and Unix filtering pipelines?

To make it more specific and potentially more useful, I provide here two tabular white space-separated datasets created with StackExchange Data Explorer, containing numerical, nominal and ordinal values:

$ cat so_tag_rank.txt
Tag           Group      MayRank   TotalRank  QuestionsInMay  QuestionsTotal
javascript    Lang       1         2          25111           869428
java          Lang       2         1          21331           870935
android       Platform   3         5          17351           682459
php           Lang       4         4          16910           761350
c#            Lang       5         3          15296           805666
jquery        Library    6         6          12950           621883
python        Lang       7         7          12394           434282
html          Markup     8         8          11672           422871
ios           Platform   9         10         9465            340714
css           Style      10        12         8141            310651


$ cat so_upvote_ratio.txt
Tags          Group       Upvotes    Downvotes    UD_ratio
c#            Lang        3692949    110750       3
java          Lang        3643245    112450       3.09
javascript    Lang        3326801    122302       3.68
c++           Lang        2247424    71309        3.17
python        Lang        2227963    59990        2.69
php           Lang        2177780    106569       4.89
android       Platform    2094791    59973        2.86
jquery        Library     2054170    68602        3.34
html          Markup      1444111    67129        4.65
.net          Platform    1334296    38208        2.86

For the sake of simplicity, let's specify that we don't care about named columns (you may ignore headers and specify columns positionally - Symbolic names can be simulated using variables in a shell script).

I would be interested in oneliners/scripts using one or more of the commands: cat/tac, grep, sed, awk, head/tail, cut, sort, uniq, comm, diff, join, rev, tee, paste in a sh/bash shell, simulating some of the following SQL commands/Relational operations:

Selection, Projection, Union

  SELECT [DISTINCT] select_expr [FROM file_refs]  
    [WHERE where_condition] | [WHERE_IN strlist]  
    [GROUP BY {col | expr}]  
    [HAVING where_condition]  
    [ORDER BY {col | expr} [ASC | DESC]]  
    [LIMIT {num_rows OFFSET offset}]  
    UNION [ALL | DISTINCT] SELECT ...(sub-query)..  

Joins/Products:

  [LEFT | RIGHT] [INNER | OUTER] JOIN [ON {on_expression} | USING {using_expression}] 

Functions and comparison

Aggregation   : CONCAT, COUNT, SUM, MAX, MIN, AVG, MEDIAN, MODE, INDEXOF_MAX, INDEXOF_MIN  
Comparison    : =,>,<, LIKE/NOT LIKE, NULL/NOT NULL, IN/NOT IN, Dates comparison
Boolean operators : AND,OR,NOT,XOR

Advanced operations

Band/Range JOIN, Self JOIN  
Views (Temporary files?), Difference/Except, Division  
Sub-queries/Nested queries  
Contingency tables 

I would prefer creativity over portability in this case.

1

There are 1 answers

5
Erwin Smout On

The relational model was conceived as a general-purpose data management tool. As such, it is capable and powerful enough to express, basically, everything and anything.

As a consequence, the kind of stuff that is expressible/doable in *ix pipeline commands is a proper subset of what is expressible/doable in RA. (But note that RA requires the stuff you're working on to have a declared relational structure, which the output of *ix commands do not.)

For example, a FIND filter (sorry for this ancient DOS example) is somewhat akin to WHERE X LIKE '%myfind%', clearly a restriction.

For example, the fact of pipelining itself is somewhat akin to relational operator nesting.

But to be honest, most of the operations supported by RA do not seem to have a more or less direct counterpart in the commands set.