Tools like cat/tac
, grep/sed
, head/tail
, cut
,sort/uniq
, comm
and join
make it easy - when chained together - to extract data from one or multiple files.
The dominating paradigm in this context is that of creating a "pipeline" or applying "filters" to extract information.
I was thinking that a different perspective, that of Relational Operators, may be useful when applied to text files, at least as a conceptual framework which may help me better utilize and combine these tools to extract and transform data.
My question is: Could someone suggest me some examples showing an analogy between relational operators/SQL queries and Unix filtering pipelines?
To make it more specific and potentially more useful, I provide here two tabular white space-separated datasets created with StackExchange Data Explorer, containing numerical, nominal and ordinal values:
$ cat so_tag_rank.txt
Tag Group MayRank TotalRank QuestionsInMay QuestionsTotal
javascript Lang 1 2 25111 869428
java Lang 2 1 21331 870935
android Platform 3 5 17351 682459
php Lang 4 4 16910 761350
c# Lang 5 3 15296 805666
jquery Library 6 6 12950 621883
python Lang 7 7 12394 434282
html Markup 8 8 11672 422871
ios Platform 9 10 9465 340714
css Style 10 12 8141 310651
$ cat so_upvote_ratio.txt
Tags Group Upvotes Downvotes UD_ratio
c# Lang 3692949 110750 3
java Lang 3643245 112450 3.09
javascript Lang 3326801 122302 3.68
c++ Lang 2247424 71309 3.17
python Lang 2227963 59990 2.69
php Lang 2177780 106569 4.89
android Platform 2094791 59973 2.86
jquery Library 2054170 68602 3.34
html Markup 1444111 67129 4.65
.net Platform 1334296 38208 2.86
For the sake of simplicity, let's specify that we don't care about named columns (you may ignore headers and specify columns positionally - Symbolic names can be simulated using variables in a shell script).
I would be interested in oneliners/scripts using one or more of the commands: cat/tac
, grep
, sed
, awk
, head/tail
, cut
, sort
, uniq
, comm
, diff
, join
, rev
, tee
, paste
in a sh/bash shell, simulating some of the following SQL commands/Relational operations:
Selection, Projection, Union
SELECT [DISTINCT] select_expr [FROM file_refs]
[WHERE where_condition] | [WHERE_IN strlist]
[GROUP BY {col | expr}]
[HAVING where_condition]
[ORDER BY {col | expr} [ASC | DESC]]
[LIMIT {num_rows OFFSET offset}]
UNION [ALL | DISTINCT] SELECT ...(sub-query)..
Joins/Products:
[LEFT | RIGHT] [INNER | OUTER] JOIN [ON {on_expression} | USING {using_expression}]
Functions and comparison
Aggregation : CONCAT, COUNT, SUM, MAX, MIN, AVG, MEDIAN, MODE, INDEXOF_MAX, INDEXOF_MIN
Comparison : =,>,<, LIKE/NOT LIKE, NULL/NOT NULL, IN/NOT IN, Dates comparison
Boolean operators : AND,OR,NOT,XOR
Advanced operations
Band/Range JOIN, Self JOIN
Views (Temporary files?), Difference/Except, Division
Sub-queries/Nested queries
Contingency tables
I would prefer creativity over portability in this case.
The relational model was conceived as a general-purpose data management tool. As such, it is capable and powerful enough to express, basically, everything and anything.
As a consequence, the kind of stuff that is expressible/doable in *ix pipeline commands is a proper subset of what is expressible/doable in RA. (But note that RA requires the stuff you're working on to have a declared relational structure, which the output of *ix commands do not.)
For example, a FIND filter (sorry for this ancient DOS example) is somewhat akin to WHERE X LIKE '%myfind%', clearly a restriction.
For example, the fact of pipelining itself is somewhat akin to relational operator nesting.
But to be honest, most of the operations supported by RA do not seem to have a more or less direct counterpart in the commands set.