Miller returns nothing to stdout

135 views Asked by At

I am currently working with a huge TSV file (~5,000 columns and 500,000 records) structured approximately as follows:

f.ID    f.1.0.0    f.2.0.0    f.3.0.1    f.3.0.2
1    A    22    B32    -1    
2    F    38    B1    65 

I cannot inspect it manually, but I have a sister file that should be in the same file format (with the join key f.ID in common).

Everything works fine on the sister file:

$ mlr --itsv cut -f f.ID file1.tab | head -n2
f.ID=1
f.ID=2

But when I try to subset it on known columns (e.g. f.ID), miller returns nothing:

$ mlr --itsv cut -f f.ID file2.tab | head -n2

I am having a hard time figuring out how to diagnose what is going on with this file, as I suspect it's formatted in a non-standard way. Is there a way to get what Miller is doing for each record or to get where it is failing?

5

There are 5 answers

5
aborruso On BEST ANSWER

If you can use another tool, try using duckdb cli and run

duckdb --csv -c "SELECT COLUMNS('^f.1.0.0$') from read_csv_auto('input.tsv');" >output.csv

Start with a limited number of rows

duckdb --csv -c "SELECT COLUMNS('^f.1.0.0$') from read_csv_auto('input.tsv') limit 1000;" >output_1000.csv
2
aborruso On

You could run a check for each line:

while IFS= read -r line;do
  mlr --tsv -N check
done < input.tsv
1
Sebastian On

Strange!

I copy&pasted your example lines into a file; afterwards I had to replace the spaces with 1 tab (in addition to the deletion of blanks at the end of each line in advance).

First step with perl

$ perl -pi.bak -e 's/\h+$//;s/\h+/\t/g' f.csv

which leads to the following file content (Tabs instead of spaces):

$ cat -vet f.csv

f.ID^If.1.0.0^If.2.0.0^If.3.0.1^If.3.0.2$

1^IA^I22^IB32^I-1$

2^IF^I38^IB1^I65$

Second step with miller

$ mlr --csv --ifs '\t' cut -f f.ID f.csv

and got

f.ID

1

2

or for the last column

$ mlr --csv --ifs '\t' cut -f f.3.0.2 f.csv

f.3.0.2

-1

65

--

However, your mlr command

$ mlr --itsv cut -f f.ID f.csv

results in:

f.ID=1

f.ID=2

Hope I have been able to shed some light on this. (Miller Version is 6.11.0)

0
Sebastian On

@Giulio Centorame

as I said before: your data lacks a clear/unique separator. I did some research on your example file (file1_header.txt) and took a closer look at the file, and what you can see immediately is that there is no unique separator.

Your first column (f.eid) is seperated from the 2nd one (f.3.0.0 ) by a tab... whereas column #2 is seperated from column #3 (f.3.1.0 ) by a space. Tabs are shown below with the regular expression \t and space with \s : So, this is the beginning of your header: f.eid\t{1}f.3.0.0\s{1}f.3.1.0\s{1}f.3.2.0 ....

And so on with spaces up to column 14, where there is again a tabulator as a separator. There is also a change in the tab width ( 1 -> 2, ie. \t{2})

.... f.6.2.0\s{1}f.19.0.0\t{2}f.21.0.0\t{2}f.21.1.0 ....

Hence I'm not surprised at all that mlr can't cope with this.

I cleaned up your header data, appended a stupid line of data below it (sorry for that) and voila: mlr has no probem with it:

New (cropped) sample file with 1 space as separator:

"f.eid f.3.0.0 f.3.1.0 f.3.2.0 f.4.0.0 f.4.1.0 f.4.2.0 f.5.0.0 f.5.1.0 f.5.2.0 f.6.0.0 f.6.1.0 f.6.2.0 f.19.0.0 f.21.0.0 f.21.1.0

data_f.eid data_f.3.0.0 data_f.3.1.0 data_f.3.2.0 data_f.4.0.0 data_f.4.1.0 data_f.4.2.0 data_f.5.0.0 data_f.5.1.0 data_f.5.2.0 data_f.6.0.0 data_f.6.1.0 data_f.6.2.0 data_f.19.0.0 data_f.21.0.0 data_f.21.1.0"

$ mlr --csv --fs ' ' --opprint --from guilio-file1_header-cropped.txt cut -f f.4.1.0,f.6.2.0,f.21.1.0

--->

f.4.1.0 f.6.2.0 f.21.1.0

data_f.4.1.0 data_f.6.2.0 data_f.21.1.0

It would be interesting to have some real world data. Just a few lines, Guilio.

0
John Kerl On

Here is a repro script, and a status update:

https://github.com/johnkerl/miller/issues/1501#issuecomment-1962439382