Trying to concat a list of 12 AnnData objects but am getting duplicates

1k views Asked by At

I'm prepping our experimental data for trajectory analysis by partially following this guide here and I have a list of 12 AnnData objects that I read in as loom files. 6 of them come from one sequencing run whereas the other 6 come from another. I followed the aforementioned link's recommendation to generated spliced/unspliced count matrices using velocyto, which is how I got the loom files.

Anyway, that's all background information. I'm trying to merge all of these AnnData objects into one.

>>> loom_data
[AnnData object with n_obs × n_vars = 5000 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 5773 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 6807 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 5613 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 6052 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 3500 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 10510 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 9356 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 3246 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 1132 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 13595 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', AnnData object with n_obs × n_vars = 9541 × 36601
    obs: 'comp.ident'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced']

I'm having trouble understanding how to do this. All of the barcodes, which are loom_data[i].obs.index, are unique and contain a suffix for the sample they correspond to. Ultimately, I want to bring in these layers into another AnnData object using scVelo.

The issue is calling sc.concat. It's the genes that have overlap; none of the barcodes are supposed to match across the 12 list elements. So I want to select the vars axis, which I think is 1, and I want the union of all of the elements in the other axis:

test = sc.concat(loom_data, axis = 1, join = 'outer')

But when I call the line above, I get a concatenation of all of the genes with the names made unique, even though I want to consolidate their counts:

>>> test
AnnData object with n_obs × n_vars = 80125 × 439212
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'

I want the genes that aren't present in all samples to just have 0 counts. How would this be possible?

1

There are 1 answers

0
CelineDion On

I just made the var names unique first and then concatenated along the obs instead of vars.

# for each element in loom_data, make var names unique
for ldata in loom_data:
    ldata.var_names_make_unique()

test = sc.concat(loom_data, axis = 0, join = 'outer')
>>> test
AnnData object with n_obs × n_vars = 80125 × 36601
    obs: 'comp.ident'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'

~36k genes is about what I expected.