I have a huge dataset (10k+ rows) that I've visualized as a bubble plot in ggplot2. However, I need to organize both axes. I am very familiar with using factor()
and a levels list to order variables, but this simply isn't realistic given the amount of data.
Example of using factor() in the past:
primary_order_list4 <- c("RHM","TS","SCS","LIL","STN","STS") fenorm.gene.count$Pond <- factor(fenorm.gene.count$Pond,levels=primary_order_list4)
that won't work here because I have a few hundred different variables, several of which are gene annotations (and super messy if you know you know).
X-axis:
I have the column Sample
that refers to the location name (ex. SCS), the year (ex. 19), and the depth (ex. 3; SCS19-3). I am trying to order the samples based on the pond's increasing distance called GOA
: For example, the pond SCS has a GOA
of 18.7 and the pond RHM has a GOA
of 6.4, so RHM would proceed SCS. Secondly, within each pond, I would then like to order based on the variable depthcenter
from shallow (0.5) to deep (19.5).
I have tried to order the df from increasing depthcenter
then GOA
, but when this is plot with ggplot
, the x-axis orders are alphabetical:
fenorm.gene.count <- fenorm.gene.count[order(fenorm.gene.count$depthcenter),]
ugh <- fenorm.gene.count[order(fenorm.gene.count$GOA),]
ughplot <- ggplot(fenorm.gene.count, aes(x=Sample,y=HMM,fill=Pond)) + geom_point(aes(size=ifelse(total_count==0, NA, total_count)))
I am also trying to so something similar with the y-axis variables.
Heres is a short version of my df
dput(head(fenorm.gene.count))
structure(list(Project = c("Ga0598240", "Ga0598240", "Ga0598240",
"Ga0598240", "Ga0598240", "Ga0598240"), category = c("iron_reduction",
"iron_reduction", "iron_oxidation", "iron_oxidation", "iron_reduction",
"iron_oxidation"), HMM = c("MtrC_TIGR03507",
"OmcS", "OmcF", "Cyc2_repCluster2", "PF01032-FecCD-YfhA-FpvE-YfeCD_transport_family",
"PvuCD-FhuB-CbrBC-FeuB-CbrC-YfhA"), total_count = c(10L,
1L, 17L, 55L, 15L, 1L), norm = c(0.00136836343732895, 0.000136836343732895,
0.00232621784345922, 0.00752599890530925, 0.00205254515599343,
0.000136836343732895), Sample = c("LIL19-1", "LIL19-1", "LIL19-1",
"LIL19-1", "LIL19-1", "LIL19-1"), Portal = c(3300064049, 3300064049,
3300064049, 3300064049, 3300064049, 3300064049), G2 = c("N",
"N", "N", "N", "N", "N"), Pond = c("LIL", "LIL", "LIL", "LIL",
"LIL", "LIL"), year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L
), depthcenter = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), toplow = c("top",
"top", "top", "top", "top", "top"), GOA = c(19.4, 19.4, 19.4,
19.4, 19.4, 19.4), elevation = c(8.2, 8.2, 8.2, 8.2, 8.2, 8.2
)), row.names = 65:70, class = "data.frame")
Simply reordering your dataset will have no effect. You have to convert the variable to a
factor
with the order oflevels
set in your desired order. To this use e.g.factor(fenorm.gene.count$HMM, levels = unique(fenorm.gene.count$HMM))
after reordering your data orforecast::fct_inorder()
.Note: In the example below I reordered by
total_count
as yourdepthcenter
column has the same value for each observation.Additionally, it might also be worthwhile to consider
reorder()
orforecast::fct_reorder
which both allow to reorder by a numeric without the need to reorder the dataset first.