order character values on x axis based on separate values in same df

40 views Asked by At

I have a huge dataset (10k+ rows) that I've visualized as a bubble plot in ggplot2. However, I need to organize both axes. I am very familiar with using factor() and a levels list to order variables, but this simply isn't realistic given the amount of data.

Example of using factor() in the past: primary_order_list4 <- c("RHM","TS","SCS","LIL","STN","STS") fenorm.gene.count$Pond <- factor(fenorm.gene.count$Pond,levels=primary_order_list4) that won't work here because I have a few hundred different variables, several of which are gene annotations (and super messy if you know you know).

X-axis: I have the column Sample that refers to the location name (ex. SCS), the year (ex. 19), and the depth (ex. 3; SCS19-3). I am trying to order the samples based on the pond's increasing distance called GOA: For example, the pond SCS has a GOA of 18.7 and the pond RHM has a GOA of 6.4, so RHM would proceed SCS. Secondly, within each pond, I would then like to order based on the variable depthcenter from shallow (0.5) to deep (19.5).

I have tried to order the df from increasing depthcenter then GOA, but when this is plot with ggplot, the x-axis orders are alphabetical:

fenorm.gene.count <- fenorm.gene.count[order(fenorm.gene.count$depthcenter),]
ugh <- fenorm.gene.count[order(fenorm.gene.count$GOA),]
ughplot <- ggplot(fenorm.gene.count, aes(x=Sample,y=HMM,fill=Pond)) + geom_point(aes(size=ifelse(total_count==0, NA, total_count)))

I am also trying to so something similar with the y-axis variables.

Heres is a short version of my df

    dput(head(fenorm.gene.count))
    structure(list(Project = c("Ga0598240", "Ga0598240", "Ga0598240", 
    "Ga0598240", "Ga0598240", "Ga0598240"), category = c("iron_reduction", 
    "iron_reduction", "iron_oxidation", "iron_oxidation", "iron_reduction", 
    "iron_oxidation"), HMM = c("MtrC_TIGR03507", 
"OmcS", "OmcF", "Cyc2_repCluster2", "PF01032-FecCD-YfhA-FpvE-YfeCD_transport_family", 
"PvuCD-FhuB-CbrBC-FeuB-CbrC-YfhA"), total_count = c(10L, 
    1L, 17L, 55L, 15L, 1L), norm = c(0.00136836343732895, 0.000136836343732895, 
    0.00232621784345922, 0.00752599890530925, 0.00205254515599343, 
    0.000136836343732895), Sample = c("LIL19-1", "LIL19-1", "LIL19-1", 
    "LIL19-1", "LIL19-1", "LIL19-1"), Portal = c(3300064049, 3300064049, 
    3300064049, 3300064049, 3300064049, 3300064049), G2 = c("N", 
    "N", "N", "N", "N", "N"), Pond = c("LIL", "LIL", "LIL", "LIL", 
    "LIL", "LIL"), year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L
    ), depthcenter = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), toplow = c("top", 
    "top", "top", "top", "top", "top"), GOA = c(19.4, 19.4, 19.4, 
    19.4, 19.4, 19.4), elevation = c(8.2, 8.2, 8.2, 8.2, 8.2, 8.2
    )), row.names = 65:70, class = "data.frame")
1

There are 1 answers

0
stefan On BEST ANSWER

Simply reordering your dataset will have no effect. You have to convert the variable to a factor with the order of levels set in your desired order. To this use e.g. factor(fenorm.gene.count$HMM, levels = unique(fenorm.gene.count$HMM)) after reordering your data or forecast::fct_inorder().

Note: In the example below I reordered by total_count as your depthcenter column has the same value for each observation.

library(ggplot2)

fenorm.gene.count <- fenorm.gene.count[
  order(fenorm.gene.count$total_count),
]

fenorm.gene.count$HMM <- factor(
  fenorm.gene.count$HMM,
  levels = unique(fenorm.gene.count$HMM)
)

ggplot(fenorm.gene.count, aes(x = Sample, y = HMM, fill = Pond)) +
  geom_point(aes(size = ifelse(total_count == 0, NA, total_count))) +
  labs(size = NULL)

enter image description here

Additionally, it might also be worthwhile to consider reorder() or forecast::fct_reorder which both allow to reorder by a numeric without the need to reorder the dataset first.

ggplot(fenorm.gene.count, aes(x = Sample, y = reorder(HMM, total_count), fill = Pond)) +
  geom_point(aes(size = ifelse(total_count == 0, NA, total_count))) +
  labs(size = NULL)