Create parallel-coordinate plot regarding to number of occurrences in r

224 views Asked by At

I have DF as below. it contains information about two students for 3 terms and subjects whether they pass or fail.
I would like to draw a parallel coordinate of the trace of the students. I wanna see which path is taken to reach the end.

ID  term  subject  result
1    1     math01    fail
1    1     Phys01    pass
1    1     chem01    pass
1    2     math01    pass
1    2     math02    fail
1    3     math02    fail
1    3     cmp01     pass
2    1     math01    fail
2    1     phys01    pass
2    2     math01    pass
2    2     math02    pass
2    3     cmp01     pass

the desired result would be similar to the image below.
Each block at each term shows the taken subject alias with the resultcolumn (fail or pass). the size of the block should correspond to the number of the taken subject. for example, if most students fail math01 at term 1, the block of math01fail should be the biggest block in below term1.

The connecting line connects that what subjects the students took in the term with the next term. The thickness of the line corresponds to the number of connections at that point. for example, if many students fail math01 (math01fail) at term1 and retake math01 at term2 and pass it (math01pass), the connecting line between math01fail to math01pass should thicker regarding the number of occurrences.

How can I create such a plot in R? enter image description here

1

There are 1 answers

0
Maurits Evers On BEST ANSWER

I think you're better off approaching this problem from a graph point of view, rather than in the context of parallel coordinates.

Here is what I would do:

  1. Start by loading necessary libraries

    library(tidyverse)
    library(igraph)
    
  2. First we define the edge list of our graph. To do so, we do a self-join of df by ID, and select rows that correspond to consecutive (increasing) terms. Every row then corresponds to a link from term i to i+1 for every student.

    el <- left_join(df, df, by = "ID") %>%
        filter(term.x == term.y - 1) %>%
        mutate_at(vars(starts_with("term")), funs(paste0("term", .))) %>%
        unite(from, term.x, subject.x, result.x, sep = "\n") %>%
        unite(to, term.y, subject.y, result.y, sep = "\n") %>%
        select(from, to) %>%
        group_by(from, to) %>%
        summarise(weight = (n() - 1) * 5 + 1)
    

    We add a weight column that is proportional to the number of students for every edge. The reason why we don't simply do weight = n() is purely due to aesthetics, where we like to have thicker lines for >1 students.

  3. Next we define a node list. The key here is to add columns x and y which will determine the grid layout of the nodes.

    nl <- df %>%
        mutate(term = paste0("term", term)) %>%
        arrange(term) %>%
        distinct(term, subject, result) %>%
        mutate(x = as.integer(as.factor(term))) %>%
        group_by(term) %>%
        mutate(y = 1:n()) %>%
        unite(node, term, subject, result, sep = "\n")
    

    Note that entries in the first column of nl need match those in the first two columns of el.

  4. We are now ready to construct an igraph from both data.frames and plot the graph.

    gr <- graph_from_data_frame(d = el, vertices = nl, directed = F)
    plot(
        gr,
        edge.width = E(gr)$weight,
        vertex.shape = "rectangle",
        vertex.size = 50, vertex.size2 = 50,
        vertex.color = "white",
        vertex.label.family = "sans",
        vertex.label.cex = 0.7)
    

    enter image description here

    The resulting plot may need some further tweaking/polishing, but this should get you started.