Overlapping test and visualization

39 views Asked by At

I have the following data frame:

    window_start    window_end  dataset
29125   1828457 1828868 129C
29126   1891493 1891904 129C
29127   2312557 2312968 129C
29128   3745905 3746316 129C
29129   5036701 5037112 129C
... ... ... ...
49838   185443673   185444084   172C
49840   186261905   186262316   172C
49841   186888969   186889380   172C
49980   187896721   187897132   172C
49987   190067549   190067960   172C
530 rows × 3 columns

I wish to get two results: 1. identify the overlapping regions numerically over all the intervals (e.g [1828450, 1828860], etc); 2. visualize all the intervals with a matplot diagram similar to the one I report below.

enter image description here

I already tried the following code to solve the point 2, but it shows nothing:

x_start_df = AllC_chr1[AllC_chr1.dataset=='129C'].window_start
xstart = x_start_df.to_numpy()
x_end_df   = AllC_chr1[AllC_chr1.dataset=='129C'].window_end
xstart = x_end_df.to_numpy()
y       = AllC_chr1[AllC_chr1.dataset=='129C'].index
pl.figure()
pl.barh(y/1000, width=x_end-x_start, left = x_start)

Any suggestions will be welcome.

Thank you for your support

1

There are 1 answers

2
JohanC On

The main problem is that the width of the vertical bars is extremely small compared to the distance between the bars. That way, you only see the outlines of the bars, not their interior. You can change the default white edge color to something else.

You can use the 'dataset' column for the y-axis, to get them automatically labeled. Bar plots are drawn with "sticky edges" (setting the left margin to zero). If that isn't desired, ax.use_sticky_edges can be turned off.

With matplotlib, it is highly recommended to import matplotlib.pyplot as plt, making the code easier to compare with example code (and for others to understand the code more rapidly). Also, the object-oriented interface helps to easier understand what's going on.

import matplotlib.pyplot as plt
import pandas as pd

AllC_chr1 = pd.DataFrame({
    'window_start': [1828457, 1891493, 2312557, 3745905, 5036701, 185443673, 186261905, 186888969, 187896721,
                     190067549],
    'window_end': [1828868, 1891904, 2312968, 3746316, 5037112, 185444084, 186262316, 186889380, 187897132, 190067960],
    'dataset': ['129C', '129C', '129C', '129C', '129C', '172C', '172C', '172C', '172C', '172C']},
    index=[29125, 29126, 29127, 29128, 29129, 49838, 49840, 49841, 49980, 49987])

df = AllC_chr1
# df = AllC_chr1 [AllC_chr1['dataset']=='129C']

fig, ax = plt.subplots(figsize=(15, 3))
ax.barh(df['dataset'], left=df['window_start'],
        width=df['window_end'] - df['window_start'], edgecolor='blue')
# Disable sticky edges
ax.use_sticky_edges = False
# Set the x-axis tick labels to millions
ax.xaxis.set_major_formatter(lambda x, pos: f"{x / 1000000:g}M")

plt.tight_layout()
plt.show()

horizontal bar plot showing differences