Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to ensure alignment of two interval tables. #37

Open
golobor opened this issue May 18, 2020 · 5 comments
Open

Find a way to ensure alignment of two interval tables. #37

golobor opened this issue May 18, 2020 · 5 comments

Comments

@golobor
Copy link
Member

golobor commented May 18, 2020

We need a function to synchronize the indices of two tables with almost identical intervals. This is typically needed to enable safe transferring of columns between these tables. Alternatively, we can have a function that transfers columns between two tables of almost identical intervals.

@nvictus , is this a good summary of your request?

@nvictus
Copy link
Member

nvictus commented May 19, 2020

Yeah, effectively it's about "re-indexing" the two tables to have the same "index", where the "index" is chrom, start, end. That means it is equivalent to a join which could be inner, outer, left or right (but without actually merging into one table). The most common case here would probably be outer.

If we forget about the left and right options, it could also be cast as an n-ary operation:

dfs = align_tables(**dfs)  # or align_tables(dfs) ?

@gfudenberg
Copy link
Member

two ideas for implementation:
https://gist.github.com/gfudenberg/3b016cdc20b3e482cdd9411d3542f823
is this what you had in mind @nvictus ?

@gfudenberg gfudenberg mentioned this issue Aug 26, 2021
27 tasks
@nvictus
Copy link
Member

nvictus commented Sep 8, 2021

I think it would have to return the dataframes with a common (reset) index, and chrom, start and end values, determined by some kind of join on the intervals in the inputs.

e.g.

df1 = pd.DataFrame([
    ['chr1', 0, 1000, 'a'],
    ['chr1', 1000, 2000, 'b'],
], columns=['chrom', 'start', 'end', 'foo'])

df2 = pd.DataFrame([
    ['chr1', 0, 1000, 'c'],
    ['chr1', 1000, 2000, 'd'],
    ['chr1', 2000, 3000, 'e'],
], columns=['chrom', 'start', 'end', 'bar'])

df3 = pd.DataFrame([
    ['chr1', 0, 1000, 'f'],
    ['chr1', 1000, 2000, 'g'],
    ['chrX', 0, 1000, 'x'],
    ['chrX', 1000, 2000, 'y'],
], columns=['chrom', 'start', 'end', 'baz'])

>>> df1, df2, df3 = align_tables([df1, df2, df3], how='outer')
>>> df1
  chrom  start   end  foo
0  chr1      0  1000    a
1  chr1   1000  2000    b
2  chr1   2000  3000  NaN
3  chrX      0  1000  NaN
4  chrX   1000  2000  NaN

@gfudenberg
Copy link
Member

gfudenberg commented Sep 15, 2021

so it seems that a requirement for alignability is that any interval in df1 can overlap 1 or 0 intervals in df2 (and vice-versa). anything else?

your example has uniform width bins-- were there use-cases where non-uniform bins would make sense? if not, this sounds like a function join_bintables() that could follow nicely from bin definitions

also, what was the use-case where you'd want to return individual 'aligned' dfs, rather than a df with joined columns?

@nvictus
Copy link
Member

nvictus commented Jan 14, 2022

For this to work, the alignment multiindex ('chrom', 'start', 'end') + potentially others would have to be unique, i.e. no duplicates, so this would have to be checked first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants