Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Limit and Limit by Partition #128

Closed
kvnkho opened this issue Dec 19, 2020 · 1 comment · Fixed by #131
Closed

[FEATURE] Limit and Limit by Partition #128

kvnkho opened this issue Dec 19, 2020 · 1 comment · Fixed by #131

Comments

@kvnkho
Copy link
Collaborator

kvnkho commented Dec 19, 2020

Implement a new method/transformer to limit.

Look at the Spark documentation: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.limit
Pandas doesn't have one. Maybe the backend can use df.head() or df.sample if we want it to be random.

@goodwanghan
Copy link
Collaborator

It should be head for pandas

For limit, if the order by is not specified, then returning anything is valid, we don't have to sample the dataset

we may expect something like this:

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3]],"a:int,b:int")
    df.partition(by=["a"], presort="b desc").limit(1).show()

it should extract [0,2],[1,3]

another case, if there is no partition by, it should be something like:

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3]],"a:int,b:int")
    df.limit(1, presort="a,b desc").show()

it should extract [0,2]

for the simplest case:

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3]],"a:int,b:int")
    df.limit(1).show()

returning any row of df should be valid because for a distributed environment, we don't guarantee order cross partitions

For this problem, we need to firstly implement limit on engine level, then on workflow level.

@goodwanghan goodwanghan added this to the 0.5.0 milestone Dec 20, 2020
@kvnkho kvnkho linked a pull request Jan 11, 2021 that will close this issue
@kvnkho kvnkho closed this as completed Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants