[FEATURE] Limit and Limit by Partition #128

kvnkho · 2020-12-19T17:00:28Z

Implement a new method/transformer to limit.

Look at the Spark documentation: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.limit
Pandas doesn't have one. Maybe the backend can use df.head() or df.sample if we want it to be random.

goodwanghan · 2020-12-20T05:46:34Z

It should be head for pandas

For limit, if the order by is not specified, then returning anything is valid, we don't have to sample the dataset

we may expect something like this:

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3]],"a:int,b:int")
    df.partition(by=["a"], presort="b desc").limit(1).show()

it should extract [0,2],[1,3]

another case, if there is no partition by, it should be something like:

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3]],"a:int,b:int")
    df.limit(1, presort="a,b desc").show()

it should extract [0,2]

for the simplest case:

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3]],"a:int,b:int")
    df.limit(1).show()

returning any row of df should be valid because for a distributed environment, we don't guarantee order cross partitions

For this problem, we need to firstly implement limit on engine level, then on workflow level.

goodwanghan assigned kvnkho Dec 20, 2020

goodwanghan added this to the 0.5.0 milestone Dec 20, 2020

goodwanghan added core feature enhancement New feature or request high priority programming interface labels Dec 20, 2020

kvnkho linked a pull request Jan 11, 2021 that will close this issue

limit with partition by #131

Merged

kvnkho closed this as completed Jan 11, 2021

Provide feedback