Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Results are mismatch with Vanilla Spark after multi left-join and complex group-by in release-1.1 #5372

Open
kecookier opened this issue Apr 11, 2024 · 0 comments
Labels
bug Something isn't working triage

Comments

@kecookier
Copy link
Contributor

Backend

VL (Velox)

Bug description

select lu.uuid uuid,
       lu.user_id user_id,
       lu.poi_id poi_id,
       ex.1day_exposure 1day_exposure,
       ex.3day_exposure 3day_exposure,
       ex.7day_exposure 3day_exposure,
       click.1days_click 1days_click,
       click.3days_click 3days_click,
       click.7days_click 7days_click,
       order.1days_order 1days_order,
       order.3days_order 3days_order,
       order.7days_order 7days_order,
       exp_nord.1days_expnoord 1days_expnoord,
       exp_nord.3days_expnoord 3days_expnoord,
       exp_nord.7days_expnoord 7days_expnoord,
       exp_noclick.1days_expnoclk,
       exp_noclick.3days_expnoclk,
       exp_noclick.7days_expnoclk,
       clk_nord.1days_clknoord,
       clk_nord.3days_clknoord,
       clk_nord.7days_clknoord
  from (
        select dp_id as uuid,
               user_id,
               poi_id
          from table_b
         where partition_date between '2024-03-22' and '2024-03-28'
           and page_id in(40041942,40744225,41861912)
           and partition_app='my_app'
         group by dp_id,
                  user_id,
                  poi_id
       ) lu
  left join(
        select poi_id,
               user_id,
               uuid,
               sum(if(partition_date = '2024-03-28', 1, 0)) as 1day_exposure,
               sum(if(partition_date between '2024-03-26' and '2024-03-28', 1, 0)) as 3day_exposure,
               sum(if(partition_date between '2024-03-22' and '2024-03-28', 1, 0)) as 7day_exposure
          from table_a
         where partition_date between '2024-03-22' and '2024-03-28'
           and partition_app = 'my_app'
           and event_id in ('yy')
           and uuid is not null
           and uuid <> ''
         group by poi_id ,user_id,
               uuid
       )ex on(lu.uuid=ex.uuid and lu.user_id=ex.user_id and lu.poi_id=ex.poi_id)
  left join (
        select poi_id,
               user_id,
               uuid,
               sum(if(partition_date = '2024-03-28', 1, 0)) as 1days_click,
               sum(if(partition_date between '2024-03-26' and '2024-03-28', 1, 0)) as 3days_click,
               sum(if(partition_date between '2024-03-22' and '2024-03-28', 1, 0)) as 7days_click
          from table_a
         where partition_date between '2024-03-22' and '2024-03-28'
           and partition_app = 'my_app' -- mt

           and event_id in ('xx')
           and uuid is not null
           and uuid <> ''
         group by poi_id,user_id,
               uuid
       )click on(lu.uuid=click.uuid and lu.user_id=click.user_id and lu.poi_id=click.poi_id)
  left join (
        select uuid,
               user_id,
               poi_id,
               sum(if(partition_date = '2024-03-28', 1, 0)) as 1days_order,
               sum(if(partition_date between '2024-03-26' and '2024-03-28', 1, 0)) as 3days_order,
               sum(if(partition_date between '2024-03-22' and '2024-03-28', 1, 0)) as 7days_order from(
                select uuid,
                       user_id,
                       deal_id,
                       poi_id,
                       req_id,
                       order_id,
                       partition_date
                  from table_c
                 where platform_type='dp'
                   and (partition_date between '2024-03-22'and '2024-03-28')
                   and is_pay = 1
                   and uuid is not null
                   and uuid <> ''
                 group by uuid,
                          user_id,
                          deal_id,
                          poi_id,
                          req_id,
                          order_id,
                          partition_date
               )
         group by uuid,
                  user_id,
                  poi_id
       )
 order on(lu.uuid=order.uuid and lu.user_id=order.user_id and lu.poi_id=order.poi_id)
  left join (
        select exp_noder.uuid,
               exp_noder.user_id,
               exp_noder.poi_id,
               sum(if(partition_date = '2024-03-28', 1, 0)) as 1days_expnoord,
               sum(if(partition_date between '2024-03-26' and '2024-03-28', 1, 0)) as 3days_expnoord,
               sum(if(partition_date between '2024-03-22' and '2024-03-28', 1, 0)) as 7days_expnoord
          from (
                select exp_noder_exp.uuid,
                       exp_noder_exp.user_id,
                       exp_noder_exp.poi_id,
                       exp_noder_exp.partition_date
                  from (
                        select poi_id,
                               user_id,
                               uuid,
                               partition_date
                          from table_a
                         where partition_date between '2024-03-22' and '2024-03-28'
                           and partition_app = 'my_app'
                           and event_id in ('b_mrroly')
                           and uuid is not null
                           and uuid <> ''
                       )exp_noder_exp
                  left join (
                        select uuid,
                               user_id,
                               deal_id,
                               poi_id,
                               req_id,
                               order_id,
                               partition_date
                          from table_c
                         where platform_type='dp'
                           and (partition_date between '2024-03-22'and '2024-03-28')
                           and is_pay = 1
                           and uuid is not null
                           and uuid <> ''
                         group by uuid,
                                  user_id,
                                  deal_id,
                                  poi_id,
                                  req_id,
                                  order_id,
                                  partition_date
                       )exp_noder_ord
                    on (exp_noder_exp.uuid=exp_noder_ord.uuid and exp_noder_exp.user_id=exp_noder_ord.user_id and exp_noder_exp.poi_id=exp_noder_ord.poi_id and exp_noder_exp.partition_date=exp_noder_ord.partition_date)
                 where exp_noder_ord.uuid is null
                   and exp_noder_ord.user_id is null
                   and exp_noder_ord.poi_id is null
               )exp_noder
         group by exp_noder.uuid,
                  exp_noder.user_id,
                  exp_noder.poi_id
       )exp_nord
    on (lu.uuid=exp_nord.uuid and lu.user_id=exp_nord.user_id and lu.poi_id=exp_nord.poi_id)
  left join (
        select exp_noclk.uuid,
               exp_noclk.user_id,
               exp_noclk.poi_id,
               sum(if(partition_date = '2024-03-28', 1, 0)) as 1days_expnoclk,
               sum(if(partition_date between '2024-03-26' and '2024-03-28', 1, 0)) as 3days_expnoclk,
               sum(if(partition_date between '2024-03-22' and '2024-03-28', 1, 0)) as 7days_expnoclk
          from (
                select exp_noclk_exp.uuid,
                       exp_noclk_exp.user_id,
                       exp_noclk_exp.poi_id,
                       exp_noclk_exp.partition_date
                  from (
                        select poi_id,
                               user_id,
                               uuid,
                               partition_date
                          from table_a
                         where partition_date between '2024-03-22' and '2024-03-28'
                           and partition_app = 'my_app'
                           and event_id in ('yy')
                           and uuid is not null
                           and uuid <> ''
                       )exp_noclk_exp
                  left join (
                        select poi_id,
                               user_id,
                               uuid,
                               partition_date
                          from table_a
                         where partition_date between '2024-03-22' and '2024-03-28'
                           and partition_app = 'my_app' -- mt

                           and event_id in ('xx')
                           and uuid is not null
                           and uuid <> ''
                       )exp_noclk_clk
                    on (exp_noclk_exp.uuid=exp_noclk_clk.uuid and exp_noclk_exp.user_id=exp_noclk_clk.user_id and exp_noclk_exp.poi_id=exp_noclk_clk.poi_id and exp_noclk_exp.partition_date=exp_noclk_clk.partition_date)
                 where exp_noclk_clk.uuid is null
                   and exp_noclk_clk.user_id is null
                   and exp_noclk_clk.poi_id is null
               )exp_noclk
         group by exp_noclk.uuid,
                  exp_noclk.user_id,
                  exp_noclk.poi_id
       )exp_noclick
    on (lu.uuid=exp_noclick.uuid and lu.user_id=exp_noclick.user_id and lu.poi_id=exp_noclick.poi_id)
  left join (
        select clk_noder.uuid,
               clk_noder.user_id,
               clk_noder.poi_id,
               sum(if(partition_date = '2024-03-28', 1, 0)) as 1days_clknoord,
               sum(if(partition_date between '2024-03-26' and '2024-03-28', 1, 0)) as 3days_clknoord,
               sum(if(partition_date between '2024-03-22' and '2024-03-28', 1, 0)) as 7days_clknoord
          from (
                select clk_noder_clk.uuid,
                       clk_noder_clk.user_id,
                       clk_noder_clk.poi_id,
                       clk_noder_clk.partition_date
                  from (
                        select poi_id,
                               user_id,
                               uuid,
                               partition_date
                          from table_a
                         where partition_date between '2024-03-22' and '2024-03-28'
                           and partition_app = 'my_app' -- mt

                           and event_id in ('xx')
                           and uuid is not null
                           and uuid <> ''
                       )clk_noder_clk
                  left join (
                        select uuid,
                               user_id,
                               deal_id,
                               poi_id,
                               req_id,
                               order_id,
                               partition_date
                          from table_c
                         where platform_type='dp'
                           and (partition_date between '2024-03-22'and '2024-03-28')
                           and is_pay = 1
                           and uuid is not null
                           and uuid <> ''
                         group by uuid,
                                  user_id,
                                  deal_id,
                                  poi_id,
                                  req_id,
                                  order_id,
                                  partition_date
                       )clk_noder_ord
                    on (clk_noder_clk.uuid=clk_noder_ord.uuid and clk_noder_clk.user_id=clk_noder_ord.user_id and clk_noder_clk.poi_id=clk_noder_ord.poi_id and clk_noder_clk.partition_date=clk_noder_ord.partition_date)
                 where clk_noder_ord.uuid is null
                   and clk_noder_ord.user_id is null
                   and clk_noder_ord.poi_id is null
               )clk_noder
         group by clk_noder.uuid,
                  clk_noder.user_id,
                  clk_noder.poi_id
       )clk_nord
    on (lu.uuid=clk_nord.uuid and lu.user_id=clk_nord.user_id and lu.poi_id=clk_nord.poi_id)
    where 
       ex.1day_exposure is not null or
       ex.3day_exposure is not null or
       ex.7day_exposure is not null or
       click.1days_click is not null or
       click.3days_click is not null or
       click.7days_click is not null or
       order.1days_order is not null or
       order.3days_order is not null or
       order.7days_order is not null or
       exp_nord.1days_expnoord is not null or
       exp_nord.3days_expnoord is not null or
       exp_nord.7days_expnoord is not null or
       exp_noclick.1days_expnoclk is not null or
       exp_noclick.3days_expnoclk is not null or
       exp_noclick.7days_expnoclk is not null or
       clk_nord.1days_clknoord is not null or
       clk_nord.3days_clknoord is not null or
       clk_nord.7days_clknoord is not null

table_a with data

        uuid         | user_id |   poi_id   | partition_date
---------------------+---------+------------+----------------
 -146174509746728999 | 6322846 | 1186034547 | 2024-03-28
 -146174509746728999 | 6322846 | 1650181797 | 2024-03-28

Wrong result is :

   poi_id   |      uuid         | user_id | gluten_1days_expnoord | vanilla_1days_expnoord
------------+---------------------+---------+----------+-----------+
 1650181797 | -146174509746728999 | 6322846 |        2 |         1 
 1186034547 | -146174509746728999 | 6322846 |        2 |         1 

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@kecookier kecookier added bug Something isn't working triage labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant