Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize agg queries #36

Merged
merged 1 commit into from
Dec 4, 2023
Merged

optimize agg queries #36

merged 1 commit into from
Dec 4, 2023

Conversation

OriHoch
Copy link
Contributor

@OriHoch OriHoch commented Dec 3, 2023

this should reduce the query times without changing the results

Copy link

@AdlerShay AdlerShay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@OriHoch OriHoch merged commit 47ede7e into main Dec 4, 2023
2 checks passed
@OriHoch
Copy link
Contributor Author

OriHoch commented Dec 4, 2023

@AdlerShay @ShayAdler FYI - fixed a bug (merged directly to main) - 0dbf39a

@ShayAdler
Copy link
Contributor

@OriHoch why was only a day of tolerance problematic?

@OriHoch
Copy link
Contributor Author

OriHoch commented Dec 4, 2023

the gtfs date is the date that the data was collected on, it's not necesarily related to the date of the actual ride

I noticed it on the site where it showed less data with the optimization then without..

@ShayAdler
Copy link
Contributor

ShayAdler commented Dec 6, 2023

Alright I see it in the code now, but I was sure that this is the day we downloaded it from the MOT FTP, the therefore we still have reference to the original date.
Can you help me here with understanding the gtfs loading process? (when are we downloading from the ftp? when is the original date is getting lost in the way?).
Also, is the gtfs_ride.start_time also something we can't trust on (regarding the date, not the hour)?

@OriHoch
Copy link
Contributor Author

OriHoch commented Dec 6, 2023

yes, gtfs date is the date we downloaded the data from MOT FTP, so it's not directly related to the ride dates/times.. but it is confusing yes..
for example in the data we downloaded on Thursday we might get rides which happened on Sunday.. or something like that

gtfs_ride.start_time is more reliable for both date and time because it should be the date from the actual gtfs data

but I'm not sure, I might be mistaken, it's very confusing..

@ShayAdler
Copy link
Contributor

We actually rely on that when setting the siri_ride.scheduled_time_gtfs_ride_id to optimize the search there as well, maybe we shouldn't?
Maybe best would be adding an ETL that will update the route date based on the most common gtfs_rides.start_time? I definitely feel like I understand nothing again 😅

Basically AFAIK, we only use that date to narrow down searches, as the date that really matters is the ride's.

I think it will help me to look at the raw data, do we store it like we store siri?

@OriHoch
Copy link
Contributor Author

OriHoch commented Dec 7, 2023

@ShayAdler
Copy link
Contributor

@OriHoch can you expand on how did you see this bug?
I wan unable to find such cases in the data (this returns 0 results) -

-- get all rides in which the diff between the gtfs_route.date and the gtfs_ride.start_time is more than 1 day. Sometimes we do have 1 day diff, mostly when the ride hour is late.
select EXtract(epoch from date_trunc('day', gtfs_ride.start_time) - gr.date) , * from gtfs_ride join gtfs_route gr on gtfs_ride.gtfs_route_id = gr.id
where date_trunc('day', gtfs_ride.start_time) != gr.date
  and EXtract(epoch from date_trunc('day', gtfs_ride.start_time) - gr.date) not in (-86400, 86400)
and date_trunc('day', gtfs_ride.start_time) between '2023-10-01' and '2023-10-30'

@OriHoch
Copy link
Contributor Author

OriHoch commented Dec 16, 2023

This is the original complaint - https://hasadna.slack.com/archives/C1M8YT5AN/p1701675411278589
this change fixed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants