-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize agg queries #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
@AdlerShay @ShayAdler FYI - fixed a bug (merged directly to main) - 0dbf39a |
@OriHoch why was only a day of tolerance problematic? |
the gtfs date is the date that the data was collected on, it's not necesarily related to the date of the actual ride I noticed it on the site where it showed less data with the optimization then without.. |
Alright I see it in the code now, but I was sure that this is the day we downloaded it from the MOT FTP, the therefore we still have reference to the original date. |
yes, gtfs date is the date we downloaded the data from MOT FTP, so it's not directly related to the ride dates/times.. but it is confusing yes.. gtfs_ride.start_time is more reliable for both date and time because it should be the date from the actual gtfs data but I'm not sure, I might be mistaken, it's very confusing.. |
We actually rely on that when setting the Basically AFAIK, we only use that date to narrow down searches, as the date that really matters is the ride's. I think it will help me to look at the raw data, do we store it like we store siri? |
yes, the raw data is availab,e see here - https://github.com/hasadna/open-bus-pipelines/blob/main/STRIDE_ETL_PROCESSES.md#gtfs-etl-download-upload |
@OriHoch can you expand on how did you see this bug? -- get all rides in which the diff between the gtfs_route.date and the gtfs_ride.start_time is more than 1 day. Sometimes we do have 1 day diff, mostly when the ride hour is late.
select EXtract(epoch from date_trunc('day', gtfs_ride.start_time) - gr.date) , * from gtfs_ride join gtfs_route gr on gtfs_ride.gtfs_route_id = gr.id
where date_trunc('day', gtfs_ride.start_time) != gr.date
and EXtract(epoch from date_trunc('day', gtfs_ride.start_time) - gr.date) not in (-86400, 86400)
and date_trunc('day', gtfs_ride.start_time) between '2023-10-01' and '2023-10-30' |
This is the original complaint - https://hasadna.slack.com/archives/C1M8YT5AN/p1701675411278589 |
this should reduce the query times without changing the results