Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Insert 2021 class profiles and points thresholds for Warsaw schools #43

Open
micorix opened this issue Apr 15, 2022 · 24 comments
Open

Insert 2021 class profiles and points thresholds for Warsaw schools #43

micorix opened this issue Apr 15, 2022 · 24 comments
Assignees

Comments

@micorix
Copy link
Member

micorix commented Apr 15, 2022

We got data regarding past class profiles and points thresholds for schools in Warsaw.

@Anakin100100 can you add them to the db?

The file is not ideal (every second row needs to be ignored) but it was the best we could come up with when converting the PDF using some online tools.

If it's possible please insert also original PDF into the repo for clarity and transparency.

PDF converted to Excel:
https://docs.google.com/spreadsheets/d/1a13O1QidSuWR4Xgf_FmDj4KDkbzVq2hz/edit?usp=sharing&ouid=115685832088064624071&rtpof=true&sd=true

Original PDF:
https://drive.google.com/file/d/12aRWJekAPh-rnqG7rJhk_8ujaf3Pzgdk/view?usp=sharing

@micorix
Copy link
Member Author

micorix commented Apr 21, 2022

@Anakin100100 bump

@Anakin100100
Copy link
Contributor

I think that a class profile together with what the requirements were to get in are general enough that we should think how to approach that in non Warsaw specific ways. Can we implement filtering based on additive components here? Something like tick boxes where users can select the extended subjects that they would like to select. @micorix what do you think about this approach?

@micorix
Copy link
Member Author

micorix commented Apr 21, 2022

I'm not sure if I follow, but yeah we planned on having multiple-select field with extended subjects as the only way to filter class profiles

@micorix
Copy link
Member Author

micorix commented Apr 21, 2022

It would be sth like school type filter here: https://test.po8klasie.pl/warszawa/search

@Anakin100100
Copy link
Contributor

@wojtodzio @micorix can we setup a meeting? Tomorrow would be tough because I'm attending hacknight at Hackespace Silesia and I won't be available from 5 pm and I have to get to Bielsko first. If you are up to it I'm free on the weekend. On monday I'm free before 8 pm.

@micorix
Copy link
Member Author

micorix commented Apr 21, 2022

Monday and Saturday would be fine with me but I would need to confirm.

What do you want to discuss?

@Anakin100100
Copy link
Contributor

Anakin100100 commented Apr 21, 2022

1 Identyfying key tasks that are required to present anything to the public
2 Data modelling as we are adding more providers, that is our data interfaces that we are going to cast to data from different data providers. For example the extended subjects are a string joined using a space and a field on the Institution model and This is not sustainable when we want to consider different combinations of extended subjects and separate score requirements.

@micorix @wojtodzio feel free to add your topics to the agenda

@micorix
Copy link
Member Author

micorix commented Apr 21, 2022

Ok.

I know that it is hard to estimate before the meeting but when approximately do you think the backend will be ready for the first official release?

@Anakin100100
Copy link
Contributor

@micorix this depends on what we need to make the first official release which is not clearly defined for now.

@micorix
Copy link
Member Author

micorix commented Apr 22, 2022

Ok, so to clarify. The most urgent 🔥 priorities are:

  1. allow users to filter by extended subjects (comma separated)
  2. insert past class profiles with point thresholds from Warsaw

@micorix
Copy link
Member Author

micorix commented Apr 23, 2022

@Anakin100100 are u able to provide raw estimates considering the priorities listed above?

@Anakin100100
Copy link
Contributor

@micorix The filtering that is available for data from gdynia has to be changed. Extended subjects are going to be a separate model that is going to be regenerated with the database. The extended subject combination will be a separate model belonging to a institution. After all the other filters run we will filter the extended subjects in memory. I will work on it tomorrow and I may finish it then but it's not a guarantee.

@micorix
Copy link
Member Author

micorix commented Apr 23, 2022

Ok, great! How about the data from Warsaw?

@Anakin100100
Copy link
Contributor

This data is hard to work because of low degree of standardisation, various irregularities in the data and a huge number of edge cases and irregularities introduced by the vocational schools. I'll do my best but using this data is challenging.

@Anakin100100
Copy link
Contributor

@wojtodzio I need your help. I need to be able to filter on class profiles which is an array of strings, sth like ["Polski", "Matematyka"]. The model hierarchy looks as follows: Institution has one Subject Set which has many Subjects that have names in the array above. We need to filter the data where each of the specified subject names must be present in the subject set but more can be present as well. For now I've written somethink like that:

if @class_profiles != nil
institutions = institutions.where(:subject_set => {
where: {
:subject => {
where: { name: @class_profiles }
}
}
})
end

which of course gives an error

NoMethodError (undefined method `key?' for nil:NilClass

  klass&.columns_hash.key?(column_name)
                     ^^^^^):

How should i go about writing this? Is there something from active record that I'm missing or is writing it with raw SQL the only way to do this?

@Anakin100100
Copy link
Contributor

@wojtodzio before you can try debugging this issue you have to populate the database with subjects and subject sets for warsaw schools using CreateSubjectsJob.new.perform_now and later ProcessWarsawDataJob.new.perform_now (this one take 15 mins so it's a good idea to start it beforehand).

@wojtodzio
Copy link
Contributor

wojtodzio commented Apr 25, 2022

@Anakin100100 I'm not sure what I'm doing wrong, but ProcessWarsawDataJob.new.perform_now takes only a few seconds for me, and it doesn't create any institutions in the DB.
What you're trying to do here is to filter institutions based on different tables. To do that, you need to join those tables to institutions. In this case, it'll be an INNER JOIN, as you don't care about institutions without any subjects (a short reminder on SQL joins).
After joining tables, you'll be able to filter them. Also, in the resulting set, you'll find each institution to be duplicated for each joined table - you'll have to remove the duplicates.
You can do that either by grouping results by the institution's ID and then filtering them using the HAVING clause or by filtering them using the WHERE clause and removing duplicates using a DISTINCT on the institution's values.

I think something like that should work (I haven't tested it, though):

Institution.joins(subject_sets: :subjects).where(subject_sets: { subjects: { name: ["Polski", "Matematyka"] } }).distinct

@Anakin100100
Copy link
Contributor

@wojtodzio Thanks for info on how to solve this issue. Assuming that all migrations have been run not loading the file correctly could be responsible for that. If there are no records to iterate over it would finish instantly. Can you confirm that inside ProcessWarsawDataService the raw_school_data contains roughly 800 records after loading the file (it should be inside the data directory)

@wojtodzio
Copy link
Contributor

Yes, it does. It seems like institutions are missing. I guess I should also run something like CreateInstitutionRecordsJob? It may be worth preparing a simple setup script so that a new developer can bootstrap all the data in one command

@Anakin100100
Copy link
Contributor

@wojtodzio Not really, the whole process is described in docs/regenerating_the_database.md first create the institution types and later enque the jobs for each type

@Anakin100100
Copy link
Contributor

Using the job for that

@wojtodzio
Copy link
Contributor

Right, I've just tried running the regeneration script from master, and then this job from your branch, and it seems to have helped (or at least the job is taking more than a few sec. to finish 😂)

@wojtodzio
Copy link
Contributor

Oh, and I've just realized that you wanted ALL of the subjects to be there. You can do group + having, then. E.g., something like this (I haven't tested it):

Institution.joins(subject_sets: :subjects).group(:id).having("ARRAY_AGG(subjects.name::text) @> ARRAY['Polski']")

@micorix
Copy link
Member Author

micorix commented May 24, 2022

@Anakin100100 what's the status of this issue?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants