Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to create initial data for tables #36

Closed
3 of 4 tasks
Tracked by #35
fyliu opened this issue Sep 1, 2022 · 10 comments
Closed
3 of 4 tasks
Tracked by #35

Document how to create initial data for tables #36

fyliu opened this issue Sep 1, 2022 · 10 comments
Labels
complexity: missing dependency Issue has dependencies documentation Improvements or additions to documentation feature: docs: team guide Same as PD team documentation? research Issue involving doing research role: technical writing s: PD team stakeholder: People Depot Team size: 2pt Can be done in 7-12 hours

Comments

@fyliu
Copy link
Member

fyliu commented Sep 1, 2022

Overview

Since some of our database tables need to be pre-populated with data, we need to decide on a way to do it and document it.

Action Items

Resources/Instructions

In any solution, we need to be able to connect it to django's migration system or in a script so it can be auto-inserted on clean build.

  1. Data-creation code directly inside the migration file. OR just use as standalone script.
    1. From spreadsheet
      1. Use App Script to generate json file
      2. Use a script to generate the python code for adding the data
    2. From existing table data
      1. Use dumpscript to generate the python code for adding the data
  2. Load SQL script from django migration file
    Place the sql script in <app>/sql/<app>.sql
    1. From spreadsheet
      1. Use App Script to generate json file
      2. Use a script to generate the SQL commands from json
    2. From existing table data
      1. Use DBeaver to generate insert statements
  3. Load fixture from django migration file
    Place the migration file in <app>/fixtures/<app>.json
    1. From spreadsheet
      1. Use App Script to generate json file
      2. Use a script to rewrite json in django migration format
    2. From existing table data
      1. Use dumpdata to generate json from data already in the db
@fyliu fyliu added documentation Improvements or additions to documentation research Issue involving doing research labels Sep 1, 2022
@fyliu
Copy link
Member Author

fyliu commented Sep 1, 2022

I've looked into options 1 and 3ii. They are both viable options and I think the fixture option is the better one since we can also use it for testing, although with the django_db decorator will automatically apply all the migrations, so all the options will work in that case. We don't have any non-django_db tests but we might eventually. So the fixture option is still preferable if other factors are equal.

In a call with @ExperimentsInHonesty, I learned about the possibility of generating json from G sheets using an app script. I later tried option 3i. It also works well. The advantage of it is that it may be a more direct path from the initial spreadsheet to the final migration file. The options that dumps data from existing tables have the extra step of first inserting the data into the system through the admin site. That's a potentially step to introduce user errors.

To compare the work involved between from spreadsheet and from table data

  • both

    • same migrations file setup
  • from sheet

    • more direct, one less step for potential copy/paste error
      • there's a single copy/paste and save as json file
    • need extra script to reshape json from spreadsheet to the proper format
  • from table data

    • need to add data into the db first, from admin site, which is relatively user-friendly
    • the amount of this manual work depends on how many rows need to be inserted. Most tables are very simple. SDG-related table rows can be a chore.
    • need a script or manual work to remove automatic fields like uuid

Now that it's laid out, starting from sheets data has more opportunities for automation, so it feels more future-proof.

The next step will be to evaluate if the option 3i route I've done will be adequate for our needs.

  • problem: one thing I noticed is that not providing created_at and updated_at values causes an error for a non-nullable field. Making them nullable gets around this but then there's no timestamps in those rows.

@fyliu
Copy link
Member Author

fyliu commented Sep 1, 2022

Another possibility

This one imports the data from a CSV file. The code is custom to each model but it can likely be made more generic.

@fyliu
Copy link
Member Author

fyliu commented Sep 2, 2022

To answer the null issue from 2 comments ago, loading from fixture is meant to not activate the auto timestamp behavior defined in the model. The load from fixture functionality is meant to be for dumping and loading data that was already in the database at one point and not for loading initial data.
See this django ticket for details about why they won't change this behavior.

So we now have to choose:

  1. Generate the fixtures from spreadsheet and allow the timestamps fields to be null. It may not be a disadvantage to not have it for the pre-populated data.
  2. Manually insert the data into django and use dumpdata like it's intended to be used. This will write the timestamps from the db.
  3. Generate python code from spreadsheet so that it'll go through the normal route to adding data to the db and have the timestamps.

@fyliu fyliu mentioned this issue Sep 27, 2022
19 tasks
@fyliu
Copy link
Member Author

fyliu commented Sep 27, 2022

This is documented here and linked from the wiki.

The strategy is to export the JSON from the spreadsheet, convert it into a python script that can insert the data, then create a migration file that will call the script when it's processed.

The advantages of using a python script as opposed to importing the data in JSON form using loaddata:

  • it will generate timestamps where as loaddata needs the JSON file to contain timestamps for each record
  • loaddata requires the data to be in the database and dumped using dumpdata
  • manually (via script) creating the JSON data dump is possible, but it won't have the necessary timestamps
  • Making timestamps optional will get around the requirement, but that's not good practice
  • manually (via script) adding timestamps is possible, but it would look weird to have some data that's way before the website is launched. The JSON file would need to be saved to revision control, which is also weird.

@fyliu
Copy link
Member Author

fyliu commented Oct 3, 2022

I finished this and then realized there's a better way to do this. The initial data shouldn't be made into migrations if this project is supposed to be generic. They should be left as individual scripts to be run on initial database setup by us, and customized by other organizations for their needs.

@fyliu
Copy link
Member Author

fyliu commented Mar 30, 2023

I have a draft PR #141 for this. It's a draft because it's dependent on #140 to be merged.

@fyliu
Copy link
Member Author

fyliu commented Apr 6, 2023

I modified the PR to not depend on sphinx so we can move this forward.

@fyliu
Copy link
Member Author

fyliu commented Apr 10, 2023

There's no model created that has a need for initial data. This makes it more difficult to review the database insertion part of #141. I will go implement #24.

@fyliu
Copy link
Member Author

fyliu commented Apr 12, 2023

Looks like #24 has the same structure as #35 and we don't have a direction for working on #35 yet. So I can't work on #24.

The other thing I can do is pull the SOC_Major table issue out of the ice box, but I would rather not do that since it's going to be prioritized in the far future.

Blocker: I need to implement a table with initial data, and there's no good one to do. This is blocking on #35.

@fyliu
Copy link
Member Author

fyliu commented Jan 17, 2024

This issue was implemented by PR #226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: missing dependency Issue has dependencies documentation Improvements or additions to documentation feature: docs: team guide Same as PD team documentation? research Issue involving doing research role: technical writing s: PD team stakeholder: People Depot Team size: 2pt Can be done in 7-12 hours
Projects
Status: ✅Done
2 participants