-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
init tool to upload csv file of kktix ticket #23
Conversation
e22f363
to
fd34e63
Compare
Hii @uranusjr , thanks for the suggestion. Here is the pros and cons come to my mind:
We all want manual process as less as possible. However, readability is also important. My gut feeling shows my may mix a "automatic smart parser" and the mapping table like:
Do you think this is a good idea? Besides, is there any conventional rule to name the bigquery column fields for data sanitizing? If you @david30907d @uranusjr know that, please let me know. |
|
btw, @tai271828 what do you think if we upload some personal information (name or email) to BQ as primary key? We cannot calculate Also, per discussion, we can setup some proper authentication and have volunteers to signup the non-disclosure agreement |
I’d just go with a smart-ish parser and keep the mapping table as small as possible, for fields that cannot be easily parsed. All fields listed here are very easily transformable, so no mapping table is needed yet. import re
def make_field_name_readable(raw: str) -> str:
match = re.search(r"__", raw)
if not match:
return raw
return raw[:match.start()].replace("_", " ") |
btw @tai271828 I don't have a convention for column naming but have one for tables. |
Interesting, thank you @uranusjr @david30907d for your information. Let me follow up the update of this pull request then : ) |
From the maintenance point of view, I will avoid a third party service (DataCalalog service in this case) at the very beginning when the team is small.
By using a third party service it means 1) overhead of communication 2) maintenance effort of the service. We have higher priority tasks. The third party service as a solution will be considered again when the data amount is large and frequent. Otherwise it is overkill in our (current) case. @uranusjr 's code snippet works like a charm (see the quotation below). I will update my parser based on @uranusjr 's code snippet. By using the "new" paser, the column looks like:
It looks much nicer (than the previous dumb code of mine). Some more special characters like
So, I feel like some mapping table is still needed ... let me think about this a bit more. 🤔 Regarding the primary key topic: @david30907d understood. I will re-upload the raw data. -tai
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-upload raw data to provide better primary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-upload raw data to provide better primary keys.
|
||
|
||
if __name__ == "__main__": | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-upload raw data to provide better primary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using hashed emails in the end. The data will be uploaded later. See #23 (comment)
@uranusjr @david30907d names like Update: oh I notice that BigQuery will convert |
up to you haha 😄 |
The current parsing result is shown below. Time to decide the mapping table...
|
2/c
|
@uranusjr thanks for the suggestion. I have the same thought as well (and implemented earlier than your comments XD ) I currently follow the conventions mentioned in this comment #23 (comment) , e.g. singular noun, snake style ...etc. |
Currently it looks like this:
|
@uranusjr forgive my ignorance, what is |
https://en.wikipedia.org/wiki/My_two_cents
|
They are Let me know any names easier to understand for you. The goal is to make the field as much self-descriptive as possible. Thanks! |
fd34e63
to
9f2a083
Compare
This pull request is ready to review again. The pre-process of column names are shown below[1]. Additionally, the revised/enhanced pull request includes the following change:
[1]
|
Besides,
The reason is that the pull request is growing too much, and in my two cents, we should stop here a bit to make sure we all agree to how we upload the data. I will make following change/enhancement when I am also ready to upload all the data I have in the past years for making sure the column names of table are all consistent. |
9f2a083
to
d395a2a
Compare
d395a2a
to
fcc8dd5
Compare
The tool is used to upload the ticket information exported from kktix to bigquery, and pre-process the "raw" data to be more bigquery friendly in column naming of tables. It's dangerous to upload data by default. So, we use dry-run mode by default. We would like to make the column names as much consistent as possible across years, so we use some heuristic. We may need to maintain the heuristic annually. Luckily and ideally, the annual maintanence will be one-off.
fcc8dd5
to
6dc9d57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK, btw might be good to document this up~
hope 2022 rookies can run this tools by themself
Types of changes
Description
When uploading the csv file of kktix ticket, the title names of column field parsed automatically by bigquery are very ugly with a lot of underscores. This tool not only helps users to upload the csv file, but also automatically handles and sanitize the title names of column field. See the attached images to see "as-is" and "to-be" results.
As-is (manually uploaded by using the bigquery dashboard UI)
Steps to Test This Pull Request
./upload-kktix-ticket-csv-to-bigquery.py ticket.csv -p bigquery-project-id -d dataset-name -t table-name
Expected behavior
The column names of data pending to upload will be shown.
If
--upload
argument is appended, a bigquery table named aftertable-name
is created under thedataset-name
ofbigquery-project-id
Related Issue
#5
Additional context
Intenal trello card status of pycontw organization team:
https://trello.com/c/yRGq1sZ3/11-kktix-%E7%9A%84%E8%B3%87%E6%96%99%EF%BC%8C%E7%94%A8-airflow-%E5%AF%A6%E4%BD%9C-etl-%E4%B8%A6%E5%AD%98%E9%80%B2-bigquery