Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init tool to upload csv file of kktix ticket #23

Merged
merged 1 commit into from
Jun 5, 2021

Conversation

tai271828
Copy link
Member

@tai271828 tai271828 commented Apr 5, 2021

Types of changes

  • New feature

Description

When uploading the csv file of kktix ticket, the title names of column field parsed automatically by bigquery are very ugly with a lot of underscores. This tool not only helps users to upload the csv file, but also automatically handles and sanitize the title names of column field. See the attached images to see "as-is" and "to-be" results.

As-is (manually uploaded by using the bigquery dashboard UI)
Selection_206

Steps to Test This Pull Request

./upload-kktix-ticket-csv-to-bigquery.py ticket.csv -p bigquery-project-id -d dataset-name -t table-name

Expected behavior

The column names of data pending to upload will be shown.

If --upload argument is appended, a bigquery table named after table-name is created under the dataset-name of bigquery-project-id

Related Issue

#5

Additional context

Intenal trello card status of pycontw organization team:
https://trello.com/c/yRGq1sZ3/11-kktix-%E7%9A%84%E8%B3%87%E6%96%99%EF%BC%8C%E7%94%A8-airflow-%E5%AF%A6%E4%BD%9C-etl-%E4%B8%A6%E5%AD%98%E9%80%B2-bigquery

@tai271828 tai271828 force-pushed the pr-upload-ticket-csv branch 2 times, most recently from e22f363 to fd34e63 Compare April 5, 2021 22:33
contrib/upload-kktix-ticket-csv-to-bigquery.py Outdated Show resolved Hide resolved
contrib/upload-kktix-ticket-csv-to-bigquery.py Outdated Show resolved Hide resolved
contrib/upload-kktix-ticket-csv-to-bigquery.py Outdated Show resolved Hide resolved
@uranusjr
Copy link
Member

uranusjr commented Apr 9, 2021

To-be (uploaded by the tool upload-kktix-ticket-csv-to-bigquery.py)
Selection_207

Since we’re processing the column names anyway, why not go all the way and make it as readable as possible? I would expect the UI to show “Ticket type”, “Payment status”, “Company”, “Job title” etc.

@tai271828
Copy link
Member Author

To-be (uploaded by the tool upload-kktix-ticket-csv-to-bigquery.py)
Selection_207

Since we’re processing the column names anyway, why not go all the way and make it as readable as possible? I would expect the UI to show “Ticket type”, “Payment status”, “Company”, “Job title” etc.

Hii @uranusjr , thanks for the suggestion. Here is the pros and cons come to my mind:

  • pros
    • readable
    • or even more, we could standardize the field names for different years. The raw name of ticket information differs a bit every year and maintained by the registration team.
  • cons
    • If we would like to make it readable, it means that we are very likely to manually maintain a mapping table in the code, e.g. {"some description from the raw csv file": "the readable field name"} and the corresponding real world cases will look like {"如何得知 PyCon TW?,Have you ever attended PyCon TW?": "Attended PyCon TW N Times"} (and it is now HaveyoueverattendedPyConTWPyConTW)

We all want manual process as less as possible. However, readability is also important. My gut feeling shows my may mix a "automatic smart parser" and the mapping table like:

if raw_field not in mapping_table:
    new_field = smart_parser(raw_field)

Do you think this is a good idea?

Besides, is there any conventional rule to name the bigquery column fields for data sanitizing? If you @david30907d @uranusjr know that, please let me know.

@david30907d
Copy link
Collaborator

To-be (uploaded by the tool upload-kktix-ticket-csv-to-bigquery.py)
Selection_207

Since we’re processing the column names anyway, why not go all the way and make it as readable as possible? I would expect the UI to show “Ticket type”, “Payment status”, “Company”, “Job title” etc.

Hii @uranusjr , thanks for the suggestion. Here is the pros and cons come to my mind:

  • pros

    • readable
    • or even more, we could standardize the field names for different years. The raw name of ticket information differs a bit every year and maintained by the registration team.
  • cons

    • If we would like to make it readable, it means that we are very likely to manually maintain a mapping table in the code, e.g. {"some description from the raw csv file": "the readable field name"} and the corresponding real world cases will look like {"如何得知 PyCon TW?,Have you ever attended PyCon TW?": "Attended PyCon TW N Times"} (and it is now HaveyoueverattendedPyConTWPyConTW)

We all want manual process as less as possible. However, readability is also important. My gut feeling shows my may mix a "automatic smart parser" and the mapping table like:

if raw_field not in mapping_table:
    new_field = smart_parser(raw_field)

Do you think this is a good idea?

Besides, is there any conventional rule to name the bigquery column fields for data sanitizing? If you @david30907d @uranusjr know that, please let me know.

  1. @tai271828 yea the pros and cons both make sense to me and okay with keekping your implementation. The third alternative might be keeping the raw kktix naming and maintain a DataCalalog service as documentation. This kind of service (there's also open source version) would let data analyst have a better understand about what's and how to use the data, who maintain this pipeline etc. thoughts on this ^ @tai271828 @uranusjr @Lee-W
  2. I don't have a convention for column naming but have one for tables.
    Take kktix table as an example, the table name would be ods_kktix_primaryKey_year

ref
image

@david30907d
Copy link
Collaborator

btw, @tai271828 what do you think if we upload some personal information (name or email) to BQ as primary key?

We cannot calculate retention if there's no primary key.

Also, per discussion, we can setup some proper authentication and have volunteers to signup the non-disclosure agreement

@david30907d david30907d self-requested a review April 10, 2021 04:25
@uranusjr
Copy link
Member

We all want manual process as less as possible. However, readability is also important. My gut feeling shows my may mix a "automatic smart parser" and the mapping table like:

if raw_field not in mapping_table:
    new_field = smart_parser(raw_field)

I’d just go with a smart-ish parser and keep the mapping table as small as possible, for fields that cannot be easily parsed. All fields listed here are very easily transformable, so no mapping table is needed yet.

import re

def make_field_name_readable(raw: str) -> str:
    match = re.search(r"__", raw)
    if not match:
        return raw
    return raw[:match.start()].replace("_", " ")

@david30907d
Copy link
Collaborator

btw @tai271828

I don't have a convention for column naming but have one for tables.
Take kktix table as an example, the table name would be ods_kktix_primaryKey_year

ref
image

@tai271828
Copy link
Member Author

Interesting, thank you @uranusjr @david30907d for your information. Let me follow up the update of this pull request then : )

@tai271828 tai271828 marked this pull request as draft April 24, 2021 16:36
@tai271828
Copy link
Member Author

tai271828 commented May 8, 2021

From the maintenance point of view, I will avoid a third party service (DataCalalog service in this case) at the very beginning when the team is small.

  • what we have
    • number of maintainer: our core team is small (< 5 people)
    • one-off ticket data uploading annually

By using a third party service it means 1) overhead of communication 2) maintenance effort of the service. We have higher priority tasks. The third party service as a solution will be considered again when the data amount is large and frequent. Otherwise it is overkill in our (current) case.

@uranusjr 's code snippet works like a charm (see the quotation below). I will update my parser based on @uranusjr 's code snippet. By using the "new" paser, the column looks like:

Index(['Ticket Type', 'Payment Status', 'Tags', 'Paid Date', 'Price',
       'Dietary Habit / 餐點偏好', 'Years of Using Python / 使用 Python 多久',
       'Area of Interest / 興趣領域',
       'Company  / 服務單位 (For students or teachers, fill in the School + Department Name)',
       'Job Title / 職稱 (If you are a student, fill in "student")',
       'Come From / 國家或地區', 'Departure from (Regions) / 出發區域', 'Gender / 性別',
       '是否願意收到贊助商轉發 Email 訊息', '是否願意提供 Email 給贊助商',
       'Privacy Policy of PyCon TW 2020',
       'I've already read and I accept the Privacy Policy of PyConTW 2020 / 我已閱讀並同意 PyCon TW 2020 個人資料保護
聲明'],
      dtype='object')

It looks much nicer (than the previous dumb code of mine). Some more special characters like / and + are not welcome, but I think it could be easy to improve with revised version of @uranusjr 's code snippet. It will be much much nicer if the column names could follow several rules in my mind (if we don't know more best practices):

  1. singular noun
  2. lower case
  3. underscore-separated words (snake case)
  4. full word (if possible) except common abbreviations
  5. ...more, see this article. The article is talking SQL but I thought some rules still apply to our case.

So, I feel like some mapping table is still needed ... let me think about this a bit more. 🤔

Regarding the primary key topic: @david30907d understood. I will re-upload the raw data.

-tai

We all want manual process as less as possible. However, readability is also important. My gut feeling shows my may mix a "automatic smart parser" and the mapping table like:

if raw_field not in mapping_table:
    new_field = smart_parser(raw_field)

I’d just go with a smart-ish parser and keep the mapping table as small as possible, for fields that cannot be easily parsed. All fields listed here are very easily transformable, so no mapping table is needed yet.

import re

def make_field_name_readable(raw: str) -> str:
    match = re.search(r"__", raw)
    if not match:
        return raw
    return raw[:match.start()].replace("_", " ")

@tai271828 tai271828 self-assigned this May 8, 2021
@tai271828 tai271828 added the enhancement New feature or request label May 8, 2021
@tai271828 tai271828 marked this pull request as ready for review May 8, 2021 22:05
@tai271828 tai271828 marked this pull request as draft May 8, 2021 22:05
Copy link
Member Author

@tai271828 tai271828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-upload raw data to provide better primary keys.

Copy link
Member Author

@tai271828 tai271828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-upload raw data to provide better primary keys.



if __name__ == "__main__":
main()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-upload raw data to provide better primary keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hashed emails in the end. The data will be uploaded later. See #23 (comment)

@tai271828
Copy link
Member Author

tai271828 commented May 8, 2021

@uranusjr @david30907d names like Company / 服務單位 (For students or teachers, fill in the School + Department Name) is really bad for me. What do you expect if you are a user when querying BigQuery? For example, organization?

Update: oh I notice that BigQuery will convert /, (, or + to _ or space.

@david30907d
Copy link
Collaborator

@uranusjr @david30907d names like Company / 服務單位 (For students or teachers, fill in the School + Department Name) is really bad for me. What do you expect if you are a user when querying BigQuery? For example, organization?

Update: oh I notice that BigQuery will convert /, (, or + to _ or space.

up to you haha 😄

@tai271828
Copy link
Member Author

The current parsing result is shown below. Time to decide the mapping table...

Column names (as-is):
Index(['Ticket Type', 'Payment Status', 'Tags', 'Paid Date', 'Price',
       '# invoice policy #', 'Invoiced Company Name / 發票抬頭 (Optional)',
       'Unified Business No. / 發票統編 (Optional)', 'Dietary Habit / 餐點偏好',
       'Years of Using Python / 使用 Python 多久', 'Area of Interest / 興趣領域',
       'Company  / 服務單位 (For students or teachers, fill in the School + Department Name)',
       'Job Title / 職稱 (If you are a student, fill in "student")',
       'Come From / 國家或地區', 'Departure from (Regions) / 出發區域',
       'How did you find out PyCon TW? / 如何得知 PyCon TW?',
       'Have you ever attended PyCon TW?/ 是否曾參加 PyCon TW?',
       'Do you know we have Financial Aid this year? / 請問您知道今年有財務補助嗎?',
       'Gender / 生理性別', 'PyNight 參加意願', '是否願意收到贊助商轉發 Email From Sponsor訊息',
       '是否願意提供 Email2Sponsor 給贊助商',
       'Privacy Policy of PyCon TW 2020 / PyCon TW 2020 個人資料保護聲明 bit.ly/3eipAut',
       'I've already read and I accept the Privacy Policy of PyCon TW 2020'],
      dtype='object')
Column names (to-be):
Index(['ticket_type', 'payment_status', 'tags', 'paid_date', 'price',
       'invoice_policy', 'invoiced_company_name_optional',
       'unified_business_no_optional', 'dietary_habit',
       'years_of_using_python_python', 'area_of_interest',
       'company_for_students_or_teachers_fill_in_the_school_department_name',
       'job_title_if_you_are_a_student_fill_in_student', 'come_from',
       'departure_from_regions', 'how_did_you_find_out_pycon_tw_pycon_tw',
       'have_you_ever_attended_pycon_tw_pycon_tw',
       'do_you_know_we_have_financial_aid_this_year', 'gender', 'pynight',
       'email_from_sponsor', 'email2sponsor',
       'privacy_policy_of_pycon_tw_2020_pycon_tw_2020_bitly3eipaut',
       'ive_already_read_and_i_accept_the_privacy_policy_of_pycon_tw_2020'],
      dtype='object')

@uranusjr
Copy link
Member

uranusjr commented May 29, 2021

2/c

  • invoiced_company_name_optionalcompany_name
  • unified_business_no_optionalunified_business_number
  • years_of_using_python_pythonyears_of_using_python
  • company_for_students_or_teachers_fill_in_the_school_department_namecompany
  • job_title_if_you_are_a_student_fill_in_studentjob_title
  • how_did_you_find_out_pycon_tw_pycon_twhow_did_you_find_out_pycon_tw
  • have_you_ever_attended_pycon_tw_pycon_twhave_you_ever_attended_pycon_tw
  • email2sponsor → ??? (Don't understand the original text either)
  • privacy_policy_of_pycon_tw_2020_pycon_tw_2020_bitly3eipaut → Drop this?
  • ive_already_read_and_i_accept_the_privacy_policy_of_pycon_tw_2020 → Drop this?

@tai271828
Copy link
Member Author

2/c

  • invoiced_company_name_optionalcompany_name
  • unified_business_no_optionalunified_business_number
  • years_of_using_python_pythonyears_of_using_python
  • company_for_students_or_teachers_fill_in_the_school_department_namecompany
  • job_title_if_you_are_a_student_fill_in_studentjob_title
  • how_did_you_find_out_pycon_tw_pycon_twhow_did_you_find_out_pycon_tw
  • have_you_ever_attended_pycon_tw_pycon_twhave_you_ever_attended_pycon_tw
  • email2sponsor → ??? (Don't understand the original text either)
  • privacy_policy_of_pycon_tw_2020_pycon_tw_2020_bitly3eipaut → Drop this?
  • ive_already_read_and_i_accept_the_privacy_policy_of_pycon_tw_2020 → Drop this?

@uranusjr thanks for the suggestion. I have the same thought as well (and implemented earlier than your comments XD )

I currently follow the conventions mentioned in this comment #23 (comment) , e.g. singular noun, snake style ...etc.

@tai271828
Copy link
Member Author

tai271828 commented May 30, 2021

Currently it looks like this:

Column names (as-is):
Index(['Ticket Type', 'Payment Status', 'Tags', 'Paid Date', 'Price',
       '# invoice policy #', 'Invoiced Company Name / 發票抬頭 (Optional)',
       'Unified Business No. / 發票統編 (Optional)', 'Dietary Habit / 餐點偏好',
       'Years of Using Python / 使用 Python 多久', 'Area of Interest / 興趣領域',
       'Company  / 服務單位 (For students or teachers, fill in the School + Department Name)',
       'Job Title / 職稱 (If you are a student, fill in "student")',
       'Come From / 國家或地區', 'Departure from (Regions) / 出發區域',
       'How did you find out PyCon TW? / 如何得知 PyCon TW?',
       'Have you ever attended PyCon TW?/ 是否曾參加 PyCon TW?',
       'Do you know we have Financial Aid this year? / 請問您知道今年有財務補助嗎?',
       'Gender / 生理性別', 'PyNight 參加意願僅供統計人數,實際是否舉辦需由官方另行公告', 'PyNight 參加意願',
       '是否願意收到贊助商轉發 Email 訊息', '是否願意提供 Email 給贊助商',
       'I've already read and I accept the Privacy Policy of PyCon TW 2020 / 我已閱讀並同意 PyCon TW 2020 個人資料保護聲明',
       'I've already read and I accept the Epidemic Prevention of PyCon TW 2020 / 我已閱讀並同意 PyCon TW 2020 COVID-19 防疫守則',
       'Contact Email'],
      dtype='object')

Column names (to-be):
Index(['ticket_type', 'payment_status', 'tags', 'paid_date', 'price',
       'invoice_policy', 'invoiced_company_name_optional',
       'unified_business_no_optional', 'dietary_habit',
       'years_of_using_python', 'area_of_interest', 'organization', 'job_role',
       'country_or_region', 'departure_from_region',
       'how_did_you_know_pycon_tw', 'have_you_ever_attended_pycon_tw',
       'do_you_know_we_have_financial_aid_this_year', 'gender',
       'pynight_attendee_numbers', 'pynight_attending_or_not',
       'email_from_sponsor', 'email_to_sponsor',
       'ive_already_read_and_i_accept_the_privacy_policy_of_pycon_tw',
       'ive_already_read_and_i_accept_the_epidemic_prevention_of_pycon_tw',
       'email'],
      dtype='object')

@david30907d
Copy link
Collaborator

@uranusjr forgive my ignorance, what is 2/c ?😄

@uranusjr
Copy link
Member

what is 2/c ?

https://en.wikipedia.org/wiki/My_two_cents

Currently it looks like this

email_from_sponsor and email_to_sponsor are confusing to me.

@tai271828
Copy link
Member Author

tai271828 commented May 30, 2021

what is 2/c ?

https://en.wikipedia.org/wiki/My_two_cents

Currently it looks like this

email_from_sponsor and email_to_sponsor are confusing to me.

They are '是否願意收到贊助商轉發 Email 訊息' and '是否願意提供 Email 給贊助商' accordingly :)

Let me know any names easier to understand for you. The goal is to make the field as much self-descriptive as possible. Thanks!

@tai271828
Copy link
Member Author

This pull request is ready to review again. The pre-process of column names are shown below[1].

Additionally, the revised/enhanced pull request includes the following change:

  1. Fix-up according to your suggestions. Thank you everyone for your input. You input make this tool much better.
  2. Add test cases.
  3. More fool-proof minor features to prevent mistakes when uploading tools.

[1]

INFO:root:Column names (as-is):
INFO:root:Index(['Ticket Type', 'Payment Status', 'Tags', 'Paid Date', 'Price',
       '# invoice policy #', 'Invoiced Company Name / 發票抬頭 (Optional)',
       'Unified Business No. / 發票統編 (Optional)', 'Dietary Habit / 餐點偏好',
       'Years of Using Python / 使用 Python 多久', 'Area of Interest / 興趣領域',
       'Company  / 服務單位 (For students or teachers, fill in the School + Department Name)',
       'Job Title / 職稱 (If you are a student, fill in "student")',
       'Come From / 國家或地區', 'Departure from (Regions) / 出發區域',
       'How did you find out PyCon TW? / 如何得知 PyCon TW?',
       'Have you ever attended PyCon TW?/ 是否曾參加 PyCon TW?',
       'Do you know we have Financial Aid this year? / 請問您知道今年有財務補助嗎?',
       'Gender / 生理性別', 'PyNight 參加意願僅供統計人數,實際是否舉辦需由官方另行公告', 'PyNight 參加意願',
       '是否願意收到贊助商轉發 Email 訊息', '是否願意提供 Email 給贊助商',
       'I've already read and I accept the Privacy Policy of PyCon TW 2020 / 我已閱讀並同意 PyCon TW 2020 個人資料保護聲明',
       'I've already read and I accept the Epidemic Prevention of PyCon TW 2020 / 我已閱讀並同意 PyCon TW 2020 COVID-19 防疫守則',
       'Contact Email'],
      dtype='object')
INFO:root:
INFO:root:Column names (to-be):
INFO:root:Index(['ticket_type', 'payment_status', 'tags', 'paid_date', 'price',
       'invoice_policy', 'invoiced_company_name', 'unified_business_no',
       'dietary_habit', 'years_of_using_python', 'area_of_interest',
       'organization', 'job_title', 'country_or_region',
       'departure_from_region', 'how_did_you_know_pycon_tw',
       'have_you_ever_attended_pycon_tw', 'know_financial_aid', 'gender',
       'pynight_attendee_numbers', 'pynight_attending_or_not',
       'email_from_sponsor', 'email_to_sponsor',
       'ive_already_read_and_i_accept_the_privacy_policy_of_pycon_tw',
       'ive_already_read_and_i_accept_the_epidemic_prevention_of_pycon_tw',
       'email'],
      dtype='object')

Process finished with exit code 0

@tai271828
Copy link
Member Author

Besides,

  1. I have not "officially" re-upload the data
  2. I have not "officially" upload the data in the past years (before 2020)

The reason is that the pull request is growing too much, and in my two cents, we should stop here a bit to make sure we all agree to how we upload the data. I will make following change/enhancement when I am also ready to upload all the data I have in the past years for making sure the column names of table are all consistent.

@tai271828 tai271828 marked this pull request as ready for review May 30, 2021 21:36
The tool is used to upload the ticket information exported from kktix to
bigquery, and pre-process the "raw" data to be more bigquery friendly in
column naming of tables.

It's dangerous to upload data by default. So, we use dry-run mode by default.

We would like to make the column names as much consistent as possible
across years, so we use some heuristic. We may need to maintain the
heuristic annually. Luckily and ideally, the annual maintanence will be
one-off.
Copy link
Collaborator

@david30907d david30907d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, btw might be good to document this up~
hope 2022 rookies can run this tools by themself

@tai271828 tai271828 merged commit 6f3b4cf into master Jun 5, 2021
@tai271828 tai271828 deleted the pr-upload-ticket-csv branch June 5, 2021 09:57
@tai271828 tai271828 mentioned this pull request Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants