Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html2parquet example #804

Draft
wants to merge 4 commits into
base: dev
Choose a base branch
from
Draft

Html2parquet example #804

wants to merge 4 commits into from

Conversation

touma-I
Copy link
Collaborator

@touma-I touma-I commented Nov 16, 2024

Why are these changes needed?

This PR includes a example showing how html2parquet can be used in a notebook

Related issue number (if any).

#788

@touma-I
Copy link
Collaborator Author

touma-I commented Nov 16, 2024

@shahrokhDaijavad @sungeunan-ibm Before I forget and drop this thread, please review/edit/approve as you see fit. this is the notebook showing how to invoke the html2parquet from within a notebook that I created last week. It can be further evolved based on direct reviews. Thanks

"\n",
"# create parameters\n",
"local_conf = {\n",
" \"input_folder\": \"input\",\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/path/to/your/input/folder
/path/to/your/output/folder

^to make sure it's the folder not files.

"params = {\n",
" # Data access. Only required parameters are specified\n",
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
" \"data_files_to_use\": ast.literal_eval(\"['.html']\"),\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data_files_to_use": ast.literal_eval("['.zip', '.html']"),

support zip as well

"source": [
"import pyarrow.parquet as pq\n",
"import pandas as pd\n",
"table = pq.read_table('output/ai-alliance-index.parquet')\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table = pq.read_table('/path/to/your/output/folder/sample.parquet')

users don't have "ai-alliance-index.parquet" so making it general.

Copy link
Collaborator

@sungeunan-ibm sungeunan-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants