-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Html2parquet example #804
base: dev
Are you sure you want to change the base?
Html2parquet example #804
Conversation
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
@shahrokhDaijavad @sungeunan-ibm Before I forget and drop this thread, please review/edit/approve as you see fit. this is the notebook showing how to invoke the html2parquet from within a notebook that I created last week. It can be further evolved based on direct reviews. Thanks |
"\n", | ||
"# create parameters\n", | ||
"local_conf = {\n", | ||
" \"input_folder\": \"input\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/path/to/your/input/folder
/path/to/your/output/folder
^to make sure it's the folder not files.
"params = {\n", | ||
" # Data access. Only required parameters are specified\n", | ||
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", | ||
" \"data_files_to_use\": ast.literal_eval(\"['.html']\"),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data_files_to_use": ast.literal_eval("['.zip', '.html']"),
support zip as well
"source": [ | ||
"import pyarrow.parquet as pq\n", | ||
"import pandas as pd\n", | ||
"table = pq.read_table('output/ai-alliance-index.parquet')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table = pq.read_table('/path/to/your/output/folder/sample.parquet')
users don't have "ai-alliance-index.parquet" so making it general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
Why are these changes needed?
This PR includes a example showing how html2parquet can be used in a notebook
Related issue number (if any).
#788