Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/Migration - GitHub Source to Connector V2 Structure #157

Open
wants to merge 62 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
823dfc7
Update CHANGELOG with 0.0.23-dev0
unstructured-theron Oct 8, 2024
32d3884
Upgrade version to 0.0.23-dev0
unstructured-theron Oct 8, 2024
20fe615
Update expected outputs to support V2
unstructured-theron Oct 8, 2024
25fc077
Add add_source_entry to GitHub V2
unstructured-theron Oct 8, 2024
974ef3c
Add GitHub Source V2
unstructured-theron Oct 8, 2024
10ca416
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 8, 2024
9b07474
lint: updating to black pattern
unstructured-theron Oct 8, 2024
49aa703
lint: updating to flake8 pattern
unstructured-theron Oct 8, 2024
bd2f6a6
cicd: rename to access token
unstructured-theron Oct 8, 2024
204918e
lint: updating to black pattern
unstructured-theron Oct 8, 2024
d5e26f1
Upgrade pygithub version to >= 2.4.0
unstructured-theron Oct 9, 2024
e3022e8
Refactoring the precheck method
unstructured-theron Oct 9, 2024
5c282a0
lint: fix imports
unstructured-theron Oct 9, 2024
7681f92
lint: fix imports
unstructured-theron Oct 9, 2024
3855545
Update expected outputs
unstructured-theron Oct 9, 2024
1fd37d7
Reverting github.sh
unstructured-theron Oct 9, 2024
d9f0788
Reverting github.sh
unstructured-theron Oct 9, 2024
8799db8
Add exclude metadata to github.sh
unstructured-theron Oct 13, 2024
c14eff0
Rename access token
unstructured-theron Oct 13, 2024
48e74bf
Updating the expected outputs (ignored the date fields)
unstructured-theron Oct 13, 2024
ce3f8a1
Update the expected outputs with permissions_data
unstructured-theron Oct 15, 2024
2d05ba1
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 15, 2024
01223fd
GitHub: fixing commented issues
unstructured-theron Oct 16, 2024
d2711b3
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 16, 2024
c437139
github.sh: rename to --file-glob
unstructured-theron Oct 16, 2024
d38b8db
github.sh: forcing raise exceptions
unstructured-theron Oct 16, 2024
2e81bcf
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 16, 2024
98ed489
Upgrading version to 0.0.26-dev5
unstructured-theron Oct 16, 2024
e78ddc2
fix download file path
unstructured-theron Oct 16, 2024
aaf8ab6
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 17, 2024
1b5cdf4
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 17, 2024
f4c82b4
Improving description and logs
unstructured-theron Oct 21, 2024
10c9aea
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 21, 2024
c91531f
fix lint
unstructured-theron Oct 21, 2024
423c206
fix lint
unstructured-theron Oct 21, 2024
a9ba1c5
fix lint
unstructured-theron Oct 21, 2024
0b67c65
Fixing doc methods
unstructured-theron Oct 21, 2024
902f69d
Fix methods run and run_async
unstructured-theron Oct 21, 2024
85ba0d9
Ad method "is_async"
unstructured-theron Oct 21, 2024
618d89e
Modify "additional_metadata" to add metadata just if the fields exist
unstructured-theron Oct 21, 2024
f35bc00
Set default value for Secret Field
unstructured-theron Oct 21, 2024
4c527b4
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 21, 2024
f52b7be
fix black
unstructured-theron Oct 21, 2024
bce778f
Change to use model_validator
unstructured-theron Oct 22, 2024
50b2db5
Change syntax code of with open file
unstructured-theron Oct 22, 2024
49ee54f
Add recursive flag
unstructured-theron Oct 22, 2024
4b0f9a1
fix lint
unstructured-theron Oct 22, 2024
c8e7908
Change syntax code of with open file
unstructured-theron Oct 22, 2024
8c78370
Add metadata to new "recursive" field on Indexer
unstructured-theron Oct 22, 2024
11e119e
lint: run black
unstructured-theron Oct 22, 2024
fe0d496
Move recursive to IndexerConfig
unstructured-theron Oct 22, 2024
28186f3
Fix typo connection_config
unstructured-theron Oct 23, 2024
6057310
fix to handle all expections
unstructured-theron Oct 23, 2024
88df7cc
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 23, 2024
6dbc4cc
fix typo path
unstructured-theron Oct 23, 2024
953bb72
Merge branch 'main' into DS-90-github-source-v2
unstructured-theron Oct 24, 2024
4ca279f
Fix changelog
unstructured-theron Oct 24, 2024
d5e3b1b
fix log
unstructured-theron Oct 24, 2024
a7dba0a
fix log
unstructured-theron Oct 24, 2024
b5e0b74
fix docstring
unstructured-theron Oct 25, 2024
725881c
Add more descriptive error message
bryan-unstructured Nov 7, 2024
44d7560
comment out the clarifai test which prevents from passing CI tests
bryan-unstructured Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# 0.2.0-dev0

### Enhancements

* **Added migration for GitHub Source V2**

## 0.2.0

### Enhancements
Expand Down
3 changes: 1 addition & 2 deletions requirements/connectors/github.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
-c ../common/constraints.txt

# NOTE - pygithub==1.58.0 fails due to https://github.com/PyGithub/PyGithub/issues/2436
pygithub>1.58.0
pygithub>=2.4.0
unstructured-theron marked this conversation as resolved.
Show resolved Hide resolved
requests
100 changes: 75 additions & 25 deletions test_e2e/expected-structured-output/github/LICENSE.txt.json
Original file line number Diff line number Diff line change
@@ -1,57 +1,107 @@
[
{
"type": "Title",
"element_id": "52585ab256e2832166ca185be6c76cc9",
"text": "Downloadify: Client Side File Creation JavaScript + Flash Library",
"metadata": {
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "Downloadify: Client Side File Creation JavaScript + Flash Library",
"type": "Title"
],
"filetype": "text/plain",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"permissions_data": null,
"filesize_bytes": 1127
}
}
},
{
"type": "Title",
"element_id": "107ab54e7143d022fee38d5dfe235f89",
"text": "Copyright (c) 2009 Douglas C. Neiner",
"metadata": {
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "Copyright (c) 2009 Douglas C. Neiner",
"type": "Title"
],
"filetype": "text/plain",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"permissions_data": null,
"filesize_bytes": 1127
}
}
},
{
"type": "NarrativeText",
"element_id": "1cd03f5c7eea429178fc15c9d6c4cbd4",
"text": "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:",
"metadata": {
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:",
"type": "NarrativeText"
],
"filetype": "text/plain",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"permissions_data": null,
"filesize_bytes": 1127
}
}
},
{
"type": "NarrativeText",
"element_id": "5da204497a4873a8d0f71ad7865cea7e",
"text": "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.",
"metadata": {
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.",
"type": "NarrativeText"
],
"filetype": "text/plain",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"permissions_data": null,
"filesize_bytes": 1127
}
}
},
{
"type": "NarrativeText",
"element_id": "1b454f06bfa94b6d367e0e812ae32655",
"text": "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"metadata": {
"filetype": "text/plain",
"languages": [
"eng"
]
},
"text": "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"type": "NarrativeText"
],
"filetype": "text/plain",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"permissions_data": null,
"filesize_bytes": 1127
}
}
}
]
86 changes: 63 additions & 23 deletions test_e2e/expected-structured-output/github/test.html.json
Original file line number Diff line number Diff line change
@@ -1,52 +1,92 @@
[
{
"type": "Title",
"element_id": "218722ac66e142a570ab2053b430c6c4",
"text": "Downloadify Example",
"metadata": {
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Downloadify Example",
"type": "Title"
],
"filetype": "text/html",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/c63c8fc21d46d44de85a14a3ed4baec0348ce344",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"permissions_data": null,
"filesize_bytes": 3001
}
}
},
{
"type": "Title",
"element_id": "bf0fab1925c4b2cbb23a53afce28ebd2",
"text": "More info available at the Github Project Page",
"metadata": {
"filetype": "text/html",
"languages": [
"eng"
],
"link_texts": [
"Github Project Page"
],
"link_urls": [
"http://github.com/dcneiner/Downloadify"
]
},
"text": "More info available at the Github Project Page",
"type": "Title"
],
"languages": [
"eng"
],
"filetype": "text/html",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/c63c8fc21d46d44de85a14a3ed4baec0348ce344",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"permissions_data": null,
"filesize_bytes": 3001
}
}
},
{
"type": "Title",
"element_id": "395aed29cd13842fede90a1a8677aa4b",
"text": "Downloadify Invoke Script For This Page",
"metadata": {
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Downloadify Invoke Script For This Page",
"type": "Title"
],
"filetype": "text/html",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/c63c8fc21d46d44de85a14a3ed4baec0348ce344",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"permissions_data": null,
"filesize_bytes": 3001
}
}
},
{
"type": "NarrativeText",
"element_id": "2e22c39e004cb7d566294080c976efc8",
"text": "Downloadify.create('downloadify',{\n filename: function(){\n return document.getElementById('filename').value;\n },\n data: function(){ \n return document.getElementById('data').value;\n },\n onComplete: function(){ \n alert('Your File Has Been Saved!'); \n },\n onCancel: function(){ \n alert('You have cancelled the saving of this file.');\n },\n onError: function(){ \n alert('You must put something in the File Contents or there will be nothing to save!'); \n },\n swf: 'media/downloadify.swf',\n downloadImage: 'images/download.png',\n width: 100,\n height: 30,\n transparent: true,\n append: false\n});",
"metadata": {
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Downloadify.create('downloadify',{\n filename: function(){\n return document.getElementById('filename').value;\n },\n data: function(){ \n return document.getElementById('data').value;\n },\n onComplete: function(){ \n alert('Your File Has Been Saved!'); \n },\n onCancel: function(){ \n alert('You have cancelled the saving of this file.');\n },\n onError: function(){ \n alert('You must put something in the File Contents or there will be nothing to save!'); \n },\n swf: 'media/downloadify.swf',\n downloadImage: 'images/download.png',\n width: 100,\n height: 30,\n transparent: true,\n append: false\n});",
"type": "NarrativeText"
],
"filetype": "text/html",
"data_source": {
"url": "https://api.github.com/repos/dcneiner/Downloadify/git/blobs/c63c8fc21d46d44de85a14a3ed4baec0348ce344",
"version": "W/\"bb342a3e84a4ce514665385d7d61fb2922b0705ff23ad599a3e2d355aabe3f21\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"permissions_data": null,
"filesize_bytes": 3001
}
}
}
]
4 changes: 2 additions & 2 deletions test_e2e/src/github.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,14 @@ PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
--partition-endpoint "https://api.unstructuredapp.io" \
--num-processes "$max_processes" \
--download-dir "$DOWNLOAD_DIR" \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth,metadata.data_source.date_created,metadata.data_source.date_modified \
--strategy hi_res \
--preserve-downloads \
--reprocess \
--output-dir "$OUTPUT_DIR" \
--verbose \
--url dcneiner/Downloadify \
--git-file-glob '*.html,*.txt' \
--file-glob '*.html,*.txt' \
--work-dir "$WORK_DIR" \
$ACCESS_TOKEN_FLAGS

Expand Down
2 changes: 1 addition & 1 deletion test_e2e/test-dest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ all_tests=(
'azure-cognitive-search.sh'
'box.sh'
'chroma.sh'
'clarifai.sh'
# 'clarifai.sh'
'couchbase.sh'
'dropbox.sh'
'elasticsearch.sh'
Expand Down
2 changes: 1 addition & 1 deletion unstructured_ingest/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.2.0" # pragma: no cover
__version__ = "0.2.0-dev0" # pragma: no cover
4 changes: 4 additions & 0 deletions unstructured_ingest/v2/processes/connectors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
from .delta_table import delta_table_destination_entry
from .elasticsearch import CONNECTOR_TYPE as ELASTICSEARCH_CONNECTOR_TYPE
from .elasticsearch import elasticsearch_destination_entry, elasticsearch_source_entry
from .github import CONNECTOR_TYPE as GITHUB_CONNECTOR_TYPE
from .github import github_source_entry
from .google_drive import CONNECTOR_TYPE as GOOGLE_DRIVE_CONNECTOR_TYPE
from .google_drive import google_drive_source_entry
from .kdbai import CONNECTOR_TYPE as KDBAI_CONNECTOR_TYPE
Expand Down Expand Up @@ -67,6 +69,8 @@
destination_type=ELASTICSEARCH_CONNECTOR_TYPE, entry=elasticsearch_destination_entry
)

add_source_entry(source_type=GITHUB_CONNECTOR_TYPE, entry=github_source_entry)

add_source_entry(source_type=GOOGLE_DRIVE_CONNECTOR_TYPE, entry=google_drive_source_entry)

add_source_entry(source_type=LOCAL_CONNECTOR_TYPE, entry=local_source_entry)
Expand Down
Loading