Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instagram Pipelines [DRAFT] #76

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

arifluthfi16
Copy link
Contributor

@arifluthfi16 arifluthfi16 commented Jan 22, 2022

This solves #60

Current Progression:

Created API for the Scrapping API, able to scrape:

  • Profile Data
  • Posts Data
  • Hashtag Data
  • Locations Data
  • Reels
  • IGTV

Todo:

  • Dockerize Scrapping API ✅
  • Select data that matters (i need feedback on this)
  • Add Proxy Support ✅
  • Convert Locations data Into H3 Indexes
  • Create Pipelines to Store Data
  • Dockerize Pipelines

Problems:

  • Scrapping locations data requires session_id which obtained from logging in
    But even after passing valid session_id there is chance the request will fail and account to be disabled Source

Examples:

Hashtag Scrapping:
Requests

[
    "https://www.instagram.com/explore/tags/googlepixel3/",
    "https://www.instagram.com/explore/tags/google/"
]

Response:

[
    {
        "allow_following": false,
        "amount_of_posts": 178563,
        "id": "17852487268191726",
        "is_following": false,
        "is_top_media_only": false,
        "name": "googlepixel3",
        "profile_pic_url": "https://instagram.fbdo1-2.fna.fbcdn.net/v/t51.2885-15/e35/s150x150/240950065_1011889952946184_2416560484104956328_n.jpg?_nc_ht=instagram.fbdo1-2.fna.fbcdn.net&_nc_cat=104&_nc_ohc=pUqmR4ouF8AAX9MBxyA&edm=ABZsPhsBAAAA&ccb=7-4&oh=00_AT8BIF4IN-mzBBJxJnPEMR2fHL3e6praejBMXu6KXhbGVQ&oe=61EE4C5A&_nc_sid=4efc9f"
    },
    {
        "allow_following": false,
        "amount_of_posts": 11056646,
        "id": "17843843635029645",
        "is_following": false,
        "is_top_media_only": false,
        "name": "google",
        "profile_pic_url": "https://instagram.fbdo1-1.fna.fbcdn.net/v/t51.2885-15/e35/c0.180.1440.1440a/s150x150/271992644_353756903249047_2268745757588465676_n.webp.jpg?_nc_ht=instagram.fbdo1-1.fna.fbcdn.net&_nc_cat=107&_nc_ohc=V9JAOOMvFScAX_VbPZj&edm=ABZsPhsBAAAA&ccb=7-4&oh=00_AT-MQFEBk7kRv8pY8pn1ahglXB-4sKQI8kdBJw5Fb_F9Mg&oe=61EE44FB&_nc_sid=4efc9f"
    }
]

Posts Scrapping:

Requests

[
    "https://www.instagram.com/p/CY3rhPQqig6/",
    "https://www.instagram.com/p/CY0lOS2vp9Y/"
]

Response

[
    {
        "accessibility_caption": "Photo by Donatekart | India on January 18, 2022. May be an image of big cat and text that says 'DONATEKART A tribute to the Queen of Pench \"Collarwali\" Who passed away yesterday due to old age. She was 16 and one of the most important and legendary tigers in India who gave birth to 29 cubs.'.",
        "caption": "R.I.P \"Collarwali\" 🥺🙌🏻🙏🌱\n\nShe played a key role in maintaining the population of tiger reserve in India🐅\n.\n.\n#collarwali #tributepost #tigeress #cubs #tiger #forest",
        "caption_is_edited": true,
        "commenting_disabled_for_viewer": false,
        "comments": 72,
        "comments_disabled": false,
        "display_url": "https://instagram.fbdo1-1.fna.fbcdn.net/v/t51.2885-15/e35/271986206_140104905091452_6023192320528863036_n.jpg?_nc_ht=instagram.fbdo1-1.fna.fbcdn.net&_nc_cat=100&_nc_ohc=Svb-p5OLusAAX94Gh1N&edm=AABBvjUBAAAA&ccb=7-4&oh=00_AT9rrCMr8_CaTrHI_wLy400ol25ZaY38mOwHaneBan6ZcA&oe=61EF0E8B&_nc_sid=83d603",
        "fact_check_information": null,
        "fact_check_overall_rating": null,
        "full_name": "Donatekart | India",
        "gating_info": null,
        "has_audio":NaN,
        "has_ranked_comments": false,
        "hashtags": [
            "collarwali",
            "tributepost",
            "tigeress",
            "cubs",
            "tiger",
            "forest"
        ],
        "height": 1350,
        "id": "2753861097288771642",
        "is_video": false,
        "likes": 4555,
        "location":NaN,
        "media_overlay_info": null,
        "media_preview": null,
        "sensitivity_friction_info": null,
        "shortcode": "CY3rhPQqig6",
        "tagged_users": [],
        "timestamp": 1642505846,
        "tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiZjAxOTMzODJhNjJkNDgyOGE4MGUzNDRkYzdiMWZjZTkyNzUzODYxMDk3Mjg4NzcxNjQyIn0sInNpZ25hdHVyZSI6IiJ9",
        "upload_date": "Tue, 18 Jan 2022 11:37:26 GMT",
        "username": "donatekart",
        "video_url":NaN,
        "video_view_count":NaN,
        "viewer_can_reshare": true,
        "viewer_has_liked": false,
        "viewer_has_saved": false,
        "viewer_has_saved_to_collection": false,
        "viewer_in_photo_of_you": false,
        "width": 1080
    },
    . . .
 ]

@arifluthfi16 arifluthfi16 changed the title Features/insta scrapper 📷 Instagram Pipelines [DRAFT] Jan 22, 2022
@arifluthfi16 arifluthfi16 changed the title 📷 Instagram Pipelines [DRAFT] Instagram Pipelines [DRAFT] Jan 23, 2022
@arifluthfi16
Copy link
Contributor Author

Some Updates:

You could use the scrapper to collect data from instagram but it is still limited because few locations are still blocked (possibly instagram is blocked the /location end point).

I've been looking for another alternative but so far i haven't find much information, any information or pointers would be really appreciated!

@mattigrthr
Copy link
Contributor

Hey @arifluthfi16, as discussed, we can merge this as an experimental pipeline. Basically identical to how we did it with the google-trends pipeline. You can include this Instagram pipeline already in the "main" README under "Third-party data connectors". Then in the README of your Instagram pipeline, clearly write the limitations of the current implementation.

We should open a separate issue with examples for figuring out which places/locations cannot be scraped at the moment.

@mattigrthr
Copy link
Contributor

@arifluthfi16 please also add a workflow for linting and building the Docker images under .github/workflows.

@mattigrthr mattigrthr self-requested a review March 1, 2022 10:47
@mattigrthr mattigrthr added the enhancement New feature or request label Mar 1, 2022
@mattigrthr mattigrthr linked an issue Mar 1, 2022 that may be closed by this pull request
@mattigrthr mattigrthr removed the enhancement New feature or request label Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Pipeline: Instagram post scraper
2 participants