Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builtwith API Project #44

Open
7 of 16 tasks
ryanmswan opened this issue Jul 2, 2021 · 49 comments
Open
7 of 16 tasks

Builtwith API Project #44

ryanmswan opened this issue Jul 2, 2021 · 49 comments

Comments

@ryanmswan
Copy link
Contributor

ryanmswan commented Jul 2, 2021

Overview

Project: Open Community Survey

Volunteer Opportunity: Create scraper to get information from builtwith.com on technologies used by neighborhood council websites. Organize the data (create categories for the tech), Automate scrape job to run periodically. Additionally, we want to display this information with a dashboard (see Google Data Studio Dashboard linked below under "Project Output" for an example).

Contact: Ryan Swan (data science), Kaylani (open community survey) Bonnie

Action Items

  • Create a wiki page
  • Build a scraper that can we reuse to get the data on the NC site technologies
  • Add the scripts and other code to the data science repo or if another repo is required, let the leads know.
  • Create a Spreadsheet from the results of initial scrape
  • Create a set of categories in the spreadsheet
  • Rework the script to grab category as well as technology (its available in the API)
  • Add category to each technology, so that the data can be grouped and analyzed will happen automatically prior item is done
  • Assess code for current scraper to determine if it still functions properly
  • Perform additional analysis on Widget technology category: Which sites are using calendars? What are the calendars used for (events of the NC or local events)? Which sites use chatbots? Which sites have search functionality? How many sites use translation widgets?
  • Finish analysis of following technology categories: Content Management system (cms), Mobile, SSL, Payment, Framework, and Copyright
  • Fix directory issues with code. Currently, it's in the 311 directory but needs to be moved to the open community survey directory.
  • Create a reusable matching table of technology to category
  • Create a script to be able to create a new spreadsheet with the matching table so that the technologies are already categorized (except of course the ones that are new).
  • create instructions for updating matching table and running scripts.
  • make sure wiki is updated.
  • release dependency on - Conduct Analysis of 99 NC Features open-community-survey#28

Resources/Instructions

External Tools

Tutorial

Project input (data)

Project output

Rajinder's code

Current presentation

OCS: Tech usage insights NCs
Analytics Analysis Workbook
Widgets Analysis Workbook

Related issues from OCS


Past Collaborators:
@akibrhast, @ava li, @Sarah Williams, @wendywilhelm10 @rajindermavi @ShikaZzz @JessicaFB @Poorvi Rao

@ryanmswan

This comment has been minimized.

@ExperimentsInHonesty
Copy link
Member

ExperimentsInHonesty commented Jul 5, 2021

@RyanSwan @salice @JessicaFB @poorvi4 @akibrhast, @ava li, @sarah Williams, @wendywilhelm10 I forgot to mention that @mattyweb has a report tool that he has setup on our AWS and it might be a good place to dump all this data (from the Comparative analysis of features and then the technologies from builtwith api), then we can define the types of reports we want to display and it would allow people to look at a single NC, as well as aggregate stats, etc. Think of it as a real time data visualizers for end users once we have figured out what is worth looking at. At least that's my understanding of how it works. Would be good to ask him to come to the Data Science Community of Practice at some point to discuss it.

@ExperimentsInHonesty

This comment has been minimized.

@ebele-oputa

This comment has been minimized.

@chelseybeck

This comment has been minimized.

@rajindermavi

This comment has been minimized.

@ebele-oputa

This comment was marked as outdated.

@rajindermavi

This comment was marked as outdated.

@rajindermavi

This comment was marked as outdated.

@ebele-oputa

This comment was marked as outdated.

@rajindermavi

This comment was marked as outdated.

@akhaleghi akhaleghi added epic project: 311 Data role: data analysis size: missing feature: guide All issues related to guide feature: missing this tags is mutually exclusive with project: missing. Please use the correct label labels Nov 2, 2021
@ExperimentsInHonesty
Copy link
Member

@willa-mannering you said

Potential additional information I can collect from each tech type includes: subcategories (i.e. WordPress Theme), tech description, tech website link, number of sites currently using tech, and competing/similar techs.

OCS team said this in response

Pull all the additional info available with the script and then decide what information is needed in the future meeting

So to be clear, we are saying yes, please pull all the information you said you could pull.

@akhaleghi please add this as a recurring reporting item to our DS/Org agenda

@ExperimentsInHonesty
Copy link
Member

ExperimentsInHonesty commented Jul 9, 2022

@willa-mannering It looks like this got discussed at a meeting but never annotated on this issue, that we only need the above subcategories for the items marked TRUE in the OCS: Builtwith tech_table, tech_categories

The different columns are for our own reference and have no significance for you. Just grab more info for any of the columns marked TRUE.

@willa-mannering
Copy link
Member

Added new sheet to OCS table (more_info_tech) including the additional information for the designated categories via the tech_categories sheet.

@willa-mannering
Copy link
Member

Need to pull additional data to determine which NC websites are using wordpress, then think of good questions to investigate further based on additional data collected (i.e. how do NC websites use certain technologies?)

@willa-mannering
Copy link
Member

Added new table ranking technologies by number of live sites to OCS Google sheet. Still working on pulling data to figure out which techs use WordPress.

@willa-mannering
Copy link
Member

Added new table to OCS google sheet with info on which NC sites use Wordpress. Some NC sites were no longer accessible (for example, svanc.org)

@kalyaniraman
Copy link
Member

Need to figure out from stand point of making actionable information for the NC's.

  • Want to know where they stand against other NC's
  • Want to see where they are against normal websites
  • Stakeholder want to know what the health of the network is

@kalyaniraman
Copy link
Member

@willa-mannering Here is the OCS Google Template for the presentation -

@willa-mannering
Copy link
Member

Updated OCS technology analysis presentation to use correct slide template. Finished Analytics portion, began working on Widgets analysis.

@akhaleghi
Copy link
Contributor

Hey @willa-mannering are there any recent updates to this issue?

@willa-mannering
Copy link
Member

Currently still working on the technology analysis presentation (which has been added as a link to this issue). I'm editing the analytics and widgets sections according to feedback. After that I will start analyzing the content management system section.

@ExperimentsInHonesty
Copy link
Member

Spreadsheet of updated script results sheet we decided we wanted to break down further
Content Management system (cms)
Mobile
SSL
Payment
Framework
Copyright

Further Widget analysis
Which sites are using it, what calendars are used for (events of the NC or local events)
Calendar
Chatbot
Search
Translate widgets (How many of the website had translations software enabled 7% of all sites use it)

@ExperimentsInHonesty
Copy link
Member

@akhaleghi it looks like Willa's update of the script is located in a 311 directory under DS, but it has nothing to do with 311. Let's sort that out.

@akhaleghi
Copy link
Contributor

@willa-mannering are there any updates on this issue?

@akhaleghi
Copy link
Contributor

Next steps:

  • Review code for scraper here to determine if the scraper still functions.
  • Add additional functionality to analyze widgets mentioned in Bonnie's comment above, dated 10/17/2022

@Rahul-Rut Rahul-Rut self-assigned this Apr 18, 2024
@Rahul-Rut
Copy link
Member

Tasks done:
Tweaked the scraper to make it functional again

  • Added a shell script to handle docker functions (might not commit)
  • Updated the code to install chrome and chromedriver
  • Updated the code to properly scrape the website

@Rahul-Rut
Copy link
Member

Tasks done:

  • Merged the two scraper scripts
  • Created a Jupyter notebook for preliminary analysis of tech; awaiting further inputs

@akhaleghi
Copy link
Contributor

@Rahul-Rut Is this issue still being worked on? Is there anything we can do to provide input if you need it?

@Rahul-Rut
Copy link
Member

@akhaleghi yes, I'll be providing updates soon; just need to wrap it up with a presentation, and I'll reach out in case I require any assistance, thanks!

@Rahul-Rut
Copy link
Member

Tasks Done:
Updated code
Added search functionality to search tech through keywords in description and tech words
Used cron to schedule

Input required:
Upload files on GitHub
A way to publish the results

@akhaleghi
Copy link
Contributor

@Rahul-Rut Is there more to be done on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress (actively working)
Development

No branches or pull requests