This project aims to identify the underlying reasons for successes and failures in data science initiatives by extracting meaningful insights from unstructured text. The objective is to provide actionable recommendations to increase the success rates of data science capabilities. The project utilizes a collection of approximately 200K news articles (about 900 MB) focused on Data Science, Machine Learning, and Artificial Intelligence.
Project Steps Data Access: The data for the project can be accessed using one of the following methods:
Download the data from the provided link: News Final Project Data Use Pandas library to read the data directly from the URL: python Copy code import pandas as pd
df_news_final_project = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow') Data Cleaning: Perform data cleaning to eliminate noise and remove irrelevant text such as newlines, tabs, remnants of web crawls, etc.
Article Filtering: Discard irrelevant articles that are not related to the project's topics of interest (Data Science, Machine Learning, Artificial Intelligence).
Topic Detection: Apply techniques like topic modeling (e.g., LDA using gensim or ktrain) or BERTopic to detect major topics within the articles.
Identify Reasons for Failure: Extract insights from the data to identify the top reasons for failing data science initiatives. These reasons can be related to technology, people, data availability, etc. Analyze negative sentiment in the articles to uncover the key factors.
Recommend Corrective Actions: Based on the identified reasons for failure, suggest actionable recommendations to improve the success rates of data science initiatives. Provide specific actions that can address the identified issues.
Sentiment Analysis Timeline: Plot a timeline to illustrate how the sentiment regarding data science initiatives is changing over time. Visualize the fluctuations in sentiment to identify any patterns or trends.
Emerging Technologies: Identify new technologies and AI solutions that might be affecting the landscape of data science applications. Analyze the introduction and adoption of these technologies over time.
Improvement Strategies: Demonstrate what companies, academic institutions, and government entities can do to improve the outcomes of data science initiatives. Analyze positive sentiment and provide recommendations for enhancing the success of such projects.
Organization and Sentiment Analysis: Leverage appropriate NLP techniques to identify organizations, people, and locations mentioned in the articles. Apply targeted sentiment analysis to gain insights into the sentiments associated with different entities.
Benefits of Data Science Initiatives: Highlight why businesses should invest in data science initiatives by presenting success stories and showcasing the benefits of such projects.
Visualization of Recommendations: Create appropriate visualizations, such as word cloud charts or bubble charts, to summarize the recommendations and key findings from the analysis.