Table of contents
- Introduction
- Understanding Data Scraping
- Setting Up
- Data Cleaning with Pandas
- Getting Started with Pandas
- Integrating Google Drive API for Data Storage and Sharing:
- Setting Up the Environment
- Authenticating with Google Drive
- Uploading Files to Google Drive
- Generating Downloadable Links
- Practical Example
- Conclusion
- Streamlit Web Interface
- Key Features
- Dynamic Configuration with 'config.json'
- The Role of config.json
- Implementation in Streamlit
- Advantages
- Demo
- Conclusion
- Acknowledgements
Introduction
This article is your gateway to understanding and utilizing news data effectively. Here, we will walk you through a straightforward yet effective approach to gather, organize, analyze, and use data from Moroccan news websites. We will use easy-to-understand methods involving popular Python tools like Streamlit, Pandas, Selenium, along with Google Drive Handle and python-dotenv. Our aim is to make this journey from collecting data to applying it as smooth as possible for you.
Understanding Data Scraping
Data scraping is the process of collecting information from websites. This is commonly done to gather data from various sources on the internet for analysis, research, or storage.
Data scraping can be especially useful for keeping track of changes on websites, gathering large amounts of data quickly, and automating repetitive tasks of collecting information.
However, it's important to remember that while data scraping is a powerful tool, it should be used responsibly and ethically. This means respecting website terms of use, considering data privacy laws, and not overloading websites with too many requests at once.
Setting Up
To get started with our project, you'll need to set up your computer with some specific tools. Here's a list of what you need and how to install them:
Streamlit: This tool helps us build and share web applications easily. Install it using the command:
pip install streamlit
.Pandas: A library that makes handling data easier. Install it with
pip install pandas
.Selenium (version 4.0.0 to less than 5.0.0): It's crucial for web scraping. Install the correct version using
pip install selenium==4.*
.PyDrive: This tool will help us work with Google Drive files. Install it using
pip install PyDrive
.python-dotenv: This library will help us manage environment variables securely. Install it with
pip install python-dotenv
.WebDriver Manager: It helps in managing the browser drivers needed for Selenium. Install it using
pip install webdriver_manager
.BeautifulSoup4: A library that makes it easy to scrape information from web pages. Install it with
pip install beautifulsoup4
.
With these tools installed, you'll be ready to start working on your data scraping and analysis project!
Data Cleaning with Pandas
After successfully scraping data from websites, the next crucial step is cleaning this data. This means making sure the data is organized, free from errors, and ready for analysis. For this, we use Pandas, a powerful Python library that simplifies the process of data manipulation.
Getting Started with Pandas
Loading Data: Begin by loading the scraped data into Pandas for easy manipulation.
Removing Duplicates: Identify and remove any duplicate entries to ensure data quality.
Handling Missing Values: Find and address any missing or incomplete information in the dataset.
Formatting Data: Adjust data formats as needed for consistency and easier analysis.
import pandas as pd # Load the dataset df = pd.read_csv('scraped_articles.csv') # Remove duplicates df.drop_duplicates(subset='article_title', inplace=True) # Remove any rows with missing values df.dropna(inplace=True) df.to_csv('cleaned_articles.csv', index=False)
Integrating Google Drive API for Data Storage and Sharing:
In our project, we leverage the Google Drive API to efficiently store and share the data we've collected. This approach not only provides a secure and scalable storage solution but also simplifies the process of making our data accessible to others. Here's how we do it:
Setting Up the Environment
Importing Libraries: We start by importing necessary libraries including
pydrive.auth
,GoogleDrive
, andoauth2client.client
.Environment Variables: Using
load_dotenv
from thedotenv
package, we load our Google API credentials stored in environment variables for security. This includesCLIENT_ID
,CLIENT_SECRET
, andREFRESH_TOKEN
.
Authenticating with Google Drive
- Authentication Function: We define a function
authenticate_google_drive
which sets up the authentication using OAuth2 credentials. This ensures a secure connection to Google Drive.
Uploading Files to Google Drive
Upload Function: The
upload_file_to_drive
function takes in thedrive
object and the file path to upload the file to Google Drive. We also handle the case where the file might already exist on Google Drive, updating it instead of uploading a duplicate.Error Handling: The function includes error handling to manage any issues during the upload process.
Generating Downloadable Links
- Download Link Function: The
get_drive_download_link
function is used to generate a direct download link for the files uploaded. This function sets the necessary permissions on the file to make it accessible to anyone with the link.
Practical Example
from dotenv import load_dotenv
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from oauth2client.client import OAuth2Credentials
import os
load_dotenv()
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
REFRESH_TOKEN = os.getenv('REFRESH_TOKEN')
REDIRECT_URI = os.getenv('REDIRECT_URIS').split(',')[0] # Access the first URI
def authenticate_google_drive():
gauth = GoogleAuth()
gauth.credentials = OAuth2Credentials(None, CLIENT_ID, CLIENT_SECRET,REFRESH_TOKEN, None,
"https://accounts.google.com/o/oauth2/token", None, "web")
drive = GoogleDrive(gauth)
return drive
drive = authenticate_google_drive()
def upload_file_to_drive(drive, file_path, folder_id=None):
if not os.path.exists(file_path):
print(f"Cannot upload, file does not exist at path: {file_path}")
return None
try:
file_metadata = {'title': os.path.basename(file_path)}
if folder_id:
file_metadata['parents'] = [{'id': folder_id}]
upload_file = drive.CreateFile(file_metadata)
# Check if the file already exists on Google Drive
existing_files = drive.ListFile({'q': f"title='{upload_file['title']}'"}).GetList()
if existing_files:
# File with the same name already exists, update the existing file
upload_file = existing_files[0]
print(f"File already exists on Drive. Updating file with ID: {upload_file['id']}")
else:
print("Uploading a new file to Drive.")
upload_file.SetContentFile(file_path)
upload_file.Upload()
print(f"File uploaded successfully. File ID: {upload_file['id']}")
return upload_file['id']
except Exception as e:
print(f"An error occurred during file upload: {e}")
return None
def get_drive_download_link(drive, file_id):
try:
file = drive.CreateFile({'id': file_id})
file.Upload() # Make sure the file exists on Drive
file.InsertPermission({
'type': 'anyone',
'value': 'anyone',
'role': 'reader'})
return "https://drive.google.com/uc?export=download&id=" + file_id
except Exception as e:
print(f"Error fetching download link: {e}")
return None
Conclusion
This integration of the Google Drive API provides a robust and secure method for storing and sharing large datasets. By automating the upload and link generation process, we significantly enhance the accessibility and usability of our data.
Streamlit Web Interface
Our Streamlit application serves as a central hub for aggregating news from various Moroccan websites. It's designed to be intuitive and user-friendly, enabling users to select news sources, languages, and categories for scraping.
Key Features
Dynamic Configuration: Users can choose websites and specific categories from a dynamically loaded configuration (
config.json
). This allows for a customizable scraping experience.Language and Category Selection: For websites offering content in multiple languages, users can select their preferred language. Additionally, users can pick specific categories of news they are interested in.
Control Over Data Collection: Through a simple interface, users can specify the number of articles to scrape.
Initiating Scraping: A 'Start Scraping' button triggers the scraping process, with a progress bar indicating the ongoing operation.
Real-time Updates and Data Display: As data is scraped and uploaded, users receive real-time updates. Each successful scrape results in a download link for the data and a display of the scraped data in a tabular format within the application.
Google Drive Integration: Scraped data files are uploaded to Google Drive, and direct download links are provided within the Streamlit interface for easy access.
Error Handling: The application includes error handling for issues like failed file uploads or unsuccessful scrapes, ensuring a smooth user experience.
import streamlit as st import pandas as pd import json import importlib from selenium import webdriver from selenium.webdriver.chrome.options import Options as ChromeOptions import google_drive_handle as gdrive from dotenv import load_dotenv import os # Load config.json with open('config.json') as f: config = json.load(f) # Set up Chrome WebDriver with options options = ChromeOptions() options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') options.add_argument('log-level=3') # Initialize the Chrome WebDriver wd = webdriver.Chrome(options=options) drive = gdrive.authenticate_google_drive() processed_files = set() st.markdown( """ <style> .centered { display: flex; align-items: center; justify-content: center; text-align: center; } </style> """, unsafe_allow_html=True ) st.markdown("<h1 class='centered'>Moroccan News Aggregator</h1>", unsafe_allow_html=True) selected_websites = {} selected_categories = {} def save_file_id_mapping(file_id_mapping): with open("file_id_mapping.json", "w") as file: json.dump(file_id_mapping, file) def load_file_id_mapping(): try: with open("file_id_mapping.json", "r") as file: return json.load(file) except FileNotFoundError: return {} # Return an empty dictionary if the file doesn't exist file_id_mapping = load_file_id_mapping() selected_websites = {} for website, details in config.items(): if st.checkbox(website, key=website): # Language selection languages = details.get("languages", {}) if languages and len(languages) > 1: language = st.selectbox(f'Choose language for {website}', list(languages.keys()), key=f'lang_{website}') selected_websites[website] = f"{website}_{language}" # like: hespress_en else: selected_websites[website] = website # like: akhbarona # Category selection categories = languages.get(language, {}) if categories: categories = st.multiselect(f'Select categories for {website}', list(categories.keys()), key=f'{website}_categories') selected_categories[website] = categories # Number of articles input num_articles = st.number_input('Number of Articles', min_value=1, max_value=10000, step=1) # Start scraping button if st.button('Start Scraping'): with st.spinner('Scraping in progress...'): progress_bar = st.progress(0) total_tasks = sum(len(categories) for categories in selected_categories.values()) completed_tasks = 0 for website, module_name in selected_websites.items(): scraper_module = importlib.import_module(module_name) for category in selected_categories.get(website, []): category_url = config[website]['languages'][language][category] if 'category_name' in config[website]: category_name = config[website]['category_name'].get(category, 'default_category_name') file_path = scraper_module.scrape_category(category_url, num_articles) if file_path: if file_path not in file_id_mapping: file_id = gdrive.upload_file_to_drive(drive, file_path) print(f"Uploading file: {file_path}, File ID: {file_id}") file_id_mapping[file_path] = file_id save_file_id_mapping(file_id_mapping) else: file_id = file_id_mapping[file_path] print(f"File already uploaded. Using existing File ID: {file_id}") if file_id: download_link = gdrive.get_drive_download_link(drive, file_id) if download_link: st.markdown(f"[Download {website} - {category} data]({download_link})", unsafe_allow_html=True) df = pd.read_csv(file_path) st.write(f"{website} - {category} Data:") st.dataframe(df) else: st.error(f"Failed to retrieve download link for file ID: {file_id}") else: st.error(f"Failed to upload file for {website} - {category}") else: st.error(f"File not created for {website} - {category}") st.success('Scraping Completed!')
This Streamlit web interface stands as a testament to the power of Python in creating efficient, user-friendly tools for data aggregation and management. It simplifies the complex process of data collection, storage, and sharing, making it accessible even to those with minimal technical background.
Dynamic Configuration with 'config.json'
Our Streamlit application is designed with agility and future expansion in mind. By incorporating a config.json
file, we've created a flexible framework that allows for easy addition and modification of news sources.
The Role of config.json
Flexible Source Management: The
config.json
file holds the details of various news websites, their available languages, and specific category URLs. This setup enables us to easily add new sources or modify existing ones without altering the core code of the application.Language and Category Customization: For each news website, multiple languages and categories are defined. Users can select their preferred language and categories, making the data scraping process highly customizable.
Implementation in Streamlit
Loading Configuration: At the start of the application,
config.json
is loaded to dynamically populate the website choices, along with their respective languages and categories.User Interactions: Users interact with checkboxes and dropdowns generated based on the configuration. They can select websites, languages, and specific categories for scraping.
Scalability: Adding a new website, language, or category is as simple as updating the
config.json
file, making the application scalable and easy to maintain.
Advantages
- Maintainability: Changes to the list of websites or their categories don't require code changes, reducing maintenance complexity.
User Experience: Provides a user-friendly interface where options are dynamically generated, offering a seamless and intuitive experience.
Demo
Conclusion
Our journey through scraping, cleaning, and deploying data from Moroccan news websites has been a testament to the power and flexibility of Python and its libraries. By leveraging tools like Selenium, Pandas, Streamlit, and Google Drive API, we've demonstrated a streamlined process that transforms raw data into accessible and actionable insights. This project not only showcases the technical capabilities of these tools but also highlights the potential for data-driven strategies in understanding and disseminating information effectively.
Acknowledgements
First and foremost, a heartfelt thanks to my dedicated teammates - Tajeddine Bourhim , Yahya NPC and @marwane khadrouf. Their expertise, creativity, and commitment were pivotal in turning this concept into reality. Their contributions in various aspects of the project, from data scraping to interface design, have been invaluable.
We also extend our gratitude to the founder of MDS and the initiator of the DataStart initiative Bahae Eddine Halim . This initiative not only kick-started our project but also inspired us to delve into the realm of data handling and analysis, contributing to a range of projects including this one.
this project stands as a collaborative effort, blending individual talents and shared vision. It's a celebration of teamwork, innovation, and the endless possibilities that open-source technology and data science bring to our world.