<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Moroccan Data Scientists Blog]]></title><description><![CDATA[Join Moroccan Data Scientists (MDS) on a journey of innovation and discovery. Uncover the power of data science, artificial intelligence, and technology in a vi]]></description><link>https://blog.moroccands.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1705877245140/LsaDracdo.png</url><title>Moroccan Data Scientists Blog</title><link>https://blog.moroccands.com</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 20 Apr 2026 09:33:09 GMT</lastBuildDate><atom:link href="https://blog.moroccands.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[PMAE-Predictive-Maintenance-of-Aircraft-Engine]]></title><description><![CDATA[Predictive Maintenance in Aerospace: Leveraging Machine Learning for Enhanced Reliability
In the realm of aerospace engineering, ensuring the reliability and longevity of critical components is paramount. The ability to predict and prevent potential ...]]></description><link>https://blog.moroccands.com/pmae-predictive-maintenance-of-aircraft-engine</link><guid isPermaLink="true">https://blog.moroccands.com/pmae-predictive-maintenance-of-aircraft-engine</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[data analytics]]></category><category><![CDATA[Artificial Intelligence]]></category><dc:creator><![CDATA[Adil Charahil]]></dc:creator><pubDate>Fri, 26 Apr 2024 23:01:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1713995340495/20abf25f-4e9e-4b2e-a9ba-bb1fc0f9c14c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Predictive Maintenance in Aerospace: Leveraging Machine Learning for Enhanced Reliability</strong></p>
<p>In the realm of aerospace engineering, ensuring the reliability and longevity of critical components is paramount. The ability to predict and prevent potential failures in aircraft engines can not only enhance safety but also significantly reduce operational costs. To address these challenges, researchers and engineers have turned to advanced technologies such as machine learning to develop predictive maintenance solutions.</p>
<h3 id="heading-introduction-to-predictive-maintenance"><strong>Introduction to Predictive Maintenance</strong></h3>
<p>Predictive maintenance is a proactive approach to maintenance management that leverages data analytics and machine learning algorithms to predict equipment failures before they occur. By analyzing historical data and real-time sensor readings, predictive maintenance models can identify patterns and anomalies indicative of impending failures, allowing maintenance activities to be scheduled precisely when needed.</p>
]]></content:encoded></item><item><title><![CDATA[Management Systems for Moroccan Agriculture 🌱 
"Detect Pests And Diseased Leaves 🍃"]]></title><description><![CDATA[Table of Contents:

Introduction

Data Collection and Preparation

Importing Libraries and Loading Data
Data Preprocessing
Data Augmentation


Building the CNN Model

Model Architecture
Compiling the Model
Training the Model


Model Evaluation


Perf...]]></description><link>https://blog.moroccands.com/management-systems-for-moroccan-agriculture-detect-pests-and-diseased-leaves</link><guid isPermaLink="true">https://blog.moroccands.com/management-systems-for-moroccan-agriculture-detect-pests-and-diseased-leaves</guid><category><![CDATA[#DataFtour]]></category><category><![CDATA[MDS]]></category><category><![CDATA[agriculture]]></category><category><![CDATA[Morocco ]]></category><dc:creator><![CDATA[asmae el ghezzaz]]></dc:creator><pubDate>Fri, 26 Apr 2024 23:01:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1714156602580/ddfbed0d-fb76-4c2a-aecc-9e9ce52c681a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Table of Contents:</strong></p>
<ul>
<li><p><strong>Introduction</strong></p>
</li>
<li><p>Data Collection and Preparation</p>
<blockquote>
<p><strong><em>Importing Libraries and Loading Data</em></strong></p>
<p><strong><em>Data Preprocessing</em></strong></p>
<p><strong><em>Data Augmentation</em></strong></p>
</blockquote>
</li>
<li><p>Building the CNN Model</p>
<blockquote>
<p><strong><em>Model Architecture</em></strong></p>
<p><strong><em>Compiling the Model</em></strong></p>
<p><strong><em>Training the Model</em></strong></p>
</blockquote>
</li>
<li><p>Model Evaluation</p>
</li>
<li><blockquote>
<p><strong><em>Performance Metrics</em></strong></p>
<p><strong><em>Visualizing Training History</em></strong></p>
<p><strong><em>Model Deployment</em></strong></p>
</blockquote>
</li>
<li><p>Conclusion</p>
</li>
</ul>
<h3 id="heading-introduction"><strong>INTRODUCTION:</strong></h3>
<p>Agricultural 🌱 practices play a crucial role in ensuring global food security. However, the health of crops can be significantly impacted by various factors, including the presence of diseases. One such crop of immense importance is the potato, a staple food for many populations around the world. Potatoes are susceptible to diseases such as Early Blight, Late Blight, and can also exhibit a healthy state.</p>
<p>In this context, leveraging advanced technologies becomes imperative to efficiently monitor and manage the health of potato crops. Deep learning, particularly Convolutional Neural Networks (CNNs), has shown remarkable success in image recognition tasks. This project focuses on utilizing CNNs for the detection and classification of diseases in potato leaves, specifically targeting Early Blight, Late Blight, and Healthy states.</p>
<blockquote>
<p><strong><em>We will build a web application to predict the diseases of Potato plants.</em></strong></p>
<p><strong><em>This application will help farmers to identify the diseases in potato plants so that they can use appropriate fertilizers to get more yield.We will build a web application to predict the diseases of Potato plants.</em></strong></p>
<p><strong><em>This application will help farmers to identify the diseases in potato plants so that they can use appropriate fertilizers to get more yield.</em></strong></p>
</blockquote>
<h2 id="heading-image-classification"><strong>Image Classification</strong></h2>
<p><em>A classical computer vision problem, where the task is to predict the class of an image within a known set of possible classes.</em></p>
<h2 id="heading-problem-statement"><strong>Problem statement</strong></h2>
<ul>
<li><p>To classify the given potato leaf image as <strong>healthy</strong>, <strong>late blight</strong> or <strong>early blight</strong>.</p>
</li>
<li><p>It is a multi class classification problem.</p>
</li>
</ul>
<h1 id="heading-data"><strong>Data</strong></h1>
<p>We will use a <a target="_blank" href="https://www.kaggle.com/arjuntejaswi/plant-village"><strong>kaggle</strong></a> dataset for this project.<br />I created a subset of the original data, which includes only the diseases of potato plants. You can find the dataset used in this project</p>
<p><strong><em>Late Blight:</em></strong> Late blight of potato is a disease caused by fungus <strong>Phytophthora infestans.</strong></p>
<p><strong><em>Early Blight</em>:</strong> Early blight of potato is a disease caused by the fungus <strong>Alternaria solani</strong></p>
<p><strong><em>Healthy:</em></strong> Uninfected or healthy plant</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714149115223/75f13f34-7122-47b7-9e0e-987cc3ced26f.png?auto=compress,format&amp;format=webp" alt /></p>
<h3 id="heading-data-collection-and-preparation"><strong>Data Collection and Preparation</strong></h3>
<p><strong>Importing Libraries and Loading Data</strong></p>
<p>involves importing the necessary libraries for the project, such as TensorFlow and Matplotlib, and loading the dataset of potato images, it ensures that all required dependencies are available and accessible for further processing.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow.keras <span class="hljs-keyword">import</span> models, layers
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
</code></pre>
<p>This part imports the necessary libraries, including TensorFlow and Matplotlib, and loads the datasets of potato images using TensorFlow's <code>image_dataset_from_directory</code> function.</p>
<h3 id="heading-data-preprocessing"><strong>Data Preprocessing</strong></h3>
<p>Data preprocessing involves preparing the dataset for model training by performing operations such as resizing, normalization, and batching. This section ensures that the data is in the appropriate format and structure for training the CNN model.</p>
<pre><code class="lang-python">IMAGE_SIZE = <span class="hljs-number">256</span>
BATCH_SIZE = <span class="hljs-number">32</span>
CHANNELS = <span class="hljs-number">3</span>
EPOCHS = <span class="hljs-number">50</span>
</code></pre>
<p>Defines constants for image size, batch size, number of color channels, and number of epochs.</p>
<h3 id="heading-data-augmentation"><strong>Data Augmentation</strong></h3>
<p>Data augmentation is a technique used to artificially increase the diversity of the training dataset by applying transformations such as rotation, flipping, and scaling to the images. This helps the model generalize better and improves its performance on unseen data.</p>
<pre><code class="lang-python">data_augmentation = tf.keras.Sequential([
    layers.experimental.preprocessing.RandomFlip(<span class="hljs-string">"horizontal_and_vertical"</span>),
    layers.experimental.preprocessing.RandomRotation(<span class="hljs-number">0.2</span>)
])
</code></pre>
<h3 id="heading-building-the-cnn-model"><strong>Building the CNN Model</strong></h3>
<h4 id="heading-model-architecture"><strong>Model Architecture</strong></h4>
<p>The model architecture defines the structure and configuration of the CNN model, including the number and type of layers, their activation functions, and their connectivity. It lays the foundation for the neural network's ability to learn and make predictions based on input data.</p>
<pre><code class="lang-python">input_shape = (BATCH_SIZE, IMAGE_SIZE, IMAGE_SIZE, CHANNELS)
n_classes = <span class="hljs-number">3</span>

model = models.Sequential([
    resize_and_rescale,
    data_augmentation,
    layers.Conv2D(<span class="hljs-number">32</span>, kernel_size=(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>, input_shape=input_shape),
    layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
    layers.Conv2D(<span class="hljs-number">64</span>,  kernel_size=(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>),
    layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
    layers.Conv2D(<span class="hljs-number">64</span>,  kernel_size=(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>),
    layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
    layers.Conv2D(<span class="hljs-number">64</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>),
    layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
    layers.Conv2D(<span class="hljs-number">64</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>),
    layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
    layers.Conv2D(<span class="hljs-number">64</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>),
    layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
    layers.Flatten(),
    layers.Dense(<span class="hljs-number">64</span>, activation=<span class="hljs-string">'relu'</span>),
    layers.Dense(n_classes, activation=<span class="hljs-string">'softmax'</span>),
])

model.build(input_shape=input_shape)
model.summary()
</code></pre>
<p>Defines the architecture of the CNN model using TensorFlow's Sequential API.</p>
<p><strong>The next step is to investigate model architecture.</strong></p>
<p>Let’s have a look at the brief summary of our mode</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714152013997/4c3d7891-35c3-4f23-a559-c8e9b260b7e6.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<h3 id="heading-compiling-the-model"><strong>Compiling the Model</strong></h3>
<p>Compiling the model involves configuring its learning process by specifying the optimizer, loss function, and evaluation metrics. This step prepares the model for training by defining how it should update its parameters to minimize the loss and improve performance.</p>
<pre><code class="lang-python">model.compile(
    optimizer=<span class="hljs-string">'adam'</span>,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=<span class="hljs-literal">False</span>),
    metrics=[<span class="hljs-string">'accuracy'</span>]
)
</code></pre>
<p>Compiles the model with the Adam optimizer, sparse categorical crossentropy loss function, and accuracy metric.</p>
<h4 id="heading-training-the-model"><strong>Training the Model</strong></h4>
<p>Training the model involves feeding the training dataset into the neural network and iteratively adjusting its parameters to minimize the loss function. This process allows the model to learn from the data and improve its ability to make accurate predictions.</p>
<pre><code class="lang-python">history = model.fit(
    train_ds,
    batch_size=BATCH_SIZE,
    validation_data=val_ds,
    verbose=<span class="hljs-number">1</span>,
    epochs=EPOCHS,
)
</code></pre>
<p>Trains the model on the training dataset and evaluates its performance on the validation dataset.</p>
<h3 id="heading-model-evaluation"><strong>Model Evaluation</strong></h3>
<p>Model evaluation involves assessing the performance of the trained model on separate validation and test datasets. It provides insights into how well the model generalizes to unseen data and helps identify areas for improvement</p>
<pre><code class="lang-python">scores = model.evaluate(test_ds)
</code></pre>
<p>Evaluates the model on the test dataset and prints the evaluation scores.</p>
<h3 id="heading-performance-metrics"><strong>Performance Metrics</strong></h3>
<p>Performance metrics are quantitative measures used to evaluate the effectiveness of the trained model. Common metrics include accuracy, precision, recall, and F1 score, which provide information about the model's classification performance.</p>
<pre><code class="lang-python">history.history.keys()
</code></pre>
<p>Prints the keys available in the history object, which contains training and validation metrics.</p>
<h3 id="heading-visualizing-training-history"><strong>Visualizing Training History</strong></h3>
<p>Visualizing the training history involves plotting graphs of training and validation metrics, such as accuracy and loss, over the course of training epochs. This visualization helps identify trends, patterns, and potential issues in the model's learning process.</p>
<pre><code class="lang-python">plt.plot(range(EPOCHS), acc, label=<span class="hljs-string">'Training Accuracy'</span>)
plt.plot(range(EPOCHS), val_acc, label=<span class="hljs-string">'Validation Accuracy'</span>)
plt.title(<span class="hljs-string">'Training and Validation Accuracy'</span>)
plt.legend(loc=<span class="hljs-string">'lower right'</span>)
plt.show()
</code></pre>
<p>Plots the training and validation accuracy over epochs to visualize the model's learning progress.</p>
<p><strong>Let’s have a look at the history parameter.</strong></p>
<p>Actually, history is a Keras callback that keeps all epoch history as a list; let’s utilize it to plot some intriguing plots. Let’s start by putting all of these parameters into variables.</p>
<p><img src="https://editor.analyticsvidhya.com/uploads/11190download.png" alt="Potato Leaf Disease Prediction VALIDATION" /></p>
<p>This graph shows the accuracy of training vs validation. Epochs are on the x-axis, and accuracy and loss are on the y-axis.</p>
<pre><code class="lang-python">Let<span class="hljs-string">'s save our model 
# it will save the model
model.save('</span>final_model.h5<span class="hljs-string">')</span>
</code></pre>
<h3 id="heading-model-deployment"><strong>Model Deployment</strong></h3>
<h4 id="heading-saving-the-model"><strong>Saving the Model</strong></h4>
<p>Saving the model involves exporting its architecture and trained weights to a file for future use or deployment. This allows the model to be easily loaded and used in other applications without needing to retrain it from scratch.</p>
<pre><code class="lang-python">model.save(<span class="hljs-string">"potato_model.h5"</span>)
</code></pre>
<p>Saves the trained model to a file for future use or deployment.</p>
<h2 id="heading-streamlit-the-boom"><strong>Streamlit – The Boom!</strong></h2>
<p><strong>Streamlit</strong> is a free, open-source Python Framework, that allows us to quickly develop a Web Application without the requirement of a backend server and without having to write HTML, CSS, or Javascript. We can start building a really good Web Application simply by using our existing python skills. I’ve created a simple web application that accepts images as input and requires the same preprocessing steps on the input image as we did on our training dataset during training because when we save our model, it only saves model trained parameters, and we must preprocess our input manually, so this is something we must keep in mind when building any web application or using a pre-trained model.</p>
<h3 id="heading-web-app"><strong>Web App</strong></h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> werkzeug.utils <span class="hljs-keyword">import</span> secure_filename
<span class="hljs-keyword">import</span> os

os.environ[<span class="hljs-string">"CUDA_VISIBLE_DEVICES"</span>] = <span class="hljs-string">"-1"</span>

class_names = [<span class="hljs-string">'Potato___Early_blight'</span>, <span class="hljs-string">'Potato___Late_blight'</span>, <span class="hljs-string">'Potato___healthy'</span>]

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">model, img</span>):</span>
    img_array = tf.keras.preprocessing.image.img_to_array(img)
    img_array = tf.expand_dims(img_array, <span class="hljs-number">0</span>)
    predictions = model.predict(img_array)
    predictions_arr = [round(<span class="hljs-number">100</span> * i, <span class="hljs-number">2</span>) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> predictions[<span class="hljs-number">0</span>]]
    predicted_class = class_names[np.argmax(predictions[<span class="hljs-number">0</span>])]
    confidence = round(<span class="hljs-number">100</span> * (np.max(predictions[<span class="hljs-number">0</span>])), <span class="hljs-number">2</span>)
    <span class="hljs-keyword">return</span> predicted_class, predictions_arr

model = tf.keras.models.load_model(<span class="hljs-string">'potato_model.h5'</span>, compile=<span class="hljs-literal">False</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    st.set_page_config(page_title=<span class="hljs-string">"Potato Disease Classifier"</span>)
    st.sidebar.title(<span class="hljs-string">"Potato Disease Classifier"</span>)
    st.sidebar.info(<span class="hljs-string">"Upload an image of a potato leaf to detect early or late blight."</span>)
    st.title(<span class="hljs-string">"Potato Disease Detection"</span>)
    uploaded_file = st.file_uploader(<span class="hljs-string">"Upload a potato leaf image"</span>,type=[<span class="hljs-string">'jpg'</span>,<span class="hljs-string">'png'</span>,<span class="hljs-string">'jpeg'</span>])
    <span class="hljs-keyword">if</span> uploaded_file <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        image = Image.open(uploaded_file)
        st.image(image,caption=<span class="hljs-string">"Uploaded Image"</span>,use_column_width=<span class="hljs-literal">True</span>)
        image = image.resize((<span class="hljs-number">256</span>,<span class="hljs-number">256</span>))
        img_arr = np.array(image)
        predicted_class,predictions=predict(model,img_arr)

        response = {
            <span class="hljs-string">"predicted_class"</span>: predicted_class,
            <span class="hljs-string">"early"</span>: <span class="hljs-string">f"<span class="hljs-subst">{predictions[<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>%"</span>,
            <span class="hljs-string">"late"</span>: <span class="hljs-string">f"<span class="hljs-subst">{predictions[<span class="hljs-number">1</span>]:<span class="hljs-number">.2</span>f}</span>%"</span>,
            <span class="hljs-string">"healthy"</span>: <span class="hljs-string">f"<span class="hljs-subst">{predictions[<span class="hljs-number">2</span>]:<span class="hljs-number">.2</span>f}</span>%"</span>
        }


        st.success(<span class="hljs-string">f"Predicted Class : <span class="hljs-subst">{response[<span class="hljs-string">'predicted_class'</span>]}</span>"</span>,icon=<span class="hljs-string">"✅"</span>)
        st.write(<span class="hljs-string">"Probabilities:"</span>)
        col1,col2,col3 = st.columns(<span class="hljs-number">3</span>)
        col1.metric(<span class="hljs-string">"Early Blight"</span> , <span class="hljs-string">f"<span class="hljs-subst">{response[<span class="hljs-string">'early'</span>]}</span>"</span>, <span class="hljs-string">f"<span class="hljs-subst">{response[<span class="hljs-string">'early'</span>]}</span>"</span>)
        col2.metric(<span class="hljs-string">"Late Blight"</span> , <span class="hljs-string">f"<span class="hljs-subst">{response[<span class="hljs-string">'late'</span>]}</span>"</span>, <span class="hljs-string">f"<span class="hljs-subst">{response[<span class="hljs-string">'late'</span>]}</span>"</span>)
        col3.metric(<span class="hljs-string">"Healthy"</span> , <span class="hljs-string">f"<span class="hljs-subst">{response[<span class="hljs-string">'healthy'</span>]}</span>"</span>, <span class="hljs-string">f"<span class="hljs-subst">{response[<span class="hljs-string">'healthy'</span>]}</span>"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p><strong>Output :</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714150183694/179de6e2-c88a-4412-99a5-e382080a6ea5.png?auto=compress,format&amp;format=webp" alt /></p>
<p>Internally, the web app uses our previously developed deep learning model to detect potato leaf diseases.</p>
<p><strong><em>Potato early bright</em></strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714150321630/d5a5465c-5f4d-487c-9e7a-7709f13b9045.png?auto=compress,format&amp;format=webp" alt /></p>
<p><strong><em>Potato healthy</em></strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714150484146/5247d9f9-9d01-4018-988d-d322e356ff4d.png?auto=compress,format&amp;format=webp" alt /></p>
<p><strong><em>Potato late blight</em></strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714150553146/63b97fd3-4926-40f8-9e4d-791e39037fd6.png?auto=compress,format&amp;format=webp" alt /></p>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>As we conclude this chapter of the Moroccan Agriculture Management System project, we're reminded of the impactful results that collaborative efforts, like those facilitated by the MDS community, can achieve. By harnessing data and technology, we gain insights into societal trends and cultural phenomena, enabling informed decision-making, community engagement, and positive social change.</p>
<p>Moving forward, let's continue embodying the spirit of collaboration, curiosity, and inclusivity that defines the MDS community. Together, we'll persist in exploring, innovating, and inspiring, shaping a brighter future through data science and community-driven initiatives.</p>
<h1 id="heading-acknowledgmentshttpshashnodecomdraft65bccc56d821d9fd24722c81heading-acknowledgments"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-acknowledgments"><strong>Acknowledgments</strong></a></h1>
<p>I Want to express my heartfelt appreciation to my dedicated team members Nizar Sahl, Idriss EL HOUARI, Farheen Akhter, Ben alla ismail, and Aicha Dessa. Their expertise and commitment have been invaluable to this project. Your hard work, collaboration, and enthusiasm have truly made a difference. As the team leader, I am incredibly proud to have worked alongside such talented individuals.</p>
<p>I also want to acknowledge <a target="_blank" href="https://www.linkedin.com/in/halimbahae/"><strong>Bahae Eddine HALIM</strong></a>, the founder of the Moroccan Data Science <a target="_blank" href="https://moroccands.com/"><strong>MDS Community</strong></a> for providing the platform for our project through the "DataFtour" second Edition initiative. His dedication to fostering a supportive environment for data enthusiasts in Morocco has been instrumental in our journey. Lastly, we thank the broader data science community for their support and encouragement, which have motivated us to push boundaries and continuously strive for excellence.</p>
<p><strong>Explore a preview of our project:</strong></p>
<blockquote>
<p><a target="_blank" href="https://huggingface.co/spaces/MoroccanDS/Moroccan-Agri-Leaf-Pest-Detection"><strong>HuggingFace</strong></a></p>
<p><a target="_blank" href="https://github.com/Moroccan-Data-Scientists/Management-System-for-Moroccan-Agriculture-"><strong>Github</strong></a></p>
</blockquote>
<p>You may connect with me <a target="_blank" href="https://www.linkedin.com/in/asmae-el-ghezzaz/"><strong>Linkedin</strong></a> also follow me.</p>
<p><strong>Thank you</strong> ✨🧠🌱</p>
<p><a target="_blank" href="https://moroccands.com/"><strong>MDS Community</strong></a></p>
]]></content:encoded></item><item><title><![CDATA[Ramadan Social Media Sentiment Analysis in Morocco]]></title><description><![CDATA[Introduction
In the vibrant landscape of data science and community-driven initiatives, the MDS community stands as a beacon of innovation and collaboration. Founded by Bahae Eddine Halim , the MDS community has brought together a diverse group of pa...]]></description><link>https://blog.moroccands.com/ramadan-social-media-sentiment-analysis-in-morocco</link><guid isPermaLink="true">https://blog.moroccands.com/ramadan-social-media-sentiment-analysis-in-morocco</guid><dc:creator><![CDATA[Loubna Bouljadiane]]></dc:creator><pubDate>Sun, 21 Apr 2024 23:28:55 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1713461794578/a5add037-f2ea-4682-a9a6-e33bb4f4015e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>In the vibrant landscape of data science and community-driven initiatives, the <a target="_blank" href="https://moroccands.com/">MDS</a> community stands as a beacon of innovation and collaboration. Founded by <a class="user-mention" href="https://hashnode.com/@bahae">Bahae Eddine Halim</a> , the <a target="_blank" href="https://moroccands.com/">MDS</a> community has brought together a diverse group of passionate individuals, united by their shared commitment to leveraging data science for societal impact.</p>
<p>In this article, we delve into one of the latest endeavors spearheaded by the <a target="_blank" href="https://moroccands.com/">MDS</a> community, a groundbreaking project aimed at analyzing Moroccan online discourse during Ramadan. With the collective efforts of <a class="user-mention" href="https://hashnode.com/@bahae">Bahae Eddine Halim</a> ,<a class="user-mention" href="https://hashnode.com/@loubna264">Loubna Bouljadiane</a> , <a class="user-mention" href="https://hashnode.com/@soufianesejjari">Soufiane sejjari</a> ,<a class="user-mention" href="https://hashnode.com/@zinebelhz">Zineb El houz</a> ,<a class="user-mention" href="https://hashnode.com/@ali55">Ali</a> ,<a class="user-mention" href="https://hashnode.com/@hibalb21">Hiba Lbazry</a>, and <a class="user-mention" href="https://hashnode.com/@Imad1">Imad Nasri</a> , this project embarks on a journey to uncover insights from the vast sea of Moroccan Darija comments and tweets.</p>
<p>At its core, the main objective of this project is to harness the power of data science, machine learning, and deep learning to decode the sentiments, topics, and engagement patterns prevalent in Moroccan online conversations during the sacred month of Ramadan. Through meticulous analysis and cutting-edge methodologies, we aim to shed light on the nuanced perspectives and cultural dynamics shaping Moroccan society in the digital age.</p>
<p><strong>Explore a preview of our project :</strong> <a target="_blank" href="https://moroccansentimentsanalysis.netlify.app/">https://moroccansentimentsanalysis.com</a></p>
<blockquote>
<p>Explore the world of Ramadan comments in Moroccan Darija through EDA, uncovering patterns, trends, and insights from social media platforms like Facebook, Twitter, Hespress, and YouTube. Discover the prevalence of religious themes, temporal engagement patterns, language distribution, topic modeling results, and comment length analysis across different platforms. Gain valuable insights into online discourse during Ramadan and the significance of EDA in extracting meaningful insights from diverse datasets.</p>
</blockquote>
<h1 id="heading-data-scraping">Data Scraping</h1>
<p>As we know in the world of AI the most important thing is the <strong>data ,</strong> so in fact having lots of data is like having a big toolbox - the more tools you have , the easier it is to fix things.</p>
<p>So before that we dive into how we scrap our data from different sources like YouTube, Twitter(X) ... let's understand scraping.</p>
<h3 id="heading-what-is-the-scraping"><strong>What is the scraping ?</strong></h3>
<p>" Scraping is like gathering ingredients for a recipe, but from websites instead of the grocery store " .</p>
<p><img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRnz7cGhx3opD2lcwyPO-UbsUNZeLm7g_rNdfW2QuRcIw&amp;s" alt="Web scraping et data scraping : les vraies définitions" class="image--center mx-auto" /></p>
<p>Scraping is simply the process of extracting data from websites. It allows us to collect information for various purposes such as research or building datasets for analysis. In the context of our project, we're specifically interested in scraping comments written in Darija (Moroccan language) from different sources.</p>
<h3 id="heading-how-do-we-get-data-from-the-websites"><strong>How do we get data from the websites ?</strong></h3>
<p>There are several methods to perform web scraping, each with its own advantages and limitations. Here are some common methods along with brief explanations:</p>
<ol>
<li><p><strong>Manual Scraping:</strong> This involves manually copying and pasting data from websites into a local file or database. While simple and straightforward, it is not practical for scraping large amounts of data and is highly inefficient.</p>
</li>
<li><p><strong>Beautiful Soup:</strong> Beautiful Soup is a Python library designed for quick and easy scraping of web pages. It provides a simple API for navigating and searching the HTML structure of a webpage, making it ideal for extracting specific information from websites.</p>
</li>
<li><p><strong>Selenium:</strong> Selenium is another popular tool for web scraping, particularly useful for scraping dynamic content generated by JavaScript. It allows automation of web browsers to interact with web pages, enabling scraping of content that is rendered dynamically.</p>
</li>
<li><p><strong>API Scraping:</strong> Many websites offer APIs (Application Programming Interfaces) that allow developers to access their data in a structured and legal manner. By interacting with these APIs, developers can retrieve data without having to parse HTML or deal with web scraping complexities.</p>
</li>
</ol>
<p>Now, let's relate these methods to our project:</p>
<p><strong>From Hespress :</strong></p>
<ul>
<li>We employed a combination of <strong>Beautiful Soup</strong> and <strong>Selenium</strong> techniques to extract data from the HESPRESS website. The data we scraped includes the article's title, published date, and comments associated with the article.</li>
</ul>
<p><strong><em>the process :</em></strong></p>
<ol>
<li><p><strong>Importing Libraries:</strong></p>
<ul>
<li><p>The <code>webdriver</code> module from Selenium serves a pivotal role in automating interactions with web browsers, enabling developers to programmatically navigate and interact with web pages. This powerful tool empowers users to simulate user actions such as clicking buttons, filling out forms...</p>
<p>  On the other hand, <code>BeautifulSoup</code>, imported from the <code>bs4</code> library, provides a versatile and user-friendly solution for parsing and extracting data from HTML and XML files.</p>
<pre><code class="lang-python">    <span class="hljs-keyword">import</span> os
    <span class="hljs-keyword">import</span> time
    <span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
    <span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Web Scraping Process:</strong></p>
<ul>
<li><p>The process begins by creating a <a target="_blank" href="http://webdriver.chrome/"><code>webdriver.Chrome</code></a> instance, which specifies the path to the Chrome WebDriver executable (chromedriver). This instance acts as a bridge between the Python script and the Chrome browser, enabling automated interactions with web pages.</p>
<p>  Next, the browser is directed to the specified URL using the <code>driver.get(url)</code> command.</p>
<p>  To ensure that the page is fully loaded before proceeding with further actions, a delay of 3 seconds is introduced using <code>time.sleep(3)</code>.</p>
<p>  Once the page is fully loaded, the HTML source code of the webpage is retrieved using <a target="_blank" href="http://driver.page/"><code>driver.page</code></a><code>_source</code>.</p>
<p>  Finally, the retrieved HTML source code is passed to BeautifulSoup for parsing and analysis. Using the <code>BeautifulSoup(src, 'lxml')</code> syntax,</p>
<pre><code class="lang-python">    driver = webdriver.Chrome(<span class="hljs-string">'C:/Users/sejja/chromedriver'</span>)  <span class="hljs-comment"># Assuming valid path</span>
    driver.get(url)
    time.sleep(<span class="hljs-number">3</span>)

    src = driver.page_source
    soup = BeautifulSoup(src, <span class="hljs-string">'lxml'</span>)
</code></pre>
</li>
<li><p>Various elements of interest, such as the title, date, tags, and comments, are extracted from the parsed HTML using BeautifulSoup's <code>find</code> and <code>findAll</code> methods. Here is an example how we extract title and the comments :</p>
<pre><code class="lang-python">    titre = soup.find(<span class="hljs-string">'h1'</span>, {<span class="hljs-string">'class'</span>: <span class="hljs-string">'post-title'</span>})
            <span class="hljs-keyword">if</span> titre:
                titre = titre.get_text().strip()
            <span class="hljs-keyword">else</span>:
                titre = <span class="hljs-string">"not available"</span>

            comments_area = soup.find(<span class="hljs-string">'ul'</span>, {<span class="hljs-string">'class'</span>: <span class="hljs-string">'comment-list hide-comments'</span>})
            comments = []
            <span class="hljs-keyword">if</span> comments_area:
                <span class="hljs-keyword">for</span> comment <span class="hljs-keyword">in</span> comments_area.findAll(<span class="hljs-string">'li'</span>, {<span class="hljs-string">'class'</span>: <span class="hljs-string">'comment even thread-even depth-1 not-reply'</span>}):
                    comment_date = comment.find(<span class="hljs-string">'div'</span>, {<span class="hljs-string">'class'</span>: <span class="hljs-string">'comment-date'</span>})
                    comment_content = comment.find(<span class="hljs-string">'div'</span>, {<span class="hljs-string">'class'</span>: <span class="hljs-string">'comment-text'</span>})
                    comment_react = comment.find(<span class="hljs-string">'span'</span>, {<span class="hljs-string">'class'</span>: <span class="hljs-string">'comment-recat-number'</span>})
                    <span class="hljs-keyword">if</span> comment_date <span class="hljs-keyword">and</span> comment_content <span class="hljs-keyword">and</span> comment_react:
                        comments.append({
                            <span class="hljs-string">"comment_date"</span>: comment_date.get_text(),
                            <span class="hljs-string">"comment_content"</span>: comment_content.get_text(),
                            <span class="hljs-string">"comment_react"</span>: comment_react.get_text()
                        })

            <span class="hljs-keyword">return</span> {<span class="hljs-string">'Date'</span>: date, <span class="hljs-string">'Titre'</span>: titre, <span class="hljs-string">'Tags'</span>: tags, <span class="hljs-string">'Comments'</span>: comments}
</code></pre>
</li>
</ul>
</li>
</ol>
<p>And finally extracted data is stored in a dictionary format.</p>
<pre><code class="lang-python">    <span class="hljs-keyword">return</span> {<span class="hljs-string">'Date'</span>: date, <span class="hljs-string">'Titre'</span>: titre, <span class="hljs-string">'Tags'</span>: tags, <span class="hljs-string">'Comments'</span>: comments}
</code></pre>
<ol start="3">
<li><p><strong>Cleanup:</strong></p>
<ul>
<li><code>driver.quit()</code> ensures that the WebDriver browser instance is closed, even if an exception occurs, preventing resource leaks.</li>
</ul>
</li>
</ol>
<p><strong>From YouTube :</strong></p>
<p><strong>YouTube API:</strong> The YouTube API provides developers with programmatic access to YouTube's features, including retrieving video metadata, comments, and other relevant data.</p>
<p>The code first fetches details about the specified video, including the channel name and video title. Then, it retrieves comments associated with the video using the 'commentThreads' endpoint. To handle pagination, it iterates through multiple pages of comments, ensuring that all comments are captured.</p>
<p>The extracted data includes essential information such as the video title, channel name, comment date, content, likes, dislikes, author, and number of replies.</p>
<p><strong><em>the process :</em></strong></p>
<ul>
<li><p><strong>Get the YouTube API :</strong></p>
<ol>
<li><p>To obtain your YouTube Data API key, you need to follow these steps:</p>
<p> <strong>1. Sign in to Google:</strong></p>
<p> - Go to the Google Developers Console at <a target="_blank" href="https://console.developers.google.com/"><strong>console.developers.google.com</strong></a></p>
<p> - Sign in with your Google account. If you don't have one, you'll need to create it.</p>
<p> <strong>2. Create a new project:</strong></p>
<p> - If you don't have any existing projects, you'll be prompted to create one. Click on the "Select a project" dropdown menu at the top and then click on the "New Project" button.</p>
<p> - Enter a name for your project and click on the "Create" button.</p>
<p> <strong>3. Enable the YouTube Data API:</strong></p>
<p> - In the Google Cloud Console, navigate to the "APIs &amp; Services" &gt; "Library" page using the menu on the left.</p>
<p> - Search for "YouTube Data API" in the search bar.</p>
<p> - Click on the "YouTube Data API v3" result.</p>
<p> - Click on the "Enable" button.</p>
<p> <strong>4. Create credentials:</strong></p>
<p> - After enabling the API, navigate to the "APIs &amp; Services" &gt; "Credentials" page using the menu on the left.</p>
<p> - Click on the "Create credentials" button and select "API key" from the dropdown menu.</p>
<p> - Your API key will be created. Copy it and securely store it.</p>
<p> <strong>5. Use your API key:</strong></p>
<p> - Now that you have your API key, you can use it in your applications to access the YouTube Data API.</p>
</li>
</ol>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> googleapiclient.discovery <span class="hljs-keyword">import</span> build
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># YouTube Data API key</span>
API_KEY = <span class="hljs-string">'your YouTube API key'</span>
</code></pre>
<ul>
<li><strong>build the YouTube service object :</strong></li>
</ul>
<pre><code class="lang-python"> <span class="hljs-comment"># Build the YouTube service object . It requires specifying the API name ('youtube'), API version ('v3'), and developer API key (API_KEY) obtained from the Google Developer Console.</span>
youtube_service = build(<span class="hljs-string">'youtube'</span>, <span class="hljs-string">'v3'</span>, developerKey=API_KEY)
</code></pre>
<ul>
<li><strong>retrieve video details</strong></li>
</ul>
<pre><code class="lang-python"> <span class="hljs-comment"># Retrieve video details using the 'videos' endpoint</span>
video_response = youtube_service.videos().list(
       part=<span class="hljs-string">'snippet'</span>,
       id=video_id
).execute()
</code></pre>
<ul>
<li><strong>Extract the essential information such as the video title, channel name, comments :</strong></li>
</ul>
<p>here we extracted channel name and video title from the video_response :</p>
<pre><code class="lang-python"><span class="hljs-comment"># Extract channel name and video title</span>
channel_name = video_response[<span class="hljs-string">'items'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'snippet'</span>][<span class="hljs-string">'channelTitle'</span>]
video_title = video_response[<span class="hljs-string">'items'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'snippet'</span>][<span class="hljs-string">'title'</span>]
</code></pre>
<p>and here we got the comments plus some details such as the comment's date, numbers of likes , the author ... : (including the pagination)</p>
<pre><code class="lang-python">response = youtube_service.commentThreads().list(
          part=<span class="hljs-string">'snippet,replies'</span>,
          videoId=video_id,
          textFormat=<span class="hljs-string">'plainText'</span>,
          maxResults=<span class="hljs-number">800</span>,  <span class="hljs-comment"># Increase the max results per page if necessary</span>
          pageToken=nextPageToken
     ).execute()

<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> response[<span class="hljs-string">'items'</span>]:
        print(<span class="hljs-string">f"Comments are disabled for video: <span class="hljs-subst">{video_id}</span>"</span>)
        <span class="hljs-keyword">return</span> []

<span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> response[<span class="hljs-string">'items'</span>]:
        snippet = item[<span class="hljs-string">'snippet'</span>][<span class="hljs-string">'topLevelComment'</span>][<span class="hljs-string">'snippet'</span>]
        comments.append({
                <span class="hljs-string">'Video Title'</span>: video_title,
                <span class="hljs-string">'Channel Name'</span>: channel_name,
                <span class="hljs-string">'Comment Date'</span>: snippet.get(<span class="hljs-string">'publishedAt'</span>, <span class="hljs-string">''</span>),
                <span class="hljs-string">'Comment'</span>: snippet.get(<span class="hljs-string">'textDisplay'</span>, <span class="hljs-string">''</span>),
                <span class="hljs-string">'Likes'</span>: snippet.get(<span class="hljs-string">'likeCount'</span>, <span class="hljs-number">0</span>),
                <span class="hljs-string">'Dislikes'</span>: snippet.get(<span class="hljs-string">'dislikeCount'</span>, <span class="hljs-number">0</span>),
                <span class="hljs-string">'Author'</span>: snippet.get(<span class="hljs-string">'authorDisplayName'</span>, <span class="hljs-string">''</span>),
                <span class="hljs-string">'Replies'</span>: item[<span class="hljs-string">'snippet'</span>][<span class="hljs-string">'totalReplyCount'</span>]
         })
</code></pre>
<ul>
<li><strong>Search for YouTube videos based on a query string:</strong></li>
</ul>
<p>So we added a function to allow us to search for YouTube videos based on a query string, with optional filters for region (for us we used MA indicating the Moroccan region) and publication dates (matching with Ramadan month). It fetches video IDs matching the search criteria, which can be further used to retrieve additional details or perform other operations on the videos.</p>
<p><strong>For Twitter :</strong></p>
<ul>
<li><strong>Twscrape :</strong> It is a tool for scraping data from tweets. It collects data such as user profiles, follower lists and follower lists, likes and retweets, as well as keyword searches.</li>
</ul>
<p>The extracted data includes essential information such as the username, content and comment date.</p>
<p><strong><em>The process :</em></strong></p>
<ul>
<li><p><strong>Fetching Tweets:</strong> Utilizing an asynchronous function gather_tweets<code>()</code> to retrieve tweets from an asynchronous Twitter scraper. It's similar to YouTube , we used a query string to retrieve the tweets related to Ramadan month. And we need also information about twitter (X) account (username, password ..)</p>
<pre><code class="lang-python">    <span class="hljs-keyword">import</span> asyncio
    <span class="hljs-keyword">import</span> twscrape
    <span class="hljs-keyword">from</span> twscrape <span class="hljs-keyword">import</span> API, gather
    <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

    <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TwitterScraper</span>:</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
            self.api = API()

        <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">gather_tweets</span>(<span class="hljs-params">self, query=<span class="hljs-string">"رمضان"</span>, limit=<span class="hljs-number">20</span></span>):</span>
            <span class="hljs-keyword">await</span> self.api.pool.add_account(<span class="hljs-string">"username dial twitter"</span>, <span class="hljs-string">"password"</span>, <span class="hljs-string">"email"</span>,<span class="hljs-string">"mail_pass"</span>) <span class="hljs-comment">#mail pass optionel  , we can have more than 1</span>
            <span class="hljs-keyword">await</span> self.api.pool.login_all()

            tweets = <span class="hljs-keyword">await</span> gather(self.api.search(query, limit=limit))

            data = []
            <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets:
                tweet_data = {
                    <span class="hljs-string">'ID'</span>: tweet.id,
                    <span class="hljs-string">'Username'</span>: tweet.user.username,
                    <span class="hljs-string">'Content'</span>: tweet.rawContent,
                    <span class="hljs-string">'Date'</span>: tweet.date
                }
                data.append(tweet_data)

                print(tweet.id, tweet.user.username, tweet.rawContent)

            df = pd.DataFrame(data)
            <span class="hljs-keyword">return</span> df
</code></pre>
</li>
<li><p><strong>Displaying Tweets:</strong> Tweets are displayed on the screen during the retrieval process.</p>
</li>
<li><p><strong>Saving Tweets:</strong> Tweet data is saved in either a CSV or JSON file, depending on your choice.</p>
</li>
</ul>
<p><strong>For Facebook :</strong></p>
<ul>
<li>As we marked before for Hespress website, we also used for Facebook a combination of <strong>Beautiful Soup</strong> and <strong>Selenium</strong> techniques to extract data from it. The data we scraped includes the article's title, published date, and comments associated with the article.</li>
</ul>
<p><strong><em>the process :</em></strong></p>
<ol>
<li><p><strong>Import necessary libraries:</strong></p>
<pre><code class="lang-python">   <span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
   <span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
   <span class="hljs-keyword">from</span> selenium.common.exceptions <span class="hljs-keyword">import</span> NoSuchElementException, TimeoutException
   <span class="hljs-keyword">from</span> selenium.webdriver.common.by <span class="hljs-keyword">import</span> By
   <span class="hljs-keyword">import</span> time
   <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
</code></pre>
</li>
<li><p><strong>Get the posts from a Facebook page :</strong></p>
<p> Define the function <code>get_facebook_post_data</code> to get posts from a Facebook page. Inside this function we initialize a Chrome WebDriver assuming the path to <code>chromedriver</code> is valid. Navigate to the provided Facebook page URL. Click the "See More Posts" button if present. Scroll down to load more posts based on the <code>scroll_count</code>. Use BeautifulSoup to parse the page source. Extract post data (title, link, date, reactions, comments) using specified class names. And finally construct a DataFrame from the extracted data and returns it.</p>
<p> <strong><em>Usage example:</em></strong></p>
<pre><code class="lang-python">   series = get_facebook_post_data(<span class="hljs-string">'https://web.facebook.com/alwa3d4'</span>, scroll_count=<span class="hljs-number">80</span>)
</code></pre>
</li>
<li><p><strong>Use</strong><code>facebook_scraper</code><strong>library to extract data from Facebook posts :</strong></p>
<ol>
<li><p>Import required libraries:</p>
<pre><code class="lang-python">   <span class="hljs-keyword">import</span> facebook_scraper <span class="hljs-keyword">as</span> fs
   <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
   <span class="hljs-keyword">from</span> facebook_scraper <span class="hljs-keyword">import</span> exceptions  <span class="hljs-comment"># Import specific exceptions</span>
</code></pre>
</li>
<li><p>Define the <code>FacebookScraper</code> class:</p>
<ul>
<li><p>Constructor that Initializes the maximum number of comments to retrieve (<code>MAX_COMMENTS</code>).</p>
</li>
<li><p>We have also the Method <code>getPostData</code> that have the parameter <code>post_url</code> URL of the Facebook post. This function extracts the post ID from the provided URL, attempts to retrieve post data using <code>facebook_scraper</code>, handles potential errors such as missing comments or invalid URLs. And if comments are found, normalizes the JSON data into a DataFrame.</p>
<pre><code class="lang-python">    post_id = post_url.split(<span class="hljs-string">"/"</span>)[<span class="hljs-number">-1</span>].split(<span class="hljs-string">"?"</span>)[<span class="hljs-number">0</span>]  <span class="hljs-comment"># Extract post ID</span>
                print(post_id)

                <span class="hljs-comment"># Attempt to get post data, handling potential errors</span>
                gen = fs.get_posts(post_urls=[post_id], options={<span class="hljs-string">"comments"</span>: self.MAX_COMMENTS, <span class="hljs-string">"progress"</span>: <span class="hljs-literal">True</span>})
                post = next(gen)

                <span class="hljs-comment"># Handle missing 'comments_full' key</span>
                comments = post.get(<span class="hljs-string">'comments_full'</span>, [])  <span class="hljs-comment"># Use default empty list if missing</span>

                <span class="hljs-keyword">if</span> comments:
                    df = pd.json_normalize(comments, sep=<span class="hljs-string">'_'</span>)
                    <span class="hljs-keyword">return</span> df
                <span class="hljs-keyword">else</span>:
                    print(<span class="hljs-string">f"No comments found for post: <span class="hljs-subst">{post_id}</span>"</span>)
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>  <span class="hljs-comment"># Return None to indicate no comments</span>
</code></pre>
</li>
</ul>
</li>
</ol>
</li>
</ol>
<p><strong>In conclusion, data scraping is a crucial process in collecting data from various sources like websites and social media platforms. By utilizing tools like Beautiful Soup, Selenium, and APIs, developers can efficiently gather and analyze data for research and analysis purposes. The ability to extract valuable information from websites, YouTube, Twitter, and Facebook opens up opportunities for AI projects to utilize data effectively.</strong></p>
<h1 id="heading-data-cleaning">Data Cleaning</h1>
<p>After successfully scraping data from various sources such as Hespress, YouTube, Twitter, and Facebook, the next crucial step is cleaning this diverse dataset. Each source brings its own set of challenges and characteristics, making data cleaning essential to ensure accuracy and reliability in subsequent analysis .In our cleaning process we used various techniques to get the data ready for analysis.</p>
<h3 id="heading-hespress-dataset"><strong>Hespress Dataset</strong></h3>
<p><strong>Data structure :</strong> The Hespress dataset comprises columns including Date, Titre, Tags, and Comments. The Date column records the date and time of comment posting, presented in Arabic format (e.g., "الجمعة 29 مارس 2024 - 18:00"). Titre represents the title or headline of the corresponding content. Tags denote keywords or labels associated with the content. Comments, depicted as a string format dictionary, encapsulate comment-related data, including comment_date, comment_content, and comment_react. Each comment_date entry denotes the timestamp of comment posting, while comment_content encapsulates the textual content of the comment. Lastly, comment_react records reactions or feedback received on the comment</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd 
hes_data1 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/DataSets sentiment analysis/hespressComments.csv"</span>)
hes_data2 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/DataSets sentiment analysis/hespressComments2.csv"</span>)
</code></pre>
<p>We focused solely on two variables: Date and Comments</p>
<pre><code class="lang-python">hes_data2 = hes_data2[[<span class="hljs-string">'Date'</span>,<span class="hljs-string">'Comments'</span>]]
hes_data1 = hes_data1[[<span class="hljs-string">'Date'</span>,<span class="hljs-string">'Comments'</span>]]
</code></pre>
<p><strong>Data cleaning :</strong></p>
<p>the Comments variable, initially presented as a string format containing a list of dictionaries, encapsulates essential comment-related data, including comment_date, comment_content, and comment_react. To facilitate further analysis, a crucial step involved converting this string format into a list of dictionaries.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> ast
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">convert_str_to_list_of_dicts</span>(<span class="hljs-params">input_str</span>):</span>
    <span class="hljs-keyword">try</span>:
        result_list = ast.literal_eval(input_str)
        <span class="hljs-keyword">if</span> isinstance(result_list, list):    
            <span class="hljs-keyword">if</span> all(isinstance(item, dict) <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> result_list):
                <span class="hljs-keyword">return</span> result_list
    <span class="hljs-keyword">except</span> (SyntaxError, ValueError):
        <span class="hljs-keyword">pass</span>
    <span class="hljs-keyword">return</span> []
hes_data1[<span class="hljs-string">'Comments'</span>] = hes_data1[<span class="hljs-string">'Comments'</span>].apply(convert_str_to_list_of_dicts)
hes_data2[<span class="hljs-string">'Comments'</span>] = hes_data2[<span class="hljs-string">'Comments'</span>].apply(convert_str_to_list_of_dicts)
hes_data = pd.concat([hes_data1,hes_data2])
</code></pre>
<p>In addition to converting the Comments variable from a string format containing a list of dictionaries to an actual list of dictionaries, another crucial step involved data validation. Some comments were found to contain empty lists, which could potentially skew the analysis. Therefore, I implemented a step to identify and remove these empty lists from the dataset. This data validation process ensured the integrity and reliability of the dataset for subsequent sentiment analysis</p>
<pre><code class="lang-python">hes_data = hes_data[hes_data[<span class="hljs-string">'Comments'</span>].apply(<span class="hljs-keyword">lambda</span> x: len(x) &gt; <span class="hljs-number">0</span>)]
hes_data.reset_index(drop=<span class="hljs-literal">True</span>, inplace=<span class="hljs-literal">True</span>)
</code></pre>
<p>We transformed the Comments variable from a list of dictionaries into a structured Data Frame. This involved organizing each comment's date and content into a single Data Frame for streamlined analysis. By leveraging Python's Pandas library. This transformation facilitated easier access and interpretation of the comment data, laying the groundwork for subsequent sentiment analysis.</p>
<pre><code class="lang-python">comments_list = []
<span class="hljs-keyword">for</span> i, comment_data <span class="hljs-keyword">in</span> enumerate(hes_data[<span class="hljs-string">'Comments'</span>]):
    <span class="hljs-keyword">for</span> comment_dict <span class="hljs-keyword">in</span> comment_data:
        comment_date = comment_dict[<span class="hljs-string">'comment_date'</span>].strip()
        comment_content = comment_dict[<span class="hljs-string">'comment_content'</span>].strip()

        comments_list.append({<span class="hljs-string">'Date'</span>: comment_date, <span class="hljs-string">'Comment'</span>: comment_content})


hes_data_final = pd.DataFrame(comments_list)
</code></pre>
<p>The next step in the data cleaning process involved transforming the date format from its original Arabic format (e.g., 'الجمعة 29 مارس 2024 - 18:00') to a standardized format ('YYYY-MM-DD HH:MM:SS'). This conversion ensured consistency and compatibility with common date-time formats, facilitating easier manipulation and analysis of the data. By implementing this transformation, the dataset's date information was brought into a uniform structure, ready for further analysis and visualization.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">convert_date</span>(<span class="hljs-params">date_str</span>):</span>    
    parts = date_str.split()
    day = int(parts[<span class="hljs-number">1</span>])
    month = {
        <span class="hljs-string">'يناير'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'فبراير'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'مارس'</span>: <span class="hljs-number">3</span>, <span class="hljs-string">'أبريل'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'ماي'</span>: <span class="hljs-number">5</span>, <span class="hljs-string">'يونيو'</span>: <span class="hljs-number">6</span>,
        <span class="hljs-string">'يوليوز'</span>: <span class="hljs-number">7</span>, <span class="hljs-string">'غشت'</span>: <span class="hljs-number">8</span>, <span class="hljs-string">'شتنبر'</span>: <span class="hljs-number">9</span>, <span class="hljs-string">'أكتوبر'</span>: <span class="hljs-number">10</span>, <span class="hljs-string">'نونبر'</span>: <span class="hljs-number">11</span>, <span class="hljs-string">'دجنبر'</span>: <span class="hljs-number">12</span>
    }[parts[<span class="hljs-number">2</span>]]
    year = int(parts[<span class="hljs-number">3</span>])
    time = parts[<span class="hljs-number">-1</span>]
    date_time = datetime(year, month, day)
    time_parts = time.split(<span class="hljs-string">':'</span>)
    date_time = date_time.replace(hour=int(time_parts[<span class="hljs-number">0</span>]), minute=int(time_parts[<span class="hljs-number">1</span>]))
    <span class="hljs-keyword">return</span> date_time

hes_data_final[<span class="hljs-string">'Date'</span>] = comments_df[<span class="hljs-string">'Date'</span>].apply(convert_date)
</code></pre>
<p>In addition to the transformation of the date format, I added a new column named 'source' to the Data Frame and assigned it the value 'Hespress' for each row. This step allowed for easy identification and categorization of the data based on its source. By including this metadata, the dataset became more informative and well-organized, facilitating subsequent analysis and interpretation.</p>
<pre><code class="lang-python">hes_data_final = hes_data_final.assign(source = <span class="hljs-string">'Hespress'</span>)
</code></pre>
<p>To enhance the dataset's comprehensiveness, the final step involved detecting the language of each comment. Utilizing the 'langdetect' library, we automatically identified the language of the comment text. This process enabled us to distinguish between comments in different languages, ensuring that subsequent analysis accurately reflected the linguistic diversity of the dataset. By incorporating language detection, we enriched the dataset with valuable metadata, facilitating more nuanced insights and interpretation</p>
<pre><code class="lang-python">pip install langdetect
</code></pre>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> langdetect <span class="hljs-keyword">import</span> detect, DetectorFactory
DetectorFactory.seed = <span class="hljs-number">0</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">detect_language</span>(<span class="hljs-params">comment</span>):</span>
    <span class="hljs-keyword">try</span>:
        lang = detect(comment)
        <span class="hljs-keyword">return</span> lang
    <span class="hljs-keyword">except</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
hes_data_final[<span class="hljs-string">'Language'</span>] = hes_data_final[<span class="hljs-string">'Comment'</span>].apply(detect_language)
</code></pre>
<h3 id="heading-youtube-dataset"><strong>YouTube Dataset</strong></h3>
<p>**Data structure :**The YouTube dataset consists of various columns, each containing specific information related to YouTube videos and comments. These columns include the video title, channel name, comment date, comment content, number of likes, number of dislikes, author name, and number of replies. Each row represents a comment posted on a YouTube video, with details such as the video's title, the channel it belongs to, the date and time the comment was posted, the content of the comment itself, the number of likes and dislikes received, the author's username, and the number of replies the comment has generated</p>
<pre><code class="lang-python">youtube_datasets_all = []
youtube_dataset1 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (1).csv"</span>)
youtube_datasets_all.append(youtube_dataset1)
youtube_dataset2 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (2).csv"</span>)
youtube_datasets_all.append(youtube_dataset2)
youtube_dataset3 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (3).csv"</span>)
youtube_datasets_all.append(youtube_dataset3)
youtube_dataset4 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (4).csv"</span>)
youtube_datasets_all.append(youtube_dataset4)
youtube_dataset5 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (5).csv"</span>)
youtube_datasets_all.append(youtube_dataset5)
youtube_dataset6 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (6).csv"</span>)
youtube_datasets_all.append(youtube_dataset6)
youtube_dataset7 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (7).csv"</span>)
youtube_datasets_all.append(youtube_dataset7)
youtube_dataset8 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (8).csv"</span>)
youtube_datasets_all.append(youtube_dataset8)
youtube_dataset9 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (9).csv"</span>)
youtube_datasets_all.append(youtube_dataset9)
youtube_dataset10 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (10).csv"</span>)
youtube_datasets_all.append(youtube_dataset10)
youtube_dataset11 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (11).csv"</span>)
youtube_datasets_all.append(youtube_dataset11)
youtube_dataset12 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (12).csv"</span>)
youtube_datasets_all.append(youtube_dataset12)
youtube_dataset13 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (13).csv"</span>)
youtube_datasets_all.append(youtube_dataset13)
youtube_dataset14 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube (14).csv"</span>)
youtube_datasets_all.append(youtube_dataset14)
youtube_dataset15 = pd.read_csv(<span class="hljs-string">"/content/drive/MyDrive/Datasets_Youtube/ramadan_morocco_comments_youtube.csv"</span>)
youtube_datasets_all.append(youtube_dataset15)

df_youtube = pd.concat(youtube_datasets_all, ignore_index=<span class="hljs-literal">True</span>)
</code></pre>
<p><strong>Data cleaning :</strong> In the initial phase of cleaning the YouTube dataset, we began by addressing missing values and duplicate rows. Missing values, if left unhandled, can introduce bias and inaccuracies into our analysis. Therefore, we meticulously examined the dataset for any missing information and removed rows where essential data was absent. Additionally, we identified and eliminated duplicate rows to ensure the integrity and reliability of our dataset</p>
<pre><code class="lang-python">df_youtube = df_youtube.dropna()
df_youtube = df_youtube.drop_duplicates()
</code></pre>
<p>After handling missing values and duplicates, we cleaned links and special characters from comments using regular expressions. This step aimed to enhance the quality of textual data by removing noise and irrelevant information, ensuring a clean and focused dataset for analysis.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> re

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_comment</span>(<span class="hljs-params">comment</span>):</span>

    <span class="hljs-comment"># Regular expression pattern to match URLs</span>
    url_pattern = re.compile(<span class="hljs-string">r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&amp;+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'</span>)

    <span class="hljs-comment"># Replace URLs and special characters with an empty string</span>
    cleaned_comment = re.sub(url_pattern, <span class="hljs-string">''</span>, comment)

    <span class="hljs-keyword">return</span> cleaned_comment

<span class="hljs-comment"># Apply cleaning to comments</span>
df_youtube[<span class="hljs-string">'Comment'</span>] = df_youtube[<span class="hljs-string">'Comment'</span>].apply(clean_comment)
</code></pre>
<p>After cleaning links and special characters, we detected the language of comments using the langdetect library. This step allowed us to identify the language of each comment, enhancing our understanding of the dataset's linguistic diversity and enabling more accurate analysis.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> langdetect <span class="hljs-keyword">import</span> detect, DetectorFactory
DetectorFactory.seed = <span class="hljs-number">0</span>

<span class="hljs-comment"># Create a function to detect the language of a comment</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">detect_language</span>(<span class="hljs-params">comment</span>):</span>
    <span class="hljs-keyword">try</span>:
        lang = detect(comment)
        <span class="hljs-keyword">return</span> lang
    <span class="hljs-keyword">except</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># Apply language detection to each comment and create a new column for language</span>
df_youtube[<span class="hljs-string">'Language'</span>] = df_youtube[<span class="hljs-string">'Comment'</span>].apply(detect_language)
</code></pre>
<p>Drop rows with None values in the 'Language' column</p>
<pre><code class="lang-python">df_youtube.dropna(subset=[<span class="hljs-string">'Language'</span>], inplace=<span class="hljs-literal">True</span>)
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># Drop comments in English, French, and Spanish</span>
df_youtube = df_youtube[(df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'en'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'fr'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'es'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'ro'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'cy'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'no'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'de'</span>) &amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'et'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'si'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'af'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'fi'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'pt'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'pl'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'vi'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'sv'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'ca'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'ti'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'lv'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'nl'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'cs'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'pt'</span>)&amp; (df_youtube[<span class="hljs-string">'Language'</span>] != <span class="hljs-string">'th'</span>)]
</code></pre>
<p>After detecting the language of comments, we added a new column called 'source', assigning the value 'YouTube' to each row</p>
<pre><code class="lang-python">df_youtube[<span class="hljs-string">'source'</span>] = <span class="hljs-string">'Youtube'</span>
</code></pre>
<p>In the final step, we streamlined our dataset by selecting only the essential columns: Comment Date, Comment, Language, and Source. This focused approach ensures that we retain pertinent information for analysis while maintaining dataset clarity and efficiency</p>
<pre><code class="lang-python">columns_to_drop = [<span class="hljs-string">'Video Title'</span>, <span class="hljs-string">'Channel Name'</span>, <span class="hljs-string">'Likes'</span>, <span class="hljs-string">'Dislikes'</span>, <span class="hljs-string">'Author'</span>, <span class="hljs-string">'Replies'</span>]
df_youtube.drop(columns=columns_to_drop, inplace=<span class="hljs-literal">True</span>)
</code></pre>
<h3 id="heading-facebook-dataset"><strong>Facebook Dataset</strong></h3>
<p><strong>Data structure :</strong> The Facebook dataset comprises several columns, including 'comment_text', 'comment_time', and 'Title'. The 'comment_text' column contains the textual content of comments posted on Facebook. The 'comment_time' column records the date and time of each comment's posting, and the 'Title' column likely denotes the title or headline associated with the Facebook post or content. Each row represents a comment posted on Facebook, with corresponding details such as comment text, posting time, and associated title.</p>
<pre><code class="lang-python">facebook_df= pd.read_csv(<span class="hljs-string">"/content/facebook_comments.csv"</span>)
</code></pre>
<p><strong>Data cleaning :</strong> In the cleaning process, we determine the language of each comment using a language classification function. This step categorizes comments in the Facebook dataset as either Arabic or French based on the presence of specific language characters. It helps organize the data by language, facilitating subsequent analysis and processing tailored to each language category</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> re

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">classify_language</span>(<span class="hljs-params">comment</span>):</span>
    <span class="hljs-string">"""
    Classify the language of a comment as 'ar' (Darija-arabic) or 'fr' (darija-French).
    """</span>
    <span class="hljs-comment"># Regular expressions for Arabic and French characters</span>
    arabic_pattern = re.compile(<span class="hljs-string">r'[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF]+'</span>)  <span class="hljs-comment"># Arabic Unicode range</span>
    english_pattern = re.compile(<span class="hljs-string">r'[\x00-\x7F\x80-\xFF]+'</span>)  <span class="hljs-comment"># English Unicode range</span>

    <span class="hljs-comment"># Check if the comment contains Arabic or French characters</span>
    <span class="hljs-keyword">if</span> arabic_pattern.search(comment):
        <span class="hljs-keyword">return</span> <span class="hljs-string">'ar'</span>
    <span class="hljs-keyword">elif</span> english_pattern.search(comment):
        <span class="hljs-keyword">return</span> <span class="hljs-string">'fr'</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>  <span class="hljs-comment"># Return None if no Arabic or French characters are found</span>

facebook_df[<span class="hljs-string">'Language'</span>] = facebook_df[<span class="hljs-string">'comment_text'</span>].apply(classify_language)
</code></pre>
<p>In the final step of cleaning, we add a new column labeled 'source' with the value 'Facebook'. Additionally, we streamline the dataset by selecting only the essential columns, excluding the 'Title' column. This focused approach ensures that the dataset retains pertinent information for analysis, enhancing clarity and efficiency.</p>
<pre><code class="lang-python">facebook_df[<span class="hljs-string">'source'</span>] = <span class="hljs-string">'Facebook'</span>
facebook_df = facebook_df.drop(columns=[<span class="hljs-string">'Title'</span>])
</code></pre>
<h2 id="heading-nlp-for-data-extraction-and-stop-words-removal">NLP for data extraction and stop words removal</h2>
<p>Machine Learning heavily relies on the quality of the data fed into it, and thus, data preprocessing plays a crucial role in ensuring the accuracy and efficiency of the model.</p>
<p><strong>Text pre-processing</strong> is the process of preparing text data so that machines can use the same to perform tasks like analysis, predictions, etc.</p>
<p>There are many different steps in text pre-processing but in this article, we'll delve into the intricacies of preprocessing Arabic text data for sentiment analysis. From importing datasets to cleaning and preparing text for analysis, we'll explore each step in detail.</p>
<h3 id="heading-importing-datasets"><strong>Importing Datasets:</strong></h3>
<p>To kickstart our preprocessing journey, we begin by importing our datasets. We've curated data from various sources including Facebook, Twitter, Hespress, and YouTube.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> os

<span class="hljs-comment">#import the datasets</span>
folder=<span class="hljs-string">"C:/Users/sejja/Downloads/Compressed/cleaned_data/cleaned_data/"</span>

df_facebook=pd.read_csv(folder+<span class="hljs-string">"facebook_clean.csv"</span>)

df_twitter=pd.read_csv(folder+<span class="hljs-string">"twitter_clean.csv"</span>)
df_hespress=pd.read_csv(folder+<span class="hljs-string">"hespress_clean.csv"</span>)
df_youtube=pd.read_csv(folder+<span class="hljs-string">"youtube_clean.csv"</span>)

df_facebook.columns,df_twitter.columns,df_hespress.columns,df_youtube.columns
</code></pre>
<h3 id="heading-normalisation"><strong>Normalisation:</strong></h3>
<p>Normalisation ensures consistency and coherence in our data. Here, we align column names and select relevant columns from each dataset for further processing.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Rename columns to match</span>
df_twitter = df_twitter.rename(columns={<span class="hljs-string">'Content'</span>: <span class="hljs-string">'Comment'</span>})
df_twitter = df_twitter.rename(columns={<span class="hljs-string">'language'</span>: <span class="hljs-string">'Language'</span>})
df_youtube = df_youtube.rename(columns={<span class="hljs-string">'Comment Date'</span>: <span class="hljs-string">'Date'</span>})

<span class="hljs-comment"># Select only the desired columns</span>
df_facebook = df_facebook[[<span class="hljs-string">'Comment'</span>, <span class="hljs-string">'Date'</span>, <span class="hljs-string">'Language'</span>, <span class="hljs-string">'source'</span>]]
df_twitter = df_twitter[[<span class="hljs-string">'Comment'</span>, <span class="hljs-string">'Date'</span>, <span class="hljs-string">'Language'</span>, <span class="hljs-string">'source'</span>]]
df_hespress = df_hespress[[<span class="hljs-string">'Comment'</span>, <span class="hljs-string">'Date'</span>, <span class="hljs-string">'Language'</span>, <span class="hljs-string">'source'</span>]]
df_youtube = df_youtube[[<span class="hljs-string">'Comment'</span>, <span class="hljs-string">'Date'</span>, <span class="hljs-string">'Language'</span>, <span class="hljs-string">'source'</span>]]


<span class="hljs-comment"># Concatenate the dataframes</span>
df_merge = pd.concat([df_facebook, df_twitter, df_hespress, df_youtube], ignore_index=<span class="hljs-literal">True</span>)
</code></pre>
<h3 id="heading-replacing-null-values"><strong>Replacing Null Values:</strong></h3>
<p>Handling missing values is crucial for robust analysis. Here, we replace null values in the 'Language' column with 'ar' (Arabic) and drop rows with null values in the 'Date' column.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Replace null values in the 'Language' column with 'ar'</span>
df_merge[<span class="hljs-string">'Language'</span>].fillna(<span class="hljs-string">'ar'</span>, inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Drop rows with null values in the 'Date' column</span>
df_merge.dropna(subset=[<span class="hljs-string">'Date'</span>], inplace=<span class="hljs-literal">True</span>)
</code></pre>
<h3 id="heading-deleting-comments-with-less-than-2-words"><strong>Deleting comments with less than 2 words</strong></h3>
<p>To maintain data quality, we filter out comments with less than two words.</p>
<pre><code class="lang-python">df_merge = df_merge[df_merge[<span class="hljs-string">'Comment'</span>].str.split().str.len() &gt;= <span class="hljs-number">2</span>]
</code></pre>
<h3 id="heading-emoji-replacement"><strong>Emoji replacement</strong></h3>
<p>In our text processing steps, we understand that emojis are important for showing feelings and emotions. So, we make sure to handle them properly. First, we create a list that matches each emoji with its meaning in words. Then, we create a function that goes through the text and changes each emoji to its word meaning using the list we made. This helps us better understand the text and the emotions it conveys.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create dictionary for emoji replacement</span>
emoji_dict = dict(zip(df_emoji[<span class="hljs-string">'emoji'</span>], df_emoji[<span class="hljs-string">'text'</span>]))

<span class="hljs-comment"># Function to replace emojis with their meanings</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">replace_emoji</span>(<span class="hljs-params">text</span>):</span>
    <span class="hljs-keyword">for</span> emoji, meaning <span class="hljs-keyword">in</span> emoji_dict.items():
        text = text.replace(emoji, meaning)
    <span class="hljs-keyword">return</span> text
</code></pre>
<h3 id="heading-saving-data"><strong>Saving Data:</strong></h3>
<p>We save the preprocessed data for future analysis, including separate datasets for Arabic and French comments, and a merged dataset for comprehensive analysis.</p>
<pre><code class="lang-python">datasetFr = df_merge[df_merge[<span class="hljs-string">'Language'</span>] == <span class="hljs-string">'fr'</span>]
datasetAr = df_merge[df_merge[<span class="hljs-string">'Language'</span>] == <span class="hljs-string">'ar'</span>]
datasetFr.to_csv(<span class="hljs-string">"dataset_fr.csv"</span>)
datasetAr.to_csv(<span class="hljs-string">"dataset_ar.csv"</span>)
df_merge.to_csv(<span class="hljs-string">'merge_data.csv'</span>)
</code></pre>
<h2 id="heading-preprocessing-and-cleaning"><strong>Preprocessing and Cleaning:</strong></h2>
<p>Now comes the heart of our preprocessing pipeline. We'll clean the text data by removing non-Arabic words, tokenizing, removing stopwords, and stemming.</p>
<p>So, let’s get started.</p>
<h3 id="heading-removing-non-arabic-words"><strong>Removing non-arabic words</strong></h3>
<p>To ensure that only Arabic text remains for analysis, we utilize a regular expression pattern that matches any characters outside the Unicode range for Arabic script (<code>\u0600</code> to <code>\u06FF</code>). This pattern effectively identifies and removes any non-Arabic characters, preserving the integrity of the text data for further processing.</p>
<h3 id="heading-removing-punctuation"><strong>Removing punctuation:</strong></h3>
<p>Punctuation marks, while essential for readability in natural language, often add noise to text analysis tasks. By employing another regular expression pattern targeting non-word characters (<code>\w</code>) and non-whitespace characters (<code>\s</code>), we systematically eliminate all punctuation from the text, streamlining the subsequent tokenization process.</p>
<h3 id="heading-what-is-tokenization"><strong>What is tokenization:</strong></h3>
<p>Tokenization is the process of breaking down large blocks of text such as paragraphs and sentences into smaller, more manageable units.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:875/1*e1ofj-4i-e9AxYKVlnkV7Q.png" alt /></p>
<h3 id="heading-what-are-stop-words"><strong>What are stop words?</strong></h3>
<p>The words which are generally filtered out before processing a natural language are called <strong>stop words</strong>. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in Arabic are<br />"كل" "لم" "لن" "له" "من" "هو" "هي" .</p>
<h3 id="heading-why-do-we-remove-stop-words"><strong>Why do we remove stop words?</strong></h3>
<p>Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information. In order words, we can say that the removal of such words does not show any negative consequences on the model we train for our task.</p>
<p>Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of tokens involved in the training.</p>
<h3 id="heading-what-is-stemming"><strong>What is stemming:</strong></h3>
<p>Stemming is the process of removing prefixes or suffixes from words to obtain their base form, known as the stem. For instance, words like “running,” “runner,” and “runs” share the same root “run.” Stemming helps consolidate words with similar meanings and reduces inflected words to a common form, aiding in tasks like text classification, sentiment analysis, and search engines.</p>
<p><strong>Common Stemming Technique:</strong></p>
<p>Snowball Stemmer: Also known as the Porter2 Stemmer, the Snowball Stemmer is an extension of the Porter algorithm with support for multiple languages. It employs a more systematic approach and can handle stemming tasks in languages beyond English, including French, German, and in our case Arabic.</p>
<h3 id="heading-functions-used-to-preprocess-data"><strong>Functions used to preprocess data</strong></h3>
<p>These functions serve as integral components of our data preprocessing pipeline, designed to clean and standardize textual data for subsequent analysis. By employing a combination of techniques such as removing non-Arabic words, eliminating punctuation, tokenizing text, removing stopwords, and applying stemming, we ensure that our data is refined and optimized for further analysis</p>
<pre><code class="lang-python"><span class="hljs-comment"># Function to clean and preprocess text without Stemming</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_text</span>(<span class="hljs-params">text</span>):</span>
    <span class="hljs-comment"># Remove non-Arabic words</span>
    text = re.sub(<span class="hljs-string">r'[^\u0600-\u06FF\s]'</span>, <span class="hljs-string">''</span>, text)
    <span class="hljs-comment"># Remove punctuation</span>
    text = re.sub(<span class="hljs-string">r'[^\w\s]'</span>, <span class="hljs-string">''</span>, text)
    <span class="hljs-comment"># Tokenization</span>
    tokens = text.split()
    <span class="hljs-comment"># Remove stopwords</span>
    tokens = [word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stopwords_darija]
    <span class="hljs-comment"># Join tokens back into text</span>
    clean_text = <span class="hljs-string">' '</span>.join(tokens)
    <span class="hljs-keyword">return</span> clean_text

<span class="hljs-comment"># with stemming</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_text</span>(<span class="hljs-params">text</span>):</span>
    <span class="hljs-comment"># Remove non-Arabic words</span>
    text = re.sub(<span class="hljs-string">r'[^\u0600-\u06FF\s]'</span>, <span class="hljs-string">''</span>, text)
        <span class="hljs-comment"># Remove punctuation</span>
    text = re.sub(<span class="hljs-string">r'[^\w\s]'</span>, <span class="hljs-string">''</span>, text)
    <span class="hljs-comment"># Tokenization</span>
    tokens = word_tokenize(text.lower())
    <span class="hljs-comment"># Remove stopwords</span>
    tokens = [token <span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> token <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stopwords_darija]
    <span class="hljs-comment"># Stemming - using SnowballStemmer for Arabic languages</span>
    stemmer = SnowballStemmer(<span class="hljs-string">'arabic'</span>)
    tokens = [stemmer.stem(token) <span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> tokens]
    <span class="hljs-keyword">return</span> <span class="hljs-string">' '</span>.join(tokens)
</code></pre>
<p>Finally, we concatenate the preprocessed tokens back into a single string, using <code>' '.join(tokens)</code>, and return the resulting cleaned and normalized text from the function. This consolidated representation of the preprocessed text serves as the foundation for subsequent analysis and modeling tasks, enabling researchers and practitioners to derive meaningful insights from Arabic text data with confidence and accuracy.</p>
<ul>
<li>Below is a visual representation of preprocessing execution, showcasing the execution of the preprocessing functions discussed above. This image provides a glimpse into the transformation of raw textual data into a refined and standardized format, illustrating the effectiveness of the cleaning, tokenization, and stemming processes in preparing the data for subsequent analysis.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713705538370/afc13ae1-12cd-4ad8-b391-b79539ee546e.png?auto=compress,format&amp;format=webp" alt /></p>
<h3 id="heading-in-summary"><strong>In Summary:</strong></h3>
<p><strong>The steps we've taken to prepare our text data are crucial for making sense of it in natural language processing. By cleaning, organizing, and simplifying the text through various techniques, we've made it more understandable and ready for analysis. This process helps us remove unnecessary information, standardize the text, and focus on what truly matters. With our data in better shape, we're now well-equipped to delve deeper into analysis tasks such as understanding sentiment, categorizing text, and finding relevant information. Overall, these preprocessing steps form a solid foundation for extracting valuable insights from text data, enhancing the accuracy and effectiveness of our analysis.</strong></p>
<h1 id="heading-unveiling-insights-exploratory-data-analysis-on-ramadan-comments-in-moroccan-darija">Unveiling Insights: Exploratory Data Analysis on Ramadan Comments in Moroccan Darija</h1>
<p>As the digital sphere continues to evolve, social media platforms serve as a rich source of public opinion and sentiment. In this blog, we delve into the world of Ramadan comments, examining the chatter before and after Iftar across popular platforms like Facebook, Twitter, Hespress, and YouTube. Our focus lies on deciphering patterns, trends, and unique characteristics within the comments, all in the vibrant language of Moroccan Darija.</p>
<h3 id="heading-in-this-article-you-will-learn-how-to"><strong>In this article, you will learn how to:</strong></h3>
<ul>
<li><p>Conduct <strong>word frequency analysis</strong> to identify predominant themes in Ramadan comments.</p>
</li>
<li><p>Explore <strong>temporal trends</strong>, revealing higher activity before Iftar.</p>
</li>
<li><p>Analyze <strong>language distribution</strong>, with Arabic comments prevailing.</p>
</li>
<li><p>Apply <strong>topic modeling</strong> to categorize comments into distinct themes.</p>
</li>
<li><p>Perform <strong>time series</strong> analysis to understand comment dynamics before and after Iftar.</p>
</li>
</ul>
<p>This simplified tutorial will guide you through each phase of EDA applications, with a special focus on interpreting and visualizing the results.</p>
<p><strong><em>You can find the entire code in the notebook below:</em></strong></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://colab.research.google.com/drive/1flWaq-_qHfJR9L2v3GhvjAbaC1mRmB8N?usp=sharing">https://colab.research.google.com/drive/1flWaq-_qHfJR9L2v3GhvjAbaC1mRmB8N?usp=sharing</a></div>
<p> </p>
<hr />
<h1 id="heading-get-ready"><strong><em>Get ready :)</em></strong></h1>
<hr />
<h1 id="heading-introduction-to-eda"><strong>Introduction to EDA</strong></h1>
<p>Exploratory Data Analysis (EDA) serves as the cornerstone of data exploration, offering a systematic approach to uncovering patterns, trends, and insights within datasets. In this section, we delve into the theoretical underpinnings of EDA, laying the groundwork for our journey through the Ramadan comments dataset in Moroccan Darija.</p>
<h2 id="heading-essence-of-eda"><strong>Essence of EDA</strong></h2>
<p>At its core, EDA embodies a philosophy of curiosity and discovery, empowering data practitioners to glean meaningful insights from raw datasets. Unlike formal statistical methods, which often require predefined hypotheses, EDA embraces a more flexible and intuitive approach, allowing analysts to let the data speak for itself.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713691279453/3eb02fe8-4b09-441a-9dba-d23dd98c0648.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<h2 id="heading-key-principles-of-eda"><strong>Key Principles of EDA:</strong></h2>
<ol>
<li><p><strong>Visualization:</strong> Visualization lies at the heart of EDA, enabling analysts to transform raw data into insightful plots, charts, and graphs. By visually inspecting the distribution, relationships, and anomalies within the data, analysts can uncover hidden patterns and outliers that may elude traditional statistical methods.</p>
</li>
<li><p><strong>Descriptive Statistics:</strong> Descriptive statistics provide a snapshot of the dataset's central tendencies, variability, and distribution. Metrics such as mean, median, standard deviation, and percentiles offer valuable insights into the shape and characteristics of the data, guiding subsequent analysis and interpretation.</p>
</li>
<li><p><strong>Data Cleaning and Preprocessing:</strong> Before embarking on exploratory analysis, it is essential to ensure the cleanliness and integrity of the dataset. Data cleaning involves identifying and addressing missing values, outliers, and inconsistencies that may distort the analysis. Additionally, preprocessing steps such as normalization and transformation may be employed to enhance the quality and interpretability of the data.</p>
</li>
<li><p><strong>Pattern Recognition:</strong> EDA involves the systematic identification of patterns, trends, and relationships within the dataset. By applying statistical techniques such as correlation analysis, clustering, and dimensionality reduction, analysts can uncover meaningful structures and associations that underlie the data.</p>
</li>
<li><p><strong>Iteration and Iterative Refinement:</strong> EDA is an iterative process, wherein analysts continuously refine their analysis based on emerging insights and hypotheses. By iteratively exploring, visualizing, and interpreting the data, analysts can gradually refine their understanding of the dataset and extract deeper insights.</p>
</li>
</ol>
<h1 id="heading-understanding-the-dataset"><strong>Understanding the Dataset</strong></h1>
<p>Before embarking on our exploratory journey, let's grasp the essence of our dataset. It comprises comments captured during Ramadan, spanning the moments preceding and following Iftar, the evening meal that breaks the fast. These comments emanate from diverse sources, reflecting the sentiments, emotions, and discussions prevalent during this sacred time.</p>
<h1 id="heading-exploring-patterns-and-trends"><strong>Exploring Patterns and Trends</strong></h1>
<h2 id="heading-word-frequency-analysis"><strong>Word Frequency Analysis</strong></h2>
<p>Word Frequency Analysis is a powerful technique used to extract meaningful insights from text data by quantifying the frequency of occurrence of individual words or phrases within a corpus. In this section, we apply Word Frequency Analysis to our Ramadan comments dataset in Moroccan Darija, aiming to uncover the most prevalent terms and themes within the discourse.</p>
<pre><code class="lang-python">
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">display_frequency_plot</span>(<span class="hljs-params">df, column, stopwords</span>):</span>
     reshaped_text = <span class="hljs-string">" "</span>.join(arabic_reshaper.reshape(t) <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> df[column].dropna())

    plt.figure(figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">10</span>))

    counts = Counter(reshaped_text.split())
    counts = {get_display(k): v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> counts.items() <span class="hljs-keyword">if</span> k <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stopwords}
    counts = dict(sorted(counts.items(), key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)[:<span class="hljs-number">20</span>])
    palette = sns.color_palette(<span class="hljs-string">"crest_r"</span>, n_colors=len(counts))
    palette = dict(zip(counts.keys(), palette))
    sns.barplot(y=list(counts.keys()), x=list(counts.values()), palette=palette)

    plt.title(<span class="hljs-string">"Frequency Plot"</span>)
    plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713692105705/fbb9660a-48de-4992-a0e1-75ae05a4ed89.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<p><strong><em>Interpretation of Results:</em></strong></p>
<p>The Word Frequency Analysis reveals that the most frequent words in our dataset are predominantly religious, which can be attributed to the sacred nature of Ramadan. Terms like "الله" (Allah), "رمضان" (Ramadan), and "اللهم" (O Allah) dominate the discourse, reflecting the deep significance of faith and spirituality during this holy month. This observation underscores the cultural and societal norms surrounding Ramadan, where discussions often revolve around religious observance and spiritual reflection, shaping the digital discourse in meaningful ways.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713692515415/940644f0-9293-457b-b28c-448fcf2cbb93.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<h2 id="heading-temporal-trends"><strong>Temporal Trends</strong></h2>
<p>In the realm of digital communication, understanding temporal patterns is crucial for deciphering trends and behaviors within online communities. In the context of Ramadan, a period marked by spiritual reflection, communal gatherings, and fasting, the temporal dynamics of online engagement hold particular significance.</p>
<p>Let's analyse first the temporal distribution of comments reveals interesting patterns, particularly in terms of peak activity during specific hours of the day.</p>
<pre><code class="lang-python">data[<span class="hljs-string">'Hour'</span>] = data[<span class="hljs-string">'Date'</span>].dt.hour

commentaires_par_heure = data.groupby(<span class="hljs-string">'Hour'</span>).size()

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">6</span>))
commentaires_par_heure.plot(kind=<span class="hljs-string">'bar'</span>, color=<span class="hljs-string">'blue'</span>)
plt.title(<span class="hljs-string">'Nombre de commentaires par heure de la journée'</span>)
plt.xlabel(<span class="hljs-string">'Heure de la journée'</span>)
plt.ylabel(<span class="hljs-string">'Nombre de commentaires'</span>)
plt.xticks(rotation=<span class="hljs-number">0</span>)
plt.grid(axis=<span class="hljs-string">'y'</span>, linestyle=<span class="hljs-string">'--'</span>, alpha=<span class="hljs-number">0.7</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713693237204/bfa3a875-7adb-4b8f-b4ea-53bfcffdad31.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<p>These findings highlight distinct peaks in commenting activity throughout the day, with the afternoon hours exhibiting the highest levels of engagement. Hour 15:00 emerges as the period of greatest activity, suggesting a concentration of discussions and interactions during this time frame. Conversely, late evening hours, such as 22:00 and 21:00, also witness significant participation, albeit to a slightly lesser extent.</p>
<h3 id="heading-time-series-analysis"><strong>Time Series Analysis</strong></h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns

data[<span class="hljs-string">'Date'</span>] = pd.to_datetime(data[<span class="hljs-string">'Date'</span>], format=<span class="hljs-string">'%Y-%m-%d %H:%M:%S'</span>, errors=<span class="hljs-string">'coerce'</span>)
data.dropna(subset=[<span class="hljs-string">'Date'</span>], inplace=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">time_series_analysis_by_source</span>():</span>
    fig, axes = plt.subplots(len(data[<span class="hljs-string">'source'</span>].unique()), <span class="hljs-number">2</span>, figsize=(<span class="hljs-number">15</span>, <span class="hljs-number">6</span> * len(data[<span class="hljs-string">'source'</span>].unique())))

    <span class="hljs-keyword">for</span> i, source <span class="hljs-keyword">in</span> enumerate(data[<span class="hljs-string">'source'</span>].unique()):
        source_data = data[data[<span class="hljs-string">'source'</span>] == source]
        source_data[<span class="hljs-string">'Part_of_Day'</span>] = pd.cut(source_data[<span class="hljs-string">'Date'</span>].dt.hour, bins=[<span class="hljs-number">0</span>, <span class="hljs-number">6</span>, <span class="hljs-number">12</span>, <span class="hljs-number">18</span>, <span class="hljs-number">24</span>],
                                            labels=[<span class="hljs-string">'Night'</span>, <span class="hljs-string">'Morning'</span>, <span class="hljs-string">'Afternoon'</span>, <span class="hljs-string">'Evening'</span>])
        sns.countplot(x=<span class="hljs-string">'Part_of_Day'</span>, data=source_data, ax=axes[i][<span class="hljs-number">0</span>])
        axes[i][<span class="hljs-number">0</span>].set_title(<span class="hljs-string">f'Distribution of Comments by Part of the Day for <span class="hljs-subst">{source}</span>'</span>)
        axes[i][<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Part of the Day'</span>)
        axes[i][<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Number of Comments'</span>)

        before_iftar = data[(data[<span class="hljs-string">'Date'</span>].dt.hour &lt; <span class="hljs-number">18</span>) | ((data[<span class="hljs-string">'Date'</span>].dt.hour == <span class="hljs-number">18</span>) &amp; (data[<span class="hljs-string">'Date'</span>].dt.minute &lt; <span class="hljs-number">30</span>))].shape[<span class="hljs-number">0</span>]
        after_iftar = data[(data[<span class="hljs-string">'Date'</span>].dt.hour &gt;= <span class="hljs-number">18</span>) | ((data[<span class="hljs-string">'Date'</span>].dt.hour == <span class="hljs-number">18</span>) &amp; (data[<span class="hljs-string">'Date'</span>].dt.minute &gt;= <span class="hljs-number">30</span>))].shape[<span class="hljs-number">0</span>]

        sns.barplot(x=[<span class="hljs-string">'Before Iftar'</span>, <span class="hljs-string">'After Iftar'</span>], y=[before_iftar, after_iftar], ax=axes[i][<span class="hljs-number">1</span>])
        axes[i][<span class="hljs-number">1</span>].set_title(<span class="hljs-string">f'Total Number of Comments before and after Iftar for <span class="hljs-subst">{source}</span>'</span>)
        axes[i][<span class="hljs-number">1</span>].set_xlabel(<span class="hljs-string">'Part of the Day'</span>)
        axes[i][<span class="hljs-number">1</span>].set_ylabel(<span class="hljs-string">'Total Number of Comments'</span>)

    plt.tight_layout()
    plt.show()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">time_series_analysis_by_topic</span>():</span>
    fig, axes = plt.subplots(len(data[<span class="hljs-string">'topic'</span>].unique()), <span class="hljs-number">2</span>, figsize=(<span class="hljs-number">15</span>, <span class="hljs-number">6</span> * len(data[<span class="hljs-string">'topic'</span>].unique())))

    <span class="hljs-keyword">for</span> i, topic <span class="hljs-keyword">in</span> enumerate(data[<span class="hljs-string">'topic'</span>].unique()):
        topic_data = data[data[<span class="hljs-string">'topic'</span>] == topic]
        topic_data[<span class="hljs-string">'Part_of_Day'</span>] = pd.cut(topic_data[<span class="hljs-string">'Date'</span>].dt.hour, bins=[<span class="hljs-number">0</span>, <span class="hljs-number">6</span>, <span class="hljs-number">12</span>, <span class="hljs-number">18</span>, <span class="hljs-number">24</span>],
                                            labels=[<span class="hljs-string">'Night'</span>, <span class="hljs-string">'Morning'</span>, <span class="hljs-string">'Afternoon'</span>, <span class="hljs-string">'Evening'</span>])
        sns.countplot(x=<span class="hljs-string">'Part_of_Day'</span>, data=topic_data, ax=axes[i][<span class="hljs-number">0</span>])
        axes[i][<span class="hljs-number">0</span>].set_title(<span class="hljs-string">f'Distribution of Comments by Part of the Day for Topic: <span class="hljs-subst">{topic}</span>'</span>)
        axes[i][<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Part of the Day'</span>)
        axes[i][<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Number of Comments'</span>)

        before_iftar = data[(data[<span class="hljs-string">'Date'</span>].dt.hour &lt; <span class="hljs-number">18</span>) | ((data[<span class="hljs-string">'Date'</span>].dt.hour == <span class="hljs-number">18</span>) &amp; (data[<span class="hljs-string">'Date'</span>].dt.minute &lt; <span class="hljs-number">30</span>))].shape[<span class="hljs-number">0</span>]
        after_iftar = data[(data[<span class="hljs-string">'Date'</span>].dt.hour &gt;= <span class="hljs-number">18</span>) | ((data[<span class="hljs-string">'Date'</span>].dt.hour == <span class="hljs-number">18</span>) &amp; (data[<span class="hljs-string">'Date'</span>].dt.minute &gt;= <span class="hljs-number">30</span>))].shape[<span class="hljs-number">0</span>]

        sns.barplot(x=[<span class="hljs-string">'Before Iftar'</span>, <span class="hljs-string">'After Iftar'</span>], y=[before_iftar, after_iftar], ax=axes[i][<span class="hljs-number">1</span>])
        axes[i][<span class="hljs-number">1</span>].set_title(<span class="hljs-string">f'Total Number of Comments before and after Iftar for Topic: <span class="hljs-subst">{topic}</span>'</span>)
        axes[i][<span class="hljs-number">1</span>].set_xlabel(<span class="hljs-string">'Part of the Day'</span>)
        axes[i][<span class="hljs-number">1</span>].set_ylabel(<span class="hljs-string">'Total Number of Comments'</span>)

    plt.tight_layout()
    plt.show()
</code></pre>
<p><strong>Increased Pre-Iftar Engagement:</strong> Before Iftar, there is a notable uptick in comment activity, indicating heightened engagement on social media platforms. This surge in activity can be attributed to various factors:</p>
<ul>
<li><p><strong>Anticipation and Preparation:</strong> Individuals actively discuss meal preparations, share recipes, and express excitement for the impending breaking of the fast. Additionally, conversations about Ramadan traditions and cultural practices contribute to the vibrant pre-Iftar discourse.</p>
</li>
<li><p><strong>Real-Time Sharing:</strong> The period preceding Iftar witnesses a surge in real-time sharing of fasting experiences and spiritual reflections. Users turn to social media to express personal anecdotes, seek communal support, and engage in collective reflection on the significance of Ramadan.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713696947269/73257a80-4f28-474e-9092-ab555861d89c.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<p>Conversely, there is a discernible decrease in comment activity after Iftar. Several factors may underlie this decline:</p>
<p><strong>Post-Iftar Social Engagement:</strong> Following the breaking of the fast, individuals prioritize spending time with family and friends, participating in communal prayers, and enjoying post-Iftar meals and social gatherings. Consequently, there is a natural diversion of attention away from social media platforms, leading to reduced comment volumes.</p>
<p><strong>Dynamic Nature of Online Behavior:</strong> The fluctuation in online engagement before and after Iftar underscores the dynamic interplay between cultural traditions, religious observance, and digital interaction during Ramadan. These patterns highlight the evolving nature of user behavior and preferences within the context of this sacred month.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713696996533/5791ff6c-7cda-47f8-8c8b-1e246de16856.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<h2 id="heading-language-distribution"><strong>Language Distribution</strong></h2>
<p>Examining the distribution of languages within the dataset reveals a notable disparity between Arabic and French comments, with Arabic comments significantly outnumbering French comments. This observation underscores the predominance of Arabic as the primary language of communication and expression within the context of Ramadan discussions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713693804910/8abcfd8a-8732-4eac-bd5e-ae0b3128ba7a.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<p>The overwhelming presence of Arabic comments reflects the cultural and linguistic dynamics inherent in conversations surrounding Ramadan, particularly in regions where Arabic is the predominant language of communication. Arabic serves as the medium through which individuals articulate their thoughts, sentiments, and reflections on the significance of Ramadan, fostering a sense of shared cultural identity and belonging among participants.</p>
<p>While French comments may represent a minority within the dataset, their presence underscores the linguistic diversity and multiculturalism inherent in online discourse surrounding Ramadan. These comments may originate from individuals with varying linguistic backgrounds, contributing to the richness and diversity of perspectives within the digital conversation.</p>
<h1 id="heading-topic-modeling"><strong>Topic Modeling</strong></h1>
<p>Topic modeling is a powerful method for uncovering recurring themes and patterns within textual data. In our analysis of Ramadan comments, we use topic modeling to reveal the underlying topics prevalent in the discourse. By identifying these themes, we gain valuable insights into the diverse narratives and interests shaping conversations during this sacred month.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer
<span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> LatentDirichletAllocation

<span class="hljs-comment"># Convert comments to a list</span>
comments_list = data[<span class="hljs-string">'clean_comment'</span>].tolist()

<span class="hljs-comment"># Create a CountVectorizer to convert comments into a matrix of token counts</span>
vectorizer = CountVectorizer(max_df=<span class="hljs-number">0.95</span>, min_df=<span class="hljs-number">2</span>, stop_words=<span class="hljs-literal">None</span>)
X = vectorizer.fit_transform(comments_list)

<span class="hljs-comment"># Specify the number of topics</span>
n_topics = <span class="hljs-number">3</span>

<span class="hljs-comment"># Run LDA</span>
lda_model = LatentDirichletAllocation(n_components=n_topics, max_iter=<span class="hljs-number">10</span>, learning_method=<span class="hljs-string">'online'</span>, random_state=<span class="hljs-number">42</span>)
lda_Z = lda_model.fit_transform(X)

<span class="hljs-comment"># Get the feature names (words)</span>
feature_names = vectorizer.get_feature_names_out()

<span class="hljs-comment"># Create a dictionary to store the top words for each topic</span>
topics_dict = {}
n_top_words = <span class="hljs-number">20</span>
<span class="hljs-keyword">for</span> topic_idx, topic <span class="hljs-keyword">in</span> enumerate(lda_model.components_):
    topic_name = <span class="hljs-string">f"Topic_<span class="hljs-subst">{topic_idx}</span>"</span>
    topic_words = [feature_names[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> topic.argsort()[:-n_top_words - <span class="hljs-number">1</span>:<span class="hljs-number">-1</span>]]
    topics_dict[topic_name] = topic_words

<span class="hljs-comment"># Print the topic names and their respective top words</span>
<span class="hljs-keyword">for</span> topic_name, topic_words <span class="hljs-keyword">in</span> topics_dict.items():
    print(<span class="hljs-string">f"<span class="hljs-subst">{topic_name}</span>: <span class="hljs-subst">{<span class="hljs-string">', '</span>.join(topic_words)}</span>"</span>)
</code></pre>
<p>Utilizing topic modeling techniques, we have identified three distinct topics within the dataset:</p>
<p>Topic_0: This topic revolves around religious themes, with keywords such as "الله" (Allah), "رمضان" (Ramadan), and "اللهم" (O Allah) indicating discussions centered on faith, supplication, and spiritual reflection. Other terms like "فلسطين" (Palestine) suggest engagement with social and humanitarian issues within a religious context.</p>
<p>Topic_1: The keywords in this topic pertain to entertainment and media content, with terms like "مسلسل" (series), "فيلم" (film), and "حلقة" (episode) indicating discussions related to television shows, movies, and popular culture. This topic reflects a divergence from religious discourse, focusing instead on leisure and entertainment activities.</p>
<p>Topic_2: This topic encompasses a mix of religious and cultural references, with keywords such as "المغرب" (Morocco), "محمد" (Mohammed), and "صلى" (prayer) suggesting discussions related to Islamic teachings, Moroccan culture, and expressions of gratitude and blessings.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713694381031/da32530a-f496-4cad-865a-34b2c662ab52.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<h3 id="heading-extracting-top-bigrams-for-each-topic-using-countvectorizer"><strong>Extracting Top Bigrams for Each Topic Using CountVectorizer</strong></h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer

<span class="hljs-comment"># Define a function to extract top bigrams for each topic</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_top_bigrams</span>(<span class="hljs-params">topic</span>):</span>
    <span class="hljs-comment"># Filter comments based on the topic</span>
    topic_comments = data[data[<span class="hljs-string">'topic'</span>] == topic][<span class="hljs-string">'clean_comment'</span>].tolist()

    <span class="hljs-comment"># Create a CountVectorizer to extract bigrams</span>
    vectorizer = CountVectorizer(ngram_range=(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>), max_features=<span class="hljs-number">10000</span>)
    X = vectorizer.fit_transform(topic_comments)

    <span class="hljs-comment"># Get feature names (bigrams)</span>
    feature_names = vectorizer.get_feature_names_out()

    <span class="hljs-comment"># Get counts of each bigram</span>
    bigram_counts = X.sum(axis=<span class="hljs-number">0</span>)

    <span class="hljs-comment"># Create a dictionary to store bigrams and their counts</span>
    bigram_dict = {bigram: count <span class="hljs-keyword">for</span> bigram, count <span class="hljs-keyword">in</span> zip(feature_names, bigram_counts.A1)}

    <span class="hljs-comment"># Get the top 20 most common bigrams</span>
    top_20_bigrams = sorted(bigram_dict.items(), key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)[:<span class="hljs-number">20</span>]

    <span class="hljs-comment"># Print header</span>
    print(<span class="hljs-string">f"<span class="hljs-subst">{<span class="hljs-string">'='</span>*<span class="hljs-number">40</span>}</span>\nTop 20 most common bigrams for <span class="hljs-subst">{topic}</span>:\n<span class="hljs-subst">{<span class="hljs-string">'='</span>*<span class="hljs-number">40</span>}</span>"</span>)

    <span class="hljs-comment"># Print the top 20 most common bigrams</span>
    <span class="hljs-keyword">for</span> i, (bigram, count) <span class="hljs-keyword">in</span> enumerate(top_20_bigrams, <span class="hljs-number">1</span>):
        print(<span class="hljs-string">f"<span class="hljs-subst">{i}</span>. <span class="hljs-subst">{bigram}</span>: <span class="hljs-subst">{count}</span> occurrences"</span>)

<span class="hljs-comment"># Iterate over topics and extract top bigrams for each topic</span>
topics = data[<span class="hljs-string">'topic'</span>].unique()
<span class="hljs-keyword">for</span> topic <span class="hljs-keyword">in</span> topics:
    extract_top_bigrams(topic)
</code></pre>
<p>The bigrams extracted for the Entertainment topic predominantly consist of phrases related to television shows, movies, and online content. Phrases like "احسن مسلسل" (best series), "مسلسل زوين" (nice series), and "دنيا بوطازوت" (Dounia Batma, a Moroccan singer and actress) indicate discussions about specific TV series and personalities. Additionally, expressions such as "شي لايكات" (some likes) and "لايكات نحسو" (likes we count) suggest engagement with social media metrics and audience interaction. The prevalence of laughter-related phrases like "ههههههههههه" (hahaha) underscores the informal and lighthearted nature of the discussions within this topic.</p>
<p>In contrast, the bigrams extracted for the Religion and Ramadan topic are predominantly religious expressions and blessings associated with Ramadan. Phrases like "تبارك الله" (blessings of Allah), "رمضان كريم" (blessed Ramadan), and "اللهم صل" (O Allah, bless) reflect the reverence and piety inherent in Ramadan-related discourse. Expressions such as "جزاكم الله" (may Allah reward you) and "ماشاء الله" (as Allah wills) convey expressions of gratitude and acknowledgment of divine blessings. The repetition of phrases like "الله الله الله" (Allah Allah Allah) emphasizes the centrality of God in the conversations, reflecting a deep spiritual connection and devotion among participants.</p>
<h3 id="heading-comment-length-analysis"><strong>Comment Length Analysis</strong></h3>
<p>In exploring the dynamics of online discourse, descriptive statistics offer valuable insights into the characteristics and patterns of communication within different thematic categories. In this context, we analyze descriptive statistics for comments categorized under two distinct topics: Entertainment and Religion/Ramadan. By examining mean comment length, mean word count, and mean number of characters, we gain a deeper understanding of the communication styles and content preferences within each topic.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">length_analysis</span>(<span class="hljs-params">topic, col</span>):</span>
    topic_comments = data[data[col] == topic][<span class="hljs-string">'clean_comment'</span>]
    comment_lengths = topic_comments.apply(<span class="hljs-keyword">lambda</span> x: len(x))
    word_counts = topic_comments.apply(<span class="hljs-keyword">lambda</span> x: len(x.split()))
    mean_length = comment_lengths.mean()
    mean_word_count = word_counts.mean()
    mean_characters = topic_comments.apply(len).mean()

    <span class="hljs-comment"># Print descriptive statistics</span>
    print(<span class="hljs-string">f"\n<span class="hljs-subst">{<span class="hljs-string">'='</span>*<span class="hljs-number">40</span>}</span>\nDescriptive Statistics for <span class="hljs-subst">{topic}</span>:\n<span class="hljs-subst">{<span class="hljs-string">'='</span>*<span class="hljs-number">40</span>}</span>"</span>)
    print(<span class="hljs-string">f"Mean Comment Length: <span class="hljs-subst">{mean_length:<span class="hljs-number">.2</span>f}</span> characters"</span>)
    print(<span class="hljs-string">f"Mean Word Count: <span class="hljs-subst">{mean_word_count:<span class="hljs-number">.2</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Mean Number of Characters: <span class="hljs-subst">{mean_characters:<span class="hljs-number">.2</span>f}</span> characters"</span>)

    <span class="hljs-comment"># Create visualization</span>
    fig, axes = plt.subplots(<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, figsize=(<span class="hljs-number">15</span>, <span class="hljs-number">5</span>))

    sns.histplot(comment_lengths, kde=<span class="hljs-literal">True</span>, ax=axes[<span class="hljs-number">0</span>])
    axes[<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Comment Length Distribution'</span>)
    axes[<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Comment Length'</span>)
    axes[<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Frequency'</span>)

    sns.histplot(word_counts, kde=<span class="hljs-literal">True</span>, ax=axes[<span class="hljs-number">1</span>])
    axes[<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Word Count Distribution'</span>)
    axes[<span class="hljs-number">1</span>].set_xlabel(<span class="hljs-string">'Word Count'</span>)
    axes[<span class="hljs-number">1</span>].set_ylabel(<span class="hljs-string">'Frequency'</span>)

    sns.histplot(topic_comments.apply(len), kde=<span class="hljs-literal">True</span>, ax=axes[<span class="hljs-number">2</span>])
    axes[<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Character Count Distribution'</span>)
    axes[<span class="hljs-number">2</span>].set_xlabel(<span class="hljs-string">'Number of Characters'</span>)
    axes[<span class="hljs-number">2</span>].set_ylabel(<span class="hljs-string">'Frequency'</span>)

    plt.tight_layout()
    plt.show()

<span class="hljs-comment"># Iterate over topics and perform length analysis for each topic</span>
topics = data[<span class="hljs-string">'topic'</span>].unique()
<span class="hljs-keyword">for</span> topic <span class="hljs-keyword">in</span> topics:
    length_analysis(topic, <span class="hljs-string">'topic'</span>)
</code></pre>
<p>For comments categorized under the Entertainment topic, the descriptive statistics reveal a mean comment length of 44.70 characters. On average, each comment consists of approximately 7.72 words. The mean number of characters per comment aligns with the mean comment length, indicating a relatively concise style of communication within this topic. This suggests that discussions related to entertainment tend to be succinct and to the point, reflecting the informal and casual nature of the conversations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713695455306/9f1300c2-ab36-4206-a500-7c0ca016e488.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<p>In contrast, comments categorized under the Religion and Ramadan topic exhibit a longer mean comment length of 68.84 characters. On average, each comment in this category comprises approximately 11.53 words. The higher mean comment length and word count suggest a more elaborate and detailed style of expression within discussions related to religion and Ramadan. Participants in these discussions may engage in more in-depth reflections, prayers, and expressions of devotion, contributing to the longer and more nuanced comments observed in this category.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713695473528/9872b96d-6842-4595-9e68-cf86a9678ad8.jpeg?auto=compress,format&amp;format=webp" alt /></p>
<h2 id="heading-analyzing-comment-length-and-word-count-by-source-platform"><strong>Analyzing Comment Length and Word Count by Source Platform</strong></h2>
<ul>
<li><p><strong>Facebook</strong></p>
<p>  Comments originating from Facebook have a mean length of 59.61 characters and a mean word count of 9.77. This suggests a moderately concise communication style, with comments typically comprising short sentences or phrases. The relatively lower mean comment length compared to other platforms may indicate a tendency for more succinct interactions on Facebook.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713696194364/1bc5d5b0-4419-45df-b545-9bc0f06bcaed.jpeg?auto=compress,format&amp;format=webp" alt /></p>
</li>
<li><p><strong>Twitter</strong></p>
<p>  Twitter comments exhibit a slightly longer mean length of 63.06 characters and a mean word count of 10.52. While still concise, Twitter users appear to engage in slightly more extensive communication compared to Facebook. This may be attributed to Twitter's character limit and the need for users to convey their message within a constrained space.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713696217059/5fa19489-9e46-4c09-89ce-d12c11c78edb.jpeg?auto=compress,format&amp;format=webp" alt /></p>
</li>
<li><p><strong>Hespress</strong></p>
<p>  In contrast, comments from Hespress display a significantly longer mean length of 156.21 characters and a mean word count of 24.26. These statistics indicate a more verbose and detailed communication style on Hespress, with users often expressing complex thoughts or opinions in their comments. The higher mean comment length suggests that discussions on Hespress may be more in-depth and comprehensive compared to other platforms.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713696232293/e05a8059-0471-4580-964e-ea296cac6eed.jpeg?auto=compress,format&amp;format=webp" alt /></p>
</li>
<li><p><strong>YouTube</strong></p>
<p>  Comments on YouTube have a mean length of 54.66 characters and a mean word count of 9.32. Similar to Facebook, YouTube comments tend to be relatively concise, with users typically conveying their thoughts or reactions in short messages. The lower mean comment length may reflect the nature of interactions on YouTube, where comments are often brief and focused on immediate reactions to the content.</p>
</li>
</ul>
<p><strong>In conclusion, this comprehensive exploration of comment data during Ramadan sheds light on various aspects of online discourse during this sacred month. Through techniques such as word frequency analysis, temporal trend exploration, language distribution analysis, topic modeling, and time series analysis, we've gained valuable insights into the patterns, themes, and dynamics of comments before and after Iftar. This analysis not only enhances our understanding of online engagement during Ramadan but also underscores the importance of Exploratory Data Analysis (EDA) in uncovering meaningful insights from complex datasets. As we continue to delve deeper into data analysis, it's crucial to leverage these techniques to derive actionable insights that can inform decision-making and deepen our understanding of societal trends and behaviors.</strong></p>
<h2 id="heading-machine-learning-and-deep-learning-models"><strong>Machine Learning and Deep Learning models</strong></h2>
<p>In our project, we harness the power of machine learning (ML) and deep learning (DL) to unlock valuable insights from Moroccan Darija comments and tweets. Our approach involves leveraging a variety of ML and DL models to tackle different aspects of the data analysis process.</p>
<p><strong>Clustering Models for Topic Segmentation</strong></p>
<p>To uncover underlying themes and topics within the dataset, we utilize clustering models. These models group similar comments and tweets together based on their content, allowing us to identify distinct topics of discussion within the Moroccan online community. By applying clustering techniques, we gain a deeper understanding of the diverse range of subjects that are relevant to Moroccan users.</p>
<h3 id="heading-next-word-prediction-model"><strong>Next Word Prediction Model</strong></h3>
<p>Another crucial aspect of our project is the development of a next word prediction model. This model predicts the most probable word to follow a given sequence of words, taking into account the context of the conversation. By accurately predicting the next word, we enhance the coherence and fluency of generated text, improving the overall quality of our analysis and insights.</p>
<p><strong>Models Available on Hugging Face via the Link Below</strong></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://huggingface.co/spaces/Soufianesejjari/MDS_RamadanWordPrediction">https://huggingface.co/spaces/Soufianesejjari/MDS_RamadanWordPrediction</a></div>
<p> </p>
<h3 id="heading-darijabert-sentiment-prediction-model"><strong>DARIJABERT: Sentiment Prediction Model</strong></h3>
<p>One of the highlights of our project is the implementation of DARIJABERT, a specialized sentiment prediction model trained on Moroccan Darija data. Inspired by state-of-the-art language models like BERT, DARIJABERT excels in understanding and predicting sentiment in Moroccan Darija comments and tweets. By leveraging deep learning techniques, DARIJABERT enables us to accurately assess the sentiment expressed in online discussions, providing valuable insights into the attitudes and opinions of Moroccan users.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://colab.research.google.com/drive/149c_0NQJcLtKdVDZsZ2UiFidMwnpSPB6?usp=sharing">https://colab.research.google.com/drive/149c_0NQJcLtKdVDZsZ2UiFidMwnpSPB6?usp=sharing</a></div>
<p> </p>
<p><strong>Conclusion</strong></p>
<p>Through the application of ML and DL models, we are able to delve deep into the world of Moroccan online discourse, uncovering hidden patterns, sentiments, and topics of interest. These advanced techniques empower us to extract meaningful insights from vast amounts of data, ultimately enhancing our understanding of Moroccan culture, society, and online behavior.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>As we conclude this chapter, we are reminded of the transformative potential of collaborative endeavors like those championed by the <a target="_blank" href="https://moroccands.com/">MDS</a> community. By harnessing the power of data and technology, we not only gain insights into societal trends and cultural expressions but also pave the way for informed decision-making, community engagement, and positive social change.</p>
<p>As we look towards the future, let us continue to embrace the spirit of collaboration, curiosity, and inclusivity that defines the <a target="_blank" href="https://moroccands.com/">MDS</a> community. Together, we will continue to explore, innovate, and inspire, shaping a brighter tomorrow through the lens of data science and community-driven initiatives.</p>
<p><strong>Explore a preview of our project :</strong> <a target="_blank" href="https://moroccansentimentsanalysis.netlify.app/">https://moroccansentimentsanalysis.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Moroccan News Aggregator:]]></title><description><![CDATA[Introduction
This article is your gateway to understanding and utilizing news data effectively. Here, we will walk you through a straightforward yet effective approach to gather, organize, analyze, and use data from Moroccan news websites. We will us...]]></description><link>https://blog.moroccands.com/moroccan-news-aggregator</link><guid isPermaLink="true">https://blog.moroccands.com/moroccan-news-aggregator</guid><category><![CDATA[moroccan news]]></category><category><![CDATA[web scraping]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[streamlit]]></category><dc:creator><![CDATA[Abril Sanaa]]></dc:creator><pubDate>Tue, 05 Mar 2024 03:46:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1709541854890/0640a090-9bf4-4e8c-a19b-4014c0deb9a0.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-introduction">Introduction</h3>
<p>This article is your gateway to understanding and utilizing news data effectively. Here, we will walk you through a straightforward yet effective approach to gather, organize, analyze, and use data from Moroccan news websites. We will use easy-to-understand methods involving popular Python tools like Streamlit, Pandas, Selenium, along with Google Drive Handle and python-dotenv. Our aim is to make this journey from collecting data to applying it as smooth as possible for you.</p>
<h3 id="heading-understanding-data-scraping">Understanding Data Scraping</h3>
<p>Data scraping is the process of collecting information from websites. This is commonly done to gather data from various sources on the internet for analysis, research, or storage.</p>
<p>Data scraping can be especially useful for keeping track of changes on websites, gathering large amounts of data quickly, and automating repetitive tasks of collecting information.</p>
<p>However, it's important to remember that while data scraping is a powerful tool, it should be used responsibly and ethically. This means respecting website terms of use, considering data privacy laws, and not overloading websites with too many requests at once.</p>
<h3 id="heading-setting-up">Setting Up</h3>
<p>To get started with our project, you'll need to set up your computer with some specific tools. Here's a list of what you need and how to install them:</p>
<ol>
<li><p><strong>Streamlit</strong>: This tool helps us build and share web applications easily. Install it using the command: <code>pip install streamlit</code>.</p>
</li>
<li><p><strong>Pandas</strong>: A library that makes handling data easier. Install it with <code>pip install pandas</code>.</p>
</li>
<li><p><strong>Selenium (version 4.0.0 to less than 5.0.0)</strong>: It's crucial for web scraping. Install the correct version using <code>pip install selenium==4.*</code>.</p>
</li>
<li><p><strong>PyDrive</strong>: This tool will help us work with Google Drive files. Install it using <code>pip install PyDrive</code>.</p>
</li>
<li><p><strong>python-dotenv</strong>: This library will help us manage environment variables securely. Install it with <code>pip install python-dotenv</code>.</p>
</li>
<li><p><strong>WebDriver Manager</strong>: It helps in managing the browser drivers needed for Selenium. Install it using <code>pip install webdriver_manager</code>.</p>
</li>
<li><p><strong>BeautifulSoup4</strong>: A library that makes it easy to scrape information from web pages. Install it with <code>pip install beautifulsoup4</code>.</p>
</li>
</ol>
<p>With these tools installed, you'll be ready to start working on your data scraping and analysis project!</p>
<h3 id="heading-data-cleaning-with-pandas">Data Cleaning with Pandas</h3>
<p>After successfully scraping data from websites, the next crucial step is cleaning this data. This means making sure the data is organized, free from errors, and ready for analysis. For this, we use Pandas, a powerful Python library that simplifies the process of data manipulation.</p>
<h4 id="heading-getting-started-with-pandas">Getting Started with Pandas</h4>
<ol>
<li><p><strong>Loading Data</strong>: Begin by loading the scraped data into Pandas for easy manipulation.</p>
</li>
<li><p><strong>Removing Duplicates</strong>: Identify and remove any duplicate entries to ensure data quality.</p>
</li>
<li><p><strong>Handling Missing Values</strong>: Find and address any missing or incomplete information in the dataset.</p>
</li>
<li><p><strong>Formatting Data</strong>: Adjust data formats as needed for consistency and easier analysis.</p>
<pre><code class="lang-python"> <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
 <span class="hljs-comment"># Load the dataset</span>
 df = pd.read_csv(<span class="hljs-string">'scraped_articles.csv'</span>)
 <span class="hljs-comment"># Remove duplicates</span>
 df.drop_duplicates(subset=<span class="hljs-string">'article_title'</span>, inplace=<span class="hljs-literal">True</span>)
 <span class="hljs-comment"># Remove any rows with missing values</span>
 df.dropna(inplace=<span class="hljs-literal">True</span>)
 df.to_csv(<span class="hljs-string">'cleaned_articles.csv'</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
</li>
</ol>
<h3 id="heading-integrating-google-drive-api-for-data-storage-and-sharing"><strong>Integrating Google Drive API for Data Storage and Sharing:</strong></h3>
<p>In our project, we leverage the Google Drive API to efficiently store and share the data we've collected. This approach not only provides a secure and scalable storage solution but also simplifies the process of making our data accessible to others. Here's how we do it:</p>
<h4 id="heading-setting-up-the-environment">Setting Up the Environment</h4>
<ol>
<li><p><strong>Importing Libraries</strong>: We start by importing necessary libraries including <code>pydrive.auth</code>, <code>GoogleDrive</code>, and <code>oauth2client.client</code>.</p>
</li>
<li><p><strong>Environment Variables</strong>: Using <code>load_dotenv</code> from the <code>dotenv</code> package, we load our Google API credentials stored in environment variables for security. This includes <code>CLIENT_ID</code>, <code>CLIENT_SECRET</code>, and <code>REFRESH_TOKEN</code>.</p>
</li>
</ol>
<h4 id="heading-authenticating-with-google-drive">Authenticating with Google Drive</h4>
<ul>
<li><strong>Authentication Function</strong>: We define a function <code>authenticate_google_drive</code> which sets up the authentication using OAuth2 credentials. This ensures a secure connection to Google Drive.</li>
</ul>
<h4 id="heading-uploading-files-to-google-drive">Uploading Files to Google Drive</h4>
<ul>
<li><p><strong>Upload Function</strong>: The <code>upload_file_to_drive</code> function takes in the <code>drive</code> object and the file path to upload the file to Google Drive. We also handle the case where the file might already exist on Google Drive, updating it instead of uploading a duplicate.</p>
</li>
<li><p><strong>Error Handling</strong>: The function includes error handling to manage any issues during the upload process.</p>
</li>
</ul>
<h4 id="heading-generating-downloadable-links">Generating Downloadable Links</h4>
<ul>
<li><strong>Download Link Function</strong>: The <code>get_drive_download_link</code> function is used to generate a direct download link for the files uploaded. This function sets the necessary permissions on the file to make it accessible to anyone with the link.</li>
</ul>
<h4 id="heading-practical-example">Practical Example</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> pydrive.auth <span class="hljs-keyword">import</span> GoogleAuth
<span class="hljs-keyword">from</span> pydrive.drive <span class="hljs-keyword">import</span> GoogleDrive
<span class="hljs-keyword">from</span> oauth2client.client <span class="hljs-keyword">import</span> OAuth2Credentials
<span class="hljs-keyword">import</span> os

load_dotenv()

CLIENT_ID = os.getenv(<span class="hljs-string">'CLIENT_ID'</span>)
CLIENT_SECRET = os.getenv(<span class="hljs-string">'CLIENT_SECRET'</span>)
REFRESH_TOKEN = os.getenv(<span class="hljs-string">'REFRESH_TOKEN'</span>)
REDIRECT_URI = os.getenv(<span class="hljs-string">'REDIRECT_URIS'</span>).split(<span class="hljs-string">','</span>)[<span class="hljs-number">0</span>]  <span class="hljs-comment"># Access the first URI</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">authenticate_google_drive</span>():</span>
        gauth = GoogleAuth()
        gauth.credentials = OAuth2Credentials(<span class="hljs-literal">None</span>, CLIENT_ID, CLIENT_SECRET,REFRESH_TOKEN, <span class="hljs-literal">None</span>,
                                             <span class="hljs-string">"https://accounts.google.com/o/oauth2/token"</span>, <span class="hljs-literal">None</span>, <span class="hljs-string">"web"</span>)
        drive = GoogleDrive(gauth)
        <span class="hljs-keyword">return</span> drive

    drive = authenticate_google_drive()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">upload_file_to_drive</span>(<span class="hljs-params">drive, file_path, folder_id=None</span>):</span>
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(file_path):
            print(<span class="hljs-string">f"Cannot upload, file does not exist at path: <span class="hljs-subst">{file_path}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

        <span class="hljs-keyword">try</span>:
            file_metadata = {<span class="hljs-string">'title'</span>: os.path.basename(file_path)}
            <span class="hljs-keyword">if</span> folder_id:
                file_metadata[<span class="hljs-string">'parents'</span>] = [{<span class="hljs-string">'id'</span>: folder_id}]

            upload_file = drive.CreateFile(file_metadata)

            <span class="hljs-comment"># Check if the file already exists on Google Drive</span>
            existing_files = drive.ListFile({<span class="hljs-string">'q'</span>: <span class="hljs-string">f"title='<span class="hljs-subst">{upload_file[<span class="hljs-string">'title'</span>]}</span>'"</span>}).GetList()
            <span class="hljs-keyword">if</span> existing_files:
                <span class="hljs-comment"># File with the same name already exists, update the existing file</span>
                upload_file = existing_files[<span class="hljs-number">0</span>]
                print(<span class="hljs-string">f"File already exists on Drive. Updating file with ID: <span class="hljs-subst">{upload_file[<span class="hljs-string">'id'</span>]}</span>"</span>)
            <span class="hljs-keyword">else</span>:
                print(<span class="hljs-string">"Uploading a new file to Drive."</span>)

            upload_file.SetContentFile(file_path)
            upload_file.Upload()
            print(<span class="hljs-string">f"File uploaded successfully. File ID: <span class="hljs-subst">{upload_file[<span class="hljs-string">'id'</span>]}</span>"</span>)
            <span class="hljs-keyword">return</span> upload_file[<span class="hljs-string">'id'</span>]
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"An error occurred during file upload: <span class="hljs-subst">{e}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_drive_download_link</span>(<span class="hljs-params">drive, file_id</span>):</span>
        <span class="hljs-keyword">try</span>:
            file = drive.CreateFile({<span class="hljs-string">'id'</span>: file_id})
            file.Upload() <span class="hljs-comment"># Make sure the file exists on Drive</span>
            file.InsertPermission({
                <span class="hljs-string">'type'</span>: <span class="hljs-string">'anyone'</span>,
                <span class="hljs-string">'value'</span>: <span class="hljs-string">'anyone'</span>,
                <span class="hljs-string">'role'</span>: <span class="hljs-string">'reader'</span>})
            <span class="hljs-keyword">return</span> <span class="hljs-string">"https://drive.google.com/uc?export=download&amp;id="</span> + file_id
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"Error fetching download link: <span class="hljs-subst">{e}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
</code></pre>
<h4 id="heading-conclusion">Conclusion</h4>
<p>This integration of the Google Drive API provides a robust and secure method for storing and sharing large datasets. By automating the upload and link generation process, we significantly enhance the accessibility and usability of our data.</p>
<h3 id="heading-streamlit-web-interface">Streamlit Web Interface</h3>
<p>Our Streamlit application serves as a central hub for aggregating news from various Moroccan websites. It's designed to be intuitive and user-friendly, enabling users to select news sources, languages, and categories for scraping.</p>
<h4 id="heading-key-features">Key Features</h4>
<ol>
<li><p><strong>Dynamic Configuration</strong>: Users can choose websites and specific categories from a dynamically loaded configuration (<code>config.json</code>). This allows for a customizable scraping experience.</p>
</li>
<li><p><strong>Language and Category Selection</strong>: For websites offering content in multiple languages, users can select their preferred language. Additionally, users can pick specific categories of news they are interested in.</p>
</li>
<li><p><strong>Control Over Data Collection</strong>: Through a simple interface, users can specify the number of articles to scrape.</p>
</li>
<li><p><strong>Initiating Scraping</strong>: A 'Start Scraping' button triggers the scraping process, with a progress bar indicating the ongoing operation.</p>
</li>
<li><p><strong>Real-time Updates and Data Display</strong>: As data is scraped and uploaded, users receive real-time updates. Each successful scrape results in a download link for the data and a display of the scraped data in a tabular format within the application.</p>
</li>
<li><p><strong>Google Drive Integration</strong>: Scraped data files are uploaded to Google Drive, and direct download links are provided within the Streamlit interface for easy access.</p>
</li>
<li><p><strong>Error Handling</strong>: The application includes error handling for issues like failed file uploads or unsuccessful scrapes, ensuring a smooth user experience.</p>
<pre><code class="lang-python"> <span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st
 <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
 <span class="hljs-keyword">import</span> json
 <span class="hljs-keyword">import</span> importlib
 <span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
 <span class="hljs-keyword">from</span> selenium.webdriver.chrome.options <span class="hljs-keyword">import</span> Options <span class="hljs-keyword">as</span> ChromeOptions
 <span class="hljs-keyword">import</span> google_drive_handle <span class="hljs-keyword">as</span> gdrive
 <span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
 <span class="hljs-keyword">import</span> os

 <span class="hljs-comment"># Load config.json</span>
 <span class="hljs-keyword">with</span> open(<span class="hljs-string">'config.json'</span>) <span class="hljs-keyword">as</span> f:
     config = json.load(f)

 <span class="hljs-comment"># Set up Chrome WebDriver with options</span>
 options = ChromeOptions()
 options.add_argument(<span class="hljs-string">'--headless'</span>)
 options.add_argument(<span class="hljs-string">'--no-sandbox'</span>)
 options.add_argument(<span class="hljs-string">'--disable-dev-shm-usage'</span>)
 options.add_argument(<span class="hljs-string">'log-level=3'</span>)
 <span class="hljs-comment"># Initialize the Chrome WebDriver</span>
 wd = webdriver.Chrome(options=options)
 drive = gdrive.authenticate_google_drive()
 processed_files = set()
 st.markdown(
     <span class="hljs-string">"""
     &lt;style&gt;
         .centered {
             display: flex;
             align-items: center;
             justify-content: center;
             text-align: center;
         }
     &lt;/style&gt;
     """</span>,
     unsafe_allow_html=<span class="hljs-literal">True</span>
 )

 st.markdown(<span class="hljs-string">"&lt;h1 class='centered'&gt;Moroccan News Aggregator&lt;/h1&gt;"</span>, unsafe_allow_html=<span class="hljs-literal">True</span>)

 selected_websites = {}
 selected_categories = {}

 <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">save_file_id_mapping</span>(<span class="hljs-params">file_id_mapping</span>):</span>
     <span class="hljs-keyword">with</span> open(<span class="hljs-string">"file_id_mapping.json"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> file:
         json.dump(file_id_mapping, file)

 <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_file_id_mapping</span>():</span>
     <span class="hljs-keyword">try</span>:
         <span class="hljs-keyword">with</span> open(<span class="hljs-string">"file_id_mapping.json"</span>, <span class="hljs-string">"r"</span>) <span class="hljs-keyword">as</span> file:
             <span class="hljs-keyword">return</span> json.load(file)
     <span class="hljs-keyword">except</span> FileNotFoundError:
         <span class="hljs-keyword">return</span> {}  <span class="hljs-comment"># Return an empty dictionary if the file doesn't exist</span>

 file_id_mapping = load_file_id_mapping()

 selected_websites = {}

 <span class="hljs-keyword">for</span> website, details <span class="hljs-keyword">in</span> config.items():
     <span class="hljs-keyword">if</span> st.checkbox(website, key=website):
         <span class="hljs-comment"># Language selection</span>
         languages = details.get(<span class="hljs-string">"languages"</span>, {})
         <span class="hljs-keyword">if</span> languages <span class="hljs-keyword">and</span> len(languages) &gt; <span class="hljs-number">1</span>:
             language = st.selectbox(<span class="hljs-string">f'Choose language for <span class="hljs-subst">{website}</span>'</span>, list(languages.keys()), key=<span class="hljs-string">f'lang_<span class="hljs-subst">{website}</span>'</span>)
             selected_websites[website] = <span class="hljs-string">f"<span class="hljs-subst">{website}</span>_<span class="hljs-subst">{language}</span>"</span>  <span class="hljs-comment"># like: hespress_en</span>
         <span class="hljs-keyword">else</span>:
             selected_websites[website] = website  <span class="hljs-comment"># like: akhbarona</span>

         <span class="hljs-comment"># Category selection</span>
         categories = languages.get(language, {})
         <span class="hljs-keyword">if</span> categories:
             categories = st.multiselect(<span class="hljs-string">f'Select categories for <span class="hljs-subst">{website}</span>'</span>, list(categories.keys()), key=<span class="hljs-string">f'<span class="hljs-subst">{website}</span>_categories'</span>)
             selected_categories[website] = categories

 <span class="hljs-comment"># Number of articles input</span>
 num_articles = st.number_input(<span class="hljs-string">'Number of Articles'</span>, min_value=<span class="hljs-number">1</span>, max_value=<span class="hljs-number">10000</span>, step=<span class="hljs-number">1</span>)

 <span class="hljs-comment"># Start scraping button</span>
 <span class="hljs-keyword">if</span> st.button(<span class="hljs-string">'Start Scraping'</span>):
     <span class="hljs-keyword">with</span> st.spinner(<span class="hljs-string">'Scraping in progress...'</span>):
         progress_bar = st.progress(<span class="hljs-number">0</span>)
         total_tasks = sum(len(categories) <span class="hljs-keyword">for</span> categories <span class="hljs-keyword">in</span> selected_categories.values())
         completed_tasks = <span class="hljs-number">0</span>
         <span class="hljs-keyword">for</span> website, module_name <span class="hljs-keyword">in</span> selected_websites.items():
             scraper_module = importlib.import_module(module_name)
             <span class="hljs-keyword">for</span> category <span class="hljs-keyword">in</span> selected_categories.get(website, []):
                 category_url = config[website][<span class="hljs-string">'languages'</span>][language][category]
                 <span class="hljs-keyword">if</span> <span class="hljs-string">'category_name'</span> <span class="hljs-keyword">in</span> config[website]:
                     category_name = config[website][<span class="hljs-string">'category_name'</span>].get(category, <span class="hljs-string">'default_category_name'</span>)
                 file_path = scraper_module.scrape_category(category_url, num_articles)

                 <span class="hljs-keyword">if</span> file_path:
                     <span class="hljs-keyword">if</span> file_path <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> file_id_mapping:
                         file_id = gdrive.upload_file_to_drive(drive, file_path)
                         print(<span class="hljs-string">f"Uploading file: <span class="hljs-subst">{file_path}</span>, File ID: <span class="hljs-subst">{file_id}</span>"</span>)
                         file_id_mapping[file_path] = file_id
                         save_file_id_mapping(file_id_mapping)
                     <span class="hljs-keyword">else</span>:
                         file_id = file_id_mapping[file_path]
                         print(<span class="hljs-string">f"File already uploaded. Using existing File ID: <span class="hljs-subst">{file_id}</span>"</span>)

                     <span class="hljs-keyword">if</span> file_id:
                         download_link = gdrive.get_drive_download_link(drive, file_id)
                         <span class="hljs-keyword">if</span> download_link:
                             st.markdown(<span class="hljs-string">f"[Download <span class="hljs-subst">{website}</span> - <span class="hljs-subst">{category}</span> data](<span class="hljs-subst">{download_link}</span>)"</span>, unsafe_allow_html=<span class="hljs-literal">True</span>)

                             df = pd.read_csv(file_path)
                             st.write(<span class="hljs-string">f"<span class="hljs-subst">{website}</span> - <span class="hljs-subst">{category}</span> Data:"</span>)
                             st.dataframe(df)
                         <span class="hljs-keyword">else</span>:
                             st.error(<span class="hljs-string">f"Failed to retrieve download link for file ID: <span class="hljs-subst">{file_id}</span>"</span>)
                     <span class="hljs-keyword">else</span>:
                         st.error(<span class="hljs-string">f"Failed to upload file for <span class="hljs-subst">{website}</span> - <span class="hljs-subst">{category}</span>"</span>)
                 <span class="hljs-keyword">else</span>:
                     st.error(<span class="hljs-string">f"File not created for <span class="hljs-subst">{website}</span> - <span class="hljs-subst">{category}</span>"</span>)

         st.success(<span class="hljs-string">'Scraping Completed!'</span>)
</code></pre>
</li>
</ol>
<p>This Streamlit web interface stands as a testament to the power of Python in creating efficient, user-friendly tools for data aggregation and management. It simplifies the complex process of data collection, storage, and sharing, making it accessible even to those with minimal technical background.</p>
<h3 id="heading-dynamic-configuration-with-configjson">Dynamic Configuration with 'config.json'</h3>
<p>Our Streamlit application is designed with agility and future expansion in mind. By incorporating a <code>config.json</code> file, we've created a flexible framework that allows for easy addition and modification of news sources.</p>
<h4 id="heading-the-role-of-configjson">The Role of <code>config.json</code></h4>
<ul>
<li><p><strong>Flexible Source Management</strong>: The <code>config.json</code> file holds the details of various news websites, their available languages, and specific category URLs. This setup enables us to easily add new sources or modify existing ones without altering the core code of the application.</p>
</li>
<li><p><strong>Language and Category Customization</strong>: For each news website, multiple languages and categories are defined. Users can select their preferred language and categories, making the data scraping process highly customizable.</p>
</li>
</ul>
<h4 id="heading-implementation-in-streamlit">Implementation in Streamlit</h4>
<ul>
<li><p><strong>Loading Configuration</strong>: At the start of the application, <code>config.json</code> is loaded to dynamically populate the website choices, along with their respective languages and categories.</p>
</li>
<li><p><strong>User Interactions</strong>: Users interact with checkboxes and dropdowns generated based on the configuration. They can select websites, languages, and specific categories for scraping.</p>
</li>
<li><p><strong>Scalability</strong>: Adding a new website, language, or category is as simple as updating the <code>config.json</code> file, making the application scalable and easy to maintain.</p>
</li>
</ul>
<h4 id="heading-advantages">Advantages</h4>
<ul>
<li><strong>Maintainability</strong>: Changes to the list of websites or their categories don't require code changes, reducing maintenance complexity.</li>
</ul>
<p><strong>User Experience</strong>: Provides a user-friendly interface where options are dynamically generated, offering a seamless and intuitive experience.</p>
<h3 id="heading-demo">Demo</h3>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://huggingface.co/spaces/MoroccanDS/A8-Moroccan-News-Aggregator">https://huggingface.co/spaces/MoroccanDS/A8-Moroccan-News-Aggregator</a></div>
<p> </p>
<h2 id="heading-conclusion-1">Conclusion</h2>
<p>Our journey through scraping, cleaning, and deploying data from Moroccan news websites has been a testament to the power and flexibility of Python and its libraries. By leveraging tools like Selenium, Pandas, Streamlit, and Google Drive API, we've demonstrated a streamlined process that transforms raw data into accessible and actionable insights. This project not only showcases the technical capabilities of these tools but also highlights the potential for data-driven strategies in understanding and disseminating information effectively.</p>
<h2 id="heading-acknowledgements">Acknowledgements</h2>
<p>First and foremost, a heartfelt thanks to my dedicated teammates - <a class="user-mention" href="https://hashnode.com/@ScorpionTaj">Tajeddine Bourhim</a> , <a class="user-mention" href="https://hashnode.com/@Steevie">Yahya NPC</a> and @marwane khadrouf. Their expertise, creativity, and commitment were pivotal in turning this concept into reality. Their contributions in various aspects of the project, from data scraping to interface design, have been invaluable.</p>
<p>We also extend our gratitude to the founder of MDS and the initiator of the DataStart initiative <a class="user-mention" href="https://hashnode.com/@bahae">Bahae Eddine Halim</a> . This initiative not only kick-started our project but also inspired us to delve into the realm of data handling and analysis, contributing to a range of projects including this one.</p>
<p>this project stands as a collaborative effort, blending individual talents and shared vision. It's a celebration of teamwork, innovation, and the endless possibilities that open-source technology and data science bring to our world.</p>
]]></content:encoded></item><item><title><![CDATA[Python based currency converter web app : A simple approach]]></title><description><![CDATA[Introduction
In today's globalized world, the ability to swiftly convert currencies is a necessity for travelers, businesses, and finance enthusiasts alike. In this blog, we delve into the fascinating realm of currency conversion, showcasing a Python...]]></description><link>https://blog.moroccands.com/python-based-currency-converter-web-app-a-simple-approach</link><guid isPermaLink="true">https://blog.moroccands.com/python-based-currency-converter-web-app-a-simple-approach</guid><category><![CDATA[Morrocan]]></category><category><![CDATA[Python]]></category><category><![CDATA[Python 3]]></category><category><![CDATA[streamlit]]></category><category><![CDATA[BeautifulSoup]]></category><category><![CDATA[huggingface]]></category><category><![CDATA[Scraping]]></category><category><![CDATA[Web Development]]></category><category><![CDATA[UI Design]]></category><category><![CDATA[currency]]></category><category><![CDATA[currency-converter]]></category><category><![CDATA[Morocco ]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Matplotlib]]></category><category><![CDATA[APIs]]></category><dc:creator><![CDATA[Adnane Karmouch]]></dc:creator><pubDate>Fri, 01 Mar 2024 11:00:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/jgOkEjVw-KM/upload/a00f7122862ab57ad248a495b9db197b.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>In today's globalized world, the ability to swiftly convert currencies is a necessity for travelers, businesses, and finance enthusiasts alike. In this blog, we delve into the fascinating realm of currency conversion, showcasing a Python-based web application that brings real-time exchange rates to your fingertips.</p>
<p>Join us on this journey as we unveil the inner workings of our currency converter, crafted with the powerful combination of Python, Streamlit, and BeautifulSoup. We'll explore the challenges and triumphs of sourcing current exchange rates, and the seamless user experience delivered by our intuitive web interface.</p>
<p>So sit back, relax, and prepare to embark on a voyage through the dynamic landscape of currency conversion, guided by the ingenuity of Python's versatile ecosystem. Let's dive in! 🌐💱</p>
<h2 id="heading-what-is-streamlit">What is streamlit</h2>
<p>Streamlit is a cutting-edge Python library that empowers developers to create interactive web applications with remarkable ease and speed. What sets Streamlit apart is its simplicity and focus on the developer experience. With just a few lines of Python code, developers can transform their data scripts into polished, user-friendly web apps. Streamlit handles all the heavy lifting behind the scenes, from data visualization to user input handling, allowing developers to focus on crafting engaging experiences for their users. Whether you're a seasoned developer or just starting out, Streamlit offers a seamless and intuitive platform for bringing your ideas to life on the web.</p>
<h2 id="heading-what-is-beautifulsoup">What is beautifulSoup</h2>
<p>Beautiful Soup is a Python library renowned for its ability to parse HTML and XML documents, making web scraping and data extraction a breeze. It provides a powerful yet user-friendly interface for navigating and manipulating parsed documents, allowing developers to extract relevant information with ease. Beautiful Soup's intuitive syntax and robust functionality make it a go-to choice for extracting data from websites of all complexities. Whether you're scraping a simple webpage or traversing a complex hierarchy of elements, Beautiful Soup streamlines the process, enabling developers to focus on extracting insights from the data rather than wrestling with the intricacies of web parsing. With its versatility and reliability, Beautiful Soup continues to be a cornerstone tool for web scraping tasks across various domains and industries.</p>
<ul>
<li><h3 id="heading-code-snippet-main-page">Code snippet (main page)</h3>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">rate_parser</span>(<span class="hljs-params">input_curr, output_curr</span>):</span>
    url = <span class="hljs-string">f"https://www.xe.com/currencyconverter/convert/?Amount=1&amp;From=<span class="hljs-subst">{input_curr}</span>&amp;To=<span class="hljs-subst">{output_curr}</span>"</span>
    content = requests.get(url).text
    soup = BeautifulSoup(content, <span class="hljs-string">'html.parser'</span>)

    result_element = soup.find(<span class="hljs-string">"p"</span>, class_=<span class="hljs-string">"result__BigRate-sc-1bsijpp-1 dPdXSB"</span>)

    <span class="hljs-keyword">if</span> result_element:
        currency_text = result_element.get_text().replace(<span class="hljs-string">','</span>, <span class="hljs-string">''</span>)  <span class="hljs-comment"># Remove comma</span>
        rate = float(currency_text.split()[<span class="hljs-number">0</span>])
        <span class="hljs-keyword">return</span> rate
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">f"Element not found for <span class="hljs-subst">{input_curr}</span> to <span class="hljs-subst">{output_curr}</span>."</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">convert</span>(<span class="hljs-params">base,dest,amount</span>):</span>
    rate = rate_parser(base,dest)
    new_amount = rate * amount
    <span class="hljs-keyword">return</span> new_amount 


currencies = [<span class="hljs-string">"USD"</span>,<span class="hljs-string">"EUR"</span>,<span class="hljs-string">"CAD"</span>,<span class="hljs-string">"MAD"</span>,<span class="hljs-string">"GBP"</span>,<span class="hljs-string">"AUD"</span>,<span class="hljs-string">"JPY"</span>]
<span class="hljs-comment">#this is just a test list you can add any currency available at xe-currency</span>

st.write(<span class="hljs-string">"# Python based currency converter"</span>)

st.sidebar.write(<span class="hljs-string">"### Currency converter:"</span>)

base = st.sidebar.selectbox(<span class="hljs-string">"Enter a base currency:"</span>,currencies)

dest = st.sidebar.selectbox(<span class="hljs-string">"Enter a destination currency:"</span>,currencies)

amount = st.sidebar.number_input(<span class="hljs-string">"Enter an amount"</span>)

input = st.sidebar.button(<span class="hljs-string">"Convert"</span>)

<span class="hljs-keyword">if</span> input:
    current_rate = rate_parser(base,dest)
    output = convert(base,dest,amount)
    st.success(<span class="hljs-string">"Success"</span>)
    st.write(<span class="hljs-string">f"## Current exchange rate between <span class="hljs-subst">{base}</span> and <span class="hljs-subst">{dest}</span> "</span>)
    st.write(<span class="hljs-string">f"#### 1 <span class="hljs-subst">{base}</span> = "</span>)
    st.write(<span class="hljs-string">f" ## :red[<span class="hljs-subst">{current_rate}</span>] <span class="hljs-subst">{dest}</span>"</span>)

    st.write(<span class="hljs-string">"## Converted amount:"</span>)
    st.write(<span class="hljs-string">f" ### <span class="hljs-subst">{amount}</span> <span class="hljs-subst">{base}</span> = :red[<span class="hljs-subst">{output}</span>] <span class="hljs-subst">{dest}</span>"</span>)
</code></pre>
<h2 id="heading-why-we-chose-scraping-over-api">Why we chose scraping over API</h2>
<p>In our pursuit of building a Python-based currency converter web app, the decision to opt for web scraping over utilizing an API was rooted in the specific requirements of our project. While APIs offer a convenient way to access data, particularly for real-time exchange rates, we encountered limitations with free API plans, which typically impose restrictions on the number of requests allowed per day. This became a significant constraint when attempting to gather historical exchange rate data, as it necessitated a considerable number of requests.</p>
<p>In light of this, we shifted our approach to web scraping using BeautifulSoup. While scraping provided the flexibility needed for historical data retrieval, it introduced the trade-off of slower performance compared to APIs. The notable advantage, however, lay in the ability to extract comprehensive historical datasets without the constraints imposed by API rate limits. This strategic decision allowed us to balance our data acquisition requirements, ensuring the reliability and completeness of the information while acknowledging the potential trade-offs in speed associated with web scraping.</p>
<ul>
<li><h3 id="heading-code-snippet-historical-data">Code snippet (historical data)</h3>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> timedelta, date
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">from</span> matplotlib.ticker <span class="hljs-keyword">import</span> MaxNLocator

<span class="hljs-comment">#Create a function to return a list of each day between two dates</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">daterange</span>(<span class="hljs-params">start_date, end_date</span>):</span>
    list = []
    <span class="hljs-keyword">for</span> n <span class="hljs-keyword">in</span> range(int ((end_date - start_date).days)):
        list.append(start_date + timedelta(n))
    <span class="hljs-keyword">return</span> list

<span class="hljs-comment">#Create a function that scrapes historical data nb(all apis didn't work well)</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">historical_data</span>(<span class="hljs-params">start_date,end_date,base,dest</span>):</span>
    df = pd.DataFrame()
    times = daterange(start_date, end_date)
    <span class="hljs-keyword">for</span> single_date <span class="hljs-keyword">in</span> times:
        dfs = pd.read_html(<span class="hljs-string">f'https://www.xe.com/currencytables/?from=<span class="hljs-subst">{base}</span>&amp;date=<span class="hljs-subst">{single_date.strftime(<span class="hljs-string">"%Y-%m-%d"</span>)}</span>'</span>)[<span class="hljs-number">0</span>]
        dfs[<span class="hljs-string">'Date'</span>] = single_date.strftime(<span class="hljs-string">"%Y-%m-%d"</span>)
        df = pd.concat([df, dfs], ignore_index=<span class="hljs-literal">True</span>)

    df_curr=df.loc[df[<span class="hljs-string">'Currency'</span>]==dest]
    df_curr = df_curr.reset_index(drop=<span class="hljs-literal">True</span>)
    df_curr.set_index(<span class="hljs-string">'Date'</span>,inplace = <span class="hljs-literal">True</span>)
    <span class="hljs-keyword">return</span> df_curr


<span class="hljs-comment">#Create a list of all supported currencies</span>

currencies = [<span class="hljs-string">"USD"</span>,<span class="hljs-string">"EUR"</span>,<span class="hljs-string">"CAD"</span>,<span class="hljs-string">"MAD"</span>,<span class="hljs-string">"GBP"</span>,<span class="hljs-string">"AUD"</span>,<span class="hljs-string">"JPY"</span>]
<span class="hljs-comment">#this is just a test list</span>
<span class="hljs-comment">#Expend the list as needed</span>

<span class="hljs-comment">#Create the UI using streamlit</span>

st.write(<span class="hljs-string">"# Historical Data:"</span>)
st.warning(<span class="hljs-string">"This is originally a scraping webapp so choosing a large duration might cause substancial running time"</span>)
st.warning(<span class="hljs-string">"Recommanded max duration : 1 Month"</span>)

base = st.sidebar.selectbox(<span class="hljs-string">"Enter a base currency:"</span>,currencies)

dest = st.sidebar.selectbox(<span class="hljs-string">"Enter a destination currency:"</span>,currencies)

start = st.sidebar.date_input(<span class="hljs-string">"Enter start date:"</span>)

finish = st.sidebar.date_input(<span class="hljs-string">"Enter finish date:"</span>)

input = st.sidebar.button(<span class="hljs-string">"Confirm"</span>)

<span class="hljs-keyword">if</span> input:
    <span class="hljs-keyword">if</span> start == finish:
        st.error(<span class="hljs-string">"Error: cannot process same date"</span>)
    <span class="hljs-keyword">with</span> st.spinner(<span class="hljs-string">'Wait for it...'</span>):
        data = historical_data(start,finish,base,dest)
    st.success(<span class="hljs-string">'Done!'</span>)
    st.table(data)
    st.write(<span class="hljs-string">"## Plotting"</span>)

    st.write(<span class="hljs-string">"### Static plot:"</span>)
    fig, ax = plt.subplots()
    ax.plot(data[<span class="hljs-string">f"<span class="hljs-subst">{base}</span> per unit"</span>])
    ax.set_title(<span class="hljs-string">f'<span class="hljs-subst">{base}</span> to <span class="hljs-subst">{dest}</span> over Time'</span>)
    ax.xaxis.set_major_locator(MaxNLocator(nbins=<span class="hljs-number">6</span>))
    st.pyplot(fig)
</code></pre>
<h3 id="heading-example">Example</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Currency</td><td>Name</td><td>Units per MAD</td><td>MAD per unit</td></tr>
</thead>
<tbody>
<tr>
<td>2024-02-01</td><td>USD</td><td>US Dollar</td><td>0.0999</td><td>10.0087</td></tr>
<tr>
<td>2024-02-02</td><td>USD</td><td>US Dollar</td><td>0.0996</td><td>10.0422</td></tr>
<tr>
<td>2024-02-03</td><td>USD</td><td>US Dollar</td><td>0.0996</td><td>10.0381</td></tr>
<tr>
<td>2024-02-04</td><td>USD</td><td>US Dollar</td><td>0.0996</td><td>10.0381</td></tr>
<tr>
<td>2024-02-05</td><td>USD</td><td>US Dollar</td><td>0.0992</td><td>10.0827</td></tr>
<tr>
<td>2024-02-06</td><td>USD</td><td>US Dollar</td><td>0.0993</td><td>10.0722</td></tr>
<tr>
<td>2024-02-07</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0611</td></tr>
<tr>
<td>2024-02-08</td><td>USD</td><td>US Dollar</td><td>0.0995</td><td>10.0458</td></tr>
<tr>
<td>2024-02-09</td><td>USD</td><td>US Dollar</td><td>0.0997</td><td>10.0340</td></tr>
<tr>
<td>2024-02-10</td><td>USD</td><td>US Dollar</td><td>0.0997</td><td>10.0327</td></tr>
<tr>
<td>2024-02-11</td><td>USD</td><td>US Dollar</td><td>0.0997</td><td>10.0327</td></tr>
<tr>
<td>2024-02-12</td><td>USD</td><td>US Dollar</td><td>0.0996</td><td>10.0396</td></tr>
<tr>
<td>2024-02-13</td><td>USD</td><td>US Dollar</td><td>0.0993</td><td>10.0679</td></tr>
<tr>
<td>2024-02-14</td><td>USD</td><td>US Dollar</td><td>0.0993</td><td>10.0744</td></tr>
<tr>
<td>2024-02-15</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0646</td></tr>
<tr>
<td>2024-02-16</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0643</td></tr>
<tr>
<td>2024-02-17</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0636</td></tr>
<tr>
<td>2024-02-18</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0642</td></tr>
<tr>
<td>2024-02-19</td><td>USD</td><td>US Dollar</td><td>0.0991</td><td>10.0882</td></tr>
<tr>
<td>2024-02-20</td><td>USD</td><td>US Dollar</td><td>0.0992</td><td>10.0763</td></tr>
<tr>
<td>2024-02-21</td><td>USD</td><td>US Dollar</td><td>0.0992</td><td>10.0761</td></tr>
<tr>
<td>2024-02-22</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0556</td></tr>
<tr>
<td>2024-02-23</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0557</td></tr>
<tr>
<td>2024-02-24</td><td>USD</td><td>US Dollar</td><td>0.0994</td><td>10.0555</td></tr>
<tr>
<td>2024-02-25</td><td>USD</td><td>US Dollar</td><td>0.0995</td><td>10.0553</td></tr>
<tr>
<td>2024-02-26</td><td>USD</td><td>US Dollar</td><td>0.0995</td><td>10.0497</td></tr>
<tr>
<td>2024-02-27</td><td>USD</td><td>US Dollar</td><td>0.0993</td><td>10.0663</td></tr>
<tr>
<td>2024-02-28</td><td>USD</td><td>US Dollar</td><td>0.0990</td><td>10.1015</td></tr>
</tbody>
</table>
</div><h3 id="heading-deployment">Deployment</h3>
<p>We chose Hugging Face for deployment due to its reputation as a leading platform for deploying machine learning models with ease and efficiency. Hugging Face offers a comprehensive suite of tools and services that streamline the deployment process, enabling us to seamlessly deploy our web app.</p>
<p>One of the key reasons for selecting Hugging Face is its user-friendly interface and robust documentation, which simplifies the deployment workflow and reduces the learning curve for developers.</p>
<p>Furthermore, Hugging Face's scalability and reliability were critical factors in our decision-making process. The platform's infrastructure is designed to handle high volumes of traffic and deliver consistent performance, making it well-suited for deploying production-ready applications.</p>
<p>Moreover, Hugging Face offers built-in features for model versioning, monitoring, and management, which facilitate seamless updates and maintenance of deployed models.</p>
<p>Overall, Hugging Face emerged as the optimal choice for deployment based on its reputation for reliability, scalability, ease of use, and comprehensive feature set, enabling us to deploy our machine learning models with confidence and efficiency.</p>
<p><a target="_blank" href="https://huggingface.co/spaces/MoroccanDS/B7-Pyhton-based-currency-converter">https://huggingface.co/spaces/MoroccanDS/B7-Pyhton-based-currency-converter</a></p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>In conclusion, embarking on this project to develop a Python-based currency converter web app has proven to be a highly rewarding endeavor, not only for the usefulness of the application but also for the invaluable skills gained throughout the process. By leveraging technologies such as Streamlit and BeautifulSoup, we've honed our abilities in web development, data extraction, and user interface design, equipping us with valuable tools for future projects.</p>
]]></content:encoded></item><item><title><![CDATA[Exploring Moroccan Real Estate: Trends and Price Predictions]]></title><description><![CDATA[Table of Contents:

Introduction
Data Scraping
Data Cleaning
Exploratory Data Analysis (EDA)
Model Development
Model Deployment
Conclusion
Acknowledgments



Introduction
Hello there! Are you up for an exciting journey through Morocco's real estate s...]]></description><link>https://blog.moroccands.com/exploring-moroccan-real-estate-trends-and-price-predictions</link><guid isPermaLink="true">https://blog.moroccands.com/exploring-moroccan-real-estate-trends-and-price-predictions</guid><category><![CDATA[geospatial visualisation]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[data scraping]]></category><category><![CDATA[Real Estate]]></category><dc:creator><![CDATA[Zahra EL HATIME]]></dc:creator><pubDate>Fri, 16 Feb 2024 18:43:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1706875172477/8ed4b88d-c625-417f-8f13-f0d4ab66ef8d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<details>
<summary>Table of Contents:</summary>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#data-scraping">Data Scraping</a></li>
<li><a href="#data-cleaning">Data Cleaning</a></li>
<li><a href="#exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</a></li>
<li><a href="#model-development">Model Development</a></li>
<li><a href="#model-deployment">Model Deployment</a></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#acknowledgments">Acknowledgments</a></li>
</ul>
</details>

<h1 id="heading-introductionhttpshashnodecomdraft65bccc56d821d9fd24722c81heading-introduction"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-introduction">Introduction</a></h1>
<p>Hello there! Are you up for an exciting journey through Morocco's real estate scene? From the bustling markets to the hidden gems of Marrakech and Chefchaouen, we're diving into the heart of it all. Join us as we decode the secrets of Moroccan property prices and unveil the magic of predictive analytics. Get ready to explore, analyze, and uncover the possibilities in this thrilling adventure! 🌟🔍</p>
<h1 id="heading-data-scrapinghttpshashnodecomdraft65bccc56d821d9fd24722c81heading-datascraping"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-datascraping">Data Scraping</a></h1>
<p>In the realm of real estate analytics, data is king. But how do we access this treasure trove of information scattered across the vast expanse of the internet? Enter web scraping – the digital prospector's tool of choice.</p>
<p><strong>What is Web Scraping?</strong></p>
<p>Web scraping is the art of extracting data from websites, allowing us to collect valuable information for analysis and insights. In the context of our project, web scraping enables us to gather real estate listings, economic indicators, and other pertinent data from online sources.</p>
<p><strong>Why is it Important?</strong></p>
<p>In the dynamic world of real estate, having access to up-to-date and comprehensive data is crucial for making informed decisions. Web scraping empowers us to gather vast amounts of data efficiently, giving us a competitive edge in predicting real estate prices and understanding market trends.</p>
<p><strong>Scraping Data with BeautifulSoup</strong></p>
<p>Our journey begins with a prominent real estate website in Morocco, brimming with valuable listings and insights. Using the Python library BeautifulSoup, we embark on our quest to extract data from it with precision and finesse.</p>
<p><strong>The Process</strong>:</p>
<p><em>Exploration</em>: We start by inspecting the structure of the website, identifying the key elements containing the data we seek – from property listings to pricing information.</p>
<p><em>Scripting</em>: Armed with our knowledge of HTML and CSS, we craft Python scripts leveraging BeautifulSoup to navigate through the website's code and extract the desired data.</p>
<p><em>Extraction</em>: With surgical precision, our scripts traverse through the pages of the website, capturing essential details such as property names, prices, locations, area, and more.</p>
<p><em>Aggregation:</em> As the data flows in, we gather and organize it into structured formats such as Excel spreadsheets, ready for further analysis and processing.</p>
<p><strong>Python Code Snippets:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> urllib.parse <span class="hljs-keyword">import</span> urlparse, urljoin

<span class="hljs-comment"># Define the base URL</span>
baseurl = <span class="hljs-string">'https://www.mu.ma/'</span>

<span class="hljs-comment"># Set to store unique product links</span>
product_links = set()

<span class="hljs-comment"># Loop through multiple pages of apartment listings</span>
<span class="hljs-keyword">for</span> page_num <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">3</span>):
    <span class="hljs-comment"># Make a GET request to the page</span>
    r = requests.get(<span class="hljs-string">f'https://www.mu.ma/fr/sc/appartements-a-vendre:p:<span class="hljs-subst">{page_num}</span>'</span>)
    soup = BeautifulSoup(r.content, <span class="hljs-string">'html.parser'</span>)

   <span class="hljs-comment"># Extract listings from the page</span>
    listings = soup.find_all(class_=<span class="hljs-string">'listingBox w100'</span>)

    <span class="hljs-comment"># Loop through each listing and extract product links</span>
    <span class="hljs-keyword">for</span> listing <span class="hljs-keyword">in</span> listings:
        links = listing.find_all(<span class="hljs-string">'a'</span>, href=<span class="hljs-literal">True</span>)
        <span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> links:
            product_link = link[<span class="hljs-string">'href'</span>]

            <span class="hljs-comment"># Check if the URL is valid</span>
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> urlparse(product_link).scheme:
                product_link = urljoin(baseurl, product_link)
                product_links.add(product_link)  <span class="hljs-comment"># Add link to the set</span>

<span class="hljs-comment"># Initialize list to store scraped data</span>
scraped_data = []

<span class="hljs-comment"># Loop through each product link and scrape data</span>
<span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> product_links:
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># Make a GET request to the product page</span>
        r = requests.get(link)
        soup = BeautifulSoup(r.content)
        <span class="hljs-comment"># Extract relevant information from the page</span>
        <span class="hljs-comment"># (Code snippet continued from previous section...)</span>
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">'Error processing page:'</span>, link)
        print(e)
        <span class="hljs-keyword">continue</span>

<span class="hljs-comment"># Convert scraped data to DataFrame and save to Excel</span>
df = pd.DataFrame(scraped_data)
df.to_excel(<span class="hljs-string">'scraped_data.xlsx'</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
<p>With web scraping as our trusty pickaxe, we delve into the digital mines of the website, extracting nuggets of real estate data to fuel our predictive models and illuminate the path to informed decision-making. Let the data adventure begin! 🏠💻</p>
<h1 id="heading-data-cleaninghttpshashnodecomdraft65bccc56d821d9fd24722c81heading-datascleaning"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-datascleaning">Data Cleaning</a></h1>
<p><strong>Why Data Cleaning Matters:</strong></p>
<p>Before we can uncover the hidden patterns and insights within our data, we must first ensure its integrity and quality. Data cleaning plays a pivotal role in this process, serving as the foundation upon which our analyses and models are built. By removing inconsistencies, handling missing values, and standardizing formats, we pave the way for accurate and reliable results.</p>
<p><strong>Steps in Data Cleaning:</strong></p>
<p>Removing Duplicates: Duplicate entries can skew our analyses and lead to erroneous conclusions. By identifying and eliminating duplicates, we ensure that each observation contributes meaningfully to our insights.</p>
<p>Handling Missing Values: Missing data is a common challenge in real-world datasets. Whether due to human error or system limitations, missing values must be addressed to maintain the integrity of our analyses. Strategies such as imputation or removal can be employed based on the nature and context of the missing data.</p>
<p>Standardizing Data Types: Inconsistent data types can impede our ability to perform meaningful calculations and comparisons. Standardizing data types ensures uniformity and facilitates seamless data manipulation and analysis.</p>
<p><strong>Python Code Snippets:</strong></p>
<pre><code class="lang-python">
<span class="hljs-comment"># Load the scraped data after some cleaning with VBA Macros</span>

df = pd.read_excel(<span class="hljs-string">'scraped_data.xlsx'</span>)

<span class="hljs-comment"># Remove duplicate rows</span>

df.drop_duplicates(inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Drop rows with missing prices</span>

df.dropna(subset=[<span class="hljs-string">'price'</span>], inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Fill missing values in 'city' with corresponding values from 'secteur'</span>

df[<span class="hljs-string">'city'</span>].fillna(df[<span class="hljs-string">'secteur'</span>], inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Fill missing values in 'sdb' and 'chambres' with the mean</span>

mean_sdb = df[<span class="hljs-string">'sdb'</span>].mean()

mean_chambres = df[<span class="hljs-string">'chambres'</span>].mean()

mean_surface = df[<span class="hljs-string">'surface'</span>].mean()

df[<span class="hljs-string">'sdb'</span>].fillna(mean_sdb, inplace=<span class="hljs-literal">True</span>)

df[<span class="hljs-string">'chambres'</span>].fillna(mean_chambres, inplace=<span class="hljs-literal">True</span>)

df[<span class="hljs-string">'surface'</span>].fillna(mean_surface, inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Fill missing values in 'pieces' with the sum of 'sdb' and 'chambres'</span>

df[<span class="hljs-string">'pieces'</span>].fillna(df[<span class="hljs-string">'sdb'</span>] + df[<span class="hljs-string">'chambres'</span>], inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># One-hot encode 'secteur' and 'city' columns</span>

df = pd.get_dummies(df, columns=[<span class="hljs-string">'secteur'</span>, <span class="hljs-string">'city'</span>])

<span class="hljs-comment"># Display the cleaned DataFrame</span>

print(df.head())
</code></pre>
<p><strong>Example:</strong></p>
<p>Let's say our original dataset contains the following entries:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>name</strong></td><td><strong>price</strong></td><td><strong>secteur</strong></td><td><strong>surface</strong></td><td><strong>pieces</strong></td><td><strong>chambres</strong></td><td><strong>sdb</strong></td><td><strong>etat</strong></td><td><strong>age</strong></td><td><strong>etage</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Appartement de 65m² en vente, Complexe Résidentiel Jnane Azzahrae</td><td>250 000 DH</td><td>Ain Atig àTemara</td><td>65m²</td><td>4 Pièces</td><td>3 Chambres</td><td>2 Salles de bains</td><td>Nouveau</td><td></td><td></td></tr>
<tr>
<td>Av grand appartement terrasse mohammedia centre</td><td>2 300 000 DH</td><td>Centre Ville àMohammedia</td><td>169m²</td><td>6 Pièces</td><td>3 Chambres</td><td>2 Salles de bains</td><td>Bon état</td><td>5-10 ans</td><td>5èmeétage</td></tr>
<tr>
<td>Appartement à vendre 269 m², 3 chambres Val Fleurie Casablanca</td><td>3 800 000 DH</td><td>Val Fleury àCasablanca</td><td>269m²</td><td>4 Pièces</td><td>3 Chambres</td><td>2 Salles de bains</td><td>Bon état</td><td>10-20 ans</td><td>1erétage</td></tr>
<tr>
<td>Appartement de 105 m² en vente, Rio Beach</td><td>9 900 DH</td><td>Sidi Rahal</td><td>105m²</td><td>3 Pièces</td><td>2 Chambres</td><td>2 Salles de bains</td><td>Nouveau</td><td>Moins d'un an</td><td></td></tr>
<tr>
<td>Studio Meublé moderne, à vendre</td><td>1 360 000 DH</td><td>Racine àCasablanca</td><td>57m²</td><td>2 Pièces</td><td>1 Chambre</td><td>1 Salle de bain</td><td>Nouveau</td><td></td><td></td></tr>
<tr>
<td>Appartement 99 m² a vendre M Océan</td><td>1 336 500 DH</td><td>Mannesmann àMohammedia</td><td>99m²</td><td>3 Pièces</td><td>2 Chambres</td><td>2 Salles de bains</td><td>Nouveau</td><td>1-5 ans</td><td></td></tr>
<tr>
<td>Appartement à vendre 88 m², 2 chambres Les princesses Casablanca</td><td>1 300 000 DH</td><td>Maârif Extension àCasablanca</td><td>88m²</td><td>2 Chambres</td><td>2 Salles de bains</td><td>4èmeétage</td><td></td><td></td></tr>
</tbody>
</table>
</div><p>After applying data cleaning techniques, our cleaned dataset might look like this:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>name</strong></td><td><strong>price</strong></td><td><strong>secteur</strong></td><td><strong>city</strong></td><td><strong>surface</strong></td><td><strong>pieces</strong></td><td><strong>chambres</strong></td><td><strong>sdb</strong></td><td><strong>age</strong></td><td><strong>etage</strong></td><td><strong>etat_Bon état</strong></td><td><strong>etat_Nouveau</strong></td><td><strong>etat_À rénover</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Appartement de 65m² en vente, Complexe Résidentiel Jnane Azzahrae</td><td>250000,00</td><td>Ain Atig</td><td>Temara</td><td>65</td><td>4</td><td>3</td><td>2</td><td>5,43647</td><td>1,771796</td><td>0</td><td>1</td><td>0</td></tr>
<tr>
<td>Vente appartement à quartier geliz marrakech</td><td>135000,00</td><td>Guéliz</td><td>Marrakech</td><td>104</td><td>6</td><td>3</td><td>1</td><td>5,43647</td><td>1,771796</td><td>1</td><td>0</td><td>0</td></tr>
<tr>
<td>Av grand appartement terrasse mohammedia centre</td><td>2300000,00</td><td>Centre Ville</td><td>Mohammedia</td><td>169</td><td>6</td><td>3</td><td>1</td><td>7,5</td><td>5</td><td>1</td><td>0</td><td>0</td></tr>
<tr>
<td>Appartement à vendre 269 m², 3 chambres Val Fleurie Casablanca</td><td>3800000,00</td><td>Val Fleury</td><td>Casablanca</td><td>269</td><td>4</td><td>3</td><td>1</td><td>15</td><td>1</td><td>1</td><td>0</td><td>0</td></tr>
<tr>
<td>Appartement de 105 m² en vente, Rio Beach</td><td>9900,00</td><td>Sidi Rahal</td><td>Marrakech</td><td>105</td><td>3</td><td>2</td><td>1</td><td>0,5</td><td>1,771796</td><td>0</td><td>1</td><td>0</td></tr>
<tr>
<td>Très joli appartement en vente meublé</td><td>2000000,00</td><td>Camp Al Ghoul</td><td>Marrakech</td><td>99</td><td>3</td><td>2</td><td>1</td><td>3,5</td><td>3</td><td>0</td><td>1</td><td>0</td></tr>
<tr>
<td>Appartement à vendre 259 m², 4 chambres Maârif Casablanca</td><td>4400000,00</td><td>Maârif</td><td>Casablanca</td><td>259</td><td>6</td><td>4</td><td>1</td><td>15</td><td>2</td><td>1</td><td>0</td><td>0</td></tr>
</tbody>
</table>
</div><p><strong>Conclusion:</strong></p>
<p>Data cleaning is the cornerstone of robust data analysis. By ensuring the cleanliness and consistency of our datasets, we lay the groundwork for accurate insights and informed decision-making. With our data now polished to perfection, we are ready to embark on the next stage of our real estate journey</p>
<h1 id="heading-exploratory-data-analysis-edahttpshashnodecomdraft65bccc56d821d9fd24722c81heading-eda"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-eda">Exploratory Data Analysis (EDA)</a></h1>
<p><strong>Understanding the Dataset:</strong></p>
<p>Exploratory Data Analysis (EDA) serves as our compass, guiding us through the vast landscape of real estate data. It empowers us to uncover hidden patterns, identify outliers, and gain valuable insights into the dynamics of the market.</p>
<p><strong>Significance of EDA:</strong></p>
<p>Understanding Real Estate Dynamics: EDA allows us to delve deep into the intricacies of the real estate dataset, unraveling the relationships between various factors such as property characteristics, location, area, and pricing.</p>
<p>Identifying Patterns and Trends: By analyzing descriptive statistics and visualizations, we can identify trends over time, seasonal fluctuations, and spatial disparities in property prices.</p>
<p>Informing Decision-Making: Insights gleaned from EDA serve as the cornerstone for informed decision-making, whether it be for investors, developers, or policymakers.</p>
<p><strong>Visualizations and Descriptive Statistics:</strong></p>
<p>Distribution of Real Estate Prices: Histograms and boxplots provide a snapshot of the distribution of property prices, highlighting central tendencies, variability, and potential outliers.</p>
<p>Relationships Between Variables: Scatter plots and correlation matrices help us explore the relationships between different variables, such as property size, number of rooms, and prices, shedding light on potential predictors of property value.</p>
<p>Temporal Trends: Time series plots allow us to visualize temporal trends in property prices, discerning patterns, seasonality, and long-term trends.</p>
<p><strong>Geospatial Visualizations:</strong></p>
<p>Interactive Maps: Utilizing geospatial data, we can create interactive maps to visualize property locations, hotspots, and regional disparities in prices. This allows stakeholders to explore the real estate landscape at a glance and identify areas of interest.</p>
<p>Heatmaps: Heatmaps offer a bird's-eye view of property density and price distribution, providing valuable insights into market saturation and demand hotspots.</p>
<p><strong>Conclusion:</strong></p>
<p>Exploratory Data Analysis serves as our compass, guiding us through the labyrinth of real estate data. By unraveling patterns, trends, and spatial dynamics, EDA equips us with the insights needed to navigate the complexities of the real estate market with confidence and clarity. With our eyes opened to the rich tapestry of real estate data, we are poised to unlock its full potential and drive informed decision-making in the ever-evolving landscape of real estate. 📊🏠</p>
<h1 id="heading-model-developmenthttpshashnodecomdraft65bccc56d821d9fd24722c81heading-modeldev"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-modeldev">Model Development</a></h1>
<p>Model development marks the culmination of our journey, where we harness the power of machine learning to predict real estate prices with precision and accuracy. This transformative process involves a series of meticulous steps, each contributing to the creation of robust predictive models.</p>
<p><strong>Model Development Process:</strong></p>
<p>Data Normalization: We embark on our journey by normalizing the data, ensuring that all features are on a consistent scale. This step prevents certain features from dominating the model training process and ensures optimal performance.</p>
<p>Feature Selection: With a myriad of features at our disposal, we carefully select the most influential ones to include in our predictive models. Through feature selection techniques, we prioritize variables that exhibit strong correlations with property prices and contribute meaningfully to the predictive power of our models.</p>
<p>Model Training: Armed with a curated dataset and selected features, we embark on the model training phase. Leveraging powerful machine learning libraries such as TensorFlow, we train regression models to learn from historical data and discern intricate patterns in real estate pricing dynamics.</p>
<p>Evaluation Metrics: As stewards of data-driven decision-making, we rely on rigorous evaluation metrics to assess the performance of our models. Metrics such as Mean Squared Error (MSE) and Mean Absolute Error (MAE) serve as our compass, guiding us towards models that exhibit optimal predictive accuracy and generalization.</p>
<p><strong>Harnessing TensorFlow and Beyond:</strong></p>
<p>TensorFlow: TensorFlow stands as our stalwart companion on the journey of model development, providing a versatile framework for building and training regression models. With its intuitive interface and powerful capabilities, TensorFlow empowers us to bring our predictive visions to life with elegance and efficiency.</p>
<p>Machine Learning Libraries: In addition to TensorFlow, we harness a diverse array of machine learning libraries such as scikit-learn and Keras to augment our model development efforts. These libraries offer a rich ecosystem of algorithms and tools, enabling us to experiment, iterate, and refine our predictive models with finesse.</p>
<p><strong>python Code Snippet</strong></p>
<pre><code class="lang-python">
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler

<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error, mean_absolute_error

<span class="hljs-comment"># Load the dataset and preprocess features</span>

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

<span class="hljs-comment"># Define and train the TensorFlow regression model</span>

model = tf.keras.Sequential([

tf.keras.layers.Dense(<span class="hljs-number">64</span>, activation=<span class="hljs-string">'relu'</span>, input_shape=(X_train_scaled.shape[<span class="hljs-number">1</span>],)),

tf.keras.layers.Dense(<span class="hljs-number">64</span>, activation=<span class="hljs-string">'relu'</span>),

tf.keras.layers.Dense(<span class="hljs-number">1</span>)

])

model.compile(optimizer=<span class="hljs-string">'adam'</span>, loss=<span class="hljs-string">'mean_squared_error'</span>)

model.fit(X_train_scaled, y_train, epochs=<span class="hljs-number">100</span>, batch_size=<span class="hljs-number">32</span>, verbose=<span class="hljs-number">1</span>)

<span class="hljs-comment"># Evaluate the model</span>

y_pred = model.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

print(<span class="hljs-string">f'Mean Squared Error: <span class="hljs-subst">{mse}</span>'</span>)

print(<span class="hljs-string">f'Mean Absolute Error: <span class="hljs-subst">{mae}</span>'</span>)
</code></pre>
<p><strong>Conclusion:</strong></p>
<p>Model development serves as the cornerstone of our predictive journey, where we transform data into actionable insights and empower stakeholders to make informed decisions in the dynamic realm of real estate. Through meticulous attention to detail, rigorous experimentation, and the judicious application of machine learning techniques, we pave the way for predictive models that illuminate the path forward with clarity and confidence. 🤖🏠📈</p>
<h1 id="heading-model-deploymenthttpshashnodecomdraft65bccc56d821d9fd24722c81heading-deployment"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-deployment">Model Deployment</a></h1>
<p><strong>Deployment Process:</strong></p>
<p>As we approach the culmination of our journey, it's time to unleash the predictive prowess of our meticulously crafted models into the real world. Model deployment represents the pivotal moment when theoretical concepts seamlessly transition into practical applications, offering actionable insights and informed decision-making capabilities to stakeholders.</p>
<p><strong>Deployment Landscape:</strong></p>
<p>Real-World Integration: Our trained models are seamlessly integrated into real-world environments, where they stand ready to analyze incoming data and provide valuable predictions on real estate prices. Whether it's assisting homebuyers in making informed purchase decisions or aiding industry professionals in strategic planning, our deployed models serve as beacons of predictive wisdom.</p>
<p>Deployment Overview: Our deployment offers a glimpse into the intuitive interface through which users can interact with the deployed model. By inputting relevant property features such as surface area, number of rooms, and location, users can harness the predictive power of our models to obtain accurate price predictions in a matter of seconds.</p>
<p>User Interaction: Interacting with our deployed model is as simple as entering the desired property features into the designated input fields. With just a few clicks, users gain access to personalized price predictions tailored to their specific requirements, empowering them to navigate the intricacies of the real estate market with confidence and clarity.</p>
<p>Ready to experience the magic of predictive analytics firsthand? Explore our deployment and witness the transformative potential of data-driven insights in action. Click here to embark on a journey of discovery and unlock the secrets of real estate pricing dynamics with just a click of a button.</p>
<p><a target="_blank" href="https://huggingface.co/spaces/saaara/real_estate_price_prediction">https://huggingface.co/spaces/saaara/real_estate_price_prediction</a></p>
<p><strong>Conclusion:</strong></p>
<p>Model deployment represents the culmination of our journey, where theoretical concepts are transformed into tangible solutions that empower individuals and organizations to make informed decisions in the ever-evolving landscape of real estate. Through seamless integration, intuitive interfaces, and the democratization of predictive analytics, we pave the way for a future where data-driven insights drive meaningful change and innovation.</p>
<h1 id="heading-conclusionhttpshashnodecomdraft65bccc56d821d9fd24722c81heading-conclusion"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-conclusion">Conclusion</a></h1>
<p>We've made our data analysis and modeling process interactive and accessible by sharing our Google Colab notebook. Dive deeper into the intricacies of our real estate analysis, run code cells, visualize data, and even experiment with our predictive models here:</p>
<p><a target="_blank" href="https://colab.research.google.com/drive/1sWd5QhPXL0MpLsRsYBb7JuxK8uDTYCaq?authuser=1#scrollTo=iJF6DFW618jR">https://colab.research.google.com/drive/1sWd5QhPXL0MpLsRsYBb7JuxK8uDTYCaq?authuser=1#scrollTo=iJF6DFW618jR</a></p>
<p>As our journey through the labyrinth of Moroccan real estate draws to a close, we stand amidst a landscape adorned with insights, revelations, and transformative discoveries. Through the collective efforts of our dedicated team, we have unearthed invaluable treasures that illuminate the intricate dynamics of real estate pricing in Morocco. Our models exhibit remarkable accuracy in forecasting real estate prices, leveraging a nuanced understanding of location, amenities, and economic trends. By harnessing the power of web scraping and exploratory data analysis, we have curated a comprehensive dataset, laying the foundation for informed decision-making. With the deployment of our models, we bridge the gap between theory and practice, offering intuitive interfaces to access predictive analytics, empowering users to navigate the complexities of the real estate market with confidence. Looking ahead, we remain committed to continuous innovation, exploring novel methodologies and technologies to enhance the accuracy of our models. While reflecting on our journey, we recognize areas for improvement, acknowledging challenges faced and lessons learned.</p>
<p><strong>A Call to Action:</strong></p>
<p>Embark on your own journey of exploration and discovery within the realm of Moroccan real estate. Whether you're a seasoned professional, an aspiring enthusiast, or a curious explorer, there's a wealth of insights and opportunities waiting to be uncovered. Connect with our team to learn more about our endeavors, collaborate on future projects, or simply indulge in the fascinating world of real estate analytics. Together, let's chart a course towards a future where data-driven insights illuminate the path to prosperity and growth. 🌟🔍🏠</p>
<h1 id="heading-acknowledgmentshttpshashnodecomdraft65bccc56d821d9fd24722c81heading-acknowledgments"><a target="_blank" href="https://hashnode.com/draft/65bccc56d821d9fd24722c81#heading-acknowledgments">Acknowledgments</a></h1>
<p>I would like to express my sincere appreciation to my dedicated team members – Sara M'HAMDI, Imane KARAM, and Asmae EL-GHEZZAZ – whose expertise and commitment have been invaluable throughout this project. Your hard work, collaboration, and enthusiasm have truly made a difference. As the team leader, I am incredibly proud to have worked alongside such talented individuals.</p>
<p>I also want to acknowledge <a target="_blank" href="https://www.linkedin.com/in/halimbahae/">Bahae Eddine Halim</a>, the founder of the Moroccan Data Science (MDS) community, whose initiative, DataStart First Edition, provided the platform for our project. His dedication to fostering a supportive environment for data enthusiasts in Morocco has been instrumental in our journey. Lastly, we thank the broader data science community for their support and encouragement. Your enthusiasm and engagement have motivated us to push boundaries and continuously strive for excellence in our endeavors.</p>
<p>To all those who have contributed to this project – mentors, team members, and supporters – we express our heartfelt thanks. Your collective efforts have been integral to the success of this endeavor, and we look forward to continued collaboration and growth in the future.</p>
]]></content:encoded></item><item><title><![CDATA[Moroccan Data Scientists: Pioneering Tech Innovation in the Heart of Africa]]></title><description><![CDATA[Introduction
In the heart of Morocco, a tech revolution is brewing, and at the forefront stands the Moroccan Data Scientists (MDS) community. MDS is not just a hub for technological advancement; it's a philosophy that nurtures inquisitive minds and e...]]></description><link>https://blog.moroccands.com/moroccan-data-scientists</link><guid isPermaLink="true">https://blog.moroccands.com/moroccan-data-scientists</guid><category><![CDATA[MDSInnovation ]]></category><category><![CDATA[maroc]]></category><category><![CDATA[MoroccanTech]]></category><category><![CDATA[Moroccan]]></category><category><![CDATA[MDS]]></category><category><![CDATA[Morocco ]]></category><category><![CDATA[tech ]]></category><category><![CDATA[technology]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Databases]]></category><category><![CDATA[data]]></category><category><![CDATA[Science ]]></category><category><![CDATA[datascience]]></category><category><![CDATA[techrevolution]]></category><category><![CDATA[community]]></category><dc:creator><![CDATA[Bahae Eddine Halim]]></dc:creator><pubDate>Sun, 21 Jan 2024 22:44:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1705876740214/27163e4f-307a-46cf-a8e6-be6a1f317b3f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-introduction"><strong>Introduction</strong></h3>
<p>In the heart of Morocco, a tech revolution is brewing, and at the forefront stands the Moroccan Data Scientists (MDS) community. MDS is not just a hub for technological advancement; it's a philosophy that nurtures inquisitive minds and empowers aspiring data scientists. Let's delve into the unique initiatives and vibrant community that make MDS a beacon of innovation in the Moroccan tech landscape.</p>
<h3 id="heading-mds-philosophy"><strong>MDS Philosophy</strong></h3>
<p>At MDS, learning is not just about academic excellence; it's about fostering confident, creative thinkers. The community is dedicated to merging social-emotional development with technical proficiency, ensuring that its members not only excel in data science but also emerge as impactful leaders in their industries.</p>
<h3 id="heading-discover-more-about-mds"><strong>Discover More About MDS</strong></h3>
<p>MDS takes pride in its active members, qualified professionals, and a plethora of activities and collaborative projects. Founder Bahae Eddine Halim envisions inspiring members to dream big and become trailblazers in the realms of technology and data science.</p>
<h3 id="heading-explore-our-specialties"><strong>Explore our Specialties</strong></h3>
<p>The learning experience at MDS is diverse and comprehensive. From sophisticated data analysis techniques and machine learning to artificial intelligence engineering and data visualization, members are equipped with skills that transcend traditional boundaries.</p>
<h3 id="heading-our-innovative-initiatives-at-mds"><strong>Our "Innovative" Initiatives at MDS</strong></h3>
<p>MDS goes beyond data understanding; it embeds it in the rich tapestry of Moroccan culture. Initiatives like the <strong>#DataStart program</strong> kickstart data science projects addressing local challenges. Regular <strong>webinars</strong> featuring experts in data science and AI keep the community abreast of the latest trends. <strong>Data-driven publications</strong> by members showcase how data science is shaping various sectors in Morocco.</p>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>As we navigate through the virtual corridors of Moroccan Data Scientists, it's evident that MDS is not just a community; it's a movement. A movement that empowers, inspires, and connects data enthusiasts in Morocco. With its unique blend of technology and cultural relevance, MDS is shaping the future of tech in Morocco, one data scientist at a time.</p>
]]></content:encoded></item></channel></rss>