In this post, I want to share a cool project I recently did as part of the Data Engineering module of my PDEng program. I will show how to do simple twitter sentiment analysis in Python with streaming data from Twitter. The data is streamed into Apache Kafka, then stored in a MongoDB database, and finally, the results are presented in a dashboard made with Dash and Plotly. Hope you are excited. Let’s dive into it!
Caveat: This tutorial/demonstration is quite advanced, so if you are new to Python and Data Science/Data Engineering, it may be a bit tough to follow and replicate. I provide all the code on GitHub (additional instructions about the code can be found on the Repository’s readme) and try to explain the steps needed. However, some familiarity with Apache Kafka, Python, MongoDB, and Dash/Plotly is preferred to be able to follow along and replicate what I did. Different technologies are involved and I cannot give a detailed tutorial in all of them in just this blog post. 🙂
Twitter Sentiment Analysis Using Python
The point of the dashboard was to inform Dutch municipalities on the way people feel about the energy transition in The Netherlands. The government wants to terminate the gas-drilling in Groningen and asked the municipalities to make the neighborhoods gas-free by installing solar panels. However, it is possible that people are not in line with this decision. We began with the assumption that people share their opinions and feelings in social media platforms. For the purpose of this project, we focused on Twitter.
Here, the data pipeline that we put in place can be seen.
Data Pipeline Overview
The central part of the data pipeline is the Apache Kafka cluster. The Apache Kafka cluster can be used for streaming data and also for integrating different data sources and different applications.
In our case, data from Twitter is pushed to the Apache Kafka cluster. Additionally, sentiment analysis is performed on the text of the tweets before the data is pushed to the cluster. This part is called a Twitter producer in terms of Kafka terminology.
On the other side, there exists a consumer that subscribes to the Apache Kafka topic and stores the JSON objects of the tweets to a MongoDB database. Finally, the results are presented in an interactive dashboard that is updated live using Dash and Plotly. Following comes a more detailed explanation of the different parts of the pipeline.
Apache Kafka Cluster
Things were kept simple here. The cluster has only 1 topic with 6 partitions and a replication factor of 1 (we only deployed locally so there was no need for distributed processing).
The Twitter Producer is written in Python and makes use of the Twitter Streaming API. Three primary Python modules were used, namely pykafka for the connection with the Apache Kafka cluster, tweepy for the connection with the Twitter Streaming API, and textblob for the sentiment analysis.
The producer fetches tweets based on a specified list of keywords. The keywords that were used for this project can be seen below. The commas between the keywords work as logical OR and the space in each keyword phrase as logical AND.
Moreover, the fields that the producer is fetching from the tweets’ JSON objects are presented in the following table.
Fields Fetched by the Twitter Producer
|creation_datetime||The date and time that the tweet was created|
|favorites||The total number of favorites the tweet has received|
|followers||The total number of followers the user who tweeted has|
|location||The location where the tweet happened|
|retweets||The total number of times that the tweet has been retweeted|
|senti_val||The sentiment score that was given by the TextBlob module algorithm (more on this later)|
|subjectivity||The subjectivity score that was given by the TextBlob module algorithm (more on this later)|
|text||The full text of the tweet|
|user_description||The description that the users provide about themselves on their profiles|
|username||The username of the user who tweeted|
Finally, the fetched JSON objects are published to the Apache Kafka cluster.
The Sentiment Analysis is performed while the tweets are streaming from Twitter to the Apache Kafka cluster. The analysis is done using the textblob module in Python. Because the module does not work with the Dutch language, we used the following approach. First, we detect the language of the tweet. Next, if the language is Dutch we translate the tweet to English using the Google Translate API. Finally, we perform the sentiment analysis on the translated text. If the text is already in English we, of course, perform the sentiment analysis right away. The algorithm can be seen below.
The fields that are created from the sentiment analysis are the sentiment score and the subjectivity score. The sentiment score ranges from -1 to +1. The closer to -1 means that the tweet is classified as negative, the closer to +1 means that the tweet is classified as positive. The subjectivity value ranges from 0 to +1. The closer the value is to +1 means that the tweet is subjective, the closer to 0 means that the tweet is objective.
The MongoDB Consumer is also written in Python. The consumer uses the pymongo module to connect with the desired collection of the MongoDB database. At the same time, it subscribes to the desired Apache Kafka topic. After the connections are established, it stores the JSON objects from the Apache Kafka topic to the MongoDB database.
The need for such a consumer is that Apache Kafka is not a database. The data is lost after some days pass. For that reason, we have to store them in the MongoDB database for future use. Finally, a NoSQL database such as MongoDB was chosen because it is made for working with JSON objects. Moreover, it provides flexibility in adding or subtracting fields we would like to fetch from Twitter in the future.
Finally, the results are shown in an interactive dashboard made using the Dash-Plotly framework in Python. It makes use of Live Updates to update the data that is shown to the user every 5 seconds (the interval can be specified). The overall structure of the dashboard can be seen below.
There are four sections in the dashboard. Each one will be discussed in more detail.
Section 1 of the Sentiment Analysis Dashboard
Section 1 of the dashboard shows a table with the 5 most recent tweets that happened about the energy transition. Additionally, the date and time of the tweet are shown, as well as the sentiment and subjectivity scores that were explained previously. The tweets are updated in real-time as they are fetched from Twitter. Naturally, at the moment tweets that are not relevant are also fetched by the data pipeline.
Section 2 of the Dashboard
Section 2 showcases a vertical bar chart with the users that were most active in tweeting about the energy transition. Furthermore, if the dashboard user hovers over the bars, the description of the twitter user will pop up, if it was available in the data.
Section 3 of the Dashboard
Section 3 showcases a horizontal bar chart with the number of tweets that happened per day related to the energy transition. Some bars are zero only because the dashboard was not running on our laptop these days. In order to get full data, the dashboard should be deployed on a server and run 24/7 (Heroku is one possible option, as well as Google Cloud, to name some ideas).
Section 4 of the Dashboard
Finally, section 4 of the Dashboard showcases a donut chart that informs on the total percentages of positive, negative, and neutral tweets fetched by the data pipeline.
Interesting Tweets Were Fetched
To conclude, while working with the dashboard we came upon several interesting examples of tweets. For example, below you can see two of those interesting tweets. In the first, the user is wondering how the Dutch government will end the Groningen gas extraction. In the second, another user is wondering about the cost of making houses ‘carbon zero’. Thus, there seems that indeed discussion is happening about the energy transition in The Netherlands on Twitter.
Thanks for Reading
Thank you for following through this post and I hope you found the project interesting. In case you were overwhelmed realize it is completely fine because many different technologies are involved and this post is not really a detailed tutorial. If you would like to use the code feel free to do so, though I would appreciate a star and fork on GitHub :). In case you find any mistakes in the post and/or code I would be very happy to hear them out. If you have any questions feel free to leave them in the comment section below and/or via email at email@example.com.