{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **Lab 1** : downloading the data" ], "metadata": { "id": "z-tRpumlVnLA" } }, { "cell_type": "markdown", "source": [ "# **Recent search**\n", "\n", "The recent search endpoint allows you to programmatically access filtered public Tweets posted over the last week, and is available to all developers who have a developer account and are using keys and tokens from an App within a Project.\n", "\n", "You can authenticate your requests with OAuth 1.0a User Context, OAuth 2.0 App-Only, or OAuth 2.0 Authorization Code with PKCE. However, if you would like to receive private metrics, or a breakdown of organic and promoted metrics within your Tweet results, you will have to use OAuth 1.0a User Context or OAuth 2.0 Authorization Code with PKCE, and pass user Access Tokens that are associated with the user that published the given content. \n", "\n", "This endpoint can deliver up to 100 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. \n", "\n", "When using a Project with Essential or Elevated access, you can use the basic set of operators and can make queries up to 512 characters long. When using a Project with Academic Research access or Enterprise access, you have access to additional operators.\n", "\n", "Rate limit: App rate limit (Application-only): 450 requests per 15-minute window shared among all users of your app" ], "metadata": { "id": "hRt9s5I1P2qS" } }, { "cell_type": "markdown", "source": [ "## Updates (especially important for people with academic research access):\n", "\n", "- counts endpoint \n", "- download of data by date\n", "- error logs" ], "metadata": { "id": "VYbfzTyeX7E5" } }, { "cell_type": "markdown", "metadata": { "id": "21NaYdhRr9i7" }, "source": [ "## Set up\n", "\n", "1. Tokens and keys should not be kept in colab notebooks. I advise you to \n", "delete them from the code after every use. Here we create an environmental variable and save the token in it. It will be deleted on restart of the runtime.\n", "\n", "2. In order to efficiently save your data it's best to connect the notebook with your Google Drive (n.b: this doesn't work with university's account - you need to use Gmail). This way you don't risk losing your data if the notebook malfunctions and you can later access it from a notebook again. \n", "\n", "3. Libraries that we need (for now) are *requests*, to handle our API requests, and *pandas* to easily view and store the results.\n", "\n", "4. Colab environment offers you loads of pre-installe dpython libraries (you can check which ones by executing `!pip freeze`), *however* if your library is not on the list you need to manually install it.\n" ] }, { "cell_type": "code", "metadata": { "id": "gNFOaZPIxyS2" }, "source": [ "import os\n", "os.environ['TOKEN'] = \"AAAAAAAAAAAAAAAAAAAAAIi0jQEAAAAAk2nnTNI41oHHdehJLOXXv36J6%2F0%3DnnMdA329Tb9YGoAo6vgGLjvEVG0m8WwCs3ZHQNqGHLC2fUdz3v\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "GjUQcJpfrecr" }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "path = \"/content/\"\n", "error_log_path = \"/content/\"" ], "metadata": { "id": "g4SEK7azWzqU" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Nwy6MeGur9i-" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "metadata": { "id": "8ge5EaEur9jA" }, "source": [ "import requests \n", "import pandas as pd \n", "import time" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OrXv204vyooY" }, "source": [ "If you need to download a library, use the following code, just specify the name of the library you need (here we downloaded emoji library)" ] }, { "cell_type": "code", "source": [ "!pip freeze" ], "metadata": { "id": "e1S6d4wMP24O" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "N6RYD_oCoJ3C" }, "source": [ "!pip install emoji" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 1: authenticate\n", "\n", "In order to authenticate our request, we need to create a request header and add an authorization field. You can authorize a request by using the bearer token, or the API consumer/secret keys. Here we do it with bearer token for the sake of simplicity.\n", "\n", "You can read more about it here: https://developer.twitter.com/en/docs/authentication/overview" ], "metadata": { "id": "9I8xu_aIQY9v" } }, { "cell_type": "markdown", "metadata": { "id": "IYIP9Ta4r9jR" }, "source": [ "\n", "### Set up headers" ] }, { "cell_type": "code", "metadata": { "id": "pKWoILn7AMzj" }, "source": [ "def create_headers(bearer_token):\n", " headers = {\"Authorization\": \"Bearer {}\".format(bearer_token)}\n", " return headers" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "W_buwaoAyPLM" }, "source": [ "headers = create_headers(os.environ['TOKEN'])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 2: build a search query\n", "\n", "**Ingredients**: endpoint, parameters and operators\n", "\n", "For endpoint we use: https://api.twitter.com/2/tweets/search/recent\n", "\n", "**Example parameters**: \n", "\n", "* query: the text of your search (required) - max 512 chars\n", "* end_time: the newest date for your tweets\n", "* start_time: the oldest date for your tweets\n", "(format for date: YYYY-MM-DDTHH:mm:ssZ (ISO 8601/RFC 3339))\n", "* max_results: between 10 (default) and 100\n", "* tweet_fields: which fields to get (if empty, you only get id&text&edit \n", "history)\n", "* user_fields, place_fields, expansions\n", "* next_token: to get the next page of results \n", "\n", "\n", "**Example operators**: keyword (menstruation), exact phrase(\"sexual education\"), hashtag (\"#metoo\"), emoji (😬), logical operators (AND = a blank space), OR, NOT), from: or to: (tweets from a user or directed to a user), @ (tweets that mention the user, @NASA), is:retweet, is: reply , is:quote, lang: (\"en\")\n", "\n", "Grouping is done with brackets. F.e (#prolife abortion) OR (#prochoice abortion)\n", "\n", "See more here: \n", "\n", "Operators: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query\n", "\n", "Parameters: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent\n", "\n", "\n" ], "metadata": { "id": "Nt2jiBSDS-XQ" } }, { "cell_type": "code", "metadata": { "id": "krWb9m_z_fgF" }, "source": [ "def create_url(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint):\n", " \n", " search_url = endpoint #Change to the endpoint you want to collect data from\n", "\n", " #change params based on the endpoint you are using\n", " #also can request different fields, e.g ids of users ... \n", " query_params = {'query': query,\n", " 'end_time': end_time,\n", " 'start_time': start_time,\n", " 'max_results': max_results,\n", " 'expansions': expansions,\n", " 'tweet.fields': tweet_fields,\n", " 'user.fields': user_fields,\n", " 'place.fields': place_fields}\n", "\n", " return (search_url, query_params)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ef5513cb" }, "outputs": [], "source": [ "def connect_to_endpoint(url, headers, params, next_token = None):\n", " #only change the default value of next_token if it is a real value returned in the response\n", " if next_token is not None and next_token != '':\n", " params['next_token'] = next_token\n", " #create a \"GET\" request to the specified url, add headers and parameters\n", " response = requests.request(\"GET\", url, headers = headers, params = params)\n", " if response.status_code != 200:\n", " #if something goes wrong, we need to know\n", " raise Exception(response.status_code, response.text)\n", " #otherwise, we want the payload of our response, which contains our tweet(s)\n", " return response.json()" ] }, { "cell_type": "markdown", "source": [ "Improved error logging." ], "metadata": { "id": "cPWfTHNBY9fY" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3024b32a" }, "outputs": [], "source": [ "def get_data(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint, next_token=\"\"):\n", " \n", " results = []\n", "\n", "\n", " while next_token is not None:\n", " try: \n", " url = create_url(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint)\n", " json_response = connect_to_endpoint(url[0], headers, url[1], next_token)\n", " #if we have results, they will be in the field 'data' of our response\n", " if \"data\" in json_response:\n", " results.extend(json_response[\"data\"])\n", " print(str(len(json_response[\"data\"])) + \" Tweets downloaded in this batch.\")\n", " #the next_token is added to the field 'meta' of our response\n", " if \"meta\" in json_response:\n", " if \"next_token\" in json_response[\"meta\"].keys():\n", " next_token = json_response[\"meta\"][\"next_token\"] \n", " else:\n", " next_token = None\n", " else:\n", " next_token = None\n", "\n", " \n", " #to control the rate limit we need to slow down our download\n", " time.sleep(3)\n", "\n", " except Exception as e:\n", " print(\"Error occured\", e)\n", " print(\"Next token value\", next_token)\n", " error_log = {\"Error\":e, \"Next token\":next_token, \"Day\":start_time, \n", " \"Downloaded\":len(results)}\n", " pd.DataFrame.from_dict(error_log, orient=\"index\").to_csv(error_log_path+query+\"_\"+start_time+\"_\"+next_token+\".csv\")\n", " return(results, next_token)\n", "\n", " print(\"Done\")\n", " \n", " return (results, next_token)" ] }, { "cell_type": "markdown", "source": [ "To anticipate the amount of data we'll need to download we can use the counts endpoint (https://developer.twitter.com/en/docs/twitter-api/tweets/counts/api-reference/get-tweets-counts-all).\n", "\n", "*ONLY AVAILABLE WITH ACADEMIC RESEARCH ACCESS*" ], "metadata": { "id": "4k426TcDYKjv" } }, { "cell_type": "code", "source": [ "import time\n", "from matplotlib import pyplot as plt\n", "from matplotlib.dates import MonthLocator, DateFormatter, DayLocator" ], "metadata": { "id": "3InaWYvAo5TH" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def create_url_counts(keyword, start_date, end_date, endpoint):\n", " search_url = endpoint\n", " query_params = {'query': keyword,\n", " 'start_time': start_date,\n", " 'end_time': end_date,\n", " }\n", " return (search_url, query_params) " ], "metadata": { "id": "BONGVt8-wuFR" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def connect_to_endpoint_counts(url, headers, params, next_token = None):\n", " #print(next_token)\n", " if next_token is not None and next_token != '':\n", " params['next_token'] = next_token\n", " response = requests.request(\"GET\", url, headers = headers, params = params)\n", " if response.status_code != 200:\n", " raise Exception(response.status_code, response.text)\n", " return response.json()" ], "metadata": { "id": "A81jHlznxIak" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def get_data_counts(keyword, start_time, end_time, next_token, endpoint):\n", " results = 0\n", " result_count = {}\n", " \n", " while next_token is not None:\n", " try: \n", " url = create_url_counts(keyword, start_time, end_time, endpoint)\n", " json_response = connect_to_endpoint_counts(url[0], headers, url[1], next_token) \n", " if \"data\" in json_response:\n", " for date in json_response[\"data\"]:\n", " result_count[date[\"start\"]] = date[\"tweet_count\"] \n", " if \"meta\" in json_response:\n", " if \"next_token\" in json_response[\"meta\"].keys():\n", " next_token = json_response[\"meta\"][\"next_token\"]\n", " else:\n", " next_token = None\n", " if \"total_tweet_count\" in json_response[\"meta\"].keys():\n", " results += int(json_response[\"meta\"][\"total_tweet_count\"])\n", " else:\n", " next_token = None\n", " except Exception as e:\n", " print(\"Error occured\", e)\n", " #print(\"Done\")\n", "\n", " return (results, next_token, result_count)" ], "metadata": { "id": "ZKsBHCU_yyyJ" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "start_time = \"2012-12-01T00:00:00.000Z\"\n", "end_time = \"2012-12-15T00:00:00.000Z\"\n", "query_text = \"\"\n", "endpoint = \"\"\n", "path = \"\"\n", "max_results = 100\n", "no_days = 15" ], "metadata": { "id": "IV2S_dXYYhoU" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "results, _, count = get_data_counts(query_text, start_time, end_time, \"\", \"https://api.twitter.com/2/tweets/counts/all\")" ], "metadata": { "id": "aY4FONPz0CYu" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "results" ], "metadata": { "id": "wOndeEvT3eXe" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "count_df = pd.DataFrame.from_dict(count, orient=\"index\").reset_index().rename(columns={\"index\": \"date\", 0: \"count\"})" ], "metadata": { "id": "esko_Hlr1KpX" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "count_df=count_df.sort_values(by=(\"date\")).reset_index().drop(columns=\"index\")" ], "metadata": { "id": "y9K0C_WY0B9K" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "count_df.set_index(\"date\", inplace=True)\n", "ax = count_df.plot()\n", "\n", "ax.tick_params(axis=\"x\", labelrotation= 90)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "metadata": { "id": "aGc3H8ml1wS9" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 3: download and save the data\n", "\n", "We call the function, filling in the desired parameters. We convert the data into a pandas dataframe to easily manipulate it (view, edit, save). We save the data in the PICKLE format, so we can recover the exact data types later." ], "metadata": { "id": "efu38OoOgHk0" } }, { "cell_type": "code", "source": [ "!mkdir /content/drive/MyDrive/[TwitterData]/" ], "metadata": { "id": "zVazx3wnL3kk" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "start_time = \"2022-11-25T13:00:00.000Z\"\n", "end_time = \"2022-11-25T13:00:20.000Z\"\n", "query_text = \"#Qatar2022\"\n", "endpoint = \"https://api.twitter.com/2/tweets/search/recent/\"\n", "path = \"/content/\"\n", "max_results = 100\n", "no_days = 15" ], "metadata": { "id": "63iHzM0BbZMW" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Simple search" ], "metadata": { "id": "cYfAdJYbacXO" } }, { "cell_type": "code", "source": [ "tweets = get_data(query_text, start_time = start_time, end_time = end_time, \n", " max_results=max_results, expansions='author_id,in_reply_to_user_id,geo.place_id', \n", " tweet_fields='id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source,entities',\n", " user_fields='id,name,username,created_at,description,public_metrics,verified',\n", " place_fields='full_name,id,country,country_code,geo,name,place_type',\n", " endpoint=endpoint)[0] \n", "tweets_df = pd.DataFrame(tweets)\n", "tweets_df.to_pickle(path+\"_tweets.pkl\")" ], "metadata": { "id": "zuGQZBpEaZDU" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Download batch data from list of dates" ], "metadata": { "id": "VUnVBTFkajnp" } }, { "cell_type": "code", "source": [ "import datetime\n", "import dateutil.parser\n", "from datetime import datetime as dt\n", "from datetime import timedelta" ], "metadata": { "id": "1fENOkQEkUlr" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "dates = []\n", "\n", "for i in range(no_days+2):\n", " print(i)\n", " date = dateutil.parser.parse(start_time)\n", " date = date + timedelta(days=i)\n", " date = date.strftime(\"%Y-%m-%dT%H:%M:%S.000Z\")\n", " dates.append(date)\n", "\n", "print(dates)\n", "print(len(dates))" ], "metadata": { "id": "SPGQgUHMlHiG" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "for i in range(no_days+1):\n", " print(\"Downloading tweets from date \"+dates[i])\n", " tweets = get_data(query_text, start_time = dates[i], end_time = dates[i+1], \n", " max_results=max_results, expansions='author_id,in_reply_to_user_id,geo.place_id', \n", " tweet_fields='id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source,entities',\n", " user_fields='id,name,username,created_at,description,public_metrics,verified',\n", " place_fields='full_name,id,country,country_code,geo,name,place_type',\n", " endpoint=endpoint)[0] \n", " tweets_df = pd.DataFrame(tweets)\n", " tweets_df.to_pickle(path+str(i)+\"_tweets.pkl\") \n", " " ], "metadata": { "id": "TFaruTRUb0TW" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 4: individual work\n", "\n", "- think of a topic you'd like to read some tweets from\n", "- build a query (play around with the logic - can you get only tweets that are not a retweet?)\n", "- how many tweets did you get?\n", "- what if you changed the date range?" ], "metadata": { "id": "OxPOkg0Sgq0E" } }, { "cell_type": "markdown", "metadata": { "id": "nUGfn_G3qcLV" }, "source": [ "# **Lab 2: working with Twitter data**" ] }, { "cell_type": "markdown", "metadata": { "id": "H02ZoRcGtpaY" }, "source": [ "### Step 1: Loading the data" ] }, { "cell_type": "markdown", "metadata": { "id": "VWhc-_nNtsnT" }, "source": [ "If you want to load the results you have previously saved, simply execute the next code, specifying the path to the file.\n", "\n", "You will need to either upload it to the Colab workspace or copy the path to the file on Drive." ] }, { "cell_type": "code", "metadata": { "id": "1LlenT11t_Xp" }, "source": [ "tweets_df = pd.read_pickle(path+\"tweets.pkl\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "tweets_df.columns" ], "metadata": { "id": "LSfF5_lBbDix" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "RNgY35TclZC6" }, "source": [ "tweets_df" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "k9ItAgz7uga9" }, "source": [ "### Step 2: Preprocessing the data" ] }, { "cell_type": "markdown", "metadata": { "id": "WSvM1_xzu9Uc" }, "source": [ "In our dataframe we have the entire Tweet object. Some columns that might be of particular interest to us are: \n", "\n", "* created_at - date when Tweet was posted\n", "* id - unique Tweet identifier\n", "* text - the content of the Tweet\n", "* author_id - unique Tweet identifier\n", "* retweeted_status - information about the original Tweet\n", "* public metrics - quote/reply/retweet/favorite count\n", "* entities - hashtags, urls, annotations present in Tweet\n", "\n", "We can filter the dataframe and keep only columns we are interested in. You can pick which columns you'd like to keep and put them int the column_list below.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "AZYiG40aumbk" }, "source": [ "tweets_filtered = tweets_df.copy() #it's a good idea to work on the copy of original dataframe, so we can always go back to it if we mess something up\n", "column_list = [\"id\",\"author_id\",\"created_at\", \"text\",\"entities\",\"public_metrics\", \"lang\"]\n", "tweets_filtered = tweets_filtered[column_list]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "qQ90m-FpxM9N" }, "source": [ "tweets_filtered" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hPgzBUJj0SZU" }, "source": [ "### Step 3: Extracting words/hashtags" ] }, { "cell_type": "markdown", "metadata": { "id": "CygaLHAS0nzP" }, "source": [ "There are many ways to build networks from the data we download from Twitter.\n", "\n", "One possibility is to have a bipartite network of Tweets and words/hashtags and then observe word, hashtag or word-hashtag projections." ] }, { "cell_type": "markdown", "metadata": { "id": "H1nTCbRc0-__" }, "source": [ "#### Extracting words" ] }, { "cell_type": "markdown", "metadata": { "id": "8pGnokgK1jma" }, "source": [ "In order to extract words, we first need to clean the Tweet text. This way we will remove punctuation, hashtags/mentions/urls (they are preserved in the entity column anyway). We will also turn all letters to lowercase.\n", "\n", "You can also consider removing stopwords, removing words that are not in the english language corpora, lematizing the words, etc. I suggest you research nltk library and its possibilities." ] }, { "cell_type": "code", "metadata": { "id": "y5746Mq918dG" }, "source": [ "import re\n", "import string\n", "# NLTK tools\n", "import nltk\n", "nltk.download('words')\n", "words = set(nltk.corpus.words.words())\n", "nltk.download('stopwords')\n", "stop_words = nltk.corpus.stopwords.words(\"english\")\n", "nltk.download('wordnet')\n", "from nltk.corpus import wordnet as wn\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "from nltk import pos_tag\n", "nltk.download('omw-1.4')\n", "nltk.download('averaged_perceptron_tagger')\n", "from collections import defaultdict\n", "tag_map = defaultdict(lambda : wn.NOUN)\n", "tag_map['J'] = wn.ADJ\n", "tag_map['V'] = wn.VERB\n", "tag_map['R'] = wn.ADV" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Z8Nrv5jv1e5W" }, "source": [ "def cleaner(tweet):\n", " tweet = re.sub(\"@[A-Za-z0-9]+\",\"\",tweet) # remove mentions\n", " tweet = re.sub(\"#[A-Za-z0-9]+\", \"\",tweet) # remove hashtags\n", " tweet = re.sub(r\"(?:\\@|http?\\://|https?\\://|www)\\S+\", \"\", tweet) # remove http links\n", " tweet = \" \".join(tweet.split())\n", " tweet = \" \".join(w for w in nltk.wordpunct_tokenize(tweet) if w.lower() in words and not w.lower() in stop_words)\n", " #remove stop words\n", " lemma_function = WordNetLemmatizer()\n", " tweet = \" \".join(lemma_function.lemmatize(token, tag_map[tag[0]]) for token, tag in nltk.pos_tag(nltk.wordpunct_tokenize(tweet))) #lemmatize\n", " tweet = str.lower(tweet) #to lowercase\n", " return tweet" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "fEhs-tyy2naZ" }, "source": [ "tweets_filtered[\"clean_text\"] = tweets_filtered[\"text\"].map(cleaner)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "tweets_filtered" ], "metadata": { "id": "TPcEWnQuco41" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "I1_lSQZ85rT_" }, "source": [ "We are going to loop through the dataframe and then through the words in the clean text. We are going to add the words as keys to dictionary and use their frequencies as values." ] }, { "cell_type": "code", "source": [ "tweets_filtered.loc[tweets_filtered[\"clean_text\"].isnull(),\"clean_text\"] = \"\"" ], "metadata": { "id": "d8paxTEfe1se" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "zlHlvVlua3bq" }, "source": [ "tweet_tokenizer = nltk.TweetTokenizer()\n", "\n", "#initialize an empty dict\n", "unique_words = {}\n", "\n", "for idx, row in tweets_filtered.iterrows():\n", " if row[\"clean_text\"] != \"\":\n", " for word in tweet_tokenizer.tokenize(row[\"clean_text\"]):\n", " unique_words.setdefault(word,0)\n", " unique_words[word] += 1" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "IgMjaVe-a3bt" }, "source": [ "uw_df = pd.DataFrame.from_dict(unique_words, orient='index').reset_index()\n", "uw_df.rename(columns = {'index':'Word', 0:'Count'}, inplace=True)\n", "uw_df.sort_values(by=['Count'], ascending=False, inplace=True)\n", "uw_df = uw_df.reset_index().drop(columns=[\"index\"])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "kVYvDBQu57If" }, "source": [ "We can inspect the words as a dataframe. \n", "\n", "\n", "You can always save this dataframe as .csv for future reference." ] }, { "cell_type": "code", "source": [ "uw_df" ], "metadata": { "id": "eZW8-9gYbHJk" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "uw_df.to_csv(path+\"words.csv\")" ], "metadata": { "id": "FJ-eBTJWbH-r" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "C4R-j-FN7Rgo" }, "source": [ "#### Extracting the hashtags" ] }, { "cell_type": "markdown", "metadata": { "id": "O8SfJwmf9GU2" }, "source": [ "We are going to loop through the dataframe and then through the hashtags in the entities. We are going to add the hashtags as keys to dictionary and use their frequencies as values. At the same time, we are going to save them in a list and add them to a separate column to facilitate our future work." ] }, { "cell_type": "code", "source": [ "tweets_filtered.loc[tweets_df[\"entities\"].isnull(), \"entities\"] = None" ], "metadata": { "id": "ph7grrfEX_kl" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "tweets_filtered[\"hashtags\"] = \"\"" ], "metadata": { "id": "gDTe7Idvi5Zo" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "9_PVYuyase5Y" }, "source": [ "unique_hashtags = {}\n", "index = 0\n", "\n", "for idx, row in tweets_filtered.iterrows():\n", " if row[\"entities\"] is not None and \"hashtags\" in row[\"entities\"]:\n", " hl = []\n", " for hashtag in row[\"entities\"][\"hashtags\"]:\n", " tag = \"#\"+hashtag[\"tag\"].lower()\n", " unique_hashtags.setdefault(tag, 0)\n", " unique_hashtags[tag] += 1\n", " hl.append(tag)\n", " \n", " tweets_filtered.at[idx,\"hashtags\"] = hl" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "WPMRJH8wX2rF" }, "source": [ "unique_hashtags = dict(sorted(unique_hashtags.items(), key=lambda item: item[1], reverse=True))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Tw1AqDdvX2rH" }, "source": [ "uh_df = pd.DataFrame.from_dict(unique_hashtags, orient='index').reset_index()\n", "uh_df.rename(columns = {'index':'Hashtag', 0:'Count'}, inplace=True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nx4klPFuX2rI" }, "source": [ "uh_df[0:50]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "uCb7arJqUXOu" }, "source": [ "uh_df.to_csv(path+\"hashtags.csv\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "wPaVj1tj9Uw6" }, "source": [ "### Step 4: Building the network" ] }, { "cell_type": "markdown", "metadata": { "id": "vt2co2ep9YCd" }, "source": [ "We are going to use the networkx library, which is a Python library that enables network science analysis of the data.\n", "\n", "We are going to use it to create our network and extract edgelist from it, since we can easily import it to Gephi (a software we are going to see in visualization labs).\n", "\n", "However, it offers implemented algorithms for analysis (for example PageRank) that you can use out-of-box to analyze your network." ] }, { "cell_type": "markdown", "metadata": { "id": "gnd62ng6-GLW" }, "source": [ "But first, we will loop through our dataframe and connect words and hashtags if they appear together in the same Tweet." ] }, { "cell_type": "code", "metadata": { "id": "BooMyc6-1JWa" }, "source": [ "import itertools\n", "import networkx as nx" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "adLbCz86M7SR" }, "source": [ "uh = unique_hashtags.keys()\n", "uw = unique_words.keys() " ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "UHuQ3rRXOA5_" }, "source": [ "network = {}\n", "network_key = 0\n", "for index, row in tweets_filtered.iterrows():\n", " combined_list = [hashtag for hashtag in row[\"hashtags\"]] + [word for word in str.split(row[\"clean_text\"], \" \") if word in uw]\n", " #itertool product creates Cartesian product of each element in the combined list\n", " for pair in itertools.product(combined_list, combined_list):\n", " #exclude self-loops and count each pair only once because our graph is undirected and we do not take self-loops into account\n", " if pair[0]!=pair[1] and not(pair[::-1] in network):\n", " network.setdefault(pair,0)\n", " network[pair] += 1 \n", " \n", "network_df = pd.DataFrame.from_dict(network, orient=\"index\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8uThrYGHSdEe" }, "source": [ "network_df.reset_index(inplace=True)\n", "network_df.columns = [\"pair\",\"weight\"]\n", "network_df.sort_values(by=\"weight\",inplace=True, ascending=False)\n", "network_df" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": false, "id": "NJvNvzGXy8Kg" }, "source": [ "#to get weighted graph we need a list of 3-element tuplels (u,v,w) where u and v are nodes and w is a number representing weight\n", "up_weighted = []\n", "for edge in network:\n", " #we can filter edges by weight by uncommenting the next line and setting desired weight threshold\n", " #if(network[edge])>1:\n", " up_weighted.append((edge[0],edge[1],network[edge]))\n", "\n", "G = nx.Graph()\n", "G.add_weighted_edges_from(up_weighted)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "eSneLIqZNvt1" }, "source": [ "print(len(G.nodes()))\n", "print(len(G.edges()))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Mj3CwR5Cy8Kk" }, "source": [ "#### Save edgelist" ] }, { "cell_type": "code", "metadata": { "id": "fFtpm869ONHg" }, "source": [ "filename = path+\"/edgelist.csv\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "PTmGSBc3y8Kn" }, "source": [ "nx.write_weighted_edgelist(G, filename, delimiter=\",\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "GlboURoYy8Kp" }, "source": [ "#add header with appropriate column names (works on collab and Linux/Mac(?))\n", "!sed -i.bak 1i\"Source,Target,Weight\" ./edgelist.csv" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "e5oT2lSry8Kq" }, "source": [ "#### Create and save node list\n" ] }, { "cell_type": "code", "metadata": { "id": "lpef5RKvUu_w" }, "source": [ "word_nodes = pd.DataFrame.from_dict(unique_words,orient=\"index\")\n", "word_nodes.reset_index(inplace=True)\n", "word_nodes[\"Label\"] = word_nodes[\"index\"]\n", "word_nodes.rename(columns={\"index\":\"Id\",0:\"delete\"},inplace=True)\n", "word_nodes = word_nodes.drop(columns=['delete'])\n", "\n", "word_nodes" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ZMdIcS4my8Ks" }, "source": [ "hashtag_nodes = uh_df.copy()\n", "hashtag_nodes[\"Label\"] = hashtag_nodes[\"Hashtag\"]\n", "hashtag_nodes.rename(columns={\"Hashtag\":\"Id\"},inplace=True)\n", "hashtag_nodes = hashtag_nodes.drop(columns=['Count'])\n", "hashtag_nodes" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "PpE0P0cIU2OD" }, "source": [ "nodelist = hashtag_nodes.append(word_nodes, ignore_index=True)\n", "\n", "nodelist.to_csv(\"nodelist.csv\",index=False)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ryR3LiMVB-70" }, "source": [ "Tasks: \n", "\n", "* We created a network where nodes are mixed (both words and hashtags). Create network of words only and one of hashtags only.\n", "* Pick one of these network and rank the nodes using PageRank centrality. Extract information about top-20 rated nodes.\n", "* following the procedure for extracting hashtags, extract mentions and annotations\n", "* following the same procedure, extract the public metric counts for tweets\n", "\n", "\n", "\n" ] } ] }