# **Lab 1** : downloading the data

# **Recent search**

The recent search endpoint allows you to programmatically access filtered public Tweets posted over the last week, and is available to all developers who have a developer account and are using keys and tokens from an App within a Project.

You can authenticate your requests with OAuth 1.0a User Context, OAuth 2.0 App-Only, or OAuth 2.0 Authorization Code with PKCE. However, if you would like to receive private metrics, or a breakdown of organic and promoted metrics within your Tweet results, you will have to use OAuth 1.0a User Context or OAuth 2.0 Authorization Code with PKCE, and pass user Access Tokens that are associated with the user that published the given content. 

This endpoint can deliver up to 100 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. 

When using a Project with Essential or Elevated access, you can use the basic set of operators and can make queries up to 512 characters long. When using a Project with Academic Research access or Enterprise access, you have access to additional operators.

Rate limit: App rate limit (Application-only): 450 requests per 15-minute window shared among all users of your app

## Updates (especially important for people with academic research access):

- counts endpoint 
- download of data by date
- error logs

## Set up

1. Tokens and keys should not be kept in colab notebooks. I advise you to 
delete them from the code after every use. Here we create an environmental variable and save the token in it. It will be deleted on restart of the runtime.

2. In order to efficiently save your data it's best to connect the notebook with your Google Drive (n.b: this doesn't work with university's account - you need to use Gmail). This way you don't risk losing your data if the notebook malfunctions and you can later access it from a notebook again. 

3. Libraries that we need (for now) are *requests*, to handle our API requests, and *pandas* to easily view and store the results.

4. Colab environment offers you loads of pre-installe dpython libraries (you can check which ones by executing `!pip freeze`), *however* if your library is not on the list you need to manually install it.


In [None]:
import os
os.environ['TOKEN'] = "AAAAAAAAAAAAAAAAAAAAAIi0jQEAAAAAk2nnTNI41oHHdehJLOXXv36J6%2F0%3DnnMdA329Tb9YGoAo6vgGLjvEVG0m8WwCs3ZHQNqGHLC2fUdz3v"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = "/content/"
error_log_path = "/content/"

### Import libraries

In [None]:
import requests 
import pandas as pd 
import time

If you need to download a library, use the following code, just specify the name of the library you need (here we downloaded emoji library)

In [None]:
!pip freeze

In [None]:
!pip install emoji

## Step 1: authenticate

In order to authenticate our request, we need to create a request header and add an authorization field. You can authorize a request by using the bearer token, or the API consumer/secret keys. Here we do it with bearer token for the sake of simplicity.

You can read more about it here: https://developer.twitter.com/en/docs/authentication/overview


### Set up headers

In [None]:
def create_headers(bearer_token):
 headers = {"Authorization": "Bearer {}".format(bearer_token)}
 return headers

In [None]:
headers = create_headers(os.environ['TOKEN'])

## Step 2: build a search query

**Ingredients**: endpoint, parameters and operators

For endpoint we use: https://api.twitter.com/2/tweets/search/recent

**Example parameters**: 

* query: the text of your search (required) - max 512 chars
* end_time: the newest date for your tweets
* start_time: the oldest date for your tweets
(format for date: YYYY-MM-DDTHH:mm:ssZ (ISO 8601/RFC 3339))
* max_results: between 10 (default) and 100
* tweet_fields: which fields to get (if empty, you only get id&text&edit 
history)
* user_fields, place_fields, expansions
* next_token: to get the next page of results 


**Example operators**: keyword (menstruation), exact phrase("sexual education"), hashtag ("#metoo"), emoji (😬), logical operators (AND = a blank space), OR, NOT), from: or to: (tweets from a user or directed to a user), @ (tweets that mention the user, @NASA), is:retweet, is: reply , is:quote, lang: ("en")

Grouping is done with brackets. F.e (#prolife abortion) OR (#prochoice abortion)

See more here: 

Operators: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

Parameters: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent




In [None]:
def create_url(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint):
 
 search_url = endpoint #Change to the endpoint you want to collect data from

 #change params based on the endpoint you are using
 #also can request different fields, e.g ids of users ... 
 query_params = {'query': query,
 'end_time': end_time,
 'start_time': start_time,
 'max_results': max_results,
 'expansions': expansions,
 'tweet.fields': tweet_fields,
 'user.fields': user_fields,
 'place.fields': place_fields}

 return (search_url, query_params)

In [None]:
def connect_to_endpoint(url, headers, params, next_token = None):
 #only change the default value of next_token if it is a real value returned in the response
 if next_token is not None and next_token != '':
 params['next_token'] = next_token
 #create a "GET" request to the specified url, add headers and parameters
 response = requests.request("GET", url, headers = headers, params = params)
 if response.status_code != 200:
 #if something goes wrong, we need to know
 raise Exception(response.status_code, response.text)
 #otherwise, we want the payload of our response, which contains our tweet(s)
 return response.json()

Improved error logging.

In [None]:
def get_data(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint, next_token=""):
 
 results = []


 while next_token is not None:
 try: 
 url = create_url(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint)
 json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
 #if we have results, they will be in the field 'data' of our response
 if "data" in json_response:
 results.extend(json_response["data"])
 print(str(len(json_response["data"])) + " Tweets downloaded in this batch.")
 #the next_token is added to the field 'meta' of our response
 if "meta" in json_response:
 if "next_token" in json_response["meta"].keys():
 next_token = json_response["meta"]["next_token"] 
 else:
 next_token = None
 else:
 next_token = None

 
 #to control the rate limit we need to slow down our download
 time.sleep(3)

 except Exception as e:
 print("Error occured", e)
 print("Next token value", next_token)
 error_log = {"Error":e, "Next token":next_token, "Day":start_time, 
 "Downloaded":len(results)}
 pd.DataFrame.from_dict(error_log, orient="index").to_csv(error_log_path+query+"_"+start_time+"_"+next_token+".csv")
 return(results, next_token)

 print("Done")
 
 return (results, next_token)

To anticipate the amount of data we'll need to download we can use the counts endpoint (https://developer.twitter.com/en/docs/twitter-api/tweets/counts/api-reference/get-tweets-counts-all).

*ONLY AVAILABLE WITH ACADEMIC RESEARCH ACCESS*

In [None]:
import time
from matplotlib import pyplot as plt
from matplotlib.dates import MonthLocator, DateFormatter, DayLocator

In [None]:
def create_url_counts(keyword, start_date, end_date, endpoint):
 search_url = endpoint
 query_params = {'query': keyword,
 'start_time': start_date,
 'end_time': end_date,
 }
 return (search_url, query_params) 

In [None]:
def connect_to_endpoint_counts(url, headers, params, next_token = None):
 #print(next_token)
 if next_token is not None and next_token != '':
 params['next_token'] = next_token
 response = requests.request("GET", url, headers = headers, params = params)
 if response.status_code != 200:
 raise Exception(response.status_code, response.text)
 return response.json()

In [None]:
def get_data_counts(keyword, start_time, end_time, next_token, endpoint):
 results = 0
 result_count = {}
 
 while next_token is not None:
 try: 
 url = create_url_counts(keyword, start_time, end_time, endpoint)
 json_response = connect_to_endpoint_counts(url[0], headers, url[1], next_token) 
 if "data" in json_response:
 for date in json_response["data"]:
 result_count[date["start"]] = date["tweet_count"] 
 if "meta" in json_response:
 if "next_token" in json_response["meta"].keys():
 next_token = json_response["meta"]["next_token"]
 else:
 next_token = None
 if "total_tweet_count" in json_response["meta"].keys():
 results += int(json_response["meta"]["total_tweet_count"])
 else:
 next_token = None
 except Exception as e:
 print("Error occured", e)
 #print("Done")

 return (results, next_token, result_count)

In [None]:
start_time = "2012-12-01T00:00:00.000Z"
end_time = "2012-12-15T00:00:00.000Z"
query_text = ""
endpoint = ""
path = ""
max_results = 100
no_days = 15

In [None]:
results, _, count = get_data_counts(query_text, start_time, end_time, "", "https://api.twitter.com/2/tweets/counts/all")

In [None]:
results

In [None]:
count_df = pd.DataFrame.from_dict(count, orient="index").reset_index().rename(columns={"index": "date", 0: "count"})

In [None]:
count_df=count_df.sort_values(by=("date")).reset_index().drop(columns="index")

In [None]:
count_df.set_index("date", inplace=True)
ax = count_df.plot()

ax.tick_params(axis="x", labelrotation= 90)

plt.tight_layout()
plt.show()

## Step 3: download and save the data

We call the function, filling in the desired parameters. We convert the data into a pandas dataframe to easily manipulate it (view, edit, save). We save the data in the PICKLE format, so we can recover the exact data types later.

In [None]:
!mkdir /content/drive/MyDrive/[TwitterData]/

In [None]:
start_time = "2022-11-25T13:00:00.000Z"
end_time = "2022-11-25T13:00:20.000Z"
query_text = "#Qatar2022"
endpoint = "https://api.twitter.com/2/tweets/search/recent/"
path = "/content/"
max_results = 100
no_days = 15

Simple search

In [None]:
tweets = get_data(query_text, start_time = start_time, end_time = end_time, 
 max_results=max_results, expansions='author_id,in_reply_to_user_id,geo.place_id', 
 tweet_fields='id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source,entities',
 user_fields='id,name,username,created_at,description,public_metrics,verified',
 place_fields='full_name,id,country,country_code,geo,name,place_type',
 endpoint=endpoint)[0] 
tweets_df = pd.DataFrame(tweets)
tweets_df.to_pickle(path+"_tweets.pkl")

Download batch data from list of dates

In [None]:
import datetime
import dateutil.parser
from datetime import datetime as dt
from datetime import timedelta

In [None]:
dates = []

for i in range(no_days+2):
 print(i)
 date = dateutil.parser.parse(start_time)
 date = date + timedelta(days=i)
 date = date.strftime("%Y-%m-%dT%H:%M:%S.000Z")
 dates.append(date)

print(dates)
print(len(dates))

In [None]:
for i in range(no_days+1):
 print("Downloading tweets from date "+dates[i])
 tweets = get_data(query_text, start_time = dates[i], end_time = dates[i+1], 
 max_results=max_results, expansions='author_id,in_reply_to_user_id,geo.place_id', 
 tweet_fields='id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source,entities',
 user_fields='id,name,username,created_at,description,public_metrics,verified',
 place_fields='full_name,id,country,country_code,geo,name,place_type',
 endpoint=endpoint)[0] 
 tweets_df = pd.DataFrame(tweets)
 tweets_df.to_pickle(path+str(i)+"_tweets.pkl") 
 

## Step 4: individual work

- think of a topic you'd like to read some tweets from
- build a query (play around with the logic - can you get only tweets that are not a retweet?)
- how many tweets did you get?
- what if you changed the date range?

# **Lab 2: working with Twitter data**

### Step 1: Loading the data

If you want to load the results you have previously saved, simply execute the next code, specifying the path to the file.

You will need to either upload it to the Colab workspace or copy the path to the file on Drive.

In [None]:
tweets_df = pd.read_pickle(path+"tweets.pkl")

In [None]:
tweets_df.columns

In [None]:
tweets_df

### Step 2: Preprocessing the data

In our dataframe we have the entire Tweet object. Some columns that might be of particular interest to us are: 

* created_at - date when Tweet was posted
* id - unique Tweet identifier
* text - the content of the Tweet
* author_id - unique Tweet identifier
* retweeted_status - information about the original Tweet
* public metrics - quote/reply/retweet/favorite count
* entities - hashtags, urls, annotations present in Tweet

We can filter the dataframe and keep only columns we are interested in. You can pick which columns you'd like to keep and put them int the column_list below.



In [None]:
tweets_filtered = tweets_df.copy() #it's a good idea to work on the copy of original dataframe, so we can always go back to it if we mess something up
column_list = ["id","author_id","created_at", "text","entities","public_metrics", "lang"]
tweets_filtered = tweets_filtered[column_list]

In [None]:
tweets_filtered

### Step 3: Extracting words/hashtags

There are many ways to build networks from the data we download from Twitter.

One possibility is to have a bipartite network of Tweets and words/hashtags and then observe word, hashtag or word-hashtag projections.

#### Extracting words

In order to extract words, we first need to clean the Tweet text. This way we will remove punctuation, hashtags/mentions/urls (they are preserved in the entity column anyway). We will also turn all letters to lowercase.

You can also consider removing stopwords, removing words that are not in the english language corpora, lematizing the words, etc. I suggest you research nltk library and its possibilities.

In [None]:
import re
import string
# NLTK tools
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words("english")
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
from collections import defaultdict
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

In [None]:
def cleaner(tweet):
 tweet = re.sub("@[A-Za-z0-9]+","",tweet) # remove mentions
 tweet = re.sub("#[A-Za-z0-9]+", "",tweet) # remove hashtags
 tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) # remove http links
 tweet = " ".join(tweet.split())
 tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) if w.lower() in words and not w.lower() in stop_words)
 #remove stop words
 lemma_function = WordNetLemmatizer()
 tweet = " ".join(lemma_function.lemmatize(token, tag_map[tag[0]]) for token, tag in nltk.pos_tag(nltk.wordpunct_tokenize(tweet))) #lemmatize
 tweet = str.lower(tweet) #to lowercase
 return tweet

In [None]:
tweets_filtered["clean_text"] = tweets_filtered["text"].map(cleaner)

In [None]:
tweets_filtered

We are going to loop through the dataframe and then through the words in the clean text. We are going to add the words as keys to dictionary and use their frequencies as values.

In [None]:
tweets_filtered.loc[tweets_filtered["clean_text"].isnull(),"clean_text"] = ""

In [None]:
tweet_tokenizer = nltk.TweetTokenizer()

#initialize an empty dict
unique_words = {}

for idx, row in tweets_filtered.iterrows():
 if row["clean_text"] != "":
 for word in tweet_tokenizer.tokenize(row["clean_text"]):
 unique_words.setdefault(word,0)
 unique_words[word] += 1

In [None]:
uw_df = pd.DataFrame.from_dict(unique_words, orient='index').reset_index()
uw_df.rename(columns = {'index':'Word', 0:'Count'}, inplace=True)
uw_df.sort_values(by=['Count'], ascending=False, inplace=True)
uw_df = uw_df.reset_index().drop(columns=["index"])

We can inspect the words as a dataframe. 


You can always save this dataframe as .csv for future reference.

In [None]:
uw_df

In [None]:
uw_df.to_csv(path+"words.csv")

#### Extracting the hashtags

We are going to loop through the dataframe and then through the hashtags in the entities. We are going to add the hashtags as keys to dictionary and use their frequencies as values. At the same time, we are going to save them in a list and add them to a separate column to facilitate our future work.

In [None]:
tweets_filtered.loc[tweets_df["entities"].isnull(), "entities"] = None

In [None]:
tweets_filtered["hashtags"] = ""

In [None]:
unique_hashtags = {}
index = 0

for idx, row in tweets_filtered.iterrows():
 if row["entities"] is not None and "hashtags" in row["entities"]:
 hl = []
 for hashtag in row["entities"]["hashtags"]:
 tag = "#"+hashtag["tag"].lower()
 unique_hashtags.setdefault(tag, 0)
 unique_hashtags[tag] += 1
 hl.append(tag)
 
 tweets_filtered.at[idx,"hashtags"] = hl

In [None]:
unique_hashtags = dict(sorted(unique_hashtags.items(), key=lambda item: item[1], reverse=True))

In [None]:
uh_df = pd.DataFrame.from_dict(unique_hashtags, orient='index').reset_index()
uh_df.rename(columns = {'index':'Hashtag', 0:'Count'}, inplace=True)

In [None]:
uh_df[0:50]

In [None]:
uh_df.to_csv(path+"hashtags.csv")

### Step 4: Building the network

We are going to use the networkx library, which is a Python library that enables network science analysis of the data.

We are going to use it to create our network and extract edgelist from it, since we can easily import it to Gephi (a software we are going to see in visualization labs).

However, it offers implemented algorithms for analysis (for example PageRank) that you can use out-of-box to analyze your network.

But first, we will loop through our dataframe and connect words and hashtags if they appear together in the same Tweet.

In [None]:
import itertools
import networkx as nx

In [None]:
uh = unique_hashtags.keys()
uw = unique_words.keys() 

In [None]:
network = {}
network_key = 0
for index, row in tweets_filtered.iterrows():
 combined_list = [hashtag for hashtag in row["hashtags"]] + [word for word in str.split(row["clean_text"], " ") if word in uw]
 #itertool product creates Cartesian product of each element in the combined list
 for pair in itertools.product(combined_list, combined_list):
 #exclude self-loops and count each pair only once because our graph is undirected and we do not take self-loops into account
 if pair[0]!=pair[1] and not(pair[::-1] in network):
 network.setdefault(pair,0)
 network[pair] += 1 
 
network_df = pd.DataFrame.from_dict(network, orient="index")

In [None]:
network_df.reset_index(inplace=True)
network_df.columns = ["pair","weight"]
network_df.sort_values(by="weight",inplace=True, ascending=False)
network_df

In [None]:
#to get weighted graph we need a list of 3-element tuplels (u,v,w) where u and v are nodes and w is a number representing weight
up_weighted = []
for edge in network:
 #we can filter edges by weight by uncommenting the next line and setting desired weight threshold
 #if(network[edge])>1:
 up_weighted.append((edge[0],edge[1],network[edge]))

G = nx.Graph()
G.add_weighted_edges_from(up_weighted)

In [None]:
print(len(G.nodes()))
print(len(G.edges()))

#### Save edgelist

In [None]:
filename = path+"/edgelist.csv"

In [None]:
nx.write_weighted_edgelist(G, filename, delimiter=",")

In [None]:
#add header with appropriate column names (works on collab and Linux/Mac(?))
!sed -i.bak 1i"Source,Target,Weight" ./edgelist.csv

#### Create and save node list


In [None]:
word_nodes = pd.DataFrame.from_dict(unique_words,orient="index")
word_nodes.reset_index(inplace=True)
word_nodes["Label"] = word_nodes["index"]
word_nodes.rename(columns={"index":"Id",0:"delete"},inplace=True)
word_nodes = word_nodes.drop(columns=['delete'])

word_nodes

In [None]:
hashtag_nodes = uh_df.copy()
hashtag_nodes["Label"] = hashtag_nodes["Hashtag"]
hashtag_nodes.rename(columns={"Hashtag":"Id"},inplace=True)
hashtag_nodes = hashtag_nodes.drop(columns=['Count'])
hashtag_nodes

In [None]:
nodelist = hashtag_nodes.append(word_nodes, ignore_index=True)

nodelist.to_csv("nodelist.csv",index=False)

Tasks: 

* We created a network where nodes are mixed (both words and hashtags). Create network of words only and one of hashtags only.
* Pick one of these network and rank the nodes using PageRank centrality. Extract information about top-20 rated nodes.
* following the procedure for extracting hashtags, extract mentions and annotations
* following the same procedure, extract the public metric counts for tweets



