{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **Recent search**\n", "\n", "The recent search endpoint allows you to programmatically access filtered public Tweets posted over the last week, and is available to all developers who have a developer account and are using keys and tokens from an App within a Project.\n", "\n", "You can authenticate your requests with OAuth 1.0a User Context, OAuth 2.0 App-Only, or OAuth 2.0 Authorization Code with PKCE. However, if you would like to receive private metrics, or a breakdown of organic and promoted metrics within your Tweet results, you will have to use OAuth 1.0a User Context or OAuth 2.0 Authorization Code with PKCE, and pass user Access Tokens that are associated with the user that published the given content. \n", "\n", "This endpoint can deliver up to 100 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. \n", "\n", "When using a Project with Essential or Elevated access, you can use the basic set of operators and can make queries up to 512 characters long. When using a Project with Academic Research access or Enterprise access, you have access to additional operators.\n", "\n", "Rate limit: App rate limit (Application-only): 450 requests per 15-minute window shared among all users of your app" ], "metadata": { "id": "hRt9s5I1P2qS" } }, { "cell_type": "markdown", "metadata": { "id": "21NaYdhRr9i7" }, "source": [ "## Set up\n", "\n", "1. Tokens and keys should not be kept in colab notebooks. I advise you to \n", "delete them from the code after every use. Here we create an environmental variable and save the token in it. It will be deleted on restart of the runtime.\n", "\n", "2. In order to efficiently save your data it's best to connect the notebook with your Google Drive (n.b: this doesn't work with university's account - you need to use Gmail). This way you don't risk losing your data if the notebook malfunctions and you can later access it from a notebook again. \n", "\n", "3. Libraries that we need (for now) are *requests*, to handle our API requests, and *pandas* to easily view and store the results.\n", "\n", "4. Colab environment offers you loads of pre-installe dpython libraries (you can check which ones by executing `!pip freeze`), *however* if your library is not on the list you need to manually install it.\n" ] }, { "cell_type": "code", "metadata": { "id": "gNFOaZPIxyS2" }, "source": [ "import os\n", "os.environ['TOKEN'] = \"AAAAAAAAAAAAAAAAAAAAAIi0jQEAAAAAk2nnTNI41oHHdehJLOXXv36J6%2F0%3DnnMdA329Tb9YGoAo6vgGLjvEVG0m8WwCs3ZHQNqGHLC2fUdz3v\"" ], "execution_count": 3, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "GjUQcJpfrecr" }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Nwy6MeGur9i-" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "metadata": { "id": "8ge5EaEur9jA" }, "source": [ "import requests \n", "import pandas as pd \n", "import time" ], "execution_count": 15, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OrXv204vyooY" }, "source": [ "If you need to download a library, use the following code, just specify the name of the library you need (here we downloaded emoji library)" ] }, { "cell_type": "code", "metadata": { "id": "N6RYD_oCoJ3C" }, "source": [ "!pip install emoji" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 1: authenticate\n", "\n", "In order to authenticate our request, we need to create a request header and add an authorization field. You can authorize a request by using the bearer token, or the API consumer/secret keys. Here we do it with bearer token for the sake of simplicity.\n", "\n", "You can read more about it here: https://developer.twitter.com/en/docs/authentication/overview" ], "metadata": { "id": "9I8xu_aIQY9v" } }, { "cell_type": "markdown", "metadata": { "id": "IYIP9Ta4r9jR" }, "source": [ "\n", "### Set up headers" ] }, { "cell_type": "code", "metadata": { "id": "pKWoILn7AMzj" }, "source": [ "def create_headers(bearer_token):\n", " headers = {\"Authorization\": \"Bearer {}\".format(bearer_token)}\n", " return headers" ], "execution_count": 1, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "W_buwaoAyPLM" }, "source": [ "headers = create_headers(os.environ['TOKEN'])" ], "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 2: build a search query\n", "\n", "**Ingredients**: endpoint, parameters and operators\n", "\n", "For endpoint we use: https://api.twitter.com/2/tweets/search/recent\n", "\n", "**Example parameters**: \n", "\n", "* query: the text of your search (required) - max 512 chars\n", "* end_time: the newest date for your tweets\n", "* start_time: the oldest date for your tweets\n", "(format for date: YYYY-MM-DDTHH:mm:ssZ (ISO 8601/RFC 3339))\n", "* max_results: between 10 (default) and 100\n", "* tweet_fields: which fields to get (if empty, you only get id&text&edit \n", "history)\n", "* user_fields, place_fields, expansions\n", "* next_token: to get the next page of results \n", "\n", "\n", "**Example operators**: keyword (menstruation), exact phrase(\"sexual education\"), hashtag (\"#metoo\"), emoji (😬), logical operators (AND = a blank space), OR, NOT), from: or to: (tweets from a user or directed to a user), @ (tweets that mention the user, @NASA), is:retweet, is: reply , is:quote, lang: (\"en\")\n", "\n", "Grouping is done with brackets. F.e (#prolife abortion) OR (#prochoice abortion)\n", "\n", "See more here: \n", "\n", "Operators: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query\n", "\n", "Parameters: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent\n", "\n", "\n" ], "metadata": { "id": "Nt2jiBSDS-XQ" } }, { "cell_type": "code", "metadata": { "id": "krWb9m_z_fgF" }, "source": [ "def create_url(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint):\n", " \n", " search_url = endpoint #Change to the endpoint you want to collect data from\n", "\n", " #change params based on the endpoint you are using\n", " #also can request different fields, e.g ids of users ... \n", " query_params = {'query': query,\n", " 'end_time': end_time,\n", " 'start_time': start_time,\n", " 'max_results': max_results,\n", " 'expansions': expansions,\n", " 'tweet.fields': tweet_fields,\n", " 'user.fields': user_fields,\n", " 'place.fields': place_fields}\n", "\n", " return (search_url, query_params)" ], "execution_count": 41, "outputs": [] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "ef5513cb" }, "outputs": [], "source": [ "def connect_to_endpoint(url, headers, params, next_token = None):\n", " #only change the default value of next_token if it is a real value returned in the response\n", " if next_token is not None and next_token != '':\n", " params['next_token'] = next_token\n", " #create a \"GET\" request to the specified url, add headers and parameters\n", " response = requests.request(\"GET\", url, headers = headers, params = params)\n", " if response.status_code != 200:\n", " #if something goes wrong, we need to know\n", " raise Exception(response.status_code, response.text)\n", " #otherwise, we want the payload of our response, which contains our tweet(s)\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "id": "3024b32a" }, "outputs": [], "source": [ "def get_data(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint, next_token=\"\"):\n", " \n", " results = []\n", "\n", " while next_token is not None:\n", " try: \n", " url = create_url(query, start_time, end_time, max_results, expansions, tweet_fields, user_fields, place_fields, endpoint)\n", " json_response = connect_to_endpoint(url[0], headers, url[1], next_token)\n", " #if we have results, they will be in the field 'data' of our response\n", " if \"data\" in json_response:\n", " results.extend(json_response[\"data\"])\n", " print(str(len(json_response[\"data\"])) + \" Tweets downloaded in this batch.\")\n", " #the next_token is added to the field 'meta' of our response\n", " if \"meta\" in json_response:\n", " if \"next_token\" in json_response[\"meta\"].keys():\n", " next_token = json_response[\"meta\"][\"next_token\"] \n", " else:\n", " next_token = None\n", " else:\n", " next_token = None\n", "\n", " \n", " #to control the rate limit we need to slow down our download\n", " time.sleep(3)\n", "\n", " except Exception as e:\n", " print(\"Error occured\", e)\n", " print(\"Next token value\", next_token)\n", " return(results, next_token)\n", "\n", " print(\"Done\")\n", " \n", " return results" ] }, { "cell_type": "markdown", "source": [ "## Step 3: download and save the data\n", "\n", "We call the function, filling in the desired parameters. We convert the data into a pandas dataframe to easily manipulate it (view, edit, save). We save the data in the PICKLE format, so we can recover the exact data types later." ], "metadata": { "id": "efu38OoOgHk0" } }, { "cell_type": "code", "source": [ "tweets = get_data(\"#UNIPD\", start_time = \"2022-11-18T00:00:00Z\", end_time = \"2022-11-19T00:00:00Z\", \n", " max_results=100, expansions='author_id,in_reply_to_user_id,geo.place_id', \n", " tweet_fields='id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source,context_annotations,entities',\n", " user_fields='id,name,username,created_at,description,public_metrics,verified',\n", " place_fields='full_name,id,country,country_code,geo,name,place_type',\n", " endpoint=\"https://api.twitter.com/2/tweets/search/recent\")\n", " " ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TFaruTRUb0TW", "outputId": "0d89b1fc-0175-4f3b-fd2b-8f13966f2434" }, "execution_count": 64, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "5 Tweets downloaded in this batch.\n", "Done\n" ] } ] }, { "cell_type": "code", "source": [ "tweets_df = pd.DataFrame(tweets)" ], "metadata": { "id": "ezdjjfmYe4e2" }, "execution_count": 65, "outputs": [] }, { "cell_type": "code", "source": [ "display(tweets_df)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 548 }, "id": "DRrthNC_g1vE", "outputId": "51ac9018-5caa-4470-c4a9-d51e6cb08e5a" }, "execution_count": 69, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " entities id \\\n", "0 {'hashtags': [{'start': 118, 'end': 124, 'tag'... 1593661416862584836 \n", "1 {'urls': [{'start': 54, 'end': 77, 'url': 'htt... 1593635572626726916 \n", "2 {'urls': [{'start': 69, 'end': 92, 'url': 'htt... 1593593404281208835 \n", "3 {'urls': [{'start': 281, 'end': 304, 'url': 'h... 1593531604886032387 \n", "4 {'hashtags': [{'start': 102, 'end': 108, 'tag'... 1593509073647030272 \n", "\n", " conversation_id lang \\\n", "0 1593661416862584836 en \n", "1 1593626010880090114 it \n", "2 1593593404281208835 en \n", "3 1593531604886032387 it \n", "4 1593509073647030272 it \n", "\n", " public_metrics author_id \\\n", "0 {'retweet_count': 3, 'reply_count': 0, 'like_c... 1598181734 \n", "1 {'retweet_count': 0, 'reply_count': 0, 'like_c... 1402397995321147401 \n", "2 {'retweet_count': 0, 'reply_count': 0, 'like_c... 723784143449014273 \n", "3 {'retweet_count': 0, 'reply_count': 0, 'like_c... 1272924933878886407 \n", "4 {'retweet_count': 1, 'reply_count': 0, 'like_c... 108979965 \n", "\n", " text \\\n", "0 RT @UniPadova: 🇪🇺 Highly Cited Researchers @Cl... \n", "1 @CarloCalenda Tanto per aggiornarle gli studi\\... \n", "2 #Christmas_tree is ready in our pharmacy lab \\... \n", "3 Se vi interessa il Giappone e il patrimonio ar... \n", "4 RT @UniPadova: 👩‍⚖️ Trasparenza e legalità per... \n", "\n", " created_at \\\n", "0 2022-11-18T17:44:25.000Z \n", "1 2022-11-18T16:01:43.000Z \n", "2 2022-11-18T13:14:09.000Z \n", "3 2022-11-18T09:08:35.000Z \n", "4 2022-11-18T07:39:03.000Z \n", "\n", " referenced_tweets reply_settings \\\n", "0 [{'type': 'retweeted', 'id': '1592880314723999... everyone \n", "1 [{'type': 'replied_to', 'id': '159363331950533... everyone \n", "2 NaN everyone \n", "3 NaN everyone \n", "4 [{'type': 'retweeted', 'id': '1593213659328585... everyone \n", "\n", " source edit_history_tweet_ids in_reply_to_user_id \\\n", "0 Twitter for iPhone [1593661416862584836] NaN \n", "1 Twitter for Android [1593635572626726916] 1402397995321147401 \n", "2 Twitter for iPhone [1593593404281208835] NaN \n", "3 Twitter Web App [1593531604886032387] NaN \n", "4 Twitter Web App [1593509073647030272] NaN \n", "\n", " geo \n", "0 NaN \n", "1 NaN \n", "2 {'place_id': 'fdb8acc1584c75bc'} \n", "3 NaN \n", "4 NaN " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
entitiesidconversation_idlangpublic_metricsauthor_idtextcreated_atreferenced_tweetsreply_settingssourceedit_history_tweet_idsin_reply_to_user_idgeo
0{'hashtags': [{'start': 118, 'end': 124, 'tag'...15936614168625848361593661416862584836en{'retweet_count': 3, 'reply_count': 0, 'like_c...1598181734RT @UniPadova: 🇪🇺 Highly Cited Researchers @Cl...2022-11-18T17:44:25.000Z[{'type': 'retweeted', 'id': '1592880314723999...everyoneTwitter for iPhone[1593661416862584836]NaNNaN
1{'urls': [{'start': 54, 'end': 77, 'url': 'htt...15936355726267269161593626010880090114it{'retweet_count': 0, 'reply_count': 0, 'like_c...1402397995321147401@CarloCalenda Tanto per aggiornarle gli studi\\...2022-11-18T16:01:43.000Z[{'type': 'replied_to', 'id': '159363331950533...everyoneTwitter for Android[1593635572626726916]1402397995321147401NaN
2{'urls': [{'start': 69, 'end': 92, 'url': 'htt...15935934042812088351593593404281208835en{'retweet_count': 0, 'reply_count': 0, 'like_c...723784143449014273#Christmas_tree is ready in our pharmacy lab \\...2022-11-18T13:14:09.000ZNaNeveryoneTwitter for iPhone[1593593404281208835]NaN{'place_id': 'fdb8acc1584c75bc'}
3{'urls': [{'start': 281, 'end': 304, 'url': 'h...15935316048860323871593531604886032387it{'retweet_count': 0, 'reply_count': 0, 'like_c...1272924933878886407Se vi interessa il Giappone e il patrimonio ar...2022-11-18T09:08:35.000ZNaNeveryoneTwitter Web App[1593531604886032387]NaNNaN
4{'hashtags': [{'start': 102, 'end': 108, 'tag'...15935090736470302721593509073647030272it{'retweet_count': 1, 'reply_count': 0, 'like_c...108979965RT @UniPadova: 👩‍⚖️ Trasparenza e legalità per...2022-11-18T07:39:03.000Z[{'type': 'retweeted', 'id': '1593213659328585...everyoneTwitter Web App[1593509073647030272]NaNNaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "for idx, row in tweets_df.iterrows():\n", " print(row[\"id\"],row[\"text\"])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "95ntVDmmg5uu", "outputId": "b68ab832-48c6-4263-ce66-f5f5aa551341" }, "execution_count": 71, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "1593661416862584836 RT @UniPadova: 🇪🇺 Highly Cited Researchers @Clarivate: Unipd hosts 8 of the most highly cited researchers worldwide.\n", "\n", "#Unipd placed first a…\n", "1593635572626726916 @CarloCalenda Tanto per aggiornarle gli studi\n", "#UniPD \n", "https://t.co/WZDd6mDIOq\n", "1593593404281208835 #Christmas_tree is ready in our pharmacy lab \n", "#UniPD #RMIT #PhD_life https://t.co/3ODcGCsYkJ\n", "1593531604886032387 Se vi interessa il Giappone e il patrimonio artistico e culturale di mondi lontani,non potete perdervi l’evento “La restituzione della collezione di cartoni giapponesi di Villa Revedin Bolasco”.\n", "26/11/22 ore 10.30 \n", "Presso Villa Revedin Bolasco a Castelfranco Veneto.\n", "#unipd #tesaf https://t.co/413dug90cp\n", "1593509073647030272 RT @UniPadova: 👩‍⚖️ Trasparenza e legalità per l'Università del futuro\n", "\n", "@iuav @UniVerona @CaFoscari a #Unipd per fare un punto sullo stato…\n" ] } ] }, { "cell_type": "code", "source": [ "tweets_df.to_pickle(\"/content/tweets.pkl\")" ], "metadata": { "id": "a-4cmhGCf_hO" }, "execution_count": 68, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 4: individual work\n", "\n", "- think of a topic you'd like to read some tweets from\n", "- build a query (play around with the logic - can you get only tweets that are not a retweet?)\n", "- how many tweets did you get?\n", "- what if you changed the date range?" ], "metadata": { "id": "OxPOkg0Sgq0E" } } ] }