this dataset of 170,000 geotagged tweets from the UK. All tweets were initially collected between April 14-21, 2016. Data filtering and preprocessing reduced the size of this dataset down to approx 40,000 tweets.
I used R to create basic data visualizations to explore the spatial and temporal distribution of all the tweets in my dataset. For example, a histogram of the temporal distribution of tweets reveals a diurnal pattern, indicating that tweet frequency peaks at midday and falls during the evening. While some days reach a higher peak of tweet frequency than others, all days share a similar internal distribution.
I then detected spatio-temporal clusters using space-time scan statistics,5 implemented with the freely available SaTScan software. I selected the top 100 most statistically significant clusters and analyzed the text content of the tweets in each cluster to determine whether or not a given cluster corresponded to a real-world event. Overall, I found 18 clusters that corresponded to real events. The first figure in this post shows how each of these event clusters occupies a distinct spatio-temporal region.
These findings are likely due to increased concentrations of people tweeting during an event, from the same place and during the same time period.