Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets

Modeling disease spread and distribution using social media data has become an increasingly popular research area. While Twitter data has recently been investigated for estimating disease spread, the extent to which it is representative of disease spread and distribution in a macro perspective is still an open question. In this paper, we focus on macro-scale modeling of influenza-like illnesses (ILI) using a large dataset containing 8,961,932 tweets from Australia collected in 2015. We first propose modifications of the state-of-the-art ILI-related tweet detection approaches to acquire a more refined dataset. We normalize the number of detected ILI-related tweets with Internet access and Twitter penetration rates in each state. Then, we establish a state-level linear regression model between the number of ILI-related tweets and the number of real influenza notifications. The Pearson correlation coefficient of the model is 0.93. Our results indicate that: 1) a strong positive linear correlation exists between the number of ILI-related tweets and the number of recorded influenza notifications at state scale; 2) Twitter data has promising ability in helping detect influenza outbreaks; 3) taking into account the population, Internet access and Twitter penetration rates in each state enhances the prevalence modeling analysis.


INTRODUCTION
Public health surveillance is an essential mission of every government. In the current era of big data, data-driven epidemics modeling and surveillance system has drawn unprecedented attention.
In Australia, epidemics of seasonal influenza are one of the major public health concerns. Seasonal influenza strains circulate at peak during each winter. During the first half of 2015, there were more than 30,000 influenza cases notified [5] when the number of flu notifications reached the highest in history during the same time period. Besides, public health data are traditionally collected via surveys and by aggregating statistics obtained from healthcare institutions. Such data collection processes are usually costly, slow, and retrospective.
Recently, analyzing data collected from Twitter, a microblogging social network, has shown promise in assessing the prevalence of flu [9]. However, modeling disease spread and distribution with Twitter data involves several challenging tasks. First of all, detecting tweets that contain expression of disease symptoms requires natural language processing (NLP), which is an active research field with plenty of open challenges [12]. Moreover, health-related tweets are relatively scarce [9] making their detection within a large corpus of tweets a highly unbalanced classification problem. Zuccon et al. [21] investigated the suitability of statistical machine learning approaches in detecting ILI-related tweets automatically. Their results show that the optimal f-score, which is the harmonic mean of precision and recall, is only up to 0.736 among most of the state-of-the-art approaches. Considering the limited likelihood of users mentioning their health condition in Twitter, only relying on classification techniques for obtaining ILI-related tweets can induce large errors and lead to a biased epidemic model.
In this paper, we analyze a large database of 8,961,932 tweets from Australia collected in 2015 for studying the disease spread and distribution of influenza-like illness epidemics. We propose modifications to the algorithm proposed in [16] to improve the ILI-related tweets classification performance. We also take into account the Internet and Twitter penetration rates at each state to normalize the results. Afterwards, we establish a state-level model between the Twitter data and the true influenza notification data and also perform temporal and spatial analysis for exploring how well can Twitter data capture the feature of disease spread and distribution. Furthermore, we identify the limitations of our study as well as the opportunity for further study on utilizing Twitter data for public health surveillance.
The remainder of the paper is organized as follows. Section 2 presents related work. Section 3 gives some general statistics about the dataset we use and provides the methodology of the experiment design. Section 4 presents the experiment results and discussions. Section 5 elaborates on the limitations of the work. Section 6 provides conclusions and ideas for future work.

RELATED WORK
In the area of social media data mining, Twitter data have been used in many studies and provided valuable insights into various research fields including demographics estimation, public opinion reflection, real-time event monitoring, and public health surveillance. For example, Sakaki et al. proposed an algorithm to monitor earthquakes on the basis of tweet text features [17]. Tumasjan et al. showed the feasibility of tracking public political opinion and predicting the election results by analyzing the relevant tweets [20].
Culotta et al.'s work identifies similar correlation in Twitter data with Google Flu Trend after experiments of tweet keywords generation, selection, document filtering methods, and regression method comparison [9]. Prieto el al.'s work focuses on using Spanish and Portuguese tweets to estimate the community health with various maladies such as flu, depression, and eating disorders [14]. Moreover, Paul and Dredze apply an Ailment Topic Aspect Model (ATAM) over a large number of tweets to discover the mentions of various ailments, such as allergies, depression, cancer, etc. to model syndromic surveillance [13].
Sadilek et al. model disease epidemics by analyzing the interactions of online user activity and human mobility patterns using geo-tagged tweets [16]. They propose a semisupervised cascade-based approach for detecting ILI-related tweets. Then they model the spread of influenza by analyzing the co-location of "sick" post users and his or her surrounding Twitter users. Our work proposes modifications to the ILI-related tweets detecting part of Sadilek et al.'s work, which is an iterative labeling and training approach along with classification result validation, to improve the performance of the classification algorithm.
In Jurdak et al.'s work [15], the authors demonstrate that the Twitter data can be considered as a reliable source for studying the human mobility patterns. Their research also provides insights into the potential of using the Twitter data for public health studies.

METHODOLOGY
In this section, we first describe the dataset we use. Then, we discuss how we modify the classification approach to achieve a better performance. In addition, the methodology of temporal and spatial mapping of ILI-related tweets in Australia and regression model for estimating the flu notifications from ILI-related tweets are further illustrated.

The Data
Twitter posts, also known as tweets, which can be up to 140 characters long, form the basis of our work. Within each tweet, users can add the hash-tag symbol (#) before a relevant keyword or phrase to categorize their tweets and use emojis to express their emotions. According to recent Twitter statistics, there are approximately 320 million Twitter users all over the world [7], 2.8 million of them being from Australia [6].
A collection of tweets obtained by CSIRO is our major data source. With the help of Twitter Streaming API 1 , a large dataset of geo-tagged tweets within Australia for the entire year of 2015 has been generated by a year long collecting process. The data is stored in MongoDB [2], a crossplatform document-oriented NoSQL database. MongoDB 1 https://dev.twitter.com/streaming/public information associated with Tweet features include the characters of big data storage, index support, straightforward queries and higher speed than traditional relational databases [11], which make interaction with data easier and more efficient.
All collected tweets are represented is JSON format. In our work, we only consider five particular fields as listed in Table 1. Table 1 provides a more concise description of the required JSON fields using a real tweet example.
After some basic data cleaning, the database contains 8,961,932 tweets posted by 225,641 unique Twitter users. Among all tweets, 3,469,190 of them are posted with precise location coordinates. Nearly every tweet is associated with a "place" field, which is location information that already existing on the Twitter server database. This field, as a coarse location information, can either be automatically assigned or manually allocated by the users. Our work considers this data field as a complement of the geo-enabled tweet database.

Detecting Illness-Related Tweets
Our primary task is to identify tweets that indicate the authors are infected at the time of posting. Based on the findings from related works [9], [16], the problem of detecting illness-related tweets is expected to be an unbalanced classification problem with scarce data points. In our work, we propose modifications to the classification algorithm in [16] and apply a semi-supervised cascade learning approach to learning Support Vector Machine (SVM) [8] classifiers with a large area under the precision-recall (PR) curve. It is worth to mention that the area under the PR curve is a more valuable evaluation method in our scenario, as the imbalance of the problem will generate a constant large area under the receiver operating characteristic (ROC) curve. The classifiers are trained to distinguish "sick" tweets (ILI-related tweets) and "other" tweets (non-ILI-related tweets) in the tweet database.
The prerequisite of learning such classifiers is to obtain a high-quality set of labeled training data. We employ an iterative process to achieve this. The training process is shown in Figure 1 and the classification process is shown in Figure  2. Within the mechanism, two different SVM classifiers, denoted by Cs and Co, are trained using scikit-learn Python library 2 , which label the tweets as either belonging to the class "sick" or the class "other". The classifier Cs is highly penalized for including false positives (mistakenly labeling an "other" tweet as a "sick" one) and the classifier Co is highly penalized for including false negatives (classifying a "sick" tweet as "other").
In each training iteration, two parameters, class weight and the C parameters, which influence the performance of the classifiers, are carefully selected through experiments. We fix one parameter and vary the other within a wide range of values to observe the changes in precision, recall, false-positive error rate, and false-negative error rate. The parameters leading to the highest precision and lowest falsepositive error rate are chosen for Cs while the parameters that give the highest recall and the lowest false-negative error rate are chosen for Co. Meanwhile, manual checking validations are included in both training stage and classification stage because those are essential steps for classifying the ILI-related tweets accurately.
Step by step instructions for the training and classification processes are discussed in the next paragraph and shown in Figures 1 and 2.
Initially, a small portion of tweets, which is around 2000, has been labeled manually resulting in 36 ILI-related tweets and 1974 non-ILI-related tweets (1). With the labeled dataset, Cs and Co are trained (2) and examined with a various range of values for the parameters. Parameters that result in the best classification performance are selected (3). Then, a larger tweet corpus is introduced and labeled using Cs and Co (4). The trained classifiers assign labels to the tweets. We further manually check the tweets and add them to the previous labeled tweets corpus as reforming the basis of training data for next round of classifier training (5). After finalizing the training of Cs and Co, both classifiers are used for labeling the entire tweet database (6). Any tweet may be labeled as "sick" or "other" by both classifiers or either one of them. Therefore, in the final step (7), we manually check those tweets labeled with different labels by the two classifiers, which is represented by the "not known" part in Figure 2.

Figure 2: Classification stage
For features, all unigram, bigram, and trigram word tokens are considered in our work. For instance, a tweet message "I got the flu" is represented by the following feature vector: Before tokenization, all texts are converted into lowercase and punctuations and stopwords are stripped. However, hash-tags and emojis are retained as they may stand for authors health condition. We use the term frequencyinverse document frequency (TF-IDF) [18] features to represent tweet data with the help of the tokenization package 3 from the CMU and the scikit-learn library. The TF-IDF numerically represents all terms, which counts word appearances offset by the frequency of words in the corpus.
Our approach employs SVMs with the linear kernel to solve the associated high-dimensional feature space problem, which has been shown to perform well under such circumstances [13]. To overcome the class imbalance problem, where the ILI-related tweets are much fewer than the non-ILI-related tweets, the experiments are designed to optimize the area under the PR curve, which is demonstrated to be more meaningful when dealing with such unbalanced scenarios [10] compared to ROC curve.

Analysis
Before modeling, we aim to understand to what extent Twitter data can capture the key features of state-level influenza prevalence both on spatial and temporal dimensions. With this objective, we design some experiments with the true influenza notifications data, which is obtained from Influenza Specialist Group (ISG) [1] and Queensland government health department websites [3], as a benchmark.

Spatial Analysis
In the spatial analysis, we first assign all ILI-related tweets to their respective locations with respect to the "geo" and "place" fields obtained from JSON-format tweets using geopy Python library 4 . A heat map generated by all ILI-related tweets in Australia is shown in Figure 3. It is evident that most of the sick users are located in those areas along the east coast with high population density. Meanwhile, the number of those target users located in capital of states, such as Perth and Adelaide, is much more than those in other areas. In the state-level analysis, we sort the "sick" tweet numbers and the number of flu notifications in each state according to the population and calculate the associated Pearson correlation coefficient to evaluate the linear relationship between the two examined values, "sick" tweet numbers and true notification numbers.
Meanwhile, we also perform a regional level analysis. We choose the Twitter data and true notifications data from the state of Queensland (QLD) and locate each tweet within its corresponding hospital and health service regions (HHS), as shown in Figure 4. Similar to the state-level case, we are interested in discovering the correlation between the tweet data and the true flu notification data by sorting them with regards to population and calculate the Pearson correlation coefficient.

Temporal Analysis
Temporal analysis is conducted by comparing the number of ILI-relate tweets and true notifications in a monthly level.
A bout of flu typically lasts one to two weeks, and flu symptoms usually start within one to four days after infection [19]. In order to identify the infected individuals precisely, multiple sick tweets posted by the same user within one week are seen as duplicate tweets and only counted once in the analysis.
Internet access and social media usage rate are different among the states and territories. For example, residents of Australian Capital Territory and Victoria are more likely to have access to the Internet compared to those living in Northern Territory or Queensland. In order to reduce the potential bias induced by these disparities, we modify our "sick" tweet numbers by weighting them according to the Internet access rate as well as Twitter penetration rate at different states and territories. We obtain the usage rate information from Australian Sensis Social Media Report [4].

Modeling Influenza-Like Illness Prevalence
In order to establish a state-level model, a linear regression model is fitted with the number of annual ILI-related tweets as the independent variable and the true illness laboratory notifications as the dependent variable. The number of influenza notifications in each state is estimated by: where B0 is the intercept, B1 is the regression slope coefficient, x is the number of ILI-related tweets, andŷ is the estimated number of influenza patients.
Internet access and Twitter penetration rate parameters are then introduced to eliminate the bias that caused by different Internet and social media usage rate in each state. Accordingly, the independent variable x is calculated by: where i is the Internet access rate and t is the Twitter penetration rate.
To better evaluate the regression model, the Pearson correlation coefficient analysis and t-test are carried out. The t-test conducts a hypothesis test to determine whether there is a linear relationship between the independent variable and the dependent variable. In the t-test, the null hypothesis is that the slope is equivalent to zero (H0), and the alternative hypothesis states that the slope is not equal to zero (H1): The associated p-value tests the null hypothesis. If the generated p-value is lower than a given significance level (normally 0.05), the null hypothesis can be rejected with high confidence.
We also carry out a confidence interval analysis, which can help identify the probable area where the best-fit regression line lies.

PERFORMANCE EVALUATION
In this section, experimental results for each stage of our work are displayed and elaborated along with analysis and discussions.

Classification Results
In the training stage, we fix the parameters of the classifiers after five training iterations with 1,585,918 tweets as the classifiers do not perform better with more training iterations. The average of 10-fold cross-validation performance of the SVM classifiers ,Cs and Co as well as Cf, are presented in detail in Table 2.
In our work, the number of the ILI-related tweets is expected to be limited. Therefore, a 74% precision for classifier Cf can induce a large error in the dataset. From Table 2 we can observe that the accuracy is high for all three classifiers because of the existence of a large amount of non-healthrelated tweet (true negative). However, in our experiments the accuracy and precision of Cf decline while the recall improves. A relatively large false positive rate shows that Cf has mistakenly labeled many non-health-related tweets as "sick" tweets. In order to obtain a more precise ILI-related tweet dataset, we employ both classifiers Cs and Co for tweet labeling and manually check the correctness of labels of tweets that are given different labels by the two classifiers.
After labeling and manual checking, 1167 tweets posted by 896 unique users are found to be ILI-related. We then remove the duplicate tweets posted within a week by the same user. This leaves us with 1027 ILI-related tweets from Australia. Compared to the size of entire 2015 tweet database, the number of sick tweet authors is relatively small. Assuming that the data obtained from ISG can cover all individuals in Australia, considering 100,586 laboratory-confirmed influenza cases in 2015 with the Australian population of 24 million, the ratio of influenza infected population within a year is around 0.0042. If we apply this ratio to 225,641, the number of unique users in entire tweet database, the result is around 944, which is close to the detected number of sick users.

Influenza Outbreak
From temporal analysis, Figure 5 shows that both ILIrelated tweet data and true influenza notification data reach the peak in August, which is during high flu season in Australia. This indicates that Twitter data can potentially help detect an influenza outbreak in the time series. However, despite a rapid increase in the number of flu notifications from June to August, the sick tweet number increases moderately within the same period. It is evident that there are around 40,000 notifications in August and less than 5,000 notifications in May. However, the Twitter data shows 150 ILI-related tweets in August and around 100 in May. Considering true notifications as a benchmark, we would expect the number of tweets in August to be around 8 times of that in May rather than only 0.5 times more. The discrepancies between ILI-related tweets data and true influenza notification data may result from the limited prevalence of mentioning health conditions and this result also shows that it is hard to reveal the severity of the influenza spread in a temporal dimension.

State-Level Linear Correlation
We sort the "sick" tweets and true notifications according to the populations of the states and normalize the tweet data with Internet access and Twitter penetration rates as shown in Figure 6. Twitter data appears to have similar variation trends to true notification data. For instance, although there is a high population density, Internet access rate, and Twitter penetration rate in Victoria compared to Queensland, the Twitter data correctly identifies more influenza infections in Queensland. Statistically, there is also a high correlation coefficient between Twitter data and true notification data, which is around 0.94. This indicates that the Twitter data can capture the key features of state-level influenza prevalence on an annual level with a linear relationship.

Regional Analysis
At the regional level, we allocate tweets to each encapsulated hospital and health service (HHS) region in Queensland and sort the number of ILI-related tweet and true notification data in each HHS area by population, as shown in Figure 7 (a). As the region names from left to right are in ascending population order, we can see that there are no ILI-related tweets posts in Central West, Torres and Cape South West, and North West. This stands for the population size in those areas being quite small, and the number of Twitter users who constantly tweet is also less. However, there is a relatively large number of ILI-related tweets in Wide Bay and Darling Downs given a small number of true influenza notifications. After further analysis, we find that, as there is a limited number of ILI-related tweets in those regions, Twitter data can be easily influenced by some unwell Twitter users that post frequently.
Interestingly, in regions with higher populations such as Cairns, Sunshine Coast, Gold Coast, Townsville, and Brisbane Metro, Twitter data shows some similar variation trends to the influenza notifications. Based on these observations, we limit our study to the regions around Brisbane city, as shown in Figure 7 (b). The number of ILI-related tweets and true influenza notifications shows a reasonable linear relationship with a correlation coefficient of 0.835. However, Twitter data in Gold Coast seem to overestimate the influenza cases. This may be because Gold Coast is a famous tourist destination and has more younger people which enhances the Twitter usage.
These analysis shows that, we may need to take the nature of cities into account regarding the Twitter usage behavior when studying regional disease distribution. However, owing to the limited Twitter usage and low likelihood of mentioning health conditions in tweets, the number of detected ILI-related tweets may not be sufficient to support regional analysis in Australia.

Regression Analysis
Finally, we fit a linear regression model to estimate influenza prevalence using the generated Twitter dataset. As shown in Figure 8, the linear regression model is generated with the slope of 83.88 with a Pearson correlation coefficient of 0.875 and p-value of 0.011.

Figure 8: Linear regression with original sick tweet data amount
After taking Internet access and Twitter penetration rates into consideration as weighting parameters, a better-fitted model has been generated with a slope of 12.55. A higher correlation coefficient of 0.93 and p-value of 0.017 suggest a state-level linear relationship between the number of ILIrelated tweets and true influenza notifications, as seen in Figure 9, which shows the promise of estimating influenza prevalence using Twitter data.
In Figure 10, the confidence intervals generated by sample data points indicate the area where there is a 95% probability that the true best-fit line for the regression lies. The prediction interval indicates that for any specific value of the number of ILI-related tweets (X), weighted by Internet access and Twitter penetration rates, there is a 95% probability that the real value of Y (a number of true influenza notifications) is within this interval where slope varies from 6.02 to 22.19. The positive slope interval indicates a strong positive linear correlation between the two variables.

Influence of Population, Internet Access, and Twitter Penetration Rates
The improvement between linear regression models depicted in Figures 8 and 9 shows that Internet access and Twitter penetration rates are important factors during modeling. During the experiments, we also discover that the number of ILI-related tweets has a strong linear correlation with the population of each state. Although the number of tweets is limited, the Pearson correlation coefficient is 0.99. We then present the data points of ratios between the number of ILI-related tweets and population in each state in Figure 11. Excluding the data point representing Northern Territory (NT) as an outlier, we find out that although each state differs regarding the Twitter user behavior and the population size, there are similar ratios between tweets data and the population. The average ratio of those other seven states is around 4.2 * 10 −5 times, which means when we know the population in a state, the number of "sick" Twitter users is around 4.2 * 10 −5 times of the population.

LIMITATIONS
This work is mainly limited by the scarcity of the tweets, especially illness-related ones, which may have three main causes. First, according to Sensis social media report 2015 [4], only 17% of Australians are using Twitter, which ranked as the 5th most use social media platform in Australia. Meanwhile, the likelihood of users commenting on health condition in social media is relatively low. Second, user's online behavior may change during an adverse health condition. For example, some users may not want to tweet when they are suffering from illness while others might. People may be more interested in talking about politics, sports, and everyday life, etc. via Twitter. Third, the considered tweet database only contains geo-tagged tweets, which is a small portion of all tweets in Australia.
The laboratory confirmed influenza notifications are also incomplete as many patients may not seek medical treatment when they catch a cold. Meanwhile, the linear regression model is relatively simple in state-level influenza modeling. However, based on the scale of our work, where there are only two variables -Twitter data and true notification data, linear regression is a suitable model in this study.
Meanwhile, our work assumes a similar likelihood and frequency of tweeting by people of different ages and socioeconomic backgrounds. However, Twitter is currently more popular among younger generations, which means the presented results and models are younger generation specified.
With respect to our approach to detecting the ILI-related tweets, manual checking steps may restrict the scalability of our learning method when applied to larger datasets.

CONCLUSIONS AND FUTURE WORK
Our work proposes effective modifications to the stateof-art approach in detecting illness-related tweets with the purpose of reducing the errors of its classifiers. Along with iterative manual checking for validation, we introduce Internet access and Twitter penetration rates in our modeling to compensate for their discrepancies among the states. We conduct the state-level and the regional-level analysis and show that although the number of tweets is limited, Twitter data is useful in spatial and temporal disease prevalence modeling.
Our analysis results show that Twitter data is a reasonable proxy for detecting disease outbreak and possesses strong linear correlation with real-world influenza notification data. Finally, a linear regression model is established with a correlation coefficient of 0.93 and a p-value of 0.017. A strong positive linear regression model strongly suggests that Twitter data can capture the key features of state-level influenza prevalence and has a good potential in disease spread modeling.
In future work, we will consider introducing other data sources such as public transportation data, Twitter follower relationships, and tweet geo-location changes as features to model influenza prevalence and spread. At the same time, we will attempt to identify the effects of user connections and human movement on disease spread using data from Twitter and other social media. Meanwhile, we will also focus on temporal modeling to identify data correlations during various time spans such as different months and seasons.