(This is the translation in English -with some slight modifications- of this article)
The possibility to have access to Google search data has stimulated considerable interest in various disciplines. Among these, political forecasting is probably the most recent, while important results has already been achieved in the economic, financial and medical fields.
The main reason which has so far limited any development in the field of political forecasting is the so-called self-selection bias: in this case, the individuals included in the analyzed sample are not chosen randomly, but rather they decided themselves to enter the sample, thereby creating a biased sample, which is not representative of the entire population. Unfortunately, this is exactly what happens with GTI, which reports data only for those individuals who actively searched on Google.
For instance, GTI contains only a small number of online searches of political nature by people over 60 years old, who use the internet much less than younger people (for obvious reasons): unfortunately, this group is the most active when voting is of concern. Similarly, if the internet penetration rate is not very high in the examined population, the self-selection bias will be pretty large. See the links below for more details and references:
This issue has been a well known problem to people conducting online surveys using the surveying technique known as Computer-Assisted Web Interviewing (CAWI):
However in this case, it is possible to know which groups of individuals are under-represented and which are over-represented (at least approximately) by comparing the qualitative characteristics of the sample used for the online survey with those of the population, and then creating the appropriate sample weights to correct the bias present in the original online sample.
Unfortunately, when working with GTI data, you do NOT have access to this kind of qualitative information, so that it is not possible to re-balance the search data for a candidate or a political party in order to have a representative sample of the population.
That said, we then have four possibilities:
1) 1) Use the simple GTI data with no corrections: this option is currently possible with a limited number of political elections, only and exclusively at the national level, with a very large turnout (over several millions voters) and a very high internet penetration rate. One such case is the U.S. presidential election: