globaltrends

# install package --------------------------------------------------------------
# current cran version
install.packages("globaltrends")
# current dev version
devtools::install_github("ha-pu/globaltrends", build_vignettes = TRUE)

# load package -----------------------------------------------------------------
library(globaltrends)

# package version --------------------------------------------------------------
packageVersion("globaltrends")
#> [1] '0.1.0'

Case study: Analyzing firm internationalization

We demonstrate the functionality of the globaltrends package based on a sample of six large U.S. firms. Measuring the degree of internationalization for firms is an essential empirical task in international business research. Yet the proposed methodology can be generalized to other applications. In this brief case study, we analyze the degree of internationalization of Alaska Air Group Inc., Coca-Cola Company, Facebook Inc., Illinois Tool Works Inc., J.M. Smucker Company, and Microsoft Corporation. The workflow proceeds in four major steps:

  1. Set up and start the database
  2. Download data from Google Trends
  3. Compute search scores and internationalization
  4. Exports

Set up and start the database

Research projects that use Google Trends generate a substantial amount of data. To optimally handle this data, the globaltrends package uses an SQLite database to store and handle all data. This ensures efficiency and portability on the one hand and seamless integration with functions implemented in the DBI and dplyr packages on the other hand.

Users create the underlying database through the initialize_db command. The command creates a folder named db within the current working directory and creates an SQLite database file named globaltrends_db.sqlite within this folder. The command also creates all necessary tables within the database. For more information on database tables, please refer to their built-in documentation, e.g., ?globaltrends::data_score. The database initialization is necessary only for the first usage of the globaltrends package.

# initialize_db ----------------------------------------------------------------
setwd("your/globaltrends/folder")
initialize_db()
#> Database has been created.
#> Table 'batch_keywords' has been created.
#> ...
#> Table 'data_global' has been created.
#> Successfully disconnected.

After initialization or when resuming work on an existing database, it is sufficient to call start_db from the respective working directory. This command connects to the globaltrends_db.sqlite database in the folder db and creates connections to all tables in the database.

# start_db ---------------------------------------------------------------------
setwd("your/globaltrends/folder")
start_db()
#> Successfully connected to database.
#> Successfully exported all objects to .GlobalEnv.
print(ls())
#>  [1] "batch_keywords"   "batch_time"       "countries"        "data_control"
#>  [5] "data_doi"         "data_global"      "data_locations"   "data_mapping"
#>  [9] "data_object"      "data_score"       "dir_current"      "dir_wd"
#> [13] "globaltrends_db"  "keyword_synonyms" "keywords_control" "keywords_object"
#> [17] "time_control"     "time_object"      "us_states"

After work with the globaltrends package is complete, the user disconnects from the database with the command disconnect_db.

# disconnect_db ----------------------------------------------------------------
disconnect_db()
#> Successfully disconnected.

Compute search scores and internationalization

Once the user has completed all control and object downloads, globaltrends computes search scores for each keyword-time-location combination and at a global level (volume of internationalization). Next, the package uses the across-country distribution of these search scores to measure the degree of internationalization of an object keyword.

Compute country search scores and volume of internationalization

The function compute_score divides the search volumes for an object keyword by the sum of search volumes for the keywords in the respective control batch. The search score computation proceeds in four steps. First, the function aggregates all search volumes into monthly data. Next, it follows the procedure proposed by Castelnuovo and Tran (2017, pp. A1-A2) and outlined in the Appendix B to map control and object data. After the mapping, object search volumes are divided by the sum of control search volumes in the respective control batch. We use the sum of search volumes for a set of control keywords, rather than the search volumes for a single control keyword, to smooth out variation in the underlying control data. Because of this division, it is essential to define a set of control keywords that mirrors “standard” Google usage for the given research setting.

# compute_score ----------------------------------------------------------------
compute_score(control = new_control[[1]], object = new_object, locations = countries)
#> Successfully computed search score | control: 1 | object: 1 | location: US [1/66]
#> ...
#> Successfully computed search score | control: 1 | object: 2 | location: DO [66/66]

A message indicates each successful computation of search scores. The data is written directly to the table data_score in the database. The computation of the volume of internationalization follows the same principles. Instead of search volumes of control and object keywords at the country level, the function compute_voi compares control and object search volumes at the global level.

# compute_voi ------------------------------------------------------------------
compute_voi(control = new_control[[1]], object = new_object)
#> Successfully computed search score | control: 1 | object: 1 | location: world [1/1]
#> Successfully computed search score | control: 1 | object: 2 | location: world [1/1]

Compute degree of internationalization

The globaltrends package uses the distribution of search scores across countries to compute the degree of internationalization for objects of interest. The function compute_doi uses an inverted Gini coefficient as a measure of the degree of internationalization. The more uniform the distribution of search scores across all countries, the higher the inverted Gini coefficient and the greater the degree of internationalization. In addition to the Gini coefficient, the package uses the inverted Herfindahl index and entropy as measures of internationalization (details below).

# compute_doi ------------------------------------------------------------------
compute_doi(control = new_control[[1]], object = new_object, locations = "countries")
#> Successfully computed DOI | control: 1 | object: 1 [1/2]
#> Successfully computed DOI | control: 1 | object: 2 [2/2]

A message indicates each successful computation. The data is written directly to the table data_doi in the database.

Exports and plots

Functions in globaltrends write all data directly to tables in the database. With the help of functions from the dplyr package and connections exported from start_db, users can access database tables and prepare their own analysis.

# manual exports ---------------------------------------------------------------
library(dplyr)
data_score %>%
  filter(keyword == "coca cola") %>%
  collect()
#> # A tibble: 8,040 x 6
#>    location keyword    date   score     batch_c  batch_o
#>    <chr>    <chr>     <int>     <dbl>   <int>    <int>
#>  1 US       coca cola 14610   0.00362   1        1
#>  ...
#> 10 US       coca cola 14883   0.00347   1        1
#> # ... with 8,030 more rows

To enhance usability, the globaltrends package includes a set of export functions that offer filters and return data as a tibble. The default value for the batch/keyword, for which export_xxx exports data, is NULL. In this case, all values from the database are exported. Alternatively, users can specify filters (e.g., keywords, batches, locations) individually, as a vector or as a list.

# export_control ---------------------------------------------------------------
export_control(control = 1)
#> # A tibble: 39,600 x 5
#>    location keyword date        hits control
#>    <chr>    <chr>   <date>     <dbl>   <int>
#>  1 US       gmail   2010-01-01    22       1
#>  ...
#> 10 US       gmail   2010-10-01    27       1
#> # ... with 39,590 more rows

# export_score -----------------------------------------------------------------
export_score(object = 1, control = 1)
#> # A tibble: 23,760 x 6
#>    location keyword   date       score    control  object
#>    <chr>    <chr>     <date>     <dbl>    <int>    <int>
#>  1 US       coca cola 2010-01-01 0.00362  1        1
#>  ...
#> 10 US       coca cola 2010-10-01 0.00347  1        1
#> # ... with 23,750 more rows

# export_doi and purrr interaction ---------------------------------------------
purrr::map_dfr(c("coca cola", "microsoft"), export_doi, control = 1)
#> # A tibble: 240 x 8
#>    keyword   date       gini   hhi entropy control object locations
#>    <chr>     <date>     <dbl> <dbl>   <dbl>   <int>  <int> <chr>
#>  1 coca cola 2010-01-01 0.397 0.874  -0.938       1     1 countries
#>  ...
#> 10 coca cola 2010-10-01 0.574 0.968  -0.303       1     1 countries
#> # ... with 230 more rows

The export functions from globaltrends also allow direct interaction with dplyr or other packages for further analysis.

# export and dplyr interaction -------------------------------------------------
library(dplyr)
export_doi(object = 1, control = 1) %>%
  filter(lubridate::year(date) == 2019) %>%
  summarise(gini = mean(gini), .by = keyword)
#> # A tibble: 3 x 2
#>   keyword    gini
#>   <chr>     <dbl>
#> 1 coca cola 0.615
#> 2 facebook  0.707
#> 3 microsoft 0.682

Additional options

The globaltrends package offers several options that allow adjustments for default computations. Users can use other measures besides the inverted Gini coefficient, or change the set of locations.

Alternative dispersion measures

The globaltrends package computes degree of internationalization based on the across-location distribution of search scores. By default, the package uses an inverted Gini coefficient. In addition, the package provides the inverted Herfindahl index and Entropy as robustness checks. In general, outcomes for all three dispersion measures are similar.

Alternative sets of locations

By default, globaltrends makes all downloads and computations for the countries set of locations. The countries set covers all countries that generated at least 0.1% of world GDP in 2018. By changing the input locations to us_states, the package uses US states and Washington, D.C. as a basis for downloads and computations instead. Apart from compute_doi, all functions use the name of the variable that contains the location vector as inputs for locations (e.g., countries, us_states). The function start_db exports these vectors of ISO2 codes to the global environment. Function compute_doi, however, does not directly refer to these objects, but to their names (e.g., “countries”, “us_states”). Using state or district-level locations allows users to analyze the within-country dispersion of firms.

# change locations -------------------------------------------------------------
download_control(control = 1, locations = us_states)
download_object(object = 1:2, locations = us_states)
compute_score(control = 1, object = 2, locations = us_states)
compute_doi(control = 1, object = 1:2, locations = "us_states")

Users can add individual sets of locations through the function add_locations. In the variable locations, users specify the location codes (e.g., “AT”, “CH”, “DE”) and type takes the name of the location set (e.g., “DACH”). The new location set can be used in all functions. Since all functions check whether data on a location already exists, globaltrends does not duplicate data for new location sets.

add_locations(c("AT", "CH", "DE"), type = "dach")
#> Successfully created new location set dach (AT, CH, DE).
data <- export_score(keyword = "coca cola", locations = dach)
dplyr::count(data, location)
#> # A tibble: 3 x 2
#>   location     n
#>   <chr>    <int>
#> 1 AT         127
#> 2 CH         127
#> 3 DE         127

Search topics vs search terms

Results for individual keywords as search terms (e.g., weather, apple, coca cola) might be distorted by translation issues (i.e., keywords are search for in different languages), keyword contamination (i.e., keywords relate to different queries: apple vs. Apple Inc.), and keyword dilution (i.e., multiple keywords relate to the same query: election, vote). Search topics help users partly overcome these issues. Google defines a search topic as “a group of terms that share the same concept in any language.” Thereby, queries that use search topics are language-independent, cover different terms, and differentiate between them.

Users can identify the codes of search topics on the Google Trends portal by selecting the respective topic, rather than a search term (see the screenshot below).

After selecting the relevant search topics, users can identify the topic codes in the query’s URL. For example, based on the URL https://trends.google.com/trends/explore?q=%2Fm%2F03phgz&geo=AT, the topic The Coca-Cola Company is %2Fm%2F03phgz. If downloads are made through the research API, special characters must be added as “literals.” This is to say, %2Fm%2F03phgz must be changed to /m/03phgz* when using the research API and remains %2Fm%2F03phgz when working with gtrendsR::gtrends. Users can use these topic codes as keywords instead of single search terms. We point users to Kupfer and Zorn (2020, pp. 1169-1170) for a detailed comparison of search topics and search terms.

Important: We recommend that search topics for control keywords are used in combination with search topics for object keywords and vice versa.

Further applications

To measure degree of internationalization, globaltrends offers a wide array of empirical possibilities (Puhr & Müllner, 2021). It allows researchers to compare degree of internationalization for various organizations on a unified scale (e.g., Coca-Cola Company, Facebook Inc., Real Madrid, and Manchester United). In addition, the time-series nature of Google Trends allows for historical analysis of internationalization patterns and speed within organizations.

The enormous detail of the data opens additional research applications that are not possible with traditional measures of internationalization. For instance, using globaltrends at the subnational level (e.g., locations = us_states) allows researchers to study proliferation within a country and, for example, trace a particular market entry. In addition, globaltrends offers applications beyond corporate internationalization, including data on global interest in products, people, events, social trends, or scandals.



References

  • Castelnuovo, E. & Tran, T. D. 2017. Google It Up! A Google Trends-based uncertainty index for the United States and Australia. Economics Letters, 161: 149-153.
  • Costola, M., Iacopini, M., & Santagiustina, C. R. M. A. (2021). Google search volumes and the financial markets during the COVID-19 outbreak. Finance Research Letters, 42: 101884.
  • Kupfer, A. & Zorn, J. 2020. A language-independent measurement of economic policy uncertainty in Eastern European countries. Emerging Markets Finance and Trade, 56(5): 1166-1180.
  • MacKinlay, A. C. 1997. Event studies in economics and finance. Journal of Economic Literature, 35(1): 13-39.
  • McWilliams, A. & Siegel, D. 1997. Event studies in management research: Theoretical and empirical issues. Academy of Management Journal, 40(3): 626-657.
  • Puhr, H., & Müllner, J. (2021). Let me Google that for you: Capturing globalization using Google Trends (SSRN Working Paper 3969013). Available at https://ssrn.com/abstract=3969013/.
  • Puhr, H., & Müllner, J. (2022). Foreign to all but fluent in many: The effect of multinationality on shock resilience. Journal of World Business, 57(6): 101370.

Appendix A

Google Trends does not query the total population of search queries on Google—an impossible task given the massive volume of data involved. Users specify which keyword \(ko\) they want to query for location \(l\) within timeframe \(T\). We follow Costola, Iacopini, and Santagiustina (2021) to illustrate the data preparation steps applied by Google below.

Google filters the total population of search queries on its platform to those queries that fit with the user-specified location \(l\) and time period \(T\). This sample (Panel A) includes all relevant search queries, those that relate to keyword \(ko\) (in red) and those do not (in green). To limit computational requirements, Google takes a random sample of the relevant search queries (Panel B) to compute the Google Trends search volume \(SV_{ko,l,t}\). Although a substantially lower number of queries is included in the sub-sample, the relation between queries that relate to \(ko\) and those that do no, remains the same. Next, Google compares the number of queries that relate to \(ko\) for each day \(t \in T\) to compute a relative search score (Panel C). To compute the Google Trends search volume \(SV_{ko,l,t}\), Google normalizes the relative search score to a value between 0 and 100, where 100 is the maximum search score in the analyzed combination of \(ko\), \(l\), and \(T\).

Appendix B

Releveling of search volumes

Google Trends does not provide raw search queries for downloads. Instead, Google Trends expresses the number of search queries as search volumes relative to the total number of search queries and then normalizes this data. To use Google Trends data, we first have to bring all search volumes to the same level.
For object keyword \(ko\), included in object batch \(bo\), Google Trends observes \(SQ_{ko,bo,l,t}\) search queries for location \(l\) at time \(t\). The number of raw search queries is transformed to search volumes \(SV_{ko,bo,l,t}\) by division through the total number of search queries for the given location-time pair \(l,t\):

\[\begin{equation} SV_{ko,bo,l,t}=\frac{SQ_{ko,bo,l,t}}{\sum SQ_{l,t}}. \tag{1} \end{equation}\]

Next, Google Trends divides search volumes \(SV_{ko,bo,l,t}\) by the maximum search value within object batch \(bo\) at location \(l\) to normalize search volumes to \(\tilde{SV}_{ko,bo,l,t}\):

\[\begin{equation} \tilde{SV}_{ko,bo,l,t}=\frac{SV_{ko,bo,l,t}}{max(SV_{bo,l})*100}. \tag{2} \end{equation}\]

Since this normalization step is contingent on the maximum search volume within object batch \(bo\), normalized search volumes \(\tilde{SV}\) depend on the other keywords included in the object batch, the choice of location, and time span \(T\) (\(t \in T\)) for which data is obtained. To prepare normalized search volumes \(\tilde{SV}\) for further usage, the globaltrends packages follows Castelnuovo and Tran (2017, pp. A1-A2) to relevel \(\tilde{SV}\) through mapping to a benchmark. To this end, we map all \(\tilde{SV}\) values in object batch \(bo\) to the same level as \(\tilde{SV}\) values in control batch \(bc\). The function download_object automatically adds a control keyword \(kc\) to all object batches \(bo\). In functions compute_score and compute_voi, \(\tilde{SV}_{kc,bc,l,t}\) of control keyword \(kc\) in control batch \(bc\) is divided by \(\tilde{SV}_{ko,bc,l,t}\) in object batch \(bo\). By multiplying the result of this division with normalized search volumes \(\tilde{SV}_{ko,bo,l,t}\), we get releveled search volumes \(\tilde{SV}_{ko,bc,l,t}\) for object keyword \(ko\), at location \(l\), at time \(t\):

\[\begin{equation} \tilde{SV}_{ko,bc,l,t}=\tilde{SV}_{ko,bo,t,l}*\frac{\tilde{SV}_{kc,bc,l,t}}{\tilde{SV}_{kc,bo,l,t}}. \tag{3} \end{equation}\]

After the releveling, search volumes from all object batches use control batch \(bc\) as basis for normalization.

Computing search scores

The outcome of the releveling is not a de-normalization but that search volumes are releveled to control batch \(bc\). This means that \(\tilde{SV}_{ko,bc,l,t}\) may still be distorted by \(max(SV_{bo,l})\). To overcome such distortion, functions compute_score and compute_voi divide \(\tilde{SV}_{ko,bc,l,t}\) by search volumes for a set of control keywords \(KC\). Since gmail, maps, translate, wikipedia, and youtube allow an approximation of “standard” search volumes on Google, we propose them as control keywords for global trend analysis. These keywords approximate the baseline search traffic on Google. For specific research settings, we suggest adapting control keywords to the respective setting and testing them on the Google Trends portal beforehand. To compute search score \(SC_{ko,l,t}\), we divide search volumes for object keywords by the sum of search volumes for control keywords \(kc \in KC\):

\[\begin{equation} SC_{ko,l,t}=\frac{\tilde{SV}_{ko,bc,l,t}}{\sum_{kc \in KC} \tilde{SV}_{kc,bc,l,t}}. \tag{4} \end{equation}\]

Using equation (3) for normalization from above, we can rewrite the equation (4) for \(SC\) as follows:

\[\begin{equation} SC_{ko,l,t}=\frac{\tilde{SV}_{ko,bo,t,l}*\frac{\tilde{SV}_{kc,bc,l,t}}{\tilde{SV}_{kc,bo,l,t}}}{\sum_{kc \in KC} \tilde{SV}_{kc,bc,l,t}} \tag{5} \end{equation}\]

\[\begin{equation} SC_{ko,l,t}=\frac{\frac{SV_{ko,bo,t,l}}{max(SV_{bo,l})*100}*\frac{\frac{SV_{kc,bc,l,t}}{max(SV_{bc,l})*100}}{\frac{SV_{kc,bo,l,t}}{ max(SV_{bo,l})*100}}}{\sum_{kc \in KC} \frac{SV_{kc,bc,l,t}}{ max(SV_{bc,l})*100}} \tag{6} \end{equation}\]

\[\begin{equation} SC_{ko,l,t}=\frac{SV_{ko,bo,t,l}*\frac{SV_{kc,bc,l,t}}{SV_{kc,bo,l,t}}}{\sum_{kc \in KC} SV_{kc,bc,l,t}} \tag{7} \end{equation}\]

\[\begin{equation} SC_{ko,l,t}=\frac{SV_{ko,bc,l,t}}{\sum_{kc \in KC} SV_{kc,bc,l,t}}. \tag{8} \end{equation}\]

Using the equation (1) for \(SV\) from above, we can reformulate \(SC\) as:

\[\begin{equation} SC_{ko,l,t}=\frac{\frac{SQ_{ko,l,t}}{\sum SQ_{l,t}}} {\sum_{kc \in KC} \frac{SQ_{kc,l,t}}{\sum SQ_{l,t}}}. \tag{9} \end{equation}\]

\[\begin{equation} SC_{ko,l,t}=\frac{SQ_{ko,l,t}}{\sum_{kc \in KC} SQ_{kc,l,t}}. \tag{10} \end{equation}\]

Based on these transformations, we can interpret search score \(SC\) as the ratio of search queries \(SQ_{ko,l,t}\) for object keyword \(ko\) divided by the sum of search queries \(SQ_{kc,l,t}\) for control keywords \(ko \in KC\) at location \(l\) for time \(t\). Since \(SC\) is independent from any keyword batch \(bo\) or \(bc\), search scores therefore allow comparison across objects of interest, time, and countries.