For those more interested in the contents of the package rather than analysis of the actual data at hand, here I detail some of the key steps.
The main steps for preparing the data found in the Shiny app were to:
I provide more detail on each of these steps in the sections below.
This project started with querying The New York Times Article Search API.
My goal was to collect every article in The New York Times “about India” dating back as far as possible. Accordingly, the query string required that all results include:
After reading a few tutorials on querying The New York Times’ API, like this one, it was not too difficult to get the results I wanted.
My code in R/01-query-nyt-api.R is probably not the most robust script, but the main learning point was to iterate over small chunks of time.
My query script essentially has two loops:
The API returns 10 articles at a time. In order to stay within the limits of the API, you won’t get too far querying the full time range all at once because as you paginate through results, the response time slows. When trying to query the results for one year, I wasn’t close to reaching the daily limit of 4,000 requests (40,000 articles), but my query was timing out.
Having smaller time chunks ensures having fewer page results for each time chunk. Once switching to a monthly increment, I was able to retrieve all results in one job.
My API query produced a folder with one file for every year-month combination. The next step was to combine those files into one data frame, remove unnecessary columns, and begin cleaning the data (R/02-prepare-nested.R).
Many of the columns in the dataset contained multiple values that represent the same or nearly the same entity.
For example, the raw values for a column like news_desk included “Business/Finance”, “Money and Business/Financial”, “Financial”, “Business”, “Business/World Business”, and “Business / World Business”.
The same problem presented itself with a column like keyword values. To give one example, there were nine different attributions all referring to the same person (“Gandhi, Mohandas Karamchand”).
I needed to consolidate values referring to the same entity to one single value. I initially approached this problem with a very long case_when()
statement. This quickly became untenable though as the severity of the problem became apparent.
A more automated solution like fuzzy matching, or approximate string matching, was also not viable as many of the correct matches are not actually that close in terms of string distance. On the other hand, values like “Gandhi, Rahul” and “Gandhi, Rajiv” are close in terms of string distance, but are two different people. Accordingly, all renaming required human review.
My solution was to write a small shiny app that would make it easy to create lookup tables.
I would first create a file with counts of distinct names for a column like news desk or keyword values. I would choose the name I wanted to keep. The app would then sort all of the other values in the column by string distance. Between the similarity score and searching in the DataTable, I was able to find the most likely values to be consolidated under the chosen name.
Screenshot of Lookup Table shiny app
Another challenge in preparing the data was handling nested keywords. After combining all of the raw files into one data frame, every record represents one article. (Think of the Table tab of the nyt_india_app
). The keywords for every article are structured as a nested data frame. You can see this in the final nested data frame below.
#> Rows: 33,646
#> Columns: 15
#> $ url <chr> "https://www.nytimes.com/2021/04/30/opinion/india-covid…
#> $ pub_date <date> 2021-04-30, 2021-04-30, 2021-04-30, 2021-04-30, 2021-0…
#> $ headline <chr> "‘Death Is the Only Truth.’ Watching India’s Funeral Py…
#> $ news_desk <chr> "OpEd", "Washington", "Washington", "Foreign", "Foreign…
#> $ section <chr> "Opinion", "U.S.", "U.S.", "World", "World", "World", "…
#> $ material <chr> "Op-Ed", "News", "News", "News", "News", "News", "Lette…
#> $ byline <chr> "Aman Sethi", "Noah Weiland and Jim Tankersley", "Noah …
#> $ abstract <chr> "Graphic images of mass cremations cut through the Indi…
#> $ lead_paragraph <chr> "NEW DELHI — The first 36 corpses were placed in the de…
#> $ front_page <lgl> NA, TRUE, NA, NA, NA, NA, FALSE, FALSE, NA, NA, NA, NA,…
#> $ printed <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FA…
#> $ keywords <list> [<tbl_df[8 x 6]>], [<tbl_df[8 x 6]>], [<tbl_df[5 x 6]>…
#> $ india_rank <dbl> 8, 4, 3, 6, 3, 12, 2, 1, 4, 1, 2, 2, 1, 4, 7, 6, 2, 1, …
#> $ max_kword <dbl> 8, 8, 5, 6, 4, 15, 4, 6, 6, 9, 4, 4, 7, 7, 8, 6, 9, 6, …
#> $ in_of_n_kword <glue> "8 of 8", "4 of 8", "3 of 5", "6 of 6", "3 of 4", "12 …
The nested keywords data frame from the Article Search API includes not only the keyword value, but also a category (“subject”, “person”, “glocation”, “organization”, or “creative_work”) and its rank of importance in the article. The lat, lon, and country columns are added in the step below. Here’s one example of a single nested keyword data frame:
#> # A tibble: 8 x 6
#> name value rank lat lon country
#> <chr> <chr> <int> <dbl> <dbl> <lgl>
#> 1 subject funerals and memorials 1 NA NA FALSE
#> 2 subject coronavirus (2019-ncov) 2 NA NA FALSE
#> 3 subject cremation 3 NA NA FALSE
#> 4 subject death and dying 4 NA NA FALSE
#> 5 subject deaths (fatalities) 5 NA NA FALSE
#> 6 subject disease rates 6 NA NA FALSE
#> 7 glocations new delhi (india) 7 28.6 77.2 FALSE
#> 8 glocations india 8 27.1 80.3 TRUE
In preparing the data, {tidyr}’s nest()
and unnest()
functions are used for this task. However, in the Shiny app itself, to go from unnested to nested versions, it’s faster to just acquire the distinct URLs from the unnested data frame and use that to filter the nested data frame.
Once keyword values were unnested and cleaned, I wanted to join in geographic coordinates in order to visualize the location keywords on a map.
There are a number of free geocoding APIs available, many with R wrappers. I chose to query the Mapquest Geocoding API to return a longitude and latitude coordinate for every unique location keyword. I expect there are a few errors, but overall it did a good job.
Again, my script (R/07-query-mapquest.R) is likely not the most robust, but, as there were a relatively small number of locations, it wasn’t a difficult task.
Location keywords include both cities and countries. They should really be distinguished in some way before plotting as bubbles on the same map. To achieve this, I joined in a list of country name data in order to identify which keywords are countries. Once again, likely not a perfect list, but it did the job. See R/08-add-coords-countries.R for details.
One last piece of data needed was a history of Indian governments. This enabled filtering the dataset by a specific party or prime minister instead of a single date range. It also provided a helpful background for the timeline.
This data came from Wikipedia. Instead of scraping it directly, copy-pasting into a Google sheet and then using the {datapasta} package was the easiest way to get the data into R. See data-raw/govt.R for more details.
#> Rows: 28
#> Columns: 7
#> $ pm <chr> "Viceroy", "Jawaharlal Nehru", "Jawaharlal Nehru", "Jawaharl…
#> $ party <chr> "British Raj", "Indian National Congress", "Indian National …
#> $ from <date> 1855-01-01, 1947-08-15, 1952-04-15, 1957-04-17, 1962-04-02,…
#> $ to <date> 1947-08-14, 1952-04-14, 1957-04-16, 1962-04-01, 1964-05-26,…
#> $ color <chr> "#fb9a99", "#a6cee3", "#a6cee3", "#a6cee3", "#a6cee3", "#a6c…
#> $ abb <chr> "GB", "INC", "INC", "INC", "INC", "INC", "INC", "INC", "INC"…
#> $ govt_name <glue> "1855-01-01 / 1947-08-14: Viceroy (British Raj)", "1947-08-…
The key points for building the shiny app were:
ui.R
with UI functionsserver.R
and into functions kept in global.R
.The UI for this Shiny app is actually quite simple.
UI functions, as described in the Mastering Shiny book, were a great help to reduce the code in ui.R
. There are about 15 inputs just on the Filters tab. Most are very similar.
After writing a myInput()
function, I was able to store most of the input information in one table and call on it with functions. Doing so increased clarity and reduced code duplication.
The most important function in global.R
is filter_unnested()
. It filters the the complete unnested data frame (where each row represents a keyword belonging to an article) based on about 15 different user inputs.
Filtering for some of these inputs like news_desk, section or material is quite simple. Others, such as keyword filtering, require more complicated logic.
When filtering by keywords, you first need to obtain the correct set of article URLs. This is because for visualizations like the keyword pairs or even the bar plot of counts, you want to retain the other keywords in articles matching the filter.
For example, imagine the user wants to filter for all articles with a keyword “pakistan”. We need to find the article URLs with the matching keyword, and then use those URLs to filter the dataset. Otherwise, the other keywords in articles with a “pakistan” keyword would be lost, and visualizations like the bar plot or keyword pairs would be empty.
Once you have the correct unnested data frame, you also need a nested version (where each row is an article) that can be used in visualizations like the Table or Timeline. Instead of going to the trouble of actually nesting it each time though, the nest_df()
function just gets the distinct URLs in the unnested data frame and then performs a join on the full nested data frame.
Once the correct reactive (nested or unnested) data frame is available, creating the actual visualizations is fairly simple.
As suggested in Mastering Shiny, I tried to keep all of the non-reactive code outside of server.R
.
Aside from calculating the outputs, most of the code in server.R
is for filtering the data by interacting with the visualizations themselves. Using Shiny with {DT}, {ggiraph}, {dygraphs}, and {leaflet} follows a similar pattern in saving a click event with an eventReactive()
.
This project was originally a typical data analysis project: a series of scripts that produced an output dataset, combined with a shiny app for visualization.
I decided to take it one step further and re-structure it as a package. This makes it easier to update the data every month and was a great learning opportunity.