Question: Why should you go through all these efforts to build an API request and download the data in JSON
file format instead of just downloading a .csv
file?
APIs and urban data
Aims of the practical
Understand the technicalities and the breadth of the urban data landscape
Understand the messiness of real world (urban) data
Access programmatically urban data and understand the benefits of such data access practise.
APIs
Application Programming Interface (APIs) represent a ‘gate’ or otherwise a platform that enables a client (that is you) to interact with a server (for example opendata.bristol.gov.uk. According to AWS (2022):
In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses. Their API documentation contains information on how developers are to structure those requests and responses.
The client’s software (this might be R for example) sends a request to the server requesting specific data. The response is the data the client asked.
More commonly, the client might be a mobile phone app (e.g. train network status app) and the server is the network operator’s server.
APIs can be private or public types. For more inthe description from AWS (2022) here
JSON file format
Although some urban data can be downloaded as a tidy1 table in a .csv
format, most APIs will ship data to you in a JSON
format.
According to the official JSON website (JSON), JSON
is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
It adopts a hierarchical structure such as attribute:value
instead of the fixed structure of tidy data. Following Singleton, Spielman, and Folch (2017), such attribute:value
pairs might be:
“mean income”:“50000”
“internet speed”:“35”
“jobs”:“15000”
These attributes might be organised hierarchically:
(“city”:“Bristol,”mean income”: “27000”, “internet speed”: “55”, “jobs”:(“2019”:“15000”, “2020”:“14000”)), (“city”:“Newcastle,”mean income”: “25000”, “internet speed”: “45”, “jobs”:(“2019”:“11000”, “2020”:“10000”))
Although this format is more verbose (attribute names are repeated) it is more flexible (not all observations need to have all attributes) (Singleton, Spielman, and Folch 2017)
Practical example with API
opendata.bristol.gov.uk offers a lot of data about Bristol. For this example, let’s try to work with Jobs by Ward (by industry) in Bristol 2020. As you can see, this is a small data set with \(35\) observations representing Bristol’s \(35\) Wards. You can visualise these data with the Table
tab and even download a .csv
with these data. To begin with, do both of these things.
Then, go to the API
tab. Change the rows
you want to include in you API request from 10, which is the default here to 35, which is the maximum as we already know. Then clink on the Submit
button and below that you will see a new link. This URL is your API request. If you carefully read the URL from right to left you can easily recognise the fields you saw on the previous screen. The weird output on your screen is the data you requested in JSON
format.
So, let’s try to access these data programmatically. First, we need to load the relevant R
packages
library(tidyverse)
library(jsonlite)
To access the data, we are going to use the function fromJSON()
from the jsonlite
package. Please check the syntax of this function as well as the other function of this package.
<- jsonlite::fromJSON("https://opendata.bristol.gov.uk/api/records/1.0/search/?dataset=jobs-by-ward-by-industry-in-bristol-2020&q=&rows=35&facet=ward", flatten = TRUE) dat
We now have an R
object called dat
. Let’s try to see how it looks like
typeof(dat)
[1] "list"
So, this is a list. If you need a reminder for what a list is, see here. Now the difficult part. Let’s try to figure out the structure of this JSON
formated data.
str(dat)
List of 4
$ nhits : int 35
$ parameters :List of 6
..$ dataset : chr "jobs-by-ward-by-industry-in-bristol-2020"
..$ rows : int 35
..$ start : int 0
..$ facet : chr "ward"
..$ format : chr "json"
..$ timezone: chr "UTC"
$ records :'data.frame': 35 obs. of 24 variables:
..$ datasetid : chr [1:35] "jobs-by-ward-by-industry-in-bristol-2020" "jobs-by-ward-by-industry-in-bristol-2020" "jobs-by-ward-by-industry-in-bristol-2020" "jobs-by-ward-by-industry-in-bristol-2020" ...
..$ recordid : chr [1:35] "8f1ee37587a2233922bfa38b048879368059502e" "464799297e1d1af8bb980450d062c7eedb787042" "4cf81b7e8c57fc61c73eb40c4d4556446d20896c" "e034ff06e7c7d67f6f716c41e8ccd940a36d41fd" ...
..$ record_timestamp : chr [1:35] "2021-12-22T11:29:00.712Z" "2021-12-22T11:29:00.712Z" "2021-12-22T11:29:00.712Z" "2021-12-22T11:29:00.712Z" ...
..$ fields.mining_quarrying_utilities : int [1:35] 10 0 0 0 0 10 0 30 2250 0 ...
..$ fields.transport_storage_inc_postal : int [1:35] 20 0 350 15 200 50 100 600 1750 300 ...
..$ fields.accommodation_food_services : int [1:35] 100 150 175 800 700 175 400 400 6000 300 ...
..$ fields.public_administration_defence : int [1:35] 0 0 0 10 50 0 0 400 9000 200 ...
..$ fields.agriculture_forestry_fishing : int [1:35] 0 0 0 0 0 0 0 0 0 0 ...
..$ fields.business_administration_support_services : int [1:35] 125 75 1250 1750 450 225 500 450 11000 125 ...
..$ fields.retail : int [1:35] 700 100 200 700 900 75 800 1500 4500 500 ...
..$ fields.property : int [1:35] 10 15 20 450 175 75 200 30 1750 30 ...
..$ fields.health : int [1:35] 150 125 125 450 500 500 2000 900 12000 1500 ...
..$ fields.motor_trades : int [1:35] 125 30 300 30 350 20 40 600 200 40 ...
..$ fields.manufacturing : int [1:35] 150 25 900 25 400 5 75 500 500 50 ...
..$ fields.information_communication : int [1:35] 225 200 400 800 350 175 450 700 10000 50 ...
..$ fields.education : int [1:35] 350 250 400 200 350 175 1000 500 9000 800 ...
..$ fields.ward_code : chr [1:35] "E05010889" "E05010918" "E05010890" "E05010894" ...
..$ fields.financial_insurance : int [1:35] 1500 10 10 600 50 10 75 15 8000 35 ...
..$ fields.ward : chr [1:35] "Bishopsworth" "Windmill Hill" "Brislington East" "Clifton Down" ...
..$ fields.period : chr [1:35] "2020" "2020" "2020" "2020" ...
..$ fields.professional_scientific_technical : int [1:35] 75 125 200 1000 600 250 800 600 22000 125 ...
..$ fields.construction : int [1:35] 500 125 500 300 400 100 350 175 1250 125 ...
..$ fields.arts_entertainment_recreation_other_services: int [1:35] 150 75 75 350 500 35 250 125 1750 100 ...
..$ fields.wholesale : int [1:35] 100 20 450 35 400 15 50 150 900 20 ...
$ facet_groups:'data.frame': 1 obs. of 2 variables:
..$ name : chr "ward"
..$ facets:List of 1
.. ..$ :'data.frame': 35 obs. of 4 variables:
.. .. ..$ name : chr [1:35] "Ashley" "Avonmouth and Lawrence Weston" "Bedminster" "Bishopston and Ashley Down" ...
.. .. ..$ count: int [1:35] 1 1 1 1 1 1 1 1 1 1 ...
.. .. ..$ state: chr [1:35] "displayed" "displayed" "displayed" "displayed" ...
.. .. ..$ path : chr [1:35] "Ashley" "Avonmouth and Lawrence Weston" "Bedminster" "Bishopston and Ashley Down" ...
Well, it’s complicated. Looking on the above output of list()
the object that looks most useful to us is the dat$records
.
Question: Based on the output of list()
, why do you think is dat$records
the most useful part of dat
for us?
And if you want to see how the dataframe dat$records
looks like, the below function will show the first five rows.
head(dat$records)
datasetid
1 jobs-by-ward-by-industry-in-bristol-2020
2 jobs-by-ward-by-industry-in-bristol-2020
3 jobs-by-ward-by-industry-in-bristol-2020
4 jobs-by-ward-by-industry-in-bristol-2020
5 jobs-by-ward-by-industry-in-bristol-2020
6 jobs-by-ward-by-industry-in-bristol-2020
recordid record_timestamp
1 8f1ee37587a2233922bfa38b048879368059502e 2021-12-22T11:29:00.712Z
2 464799297e1d1af8bb980450d062c7eedb787042 2021-12-22T11:29:00.712Z
3 4cf81b7e8c57fc61c73eb40c4d4556446d20896c 2021-12-22T11:29:00.712Z
4 e034ff06e7c7d67f6f716c41e8ccd940a36d41fd 2021-12-22T11:29:00.712Z
5 88ce4a95ea29ef24f38ecd46b3a7e021d325991c 2021-12-22T11:29:00.712Z
6 0576d71991a12a6102330e89a595d26fcc92fd32 2021-12-22T11:29:00.712Z
fields.mining_quarrying_utilities fields.transport_storage_inc_postal
1 10 20
2 0 0
3 0 350
4 0 15
5 0 200
6 10 50
fields.accommodation_food_services fields.public_administration_defence
1 100 0
2 150 0
3 175 0
4 800 10
5 700 50
6 175 0
fields.agriculture_forestry_fishing
1 0
2 0
3 0
4 0
5 0
6 0
fields.business_administration_support_services fields.retail fields.property
1 125 700 10
2 75 100 15
3 1250 200 20
4 1750 700 450
5 450 900 175
6 225 75 75
fields.health fields.motor_trades fields.manufacturing
1 150 125 150
2 125 30 25
3 125 300 900
4 450 30 25
5 500 350 400
6 500 20 5
fields.information_communication fields.education fields.ward_code
1 225 350 E05010889
2 200 250 E05010918
3 400 400 E05010890
4 800 200 E05010894
5 350 350 E05010914
6 175 175 E05010916
fields.financial_insurance fields.ward fields.period
1 1500 Bishopsworth 2020
2 10 Windmill Hill 2020
3 10 Brislington East 2020
4 600 Clifton Down 2020
5 50 Southville 2020
6 10 Stoke Bishop 2020
fields.professional_scientific_technical fields.construction
1 75 500
2 125 125
3 200 500
4 1000 300
5 600 400
6 250 100
fields.arts_entertainment_recreation_other_services fields.wholesale
1 150 100
2 75 20
3 75 450
4 350 35
5 500 400
6 35 15
Looking into the above, I can think of the the following necessary data cleaning and wrangling steps:
remove the first three columns as they don’t include any useful data
remove the string “fields.” from the column names
although not necessary, it is common practice for the first columns to contain identification data such as the ward names and codes.
To do all the above, we will use the below functions from the dplyr
package, which is part of the tidyverse
ecosystem.
<- dat$records %>%
dat select(-datasetid, -recordid, -record_timestamp) %>% # remove useless columns
rename_with(~sub("fields.", "", .)) %>% # ~ introduces a formula: function `sub` substitutes (in other words replaces) the string "fields." with an empty sting "" for all columns (symbol .)
relocate(ward, ward_code, period) %>% # looks better
glimpse() # a glimpse of the output
Rows: 35
Columns: 21
$ ward <chr> "Bishopsworth", "Windmill~
$ ward_code <chr> "E05010889", "E05010918",~
$ period <chr> "2020", "2020", "2020", "~
$ mining_quarrying_utilities <int> 10, 0, 0, 0, 0, 10, 0, 30~
$ transport_storage_inc_postal <int> 20, 0, 350, 15, 200, 50, ~
$ accommodation_food_services <int> 100, 150, 175, 800, 700, ~
$ public_administration_defence <int> 0, 0, 0, 10, 50, 0, 0, 40~
$ agriculture_forestry_fishing <int> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ business_administration_support_services <int> 125, 75, 1250, 1750, 450,~
$ retail <int> 700, 100, 200, 700, 900, ~
$ property <int> 10, 15, 20, 450, 175, 75,~
$ health <int> 150, 125, 125, 450, 500, ~
$ motor_trades <int> 125, 30, 300, 30, 350, 20~
$ manufacturing <int> 150, 25, 900, 25, 400, 5,~
$ information_communication <int> 225, 200, 400, 800, 350, ~
$ education <int> 350, 250, 400, 200, 350, ~
$ financial_insurance <int> 1500, 10, 10, 600, 50, 10~
$ professional_scientific_technical <int> 75, 125, 200, 1000, 600, ~
$ construction <int> 500, 125, 500, 300, 400, ~
$ arts_entertainment_recreation_other_services <int> 150, 75, 75, 350, 500, 35~
$ wholesale <int> 100, 20, 450, 35, 400, 15~
Let’s try now to do some analysis based on these data and create some outputs.
%>% mutate(total = rowSums(across(c(4:21))), # total jobs in a ward
dat across(c(4:21), # select the columns you are interested in
.names = "{col}_share") / total) %>% # divide these columns by total to create industry share within ward
# and name the new columns as oldname_share
select(1:2, 23:40) %>%
pivot_longer(!c(ward, ward_code), # These are the columns I am 'pivoting'
names_to = "industry", # This is how I will name the column with all the industries ...
values_to = "jobs_share") %>% # ... and the column with the jobs percentage
# Look https://tidyr.tidyverse.org/articles/pivot.html to
# understand the long to wide / wide to long transformations.
filter(ward == "Central") %>% # Let's focus on Central Bristol
ggplot(aes(x=industry, y=jobs_share)) + # ... and create a bar chart for it jobs distribution
geom_bar(stat = "identity") +
# to understand the "identity part go on the Details section of the geom_bar documentation:
# https://www.rdocumentation.org/packages/ggplot2/versions/1.0.1/topics/geom_bar
scale_x_discrete(guide = guide_axis(angle = 90)) + # This line rotates the x labels,
ylab("jobs %") # and here we rename the y axis.
Now, let’s do something slightly more interesting. Let’s create two plots to compare the job distribution between two Bristol wards.
%>% mutate(total = rowSums(across(c(4:21))), # total jobs in a ward
dat across(c(4:21), # select the columns you are interested in
.names = "{col}_share") / total) %>% # divide these columns by total to create industry share within ward
# and name the new columns as oldname_share
select(1:2, 23:40) %>%
pivot_longer(!c(ward, ward_code), # These are the columns I am 'pivoting'
names_to = "industry", # This is how I will name the column with all the industries ...
values_to = "jobs_share") %>% # ... and the column with the jobs percentage
# Look https://tidyr.tidyverse.org/articles/pivot.html to
# understand the long to wide / wide to long transformations.
filter(ward == "Central" |
== "Easton") %>% # Let's focus on Easton and Central Bristol
ward ggplot(aes(x=industry, y=jobs_share)) + # ... and create a bar chart for it jobs distribution
geom_bar(stat = "identity") +
scale_x_discrete(guide = guide_axis(angle = 90)) + # This line rotates the x labels,
ylab("jobs %") + # and here we rename the y axis.
facet_wrap(vars(ward), ncol = 2) # facet_wrap creates two graphs based on `ward` and puts one next to each other
Well done! You’ve just utilised Bristol’s open data and became part of the urban open data revolution 😃.
Obviously the x
labels are rather ugly.
Task: using and tweaking the code provided here, can you rename these labels?
Next task: Search for some other interesting data from opendata.bristol.gov.uk, access it programmatically, do the necessary data wrangling and produce some graphs in a new .Rmd
document. Make sure to knit it as .html
.
If this is to easy for you, why don’t you check the new Quarto format.
Accessing APIs using existing R
packages
Now you might be lucky and somebody else has already done all the infrastructural / back-end work in order for you to easily interrogate an API and access their data. How would this look like? Like another R
package. Here is a list with a few examples:
nomisr
is a package that allows you to access Nomis data withR
. Find here an introduction.tidysencus
is part of thetidyverse
ecosystem and offers access to the US Census. This is the link. Pay extra attention to the spatial data option.The city of Toronto offers an
R
package to access City of Toronto Open Data Portal.See a relevant book by Bauer, Landesvatter, and Behrens (2022) available here.
Another task: On your own time, play with these packages and access some of the (urban) data they offer.
References
Footnotes
Every column is a variable; every row is an observation; every cell is a single value. See [here])(https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) for more info↩︎