APIs and urban data

Author

Emmanouil Tranos

Published

November 11, 2022

Aims of the practical

  • Understand the technicalities and the breadth of the urban data landscape

  • Understand the messiness of real world (urban) data

  • Access programmatically urban data and understand the benefits of such data access practise.

APIs

Application Programming Interface (APIs) represent a ‘gate’ or otherwise a platform that enables a client (that is you) to interact with a server (for example opendata.bristol.gov.uk. According to AWS (2022):

In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses. Their API documentation contains information on how developers are to structure those requests and responses.

The client’s software (this might be R for example) sends a request to the server requesting specific data. The response is the data the client asked.

More commonly, the client might be a mobile phone app (e.g. train network status app) and the server is the network operator’s server.

APIs can be private or public types. For more inthe description from AWS (2022) here

JSON file format

Although some urban data can be downloaded as a tidy1 table in a .csv format, most APIs will ship data to you in a JSON format.

According to the official JSON website (JSON), JSON

is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

It adopts a hierarchical structure such as attribute:value instead of the fixed structure of tidy data. Following Singleton, Spielman, and Folch (2017), such attribute:value pairs might be:

  • “mean income”:“50000”

  • “internet speed”:“35”

  • “jobs”:“15000”

These attributes might be organised hierarchically:

(“city”:“Bristol,”mean income”: “27000”, “internet speed”: “55”, “jobs”:(“2019”:“15000”, “2020”:“14000”)), (“city”:“Newcastle,”mean income”: “25000”, “internet speed”: “45”, “jobs”:(“2019”:“11000”, “2020”:“10000”))

Although this format is more verbose (attribute names are repeated) it is more flexible (not all observations need to have all attributes) (Singleton, Spielman, and Folch 2017)

Practical example with API

opendata.bristol.gov.uk offers a lot of data about Bristol. For this example, let’s try to work with Jobs by Ward (by industry) in Bristol 2020. As you can see, this is a small data set with \(35\) observations representing Bristol’s \(35\) Wards. You can visualise these data with the Table tab and even download a .csv with these data. To begin with, do both of these things.

Then, go to the API tab. Change the rows you want to include in you API request from 10, which is the default here to 35, which is the maximum as we already know. Then clink on the Submit button and below that you will see a new link. This URL is your API request. If you carefully read the URL from right to left you can easily recognise the fields you saw on the previous screen. The weird output on your screen is the data you requested in JSON format.

Question: Why should you go through all these efforts to build an API request and download the data in JSON file format instead of just downloading a .csv file?

So, let’s try to access these data programmatically. First, we need to load the relevant R packages

library(tidyverse)
library(jsonlite)

To access the data, we are going to use the function fromJSON() from the jsonlite package. Please check the syntax of this function as well as the other function of this package.

dat <- jsonlite::fromJSON("https://opendata.bristol.gov.uk/api/records/1.0/search/?dataset=jobs-by-ward-by-industry-in-bristol-2020&q=&rows=35&facet=ward", flatten = TRUE)

We now have an R object called dat. Let’s try to see how it looks like

typeof(dat)
[1] "list"

So, this is a list. If you need a reminder for what a list is, see here. Now the difficult part. Let’s try to figure out the structure of this JSON formated data.

str(dat)
List of 4
 $ nhits       : int 35
 $ parameters  :List of 6
  ..$ dataset : chr "jobs-by-ward-by-industry-in-bristol-2020"
  ..$ rows    : int 35
  ..$ start   : int 0
  ..$ facet   : chr "ward"
  ..$ format  : chr "json"
  ..$ timezone: chr "UTC"
 $ records     :'data.frame':   35 obs. of  24 variables:
  ..$ datasetid                                          : chr [1:35] "jobs-by-ward-by-industry-in-bristol-2020" "jobs-by-ward-by-industry-in-bristol-2020" "jobs-by-ward-by-industry-in-bristol-2020" "jobs-by-ward-by-industry-in-bristol-2020" ...
  ..$ recordid                                           : chr [1:35] "8f1ee37587a2233922bfa38b048879368059502e" "464799297e1d1af8bb980450d062c7eedb787042" "4cf81b7e8c57fc61c73eb40c4d4556446d20896c" "e034ff06e7c7d67f6f716c41e8ccd940a36d41fd" ...
  ..$ record_timestamp                                   : chr [1:35] "2021-12-22T11:29:00.712Z" "2021-12-22T11:29:00.712Z" "2021-12-22T11:29:00.712Z" "2021-12-22T11:29:00.712Z" ...
  ..$ fields.mining_quarrying_utilities                  : int [1:35] 10 0 0 0 0 10 0 30 2250 0 ...
  ..$ fields.transport_storage_inc_postal                : int [1:35] 20 0 350 15 200 50 100 600 1750 300 ...
  ..$ fields.accommodation_food_services                 : int [1:35] 100 150 175 800 700 175 400 400 6000 300 ...
  ..$ fields.public_administration_defence               : int [1:35] 0 0 0 10 50 0 0 400 9000 200 ...
  ..$ fields.agriculture_forestry_fishing                : int [1:35] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ fields.business_administration_support_services    : int [1:35] 125 75 1250 1750 450 225 500 450 11000 125 ...
  ..$ fields.retail                                      : int [1:35] 700 100 200 700 900 75 800 1500 4500 500 ...
  ..$ fields.property                                    : int [1:35] 10 15 20 450 175 75 200 30 1750 30 ...
  ..$ fields.health                                      : int [1:35] 150 125 125 450 500 500 2000 900 12000 1500 ...
  ..$ fields.motor_trades                                : int [1:35] 125 30 300 30 350 20 40 600 200 40 ...
  ..$ fields.manufacturing                               : int [1:35] 150 25 900 25 400 5 75 500 500 50 ...
  ..$ fields.information_communication                   : int [1:35] 225 200 400 800 350 175 450 700 10000 50 ...
  ..$ fields.education                                   : int [1:35] 350 250 400 200 350 175 1000 500 9000 800 ...
  ..$ fields.ward_code                                   : chr [1:35] "E05010889" "E05010918" "E05010890" "E05010894" ...
  ..$ fields.financial_insurance                         : int [1:35] 1500 10 10 600 50 10 75 15 8000 35 ...
  ..$ fields.ward                                        : chr [1:35] "Bishopsworth" "Windmill Hill" "Brislington East" "Clifton Down" ...
  ..$ fields.period                                      : chr [1:35] "2020" "2020" "2020" "2020" ...
  ..$ fields.professional_scientific_technical           : int [1:35] 75 125 200 1000 600 250 800 600 22000 125 ...
  ..$ fields.construction                                : int [1:35] 500 125 500 300 400 100 350 175 1250 125 ...
  ..$ fields.arts_entertainment_recreation_other_services: int [1:35] 150 75 75 350 500 35 250 125 1750 100 ...
  ..$ fields.wholesale                                   : int [1:35] 100 20 450 35 400 15 50 150 900 20 ...
 $ facet_groups:'data.frame':   1 obs. of  2 variables:
  ..$ name  : chr "ward"
  ..$ facets:List of 1
  .. ..$ :'data.frame': 35 obs. of  4 variables:
  .. .. ..$ name : chr [1:35] "Ashley" "Avonmouth and Lawrence Weston" "Bedminster" "Bishopston and Ashley Down" ...
  .. .. ..$ count: int [1:35] 1 1 1 1 1 1 1 1 1 1 ...
  .. .. ..$ state: chr [1:35] "displayed" "displayed" "displayed" "displayed" ...
  .. .. ..$ path : chr [1:35] "Ashley" "Avonmouth and Lawrence Weston" "Bedminster" "Bishopston and Ashley Down" ...

Well, it’s complicated. Looking on the above output of list() the object that looks most useful to us is the dat$records.

Question: Based on the output of list(), why do you think is dat$records the most useful part of dat for us?

And if you want to see how the dataframe dat$records looks like, the below function will show the first five rows.

head(dat$records)
                                 datasetid
1 jobs-by-ward-by-industry-in-bristol-2020
2 jobs-by-ward-by-industry-in-bristol-2020
3 jobs-by-ward-by-industry-in-bristol-2020
4 jobs-by-ward-by-industry-in-bristol-2020
5 jobs-by-ward-by-industry-in-bristol-2020
6 jobs-by-ward-by-industry-in-bristol-2020
                                  recordid         record_timestamp
1 8f1ee37587a2233922bfa38b048879368059502e 2021-12-22T11:29:00.712Z
2 464799297e1d1af8bb980450d062c7eedb787042 2021-12-22T11:29:00.712Z
3 4cf81b7e8c57fc61c73eb40c4d4556446d20896c 2021-12-22T11:29:00.712Z
4 e034ff06e7c7d67f6f716c41e8ccd940a36d41fd 2021-12-22T11:29:00.712Z
5 88ce4a95ea29ef24f38ecd46b3a7e021d325991c 2021-12-22T11:29:00.712Z
6 0576d71991a12a6102330e89a595d26fcc92fd32 2021-12-22T11:29:00.712Z
  fields.mining_quarrying_utilities fields.transport_storage_inc_postal
1                                10                                  20
2                                 0                                   0
3                                 0                                 350
4                                 0                                  15
5                                 0                                 200
6                                10                                  50
  fields.accommodation_food_services fields.public_administration_defence
1                                100                                    0
2                                150                                    0
3                                175                                    0
4                                800                                   10
5                                700                                   50
6                                175                                    0
  fields.agriculture_forestry_fishing
1                                   0
2                                   0
3                                   0
4                                   0
5                                   0
6                                   0
  fields.business_administration_support_services fields.retail fields.property
1                                             125           700              10
2                                              75           100              15
3                                            1250           200              20
4                                            1750           700             450
5                                             450           900             175
6                                             225            75              75
  fields.health fields.motor_trades fields.manufacturing
1           150                 125                  150
2           125                  30                   25
3           125                 300                  900
4           450                  30                   25
5           500                 350                  400
6           500                  20                    5
  fields.information_communication fields.education fields.ward_code
1                              225              350        E05010889
2                              200              250        E05010918
3                              400              400        E05010890
4                              800              200        E05010894
5                              350              350        E05010914
6                              175              175        E05010916
  fields.financial_insurance      fields.ward fields.period
1                       1500     Bishopsworth          2020
2                         10    Windmill Hill          2020
3                         10 Brislington East          2020
4                        600     Clifton Down          2020
5                         50       Southville          2020
6                         10     Stoke Bishop          2020
  fields.professional_scientific_technical fields.construction
1                                       75                 500
2                                      125                 125
3                                      200                 500
4                                     1000                 300
5                                      600                 400
6                                      250                 100
  fields.arts_entertainment_recreation_other_services fields.wholesale
1                                                 150              100
2                                                  75               20
3                                                  75              450
4                                                 350               35
5                                                 500              400
6                                                  35               15

Looking into the above, I can think of the the following necessary data cleaning and wrangling steps:

  1. remove the first three columns as they don’t include any useful data

  2. remove the string “fields.” from the column names

  3. although not necessary, it is common practice for the first columns to contain identification data such as the ward names and codes.

To do all the above, we will use the below functions from the dplyr package, which is part of the tidyverse ecosystem.

dat <- dat$records %>% 
  select(-datasetid, -recordid, -record_timestamp) %>% # remove useless columns
  rename_with(~sub("fields.", "", .)) %>%              # ~ introduces a formula: function `sub` substitutes (in other words replaces) the string "fields." with an empty sting "" for all columns (symbol .) 
  relocate(ward, ward_code, period) %>%                # looks better
  glimpse()                                            # a glimpse of the output
Rows: 35
Columns: 21
$ ward                                         <chr> "Bishopsworth", "Windmill~
$ ward_code                                    <chr> "E05010889", "E05010918",~
$ period                                       <chr> "2020", "2020", "2020", "~
$ mining_quarrying_utilities                   <int> 10, 0, 0, 0, 0, 10, 0, 30~
$ transport_storage_inc_postal                 <int> 20, 0, 350, 15, 200, 50, ~
$ accommodation_food_services                  <int> 100, 150, 175, 800, 700, ~
$ public_administration_defence                <int> 0, 0, 0, 10, 50, 0, 0, 40~
$ agriculture_forestry_fishing                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ business_administration_support_services     <int> 125, 75, 1250, 1750, 450,~
$ retail                                       <int> 700, 100, 200, 700, 900, ~
$ property                                     <int> 10, 15, 20, 450, 175, 75,~
$ health                                       <int> 150, 125, 125, 450, 500, ~
$ motor_trades                                 <int> 125, 30, 300, 30, 350, 20~
$ manufacturing                                <int> 150, 25, 900, 25, 400, 5,~
$ information_communication                    <int> 225, 200, 400, 800, 350, ~
$ education                                    <int> 350, 250, 400, 200, 350, ~
$ financial_insurance                          <int> 1500, 10, 10, 600, 50, 10~
$ professional_scientific_technical            <int> 75, 125, 200, 1000, 600, ~
$ construction                                 <int> 500, 125, 500, 300, 400, ~
$ arts_entertainment_recreation_other_services <int> 150, 75, 75, 350, 500, 35~
$ wholesale                                    <int> 100, 20, 450, 35, 400, 15~

Let’s try now to do some analysis based on these data and create some outputs.

dat %>% mutate(total = rowSums(across(c(4:21))),           # total jobs in a ward
               across(c(4:21),                             # select the columns you are interested in
                      .names = "{col}_share") / total) %>% # divide these columns by total to create industry share within ward  
                                                           # and name the new columns as oldname_share
  select(1:2, 23:40) %>%
  pivot_longer(!c(ward, ward_code),                        # These are the columns I am 'pivoting'
               names_to = "industry",                      # This is how I will name the column with all the industries ...
               values_to = "jobs_share") %>%               # ... and the column with the jobs percentage
                                                           # Look https://tidyr.tidyverse.org/articles/pivot.html to 
                                                           # understand the long to wide / wide to long transformations.
  filter(ward == "Central") %>%                            # Let's focus on Central Bristol
  ggplot(aes(x=industry, y=jobs_share)) +                  # ... and create a bar chart for it jobs distribution
    geom_bar(stat = "identity") +
# to understand the "identity part go on the Details section of the geom_bar documentation:
# https://www.rdocumentation.org/packages/ggplot2/versions/1.0.1/topics/geom_bar
    
    scale_x_discrete(guide = guide_axis(angle = 90)) +     # This line rotates the x labels,
    ylab("jobs %")                                         # and here we rename the y axis. 

Now, let’s do something slightly more interesting. Let’s create two plots to compare the job distribution between two Bristol wards.

dat %>% mutate(total = rowSums(across(c(4:21))),           # total jobs in a ward
               across(c(4:21),                             # select the columns you are interested in
                      .names = "{col}_share") / total) %>% # divide these columns by total to create industry share within ward  
                                                           # and name the new columns as oldname_share
  select(1:2, 23:40) %>%
  pivot_longer(!c(ward, ward_code),                        # These are the columns I am 'pivoting'
               names_to = "industry",                      # This is how I will name the column with all the industries ...
               values_to = "jobs_share") %>%               # ... and the column with the jobs percentage
                                                           # Look https://tidyr.tidyverse.org/articles/pivot.html to 
                                                           # understand the long to wide / wide to long transformations.
  filter(ward == "Central" |
         ward == "Easton") %>%                             # Let's focus on Easton and Central Bristol
  ggplot(aes(x=industry, y=jobs_share)) +                  # ... and create a bar chart for it jobs distribution
    geom_bar(stat = "identity") +
    scale_x_discrete(guide = guide_axis(angle = 90)) +     # This line rotates the x labels,
    ylab("jobs %") +                                       # and here we rename the y axis. 
    facet_wrap(vars(ward), ncol = 2)                       # facet_wrap creates two graphs based on `ward` and puts one next to each other

Well done! You’ve just utilised Bristol’s open data and became part of the urban open data revolution 😃.

Obviously the x labels are rather ugly.

Task: using and tweaking the code provided here, can you rename these labels?

Next task: Search for some other interesting data from opendata.bristol.gov.uk, access it programmatically, do the necessary data wrangling and produce some graphs in a new .Rmd document. Make sure to knit it as .html.

If this is to easy for you, why don’t you check the new Quarto format.

Accessing APIs using existing R packages

Now you might be lucky and somebody else has already done all the infrastructural / back-end work in order for you to easily interrogate an API and access their data. How would this look like? Like another R package. Here is a list with a few examples:

  • nomisr is a package that allows you to access Nomis data with R. Find here an introduction.

  • tidysencus is part of the tidyverse ecosystem and offers access to the US Census. This is the link. Pay extra attention to the spatial data option.

  • The city of Toronto offers an R package to access City of Toronto Open Data Portal.

  • See a relevant book by Bauer, Landesvatter, and Behrens (2022) available here.

Another task: On your own time, play with these packages and access some of the (urban) data they offer.

References

AWS, Amazon. 2022. What Is an API? - API Beginner’s Guide - AWS — Aws.amazon.com.” https://aws.amazon.com/what-is/api/#:~:text=API%20stands%20for%20Application%20Programming,other%20using%20requests%20and%20responses.
Bauer, P. C., Landesvatter C., and L. Behrens. 2022. “APIs for Social Scientists: A Collaborative Review V1.0.” https://doi.org/https://doi.org/10.5281/zenodo.6798690.
JSON. JSON — Json.org.” https://www.json.org/json-en.html.
Singleton, Alex D, Seth Spielman, and David Folch. 2017. Urban Analytics. Sage.

Footnotes

  1. Every column is a variable; every row is an observation; every cell is a single value. See [here])(https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) for more info↩︎