Economic diversity

  • Production, i.e. firms

  • Consumption, i.e. product variety

  • Labour pool, i.e. skills in labour market

In general is a good thing for:

  • urban economies

  • productivity

  • urban and industrial agglomeration

Opposing forces

  • Within-sector or Marshall–Arrow–Romer (MAR) spillovers

  • Between-sector or Jacobs spillovers

  • Large empirical literature trying to identify the optimal ratio, e.g. Saviotti and Frenken (2008) and Caragliu, Dominicis, and Groot (2016)

  • MAR externalities (or spillovers): good for productivity and short-term growth

  • Jacobean externalities: good for innovation and long-term growth

Opposing forces

Using more clear economics terminology (Fujita et al. 1989):

  • Diverse cities (heterogeneous agglomerations) enjoy economies of scope

  • Homogeneous agglomeration enjoy increasing returns from economies of scale

On the ground

  • Ambiguous concepts

  • Variety, diversity, difference: a relative concept of agglomeration and the clustering of activities

  • Not only higher ‘abundance’, ‘difference’ or ‘number’, but also the degrees of ‘richness’, ‘concentration’ or ‘evenness’ (Yuo and Tseng 2021)

  • Different ways to measure (Bettencourt 2021)

Spieces richness…

  • … aka variety

  • \(D = \sum_{i}^n p_{i}^0\)

  • \(p_i\) is the proportion of data points in the \(i\)th category

  • \(n\) is the number of total categories

  • A count of different species / categories / …


  • Plurality

  • Availability of options

Shannon entropy

  • \(H = -\sum_{i}^n p_{i} \ln{p_{i}}\)

  • \(n\) is the number of total categories

  • \(p_i\) is the proportion of data points in the \(i\)th category

  • Probably the most common diversity index.

  • Interpretation:

    • If one category dominates ➔ less surprise ➔ low entropy

    • No category dominates ➔ more surprise ➔ high entropy

Herfindahl-Hirschman index

  • \(HHI = \sum_{i}^{n}(p_{i}^2)\)

  • \(p_i\) is the proportion of data points in the \(i\)th category

  • Concentration of the market.

  • Interpretation:

    • \(1/n \leq HHI \leq 1\)

    • Two scenarios:

HHI_1 = .8^2 + .05^2 + .05^2 + .1^2
[1] 0.655
HHI_2 = .25^2 + .25^2 + .25^2 + .25^2
[1] 0.25

Herfindahl-Hirschman index

  • Caution: alternative specification

  • \(HHI = 1- \sum_{i}^{n}(p_{i}^2)\)



  • Relatedness spans the continuum between MAR and Jacobs (Hidalgo 2021)

  • Related activities are neither exactly the same nor completely different (Frenken, Van Oort, and Verburg 2007; Boschma et al. 2012)

  • Why? Because:

    • identical activities compete for customers and resources,

    • no learning between very dissimilar economic activities


  • Absorptive capacity: a firm’s capacity to absorb new knowledge depends on its prior level of related knowledge (Cohen and Levinthal 1990)

Economic complexity

  • Large scale fine-grained data on economic activities

  • Learn about abstract factors of production and the way they combine into outputs

  • Dimensionality reduction techniques to data on the geography of activities, e.g. employment by industry or patents by technology

  • Machine learning and network techniques to predict and explain the economic trajectories of countries, cities and regions

For a review, check Hidalgo (2021) and Balland et al. (2022).

Measuring diversity

Source: Companies House

Measuring diversity

  • Go to data.london.gov.uk

  • Download and and save locally the Businesses-in-London.csv

  • Make sure you know the file location!

  • We will use the REAT and entropy packages. Check what these packages do here and here.

  • Install them if needed with install.packages("packagename")

Measuring diversity

library(tidyverse)  # for data wrangling
library(rprojroot)  # for relative paths
library(REAT)       # for diversity measures
library(entropy)    # for entropy
library(cluster)    # for cluster analysis
library(factoextra) # help functions for clustering 
library(kableExtra) # for nice html tables
library(dbscan)     # for HDBSCAL
library(sf)         # for mapping

# This is the project path
path <- find_rstudio_root_file()
path.data <- paste0(path, "/data/businesses-in-london.csv")

london.firms <- read_csv(path.data) 

london.firms.sum <- london.firms %>% 
  filter(SICCode.SicText_1!="None Supplied") %>% # dropping NAs in essence
  group_by(oslaua, SICCode.SicText_1) %>%        # grouping by Local Authority and SIC code
  summarise(n = n()) %>%                         # summarise: n is the number of firms per Local Authority and SIC code
  mutate(total = sum(n),                         # total equal all firms
         freq = n / total) %>%                   # just a frequency
  group_by(oslaua) %>%                           # grouping again only by Local Authority
  summarise(richness = n_distinct(SICCode.SicText_1), # the number of distinct SIC per Local Authority
            entropy = entropy(freq, method = "ML"),   # entropy for each Local Authority, we did the first group_by() and mutate() to be                                                           able to calculate freq so we can calculate entropy
            herf = herf(n)) %>%                       # HHI for each local authority
  arrange(-herf)                                      # sort based on HHI (descending)

london.firms.sum %>% kbl() %>%
  kable_styling(full_width = F) %>%                   # A nice(r) table
  scroll_box(width = "800px", height = "300px")
Measuring diversity


You don’t know what local authorities these codes refer to. You should download the codes and names and join them with your data from here.


Discuss what we can learn from this exercise.

Can you think of a way to understand how different these indices are among London’s Local Authorities?

Mapping diversity

path.shape <- paste0(path, "/data/Local_Authority_Districts_(May_2021)_UK_BFE.geojson")

london <- st_read(path.shape, quiet = T) %>%
  dplyr::filter(LAD21CD %in% (london.firms$oslaua))

london <- merge(london, london.firms.sum, by.x = "LAD21CD", by.y = "oslaua" )
ggplot() +
  geom_sf(data = london, aes(fill = entropy), color = NA) +
    title = "Business diversity in London' Local Autorities",
    fill = "Entropy") +
  scale_fill_viridis_c() +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) # centres the title


  • Reducing the dimensions of the observation space

  • Classification of observations into (exclusive) groups

  • Distance or (dis)similarity between each pair of observations to create a distance or dissimilarity or matrix

  • Observations within the same group are as similar as possible

  • Based on Boehmke and Greenwell (2019) available here

  • Plenty of other resources online and in textbooks

Source: medium.com


  1. k-means

  2. Hierarchical clustering


  1. k is the number of clusters and is pre-defined

  2. The algorithm selects k random observations (starting centres)

  3. The remaining observations are assigned to the nearest centre

  4. Recalculates the new centres

  5. Re-check cluster assignment

  6. Iterative process to minimise within-cluster variation until convergence

\(SS_{within} = \sum_{k=1}^k W(C_{k}) = \sum_{k=1}^k \sum_{x_i\in C_K}(x_i-\mu_k)^2\)


First, create an appropriate data frame

la.sic <- london.firms %>% 
  filter(SICCode.SicText_1!="None Supplied") %>% # Drop firms which haven't declared SIC code
  group_by(oslaua, SICCode.SicText_1) %>%        # Group by Local Authorities and SIC code
  summarise(n = n()) %>%                         # Summarise; n = number of observations
  mutate(total = sum(n),                         # New column: total number of observations
         freq = n / total) %>%                   # New column: frequency
  arrange(oslaua,-n) %>%                         # Just arrange by Local Authority and descenting order of n
  select(-n, -total) %>%                         # Drop n and total, we don't need them any more.
  pivot_wider(names_from = SICCode.SicText_1, values_from = freq) %>% # Data transformation: from long to wide. Have a look: https://tidyr.tidyverse.org/reference/pivot_wider.html
  replace(is.na(.), 0)                          # Replace any missing values with 0 as missing value represent SIC codes with 0 frequency

la.sic %>%  
  select(1:20) %>%  # Select the first 20 columns as there 1037 in total
  kbl() %>%
  kable_styling()   # Nice(r) table
E09000033 0.0871492 0.0369990 0.0475242 0.0180250 0.0727706 0.0228702 0.0137720 0.0385974 0.0192951 0.0375056 0.0004424 0.0307623 0.0292781 0.0150850 0.0143358 0.0047239 0.0074997 0.0153490 0.0001570


kclust = kmeans(la.sic[,-1], centers = 10, nstart = 10) # be aware of the [,-1]
centers is 10 x 1036: 1036 is the number of SIC codes.

Choosing k

  1. Rule of thumb: \(k = \sqrt{n/2}\)

  2. The elbow method

    • Compute k-means clustering for different values of k

    • Calculate \(SS_{within}\)

    • Plot and spot the loction of a bend

Choosing k

  k.max = 20,
  method = "wss"

Hierarchical clustering

  1. Agglomerative clustering (AGNES – AGglomerative NESting)

  2. Divisive hierarchical clustering (DIANA – DIvise ANAlysis)

Dissimilarity (distance) of observations

Hierarchical clustering

# distances between observations
d <- dist(la.sic)

# creates labels for the dendrogam
l <- london.firms %>% distinct(oslaua) %>% arrange(oslaua)

hclust = hclust(d)

plot(hclust, hang=-1, labels=l$oslaua, main='Default from hclust') 
#hang: the fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0.

Optimal number of clusters

Optimal number of clusters


Explore what the 2 cluster solution tells us about London?

Clusters in space

  • Create a SIC frequency table
# This will build an SIC frequency table
london.firms %>% 
  group_by(SICCode.SicText_1) %>% 
  summarise(n=n()) %>% 
  arrange(-n) %>% 
Clusters in space

  • Focus on, let’s say “70221 - Financial management”
london.firms.sample <- london.firms %>% 
  filter(SICCode.SicText_1=="70221 - Financial management") %>% 
  select(oseast1m, osnrth1m) %>% 

Financial management in London


Clusters in space, k-means

  k.max = 10,
  method = "wss"

Clusters in space, k-means

sp.cluster = kmeans(london.firms.sample, 6) 

plot(london.firms.sample, col = sp.cluster$cluster)

Clusters in space, hdbscan

  1. Transform the space according to the density/sparsity

  2. Build the minimum spanning tree of the distance weighted graph

  3. Construct a cluster hierarchy of connected components

  4. Condense the cluster hierarchy based on minimum cluster size

  5. Extract the stable clusters from the condensed tree.

Resources: SciKit-learn docs and dbscan package

Clusters in space, hdbscan

cl <- hdbscan(london.firms.sample, 
              minPts = 10)         #minimum size of clusters

plot(london.firms.sample, col=cl$cluster+1, pch=20)


