Using the web to predict regional trade flows: material and immaterial regional interdependencies


Emmanouil Tranos, Andre Carrascal Incera & George Willis

University of Bristol, Alan Turing Institute
, @EmmanouilTranos, etranos.info

Contents

  • Introduction
  • Web data and spatial research
  • Empirical strategy
  • Descriptive statistics
  • Results
  • Conclusions

    etranos.info/post/sad2021

Introduction

Regional trade flows

  • Bilateral trade is a complex phenomenon (Serrano and Boguñá 2003)
  • Its complexity increases when it is approached from a spatially disaggregated perspective
  • Regions are more specialised and open than countries
  • Regions are more open to trade with other regions in comparison to national economies
  • Important external trade dependencies
  • Regions vary a lot in terms of their specialisation patterns, trade relationships and openness

Regional trade flows

  • Knowing and predicting regional trade helps to understand:
    • regional economic performance
    • exposure to external shocks
    • place-based development
  • Employment vulnerability and transmission of internal and external shocks is different for different regions.

Regional trade flow: hardly any data

  • Big caveat: interregional trade data
  • Europe: spatially disaggregated IO for NUTS2 regions (Thissen, Diodato, and Van Oort 2013b, 2013a)
  • Costly, difficult exercise

Our contribution

  • Utilise the digital traces that interregional trade leaves behind
  • Model and predict trade flows for the UK NUTS2 regions
  • Scrape open web data
  • Hyperlinks between commercial websites
  • ML techniques for predictions of unseen interregional trade flows
  • Spatially disaggregated trade data
  • Hypothesis: such hyperlinks reflect business and trade relations

Web data and spatial research

Web data and business studies

  • Businesses may not expose all of their strategies on their websites, but neither do they do during surveys (Arora et al. 2013)
  • Business websites:
    • spreading information
    • establishing a public image
    • supporting online transactions
    • sharing opinions

Empirical strategy

Web data: The Internet Archive

  • The largest archive of webpages in the world
  • 273 billion webpages from over 361 million websites, 15 petabytes of storage (1996 -)
  • A web crawler starts with a list of URLs (a seed list) to crawl and downloads a copy of their content
  • Using the hyperlinks included in the crawled URLs, new URLs are identified and crawled (snowball sampling)
  • Time-stamp

Web data: The Internet Archive

Web data: The Internet Archive

Our web data

  • JISC UK Web Domain Dataset: all archived webpages from the .uk domain 1996-2010
  • Curated by the British Library
  • Tranos, E., and C. Stich. 2020. Individual internet usage and the availability of online content of local interest: A multilevel approach. Computers, Environment and Urban Systems, 79:101371
  • Tranos, E., T. Kitsos, and R. Ortega-Argilés, R. 2020 Digital economy in the UK: Regional productivity effects of early adoption. Regional Studies, in press

Our web data

  1. All .uk archived webpages which contain a UK postcode in the web text

    - circa 0.5 billion URLs with valid UK postcodes

    - 20080509162138/http://www.example_website_1.co.uk/contact_us IG8 8HD

  2. Hyperlinks

    - http://www.example_website_1.co.uk | http://www.example_website_2.co.uk | 3

    - much larger pool, only part is geolocated

Modelling strategy

\[trade_{ij,t} \sim hyperlinks_{ij,t} + distance_{ij} + \\ pop.density_{i,t} + pop.density_{i,t} + empl_{i,t} + empl_{j,t} \]

  • Predict inter-regional trade flows using Random Forests (RF)
  • Trade flow data from Thissen, Diodato, and Van Oort (2013b) and Thissen, Diodato, and Van Oort (2013a)
  • RF: tree-based ensemble learning method (Breiman 2001)
  • Classification and regression problems
  • Random samples of the training data, which are then used to grow an equivalent number of regression trees to predict the dependent variable
  • Decision trees are trained in parallel
  • To make a predictions for regression problems, RF average the predictions of all decision trees

Modelling strategy: Random Forests

  • Can handle skewed distributions and outliers
  • Avoid overfitting
  • Effectively model non-linear relationships
  • Small number of hyperparameters that need to be tuned, low sensitivity
  • Short training time
  • Current economic thinking advocates towards the use of ML algorithm such as RF
  • Outperform OLS in out-of-sample predictions even when using moderate size training datasets and limited number of predictors

Modelling strategy: rolling forecasting

  • Train RF models on data from years \(t\) and \(t + 1\) to increase the size of the training dataset
  • 10-fold cross validation
  • Predict unseen data from year \(t + 2\)
  • No data pooling to maintain their temporal structure both for methodological and conceptual reasons
  • No data leakage

Modelling strategy: predictive performance

\[\begin{align} R^2 = 1 - \frac{\sum_{k} (y_{k} - \hat{y_{k}})^2} {\sum_{k} (y_{k} - \overline{y_{k}})^2} \label{eq:rsquared} \end{align}\]

\[\begin{align} MAE = \frac{1}{N} \sum_{k = 1}^{N} |\hat{y_{k}} - y_{k}| \label{eq:mae} \end{align}\]

\[\begin{align} RMSE = \sqrt{\frac{\sum_{k = 1}^{N} (\hat{y_{k}} - y_{k})^2} {N}} \label{eq:rmse} \end{align}\]

  • Larger errors carry more weight for \(RMSE\)

Data cleaning

Unique postcodes frequencies, 2000

level freq perc cumfreq cumperc
(0,1] 41596 0.718 41596 0.718
(1,2] 6451 0.111 48047 0.830
(2,10] 6163 0.106 54210 0.936
(10,100] 2975 0.051 57185 0.988
(100,1000] 646 0.011 57831 0.999
(1000,10000] 62 0.001 57893 1.000
(10000,100000] 4 0.000 57897 1.000
  • Websites with a large number of postcodes: e.g. directories, real estate websites
  • \(2\) samples: Websites with \(1\) vs. up to \(10\) unique postcodes

Directory website with a lot of postcodes

Website with a unique postcode in London

Desctiptive statistics

Interregional trade flows

Correlations with interregional trade

year hyperlinks distance
2000 0.539 -0.219
2001 0.578 -0.221
2002 0.793 -0.221
2003 0.483 -0.220
2004 0.807 -0.223
2005 0.643 -0.219
2006 0.585 -0.219
2007 0.598 -0.214
2008 0.491 -0.205
2009 0.922 -0.207
2010 0.674 -0.205

Results

Modelling strategy

\[trade_{ij,t} \sim hyperlinks_{ij,t} + distance_{ij} + \\ pop.density_{i,t} + pop.density_{i,t} + empl_{i,t} + empl_{j,t} \]

  • Rolling forecasting
  • Train RF models on data from years \(t\) and \(t + 1\)
  • 10-fold cross validation
  • Predict unseen data from year \(t + 2\)

Train on year t and t + 1

Feature importance

Test on t + 2

year RMSE Rsquared MAE
2002 937.93 0.96 159.87
2003 1360.28 0.94 244.75
2004 1014.83 0.95 179.15
2005 1790.07 0.89 304.86
2006 1706.73 0.92 309.16
2007 1920.11 0.91 210.23
2008 1558.92 0.92 233.35
2009 1353.12 0.93 202.70
2010 3170.16 0.63 303.68

Test on t + 2

Sectoral decombosition

Code Industry name
s1 Agriculture
s2 Mining, quarrying and energy supply
s3 Food beverages and tobacco
s4 Textiles and leather etc.
s5 Coke, refined petroleum, nuclear fuel and chemicals etc.
s6 Electrical and optical equipment and transport equipment
s8 Other manufacturing
s9 Construction
s10 Distribution
s11 Hotels and restaurant
s12 Transport storage and communication
s13 Financial intermediation
s14 Real estate renting and business activities
s15 Non-Market Services

Sectoral decombosition

Sectoral decombosition

  • Higher accuracy in trade of goods (\(s1\)-\(s8\)) than services (\(s10\)-\(s15\))
  • Drop in prediction accuracy in \(2010\) for services sectors (\(s10\)-\(s15\)) due to the financial crisis and the knock on effects
  • The decrease of interregional trade volume makes it more difficult to predict
  • Hotels and Restaurants (\(s11\)): the most difficult sector to predict because of strong intraregional trade dependencies

Alternative specifications

Alternative specifications

  • Distance plays the most important role in predicting interregional trade flows
  • The difference of the prediction accuracy between the models with and without distance decreases over time
  • Over time, as the adoption rate of web technologies increased, interregional trade flows leave more digital breadcrumbs behind

Robustness check: websites with up to 10 postcodes

year RMSE Rsquared MAE
2002 1181.91 0.94 244.27
2003 1428.99 0.93 282.77
2004 1011.14 0.95 173.31
2005 1414.77 0.94 232.25
2006 1433.92 0.94 208.32
2007 1894.59 0.91 227.77
2008 1206.30 0.95 249.66
2009 2008.83 0.81 238.38
2010 2500.10 0.78 298.27

Robustness check: websites with up to 10 postcodes

\label{prediction_multi_pc}Predicted vs. observed interregional trade by year for multiple postcodes

A local level application

  • From NUTS2 to Local Authorities
  • No such spatially disaggregated trade data
  • The main subnational administrative division in the UK
  • Trained in 2008 and 2009, tested for 2010
  • Can’t validate for Local Authorities