Using the web to predict regional trade flows: material and immaterial regional interdependencies

Emmanouil Tranos, Andre Carrascal Incera & George Willis

University of Bristol, Alan Turing Institute
e.tranos@bristol.ac.uk, @EmmanouilTranos, etranos.info

Introduction
Web data and spatial research
Empirical strategy
Descriptive statistics
Results
Conclusions

etranos.info/post/sad2021

Introduction

Regional trade flows

Bilateral trade is a complex phenomenon (Serrano and Boguñá 2003)
Its complexity increases when it is approached from a spatially disaggregated perspective
Regions are more specialised and open than countries
Regions are more open to trade with other regions in comparison to national economies
Important external trade dependencies
Regions vary a lot in terms of their specialisation patterns, trade relationships and openness

Regional trade flows

Knowing and predicting regional trade helps to understand:
- regional economic performance
- exposure to external shocks
- place-based development
Employment vulnerability and transmission of internal and external shocks is different for different regions.

Regional trade flow: hardly any data

Big caveat: interregional trade data
Europe: spatially disaggregated IO for NUTS2 regions (Thissen, Diodato, and Van Oort 2013b, 2013a)
Costly, difficult exercise

Our contribution

Utilise the digital traces that interregional trade leaves behind
Model and predict trade flows for the UK NUTS2 regions
Scrape open web data
Hyperlinks between commercial websites
ML techniques for predictions of unseen interregional trade flows
Spatially disaggregated trade data
Hypothesis: such hyperlinks reflect business and trade relations

Web data and spatial research

Spatial studies using hyperlinks

Hyperlinks tend to follow national borders and gravitate towards the US (Halavais 2000)
Keßler (2017) used the hyperlinks between German Wikipedia webpages to represent the hierarchy of urban centres in Germany
Salvini and Fabrikant (2016) used a the English Wikipedia to build a graph of world cities
Hyperlinks between and to administrative websites to study spatial relationships and structure (Holmberg and Thelwall 2009; Holmberg 2010; Janc 2015)

Spatial studies using hyperlinks

Lin, Halavais, and Zhang (2007) used webblog hyperlinks to analyse the spatial reflections of the blogsphere
Jones, Spigel, and Malecki (2010) focused on the New York City theater scene to investigate the existence and role of a ‘virtual buzz’

Web data and business studies

Businesses may not expose all of their strategies on their websites, but neither do they do during surveys (Arora et al. 2013)
Business websites:
- spreading information
- establishing a public image
- supporting online transactions
- sharing opinions

Business studies using hyperlinks

Hyperlinks to business websites reflect business motivations and contain useful business information (Vaughan, Gao, and Kipp 2006)
Significant correlations between the number of incoming links and business performance (Vaughan 2004; Vaughan and Wu 2004)
Krüger et al. (2020) used hyperlinks between business websites in Germany to test the role of different proximity frameworks
Innovative businesses share more hyperlinks with other business, which also tend to be innovative

Empirical strategy

Web data: The Internet Archive

The largest archive of webpages in the world
273 billion webpages from over 361 million websites, 15 petabytes of storage (1996 -)
A web crawler starts with a list of URLs (a seed list) to crawl and downloads a copy of their content
Using the hyperlinks included in the crawled URLs, new URLs are identified and crawled (snowball sampling)
Time-stamp

Web data: The Internet Archive

Our web data

JISC UK Web Domain Dataset: all archived webpages from the .uk domain 1996-2010
Curated by the British Library
Tranos, E., and C. Stich. 2020. Individual internet usage and the availability of online content of local interest: A multilevel approach. Computers, Environment and Urban Systems, 79:101371
Tranos, E., T. Kitsos, and R. Ortega-Argilés, R. 2020 Digital economy in the UK: Regional productivity effects of early adoption. Regional Studies, in press

Our web data

All .uk archived webpages which contain a UK postcode in the web text

- circa 0.5 billion URLs with valid UK postcodes

- 20080509162138/http://www.example_website_1.co.uk/contact_us IG8 8HD
Hyperlinks

- http://www.example_website_1.co.uk | http://www.example_website_2.co.uk | 3

- much larger pool, only part is geolocated

Modelling strategy

\[trade_{ij,t} \sim hyperlinks_{ij,t} + distance_{ij} + \\ pop.density_{i,t} + pop.density_{i,t} + empl_{i,t} + empl_{j,t} \]

Predict inter-regional trade flows using Random Forests (RF)
Trade flow data from Thissen, Diodato, and Van Oort (2013b) and Thissen, Diodato, and Van Oort (2013a)
RF: tree-based ensemble learning method (Breiman 2001)
Classification and regression problems
Random samples of the training data, which are then used to grow an equivalent number of regression trees to predict the dependent variable
Decision trees are trained in parallel
To make a predictions for regression problems, RF average the predictions of all decision trees

Modelling strategy: Random Forests

Can handle skewed distributions and outliers
Avoid overfitting
Effectively model non-linear relationships
Small number of hyperparameters that need to be tuned, low sensitivity
Short training time
Current economic thinking advocates towards the use of ML algorithm such as RF
Outperform OLS in out-of-sample predictions even when using moderate size training datasets and limited number of predictors

Modelling strategy: rolling forecasting

Train RF models on data from years $t$ and $t + 1$ to increase the size of the training dataset
10-fold cross validation
Predict unseen data from year $t + 2$
No data pooling to maintain their temporal structure both for methodological and conceptual reasons
No data leakage

Modelling strategy: predictive performance

\[\begin{align} R^2 = 1 - \frac{\sum_{k} (y_{k} - \hat{y_{k}})^2} {\sum_{k} (y_{k} - \overline{y_{k}})^2} \label{eq:rsquared} \end{align}\]

\[\begin{align} MAE = \frac{1}{N} \sum_{k = 1}^{N} |\hat{y_{k}} - y_{k}| \label{eq:mae} \end{align}\]

\[\begin{align} RMSE = \sqrt{\frac{\sum_{k = 1}^{N} (\hat{y_{k}} - y_{k})^2} {N}} \label{eq:rmse} \end{align}\]

Larger errors carry more weight for $RMSE$

Data cleaning

All the archived .uk webpages
Archived during 2000-2010
Commercial webpages (.co.uk)
From webpages to websites:

- http://www.website1.co.uk/webpage1 and

- http://www.website1.co.uk/webpage2 are part of the

- http://www.website1.co.uk
1 vs. multuple postcodes in a website

Unique postcodes frequencies, 2000

level	freq	perc	cumfreq	cumperc
(0,1]	41596	0.718	41596	0.718
(1,2]	6451	0.111	48047	0.830
(2,10]	6163	0.106	54210	0.936
(10,100]	2975	0.051	57185	0.988
(100,1000]	646	0.011	57831	0.999
(1000,10000]	62	0.001	57893	1.000
(10000,100000]	4	0.000	57897	1.000

Websites with a large number of postcodes: e.g. directories, real estate websites
$2$ samples: Websites with $1$ vs. up to $10$ unique postcodes

Directory website with a lot of postcodes

Website with a unique postcode in London

Desctiptive statistics

Interregional trade flows

Interregional hyperlinks

Scatter plots of trade vs. hyperlinks

Correlations with interregional trade

year	hyperlinks	distance
2000	0.539	-0.219
2001	0.578	-0.221
2002	0.793	-0.221
2003	0.483	-0.220
2004	0.807	-0.223
2005	0.643	-0.219
2006	0.585	-0.219
2007	0.598	-0.214
2008	0.491	-0.205
2009	0.922	-0.207
2010	0.674	-0.205

Results

Modelling strategy

\[trade_{ij,t} \sim hyperlinks_{ij,t} + distance_{ij} + \\ pop.density_{i,t} + pop.density_{i,t} + empl_{i,t} + empl_{j,t} \]

Rolling forecasting
Train RF models on data from years $t$ and $t + 1$
10-fold cross validation
Predict unseen data from year $t + 2$

Train on year t and t + 1

Feature importance

Test on t + 2

year	RMSE	Rsquared	MAE
2002	937.93	0.96	159.87
2003	1360.28	0.94	244.75
2004	1014.83	0.95	179.15
2005	1790.07	0.89	304.86
2006	1706.73	0.92	309.16
2007	1920.11	0.91	210.23
2008	1558.92	0.92	233.35
2009	1353.12	0.93	202.70
2010	3170.16	0.63	303.68

Test on t + 2

Sectoral decombosition

Code	Industry name
s1	Agriculture
s2	Mining, quarrying and energy supply
s3	Food beverages and tobacco
s4	Textiles and leather etc.
s5	Coke, refined petroleum, nuclear fuel and chemicals etc.
s6	Electrical and optical equipment and transport equipment
s8	Other manufacturing
s9	Construction
s10	Distribution
s11	Hotels and restaurant
s12	Transport storage and communication
s13	Financial intermediation
s14	Real estate renting and business activities
s15	Non-Market Services

Sectoral decombosition

Higher accuracy in trade of goods ($s1$-$s8$) than services ($s10$-$s15$)
Drop in prediction accuracy in $2010$ for services sectors ($s10$-$s15$) due to the financial crisis and the knock on effects
The decrease of interregional trade volume makes it more difficult to predict
Hotels and Restaurants ($s11$): the most difficult sector to predict because of strong intraregional trade dependencies

Alternative specifications

Distance plays the most important role in predicting interregional trade flows
The difference of the prediction accuracy between the models with and without distance decreases over time
Over time, as the adoption rate of web technologies increased, interregional trade flows leave more digital breadcrumbs behind

Robustness check: websites with up to 10 postcodes

year	RMSE	Rsquared	MAE
2002	1181.91	0.94	244.27
2003	1428.99	0.93	282.77
2004	1011.14	0.95	173.31
2005	1414.77	0.94	232.25
2006	1433.92	0.94	208.32
2007	1894.59	0.91	227.77
2008	1206.30	0.95	249.66
2009	2008.83	0.81	238.38
2010	2500.10	0.78	298.27

Robustness check: websites with up to 10 postcodes

$\label{prediction_multi_pc}Predicted vs. observed interregional trade by year for multiple postcodes$

A local level application

From NUTS2 to Local Authorities
No such spatially disaggregated trade data
The main subnational administrative division in the UK
Trained in 2008 and 2009, tested for 2010
Can’t validate for Local Authorities

A local level application

Both of these examples illustrate the importance of distance in trade fows
Camden appears to have more light colour links not only with adjacent LAD, but also with more distant ones
Not surprisingly, Camden’s reach appears to be more extended than Birmingham’s.
Illustration of the capacity of our research framework for spatially disaggregated analysis of trade fows

Conclusions

Interregional trade is important to know about…
… but very difficult to capture
Current state-of-the art: distance decay
Interregional trade increasingly leaves behind digital paper trail
Highly accurate prediction framework
Sectorally disaggregated
Opportunity for more spatially disaggregated trade studies
Wide availability of current web archives: nowcasting in different geographical contexts

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Halavais, Alexander. 2000. “National Borders on the World Wide Web.” New Media & Society 2 (1): 7–28.

Holmberg, Kim. 2010. “Co-Inlinking to a Municipal Web Space: A Webometric and Content Analysis.” Scientometrics 83 (3): 851–62.

Holmberg, Kim, and Mike Thelwall. 2009. “Local Government Web Sites in Finland: A Geographic and Webometric Analysis.” Scientometrics 79 (1): 157–69.

Janc, Krzysztof. 2015. “Geography of Hyperlinks—Spatial Dimensions of Local Government Websites.” European Planning Studies 23 (5): 1019–37.

Jones, Brant W, Ben Spigel, and Edward J Malecki. 2010. “Blog Links as Pipelines to Buzz Elsewhere: The Case of New York Theater Blogs.” Environment and Planning B: Planning and Design 37 (1): 99–111.

Keßler, Carsten. 2017. “Extracting Central Places from the Link Structure in Wikipedia.” Transactions in GIS 21 (3): 488–502.

Krüger, Miriam, Jan Kinne, David Lenz, and Bernd Resch. 2020. “The Digital Layer: How Innovative Firms Relate on the Web.” ZEW-Centre for European Economic Research Discussion Paper, no. 20-003.

Lin, Jia, Alexander Halavais, and Bin Zhang. 2007. “The Blog Network in America: Blogs as Indicators of Relationships Among US Cities.” Connections 27 (2): 15–23.

Salvini, Marco M, and Sara I Fabrikant. 2016. “Spatialization of User-Generated Content to Uncover the Multirelational World City Network.” Environment and Planning B: Planning and Design 43 (1): 228–48.

Serrano, Ma Ángeles, and Marián Boguñá. 2003. “Topology of the World Trade Web.” Phys. Rev. E 68 (July): 015101. https://doi.org/10.1103/PhysRevE.68.015101.

Thissen, M, D Diodato, and F Van Oort. 2013a. “European Regional Trade Flows: An Update for 2000–2010.” PBL Netherlands Environmental Assessment Agency, The Hague.

———. 2013b. “Integrated Regional Europe: European Regional Trade Flows in 2000.” PBL Netherlands Environmental Assessment Agency, The Hague.

Vaughan, Liwen. 2004. “Exploring Website Features for Business Information.” Scientometrics 61 (3): 467–77.

Vaughan, Liwen, Yijun Gao, and Margaret Kipp. 2006. “Why Are Hyperlinks to Business Websites Created? A Content Analysis.” Scientometrics 67 (2): 291–300.

Vaughan, Liwen, and Guozhu Wu. 2004. “Links to Commercial Websites as a Source of Business Information.” Scientometrics 60 (3): 487–96.

Using the web to predict regional trade flows: material and immaterial regional interdependencies

Emmanouil Tranos, Andre Carrascal Incera & George Willis University of Bristol, Alan Turing Institute e.tranos@bristol.ac.uk, @EmmanouilTranos, etranos.info

Contents

Introduction

Regional trade flows

Regional trade flows

Regional trade flow: hardly any data

Our contribution

Web data and spatial research

Spatial studies using hyperlinks

Spatial studies using hyperlinks

Web data and business studies

Business studies using hyperlinks

Empirical strategy

Web data: The Internet Archive

Web data: The Internet Archive

Web data: The Internet Archive

Our web data

Our web data

Modelling strategy

Modelling strategy: Random Forests

Modelling strategy: rolling forecasting

Modelling strategy: predictive performance

Data cleaning

Unique postcodes frequencies, 2000

Directory website with a lot of postcodes

Website with a unique postcode in London

Desctiptive statistics

Interregional trade flows

Interregional hyperlinks

Scatter plots of trade vs. hyperlinks

Correlations with interregional trade

Results

Modelling strategy

Train on year t and t + 1

Feature importance

Test on t + 2

Test on t + 2

Sectoral decombosition

Sectoral decombosition

Sectoral decombosition

Alternative specifications

Alternative specifications

Robustness check: websites with up to 10 postcodes

Robustness check: websites with up to 10 postcodes

A local level application

A local level application

Conclusions

References

Emmanouil Tranos, Andre Carrascal Incera & George Willis

University of Bristol, Alan Turing Institute
e.tranos@bristol.ac.uk, @EmmanouilTranos, etranos.info