5 Processing pipeline

5.1 Purpose

The City Digital Twin pipeline is a piece of software (ETL) developed by Digital twin LLC. It has been developed to:

  • collect the initial data on indicators of socio-economic and natural-anthropogenic development of a city from available data sources,
  • update the parameters for mathematical models representing the life cycles of settlements, territories, and areas of activity
  • prepare forecast scenarios for the development of the territory
  • estimate the general indicators of the condition of the territory and achieve the target outcome
  • conduct a bottleneck analysis
  • conduct a comparative analysis between the territories
  • prepare parameters for management decisions that ensure the achievement of the target state

Pipeline features:

  • functional connectivity of data sets at each stage of pipeline processing (transparency of transformation)
  • traceability of data sets from the source to the recipient of the information
  • integrity, completeness, and consistency of output indicator values
  • the satisfaction of balance ratios (output indicators correspond to a system of equations set by a unified model of the socio-economic situation and development of territories)

5.2 Underlying technology

The underlying technology used in the pipeline is an open-source library of functions Targets, implemented on an open-source platform R.

The technology is a declaration and execution of computational nodes (tar_target) connected into a unified network (tar_repository) by input and output data sets (tar_objects). The network stores information about changes in the state of the original nodes and the content of the calculation functions and evaluates the relevance of the calculations at each node. In this way, tracing and control over the relevance of data at each processing node are achieved.

5.3 Pipeline objectives

5.3.1 Data preparation

  • gaining programming access (API) to external data sources with indicators of socio-economic and spatial development
  • downloading and recognizing raw data
  • confirming the completeness of primary data from sources (open data, official statistics, Federal Tax & Customs services, budget systems)
  • maintaining directories (indicators, analytic dimensions, scenarios, versions, stages of processing), data models of sources and recipients, as well as correspondence tables, taking into account changes over time
  • collecting data on investment projects from external sources

5.3.2 Data processing

  • elimination of technical errors in primary data formats, including field naming, value formats, shifts in data series, gaps in analytical data slices
  • aggregation of downloaded primary data and reduction to a reference structure
  • detection of changes in primary data, including retrospectively, including in the data structure, directories, and actual values
  • recovery of gaps, elimination of data duplication
  • validation and proposals for adjusting the values of indicators by econometric methods according to a set of rules, including:
    • meeting balance ratios
    • being in the range of acceptable intervals
    • satisfying the ratios of the principal components (eigenvectors)
  • normalization of the database of projects by implementing a unified system of indicators and characteristics

5.3.3 Model calibration

  • calibrations of models of dynamics of macroeconomic indicators of the territory
  • calculation of eigenvectors (7 items) of socio-economic development for the territory
  • calculation of transition matrices of initial and target macroeconomic data to eigenvectors
  • calculation of correlation matrices of influence (sensitivity of changes in indicators to each other and to external factors)
  • calculation of intersectoral balance models (multiplier matrices) by detailing macroeconomic data at the industry level
  • calibrations of interterritorial balances (stress matrices) models by estimation of commuting, passenger traffic, and cargo traffic
  • calibrations supply and demand models across a range of products

5.3.4 Analysis, evaluation, and forecasting

  • scenario forecasting for macroeconomic indicators of territories
  • sectoral forecasting by territory
  • calculations of generalizing indicators (efficiency, reliability, safety, sustainability) for individual territories and median values for a sample of territories
  • calculations of the development potential of the territory
  • calculation of the deviation of the actual dynamics of socio-economic development indicators from the target dynamics (determined by strategies and national goals)
  • calculation of the impact of the database of projects on the socio-economic development of territories

5.3.5 Planning

  • determining the magnitude and rhythm of the necessary impact to achieve target indicators for a given vector of regulated indicators
  • determining the impact of the program on the territory to achieve the target trajectory, taking into account the specified restrictions
  • calculation of a comprehensive plan (by industries, spheres, and territories), taking into account the size of the reproduced resource

5.3.6 Data provision

  • uploading data sets and master data in csv, xlsx, parquet, qs, fst data formats
  • loading data into the data warehouse for access via API
  • providing an interactive dashboard for indicator sets
  • providing matrices for evaluating effectiveness

5.3.7 Documentation processes for collecting, processing, and providing

  • providing scenarios (actual data, assessment, forecast, plan, scenario, and goal), stages (initial, amended, corrected, stage number), versions, and methods of accounting indicator values, which is necessary for the correct interpretation and tracing of indicator values
  • maintaining a library of verification methods, rules for verification and validation of indicator values
  • preparing interactive reports on the volume, completeness and identified errors

5.4 Current state of the pipeline

At the moment, the pipeline consists of:

  • master data sets
  • official and government statistics by countries, regions, and cities
  • reconciled and verified data sets
  • 2 estimated scenarios of projections till 2050 for more than 300 KPIs
  • model parameters in matrices
  • data warehouse
  • data marts for different presentations
  • dashboards
  • charbots
  • flexible reports

The pipeline scheme is presented in the form of an interactive graph.

5.5 Subject pipeline composition

The pipeline includes stream processing and data integration from three main domains:

  • basic indicators of socio-economic development of the country, regions, municipalities, and settlements
  • basic indicators of accounting reports on the activities of most enterprises
  • basic indicators and attributes of investment projects (projects of management decisions)

Basic indicators presented in the form of time series are used to build a multi-level standard model of the city (in terms of territory/industry/object/time) - characterizing the dynamics of the state and the structure of its economic activity.

The parameters of the city model are entered in the matrices of stable parameters of the city:

  • matrix of eigenvectors (principal components),
  • matrix of dynamic coefficients,
  • correlation matrix,
  • matrix of multipliers and specific indicators,
  • stress matrix

The matrix of stable parameters is applied to management problems:

  • ranking and comparative analysis of territories and industries
  • scenario forecasting
  • identification of “bottlenecks”
  • sensitivity estimates
  • impact assessments
  • estimation of the required managerial impact
  • estimation of optimal development plans and programs

5.6 Pipeline of technical modules

Quantitative pipeline characteristics

Field Value
1 Number of nodes with datasets, pcs. 189
2 Data volume, Gb 11
3 Number of functions, pcs. 148

5.6.1 Data collection module

The data collection module consists of:

  • connection via REST API to the following data sources:
  • official statistics (municipal, regional, and federal)
  • Federal Tax Services (balance sheets from 2012 to 2021)
  • processing hierarchical data obtained from external sources:
  • official statistics (input-output matrices, information on household income, consumer price index)
  • Federal Customs Services (volumes of import and export)
  • Central Banks (refinancing rate, balance of exports and imports, US Dollar exchange rate)
  • information about registered enterprises
  • information about small and medium-sized businesses
  • state, regional, and municipal budgets
  • investment program

5.6.2 Data normalization module

A data normalization module consisting of the following nodes:

  • bringing data to the directories of the Digital Twin database
  • eliminating duplicates
  • eliminating gaps (recovery) in data series
  • bringing data sets to a unified format sufficient for further estimates
  • normalizing data on investment projects

5.6.3 Modeling and forecasting module

The modeling and forecasting module consists of the following nodes:

  • calibrating demographic model parameters based on actual data
  • building demographic forecasts
  • calibrating the parameters of the macroeconomic dynamics model
  • building inertial forecasts of macroeconomic indicators
  • building transition matrices from observed indicators to eigenvectors (principal components) and vice versa
  • building correlation matrices linking changes in regulated indicators with changes in target indicators over time
  • assessments of generalizing and relative indicators of socio-economic development
  • calculation of matrices of intersectoral balances, including matrices of direct production coefficients, matrices of multipliers, specific household consumption, and budget expenditures
  • detailing and calibrating intersectoral balance sheets, taking into account actual accounts reports
  • estimating projection-based intersectoral balances based on forecasts of macroeconomic indicators, matrices of multipliers, and specific household consumption and budget expenditures
  • estimating interterritorial balances, including indicators of passenger and cargo flows
  • estimating financial models for investment projects (preparation of project passports)
  • estimating the impact of projects, portfolios, and programs on indicators of socio-economic development
  • estimating the investment development of territories (taking into account the implementation of investment projects and other management decisions)

5.6.4 Results preparation modules

Result preparation modules consist of the following nodes:

  • transformations and converting of the data sets to the format of the data consumer
  • preparation of separate data segments
  • preparation of dashboards
  • preparation of analytical reports on socio-economic development
  • preparation of technical reports on completeness and inconsistencies in data

Questions & proporals
All rights reserved Digital twin LLC