Applying Machine Learning to forecast the Unemployment Rate in Australia as a result of Covid-19

--

Predicting the impact of the White Swan event is not an easy job. Applying supervised machine learning techniques and collecting inputs from open resources, I aim to analyze the influencing indicators to the unemployment rate in Australia and forecast the unemployment rate in 1Q2021 as a result of Covid-19.

Machine learning (supervised) techniques: Time-series forecasting techniques and regression

Programming language: R

The detail though process and source code:

Input resources:

  • Monthly unemployment rate (cat.no.6202.0 ; Australia Bureau of Statistics)
  • Quarterly economy data (cat.no.5206.0 ; Australia Bureau of Statistics)
  • Quarterly unemployment rate (https://www.ilo.org/shinyapps/bulkexplorer15/ ; International Labour Organization — ilostats)
  • Quarterly Wages & Salaries by State (cat.no.5676.0" — table20 ; Australia Bureau of Statistics)

Agenda:

Part 1. Exploratory analysis: data cleaning & understanding characteristics of the dataset

Part 2. Time-series forecast: Random Walk, Simple Exponential Smoothing, Holt’trend, and ARIMA (Autoregressive Integrated Moving Average)

Part 3. Study the influencing factors by using machine learning technique: Multiple Linear Regression

Big Data Jobs

Part 1. Exploratory analysis: data cleaning & understanding characteristics of the dataset

The Crisp-dm (cross-industry standard process for data mining) methodology is used to explore and conduct the analysis (Graph 1). The business understanding will be reviewed to provide a big picture before pre-processing data by conducting the exploratory analysis, forming the analysis direction, and selecting the appropriate datasets to study.

To begin with, the understanding of labour market theory is reviewed to define the analysis’s direction. There are thirteen causes that influence unemployment, categorizing into (1) causes by supply sides, (2) causes by demand sides, and (3) causes by real-wage unemployment (Table 1). Frictional is defined as the time people take to move between jobs; structural is defined as the mismatch of skills, and real-wage unemployment is defined as demand for labour falls as firms have less incentive to employ workers at a higher wage rate.

First, for the causes by the supply side, there are two scenarios, including (1) the increasing of labour supply, and (2) the decreasing of labour supply. In the first scenario, the hypothesis is that number of people not in the labour market was increased dramatically as a result of the pandemic. For instance, the students, who live in Australia eligible to get jobs domestically but did not want to search for the jobs as they can rely on their family subsidies, decided to seek out jobs, helping their family to overcome the recession. Then, line S would be shifted to S1 (Graph 2), moving the LS to LS1 at a faster speed than the shifting of the WS line (from WS to WS1) (Graph 3). As a result, the unemployment rate would be increased (U < U1) (Graph 3). In the second scenario, the hypothesis is that number of people not in labour market was decreased significantly as a result of the lockdown. The potential explanation is that the number of fresh graduates/workers left the labour force by a great amount and/or the number of deaths was exceeding expectation. The phenomenon caused a significant decrease in supply, moving S to S2, LS to LS2, and WS to WS2. In addition, the WS2 was moved at a faster speed than LS2 as the choice of enterprises on qualified candidates was narrowed.

Trending AI Articles:

1. Write Your First AI Project in 15 Minutes

2. Generating neural speech synthesis voice acting using xVASynth

3. Top 5 Artificial Intelligence (AI) Trends for 2021

4. Why You’re Using Spotify Wrong

Graph 2 & 3: Impacts of Supply Side To Unemployment

Graph 4 & 5: Australia Labour Market 2019–20

Then, the historical data of the Australian labour market 2019–20 was explored to confirm the above hypotheses. From Graph 4 & 5, the population aged 15 years and over in 2018–20 was not be affected much. However, the AU labour force was dropped dramatically while the unemployment rate rose rocket. Hence, the latter case occurred. As there are two potential sub-cases in the second hypothesis, the data of not in labour market 2018–20 and the population aged 15 years and over 2018–20 were evaluated, exploring the insight. According to Table 2, the main reason for not joining the labour force is that people did not actively look for work, and the heavily shifting upward trend of the line was also caused by the 82% increasing in that category (Australian Bureau of Statistics. 2020). Hence, the clock down has affected Australian sentiment and job search demand.

Table 2: Reasons for Not In Labour Market

Second, for the causes by real-wage unemployment, the possibility that the decision to increase 1.75% wage implemented in July 2020 (Fair Work Ombudsman, 2020) impacted the unemployment rate. In the normal condition, the lifting of the wage-setting curve can positively assist the labour market by shifting E to E1 and U to U1. However, there would be the case that some companies could not bear the new wage-setting, leading to shifting of quantity from Q1 to Q2, shifting labour supply line towards L1, and moving U to U1 (Graph 6, and 7, respectively). As this order will be fully effective in February 2021, and there will be a certain delay for the cause by real-wage unemployment, more data need to be collected and explored in the later stage.

Graph 6 & 7: Real-Wage Unemployment

Lastly, for the causes by the demand side, when the long-run aggregate supply line (LRAS) was shifted backward, the wage-setting curve was affected, shifting it to the left, and the demand-deficient issue is recorded (Graph 8 & 9). According to Graph 10 & 11, there is clear evidence that demand-deficient unemployment would be the case.

Graph 8 & 9: Impacts of Demand Side To Unemployment
Graph 10 & 11: Historical data 2019–20

In summary, three scenarios have been reviewed to provide an overview of potentially related elements for choosing and analyzing. For the forecasting, time-series analysis and regression analysis are techniques chosen to conduct the predictive analysis.

Table 3: Applied Supervised Learning Techniques

Part 2. Time-series forecast: Random Walk, Simple Exponential Smoothing, Holt’trend, and ARIMA (Autoregressive Integrated Moving Average)

Time-series forecasting:

or

predictive value = intercept + lagged values + lagged error

It is necessary to evaluate whether the dataset is white-noise to conduct the upcoming step. The White-noise criteria include (1) the mean equals 0, (2) the standard deviation is constant, and (3) the correlation between lag is 0.

Graph 12: Monthly Unemployment Rate in Australia Feb 1978 — Nov 2020

First, from Graph 12, the mean of the dataset was not equal. Besides, the standard deviation of the dataset was not constant. The “acf” function from the forecast package in R was used to test the correlation between lag and recorded not-equal-0 result (Graph 13). Notice that as the used autocorrelation function is obtained from forecast package instead of stats package, the lag showed in the horizontal axes show lags in time units (monthly) rather than seasonal units.

Graph 13: Autocorrelation Test

Furthermore, the Augmented Dickey-Fuller Test was conducted to review whether the dataset is stationary. The obtained result indicates that the time-series is not stationary (Graph 14). Details information about the null hypothesis testing can be found in the github file. The first difference transformation process is applied to treat the dataset as stationary for autoregression integrated moving average (Arima) analysis in the later stage.

Graph 14: Augmented Dickey-Fuller Test

Then, the dataset was split into train & test dataset. Train dataset is contained data collected from February 1978 to June 2020, and the test dataset is contained data collected from July 2020 to November 2020. Four models were implemented for forecasting, including (1) Random Walk Forecast, (2) Simple Exponential Smoothing (SES), (3) Holt’s trend, and (4) Seasonal Autoregressive Integrated Moving Average — SARIMA(2,0,2)(0,0,2)(12). The (2) model was out-perform others with a mean absolute percentage error (MAPE) equal to 1.438719% (Table 4). Details explanation about the model building methods can be found in the github file.

Table 4: Time-Series Forecasting_Methods & Results

The four-month prediction was conducted and the results were plotted on Graph 15. White the Holt’s trend method provides that the unemployment rate will be increased continuously, the SARIMA(2,0,2)(0,0,2)(12) model illustrates that the unemployment rate could be decreased, reaching 6.8% in March 2021 (Table 5).

Graph 15: Forecast Monthly Unemployment Rate Dec 2020 — Mar 2021 & Table 5: Forecast in details

Part 3. Study the influencing factors by using machine learning technique: Multiple Linear Regression

Graph 16: Correlation Plot

From Graph 16, the variables that have strong statistical relationship with unemployment rate are GDPchange, GDPprice, Wage.Victoria, Wage.total, Wage.alter, GDPindex, Domesticdemand, Wage.WAustralia. However, there would be a certain level of multicollinearity. Hence, the multicollinearity was diagnosed by using vif function from car package in R. Criteria to select variables and building model include (1) p-value < 0.05, (2) VIF < 10, and (3) R-squared >= 0.3. According to Moore, Notz, and Flinger (2013), r-square lower than 0.3 indicates a none or very weak effect size, and, according to James (2014), VIF value above 5 or 10 should be removed from the model.

The dataset was split into train & test dataset with the ratio 80:20, and details information related to the hypothesis testing, variables selection process, and though process can be found on the github’s file.

Prediction model:

Quarterly Unemployment Rate (%) = 6.16691 + (-0.92121) * [Domestic final demand: Index - Percentage change] + (-0.17724) * [Wages quarterly percentage change ; Total (State); Total (Industry)]
Standard deviation of residuals: 0.7092026

Three cases are created to forecast the unemployment in 1Q2021 with the assumption stated in table 6.

Table 6: Predictive Results & Assumptions

The shortage of this regression model is that r-squared (0.3794) is lower than 0.5. R-squared can be improved when keep adding new variables. To ensure that the added variables are useful, adjusted R-squared should also be considered when evaluating the model’s performance. The decision to use quarterly datasets from different resources was made due to the shortage of data. Therefore, it is also a risk that the data collection, pre-processing, and terminology definition might be different between inputs. Consequently, potential optimization methods are: (1) collect real-time data & adding real-time data in a shorter period such as monthly instead of quarterly, and (2) increase the complexity by including variables such as the government expense on Job Keeper package and other supporting attempts to recover the economy.

References:

Australian Bureau of Statistics. 2020. Reasons people are not in the labour force. [online] Available at: <https://www.abs.gov.au/articles/reasons-people-are-not-labour-force> [Accessed 3 February 2021].

Fair Work Ombudsman. 2020. Welcome to the Fair Work Ombudsman website. [online] Available at: <https://www.fairwork.gov.au/about-us/news-and-media-releases/website-news/the-commission-has-announced-a-1-75-increase-to-minimum-wages> [Accessed 3 February 2021].

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, 2014. An Introduction to Statistical Learning: With Applications in R

Moore, D.S., Notz, W.I, & Flinger, M.A, 2013. The basic practice of statistics (6th ed.). New York, NY: W.H. Freeman and Company. Page 138

Don’t forget to give us your 👏 !

--

--

Postgraduate at University of New South Wales (Australia) ; studying Data Mining and Analysis at Stanford University ; Opening to collab or work