Market study of data science jobs
The aim of this project is to investigate the trend in Jobs related to Data Science, Data Analysts, Business Intelligence and Business Analysts.
Project Outline
Problem Statement
- Determine the industry factors that are most important in predicting the salary amounts for these data.
- Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?
Approach
This is a data science problem that required data extraction, data cleaning and data modelling. Each of the 3 sections are explained below:
1. Data extraction
The Job postings were obtained using webscraping code written using Python, Scrapy and Selector, Xpath.
Source and output details:
- Job aggregator: Seek
- Locations of interest: NSW, Victoria and Queensland were scraped.
- No of jobs obtained: 2831
- Output format - csv, SQL database
- Click here to view the WebScraping code on github
Scraper Design:
- Create search url -
job_desc_i_want = ['Data-Science', 'Data-Analyst', 'Business-Intelligence', 'Research-Scientist', 'Business-Analyst'] where_is_the_job = ['-jobs/in-All-Sydney-NSW', '-jobs/in-All-Melbourne-VIC', '-jobs/in-All-Brisbane-QLD'] # generate url string 'https://www.seek.com.au/' + job_desc_i_want + where_is_the_job
- Get search result pages url -
# Get the html response from the first page url using Requests and Selector response = requests.get(url) # get the link to the next page next_page_anchor = Selector(text=response.text).xpath("//a[@data-automation='page-next']/@href").extract() # Save every search result page to a pandas dataframe
- Get individual Job posting links from each search page
- Get Job details of each job and save to CSV and SQL database.
Title, region, Location, Salary, Worktype, Advertiser, Classification, Job Description
2. Data Cleaning
The dataset consists of many categorical variables. One of the required feature is Salary. The salary was in an inconsistent format with either textual description, numerical ranges or values.
First look at the dataset:
- No of datapoints - 2831
- Continuous feature - Salary(eg: 100000 - 120000 p.a. with 9.5% Super)
- The digits had to be extracted from the inconsistent salary format. Regular expressions were very helpful.
- Removed Super and Bonus information.
- Averaged the salary range and standardized the salaries at a yearly rate.
- Removed Outliers. Boxplot was helpful in checking this.
- Textual Information - Job description
- Job description - Some text manipulation was required here to remove special charachters and noise that was extracted.
- Categorical features - Region, Location, Worktype, Classification, Advertoser
- Reducing the number of distinct values in categorical features using logical assumptions was helpful.
- Explore the data
- Lot of jobs in Sydney compared to Melbourne and Brisbane
- In every city the CBD has more jobs compared to suburbs
- Junior postions are in demand
- Median salary is higher in sydney except for a Manager position. Interesting!
- Among Data analyst, Data scientist, Research scientist, Business analysts - Business analysts are the highest paid !
- Plotting a heat map was not useful in this case.
- Click here to view the Data cleaning and EDA notebook
3. Data Modelling
3A. What are the factors that impact salary and can be used to predict it?
Jobs available - 2830
Jobs with valid salary - 783
Approach
a. There are 783 jobs with valid salary. Firstly, I will use this subset to see if a salary prediction model provides good results. b. Secondly, Using these 783 jobs, I will check if it is possible to predict the salary range for a given job. This being a binary classification problem will use logistic regression and other classification models.
Risk
Small dataset and Unequal distribution of the valid salaries
Salary Prediction model:
- Stats model with all features - R2 0.244 - The predictors explains 24.7% of the variance, which is not good.
- Test train splits tried - 70/30,20/80
- K-fold Cross validation - A negative R2!
- Regularization using Ridge, Lasso - No improvement
- Feature Selection - kbest
- Gradient descent regressor model - Score - 0.91
Inference:
- The Linear Regression is not a good model for predicting the salary. The small dataset and the unequal distribution are not good enough to create a regression model.
Salary Range Prediction model:
- Salary Median - 120000
- Create a new feature High Paying job - Salary > 120000 (1)
- Logistic regression model:
- Target - High_Pay
- Features - Job Title, Job Location, Job Classification, Work Type
- Score: 0.70861
Inference:
- Even with less no of data points, the regression model performed well. This was because of equal distribution of target variables.
3B. What components of a job posting distinguish data scientists from other data jobs?
The Job descriptions contained the keywords. Following NLP techniques were performed:
Count Vectorizer with word analyzer and english stop words were used to get the 1000 2-grams and 3-grams with largest frequency in the JD.
Create these n-grams as features for the jobs
Data points - 2830 (All the jobs scraped were used here)
Most predictive features found using GridSearchCV
Results
- Skills that distinguish Data Scientists and Analysts from other jobs are:
- ‘data science’, ‘data analyst’, ‘business intelligence’, ‘data analytics’, ‘data reporting’, ‘data driven’, ‘best practices’
- What are the key factors/skills/keywords for Data Science jobs?
- Skills: ‘advanced excel’, ‘complex data’,’data science’,’machine learning’, ‘big data’,’power bi’,’data integration’, ‘excellent communication skills’, ‘market leader’,’problem solving’, ‘data management’,’management skills’, ‘development implementation’, ‘written communication’,tertiary qualifications’, ‘requirements gathering’,’business needs’,’stakeholder management’,business requirements’,
-
Key job positions: ‘data analyst’,’data scientist’,’data modelling’, ‘data warehouse’,’data analysis’,’business analysts’,high performing’, ‘business intelligence’,
- Experience: ‘previous experience’, ‘seeking experienced’, ‘best practice’, ‘proven experience’, ‘strong experience’,’experience working’, ‘demonstrated experience’,’extensive experience’, ‘extensive experience’
Click here to view the modelling steps on github
Conclusion
Data Science is a very new industry so direct predictions could not be made for the salary. However, the skills and salary ranges for the Data Science/Analysts in the industry are quite distinct. The following improvements to the study can provide better insights.
-
The lack of valid salary data was a setback for the Salary prediction. It is expected since this is an upcoming industry and the Salaries have not been standardised. As a next step, more job posts with valid salaries can be gathered over time to analyse in the next phase.
-
This phase concentrated on extracting the keywords from various Job descriptions by advertisers. Though key words have been extracted and accuracy checked, identified and known keywords need to be checked against the various facors like salary, seniority, industry.