At OrigoMigo Inc., recent layoffs have raised concerns among employees about potential biases in the selection process. This analysis seeks to explore the HR data to identify patterns or biases, address specific accusations, and provide actionable insights. The ultimate goal is to reassure stakeholders and help OrigoMigo refine its HR practices.
This analysis uses the fictional IBM HR Attrition dataset from Kaggle (https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset). The dataset was created by IBM data scientists to mimic real-world HR scenarios. For this analysis, attrition is interpreted as employees being laid off.
# Load dataset
hrdata <- read.csv("HR-Employee-Attrition.csv")
# Preview data
head(hrdata)
## Age Attrition BusinessTravel DailyRate Department
## 1 41 Yes Travel_Rarely 1102 Sales
## 2 49 No Travel_Frequently 279 Research & Development
## 3 37 Yes Travel_Rarely 1373 Research & Development
## 4 33 No Travel_Frequently 1392 Research & Development
## 5 27 No Travel_Rarely 591 Research & Development
## 6 32 No Travel_Frequently 1005 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 1 2 Life Sciences 1 1
## 2 8 1 Life Sciences 1 2
## 3 2 2 Other 1 4
## 4 3 4 Life Sciences 1 5
## 5 2 1 Medical 1 7
## 6 2 2 Life Sciences 1 8
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Female 94 3 2
## 2 3 Male 61 2 2
## 3 4 Male 92 2 1
## 4 4 Female 56 3 1
## 5 1 Male 40 3 1
## 6 4 Male 79 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Sales Executive 4 Single 5993 19479
## 2 Research Scientist 2 Married 5130 24907
## 3 Laboratory Technician 3 Single 2090 2396
## 4 Research Scientist 3 Married 2909 23159
## 5 Laboratory Technician 2 Married 3468 16632
## 6 Laboratory Technician 4 Single 3068 11864
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 8 Y Yes 11 3
## 2 1 Y No 23 4
## 3 6 Y Yes 15 3
## 4 1 Y Yes 11 3
## 5 9 Y No 12 3
## 6 0 Y No 13 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 1 80 0 8
## 2 4 80 1 10
## 3 2 80 0 7
## 4 3 80 0 8
## 5 4 80 1 6
## 6 3 80 0 8
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 0 1 6 4
## 2 3 3 10 7
## 3 3 3 0 0
## 4 3 3 8 7
## 5 3 3 2 2
## 6 2 2 7 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 0 5
## 2 1 7
## 3 0 0
## 4 3 0
## 5 2 2
## 6 3 6
summary(hrdata)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 Length:1470 Length:1470 Min. : 102.0
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 465.0
## Median :36.00 Mode :character Mode :character Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
## Department DistanceFromHome Education EducationField
## Length:1470 Min. : 1.000 Min. :1.000 Length:1470
## Class :character 1st Qu.: 2.000 1st Qu.:2.000 Class :character
## Mode :character Median : 7.000 Median :3.000 Mode :character
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
## EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender
## Min. :1 Min. : 1.0 Min. :1.000 Length:1470
## 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 Class :character
## Median :1 Median :1020.5 Median :3.000 Mode :character
## Mean :1 Mean :1024.9 Mean :2.722
## 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000
## Max. :1 Max. :2068.0 Max. :4.000
## HourlyRate JobInvolvement JobLevel JobRole
## Min. : 30.00 Min. :1.00 Min. :1.000 Length:1470
## 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Median : 66.00 Median :3.00 Median :2.000 Mode :character
## Mean : 65.89 Mean :2.73 Mean :2.064
## 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :100.00 Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## Min. :1.000 Length:1470 Min. : 1009 Min. : 2094
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.: 8047
## Median :3.000 Mode :character Median : 4919 Median :14236
## Mean :2.729 Mean : 6503 Mean :14313
## 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
## Max. :4.000 Max. :19999 Max. :26999
## NumCompaniesWorked Over18 OverTime PercentSalaryHike
## Min. :0.000 Length:1470 Length:1470 Min. :11.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00
## Median :2.000 Mode :character Mode :character Median :14.00
## Mean :2.693 Mean :15.21
## 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :9.000 Max. :25.00
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.154 Mean :2.712 Mean :80 Mean :0.7939
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
The first step in understanding potential biases is to explore the data distribution. For OrigoMigo, knowing the demographic and numerical trends in the data is crucial for identifying areas of concern.
Understanding attrition rates is key to assessing whether layoffs were disproportionately impacting certain groups.
ggplot(hrdata, aes(x = Attrition, fill = Attrition)) +
geom_bar() +
ggtitle("Attrition Distribution") +
xlab("Attrition") +
ylab("Count") +
theme_minimal()
Analyzing the age distribution of employees who left versus those who stayed helps assess claims of ageism.
ggplot(hrdata, aes(x = Age, fill = Attrition)) +
geom_histogram(position = "dodge", bins = 20) +
ggtitle("Age Distribution by Attrition") +
xlab("Age") +
ylab("Count") +
theme_minimal()
At OrigoMigo, understanding how key variables relate to one another can provide insights into systemic patterns. For instance, does experience correlate with higher income, and do these relationships influence layoffs?
# Select relevant columns
correlation_columns <- c("Age", "DailyRate", "DistanceFromHome", "Education",
"HourlyRate", "MonthlyIncome", "MonthlyRate",
"NumCompaniesWorked", "TotalWorkingYears", "TrainingTimesLastYear")
# Compute correlation matrix
cor_matrix <- cor(hrdata[, correlation_columns], use = "complete.obs")
# Visualize correlation matrix
corrplot(cor_matrix, method = "circle", type = "upper", title = "Correlation Matrix",
tl.col = "black", tl.cex = 0.8)
A specific claim from a former employee is that layoffs were biased against older employees. This analysis tests whether there is a statistically significant difference in ages between employees who were laid off and those who were not.
Hâ‚€: There is no significant difference in the ages of employees who were laid off versus those who were not.
ggplot(hrdata, aes(x = Attrition, y = Age, fill = Attrition)) +
geom_boxplot() +
ggtitle("Age Distribution by Attrition") +
xlab("Attrition") +
ylab("Age") +
theme_minimal()
Interpretation: The boxplot shows that the median age of employees who were laid off is lower than that of those who stayed. Additionally, the range of ages is smaller for those who left, with a notable absence of older employees in this group compared to those who stayed.
# Split data into groups
yes_group <- hrdata[(hrdata$Attrition == "Yes"), "Age"]
no_group <- hrdata[(hrdata$Attrition == "No"), "Age"]
# Perform Welch Two Sample T-Test
t_test_age <- t.test(yes_group, no_group)
print(t_test_age)
##
## Welch Two Sample t-test
##
## data: yes_group and no_group
## t = -5.828, df = 316.93, p-value = 1.38e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.288346 -2.618930
## sample estimates:
## mean of x mean of y
## 33.60759 37.56123
Interpretation: With a p-value of 1.3797601^{-8}, which is less than the threshold of 0.05, we reject the null hypothesis. This indicates a statistically significant difference in ages between the two groups. The evidence suggests younger employees were more likely to be laid off, countering the claim of age bias against older employees.
Another accusation is that newer employees (indicated by lower employee numbers) were targeted. This test examines whether tenure was a factor in layoffs.
Hâ‚€: There is no significant difference in the employee numbers of those who were laid off versus those who were not.
ggplot(hrdata, aes(x = Attrition, y = EmployeeNumber, fill = Attrition)) +
geom_boxplot() +
ggtitle("Employee Number Distribution by Attrition") +
xlab("Attrition") +
ylab("Employee Number") +
theme_minimal()
Interpretation: The boxplot shows that the distribution of employee numbers for both groups is relatively similar. While there are slight differences in the range and median, no dramatic disparities are evident.
# Split data
yes_group_enum <- hrdata[(hrdata$Attrition == "Yes"), "EmployeeNumber"]
no_group_enum <- hrdata[(hrdata$Attrition == "No"), "EmployeeNumber"]
# Perform T-Test
t_test_enum <- t.test(yes_group_enum, no_group_enum)
print(t_test_enum)
##
## Welch Two Sample t-test
##
## data: yes_group_enum and no_group_enum
## t = -0.41725, df = 342.33, p-value = 0.6768
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -98.91087 64.29061
## sample estimates:
## mean of x mean of y
## 1010.346 1027.656
Interpretation: The p-value of 0.6767598 exceeds 0.05, meaning we fail to reject the null hypothesis. This indicates no statistically significant difference in employee numbers between the two groups, countering the claim that newer employees were disproportionately targeted.
OrigoMigo wants to better understand salary dynamics to attract and retain talent. This section uses regression models to explore predictors of MonthlyIncome.
age_model <- lm(MonthlyIncome ~ Age, data = hrdata)
simple_model_summary <- summary(age_model)
simple_model_summary
##
## Call:
## lm(formula = MonthlyIncome ~ Age, data = hrdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9990.1 -2592.7 -677.9 1810.5 12540.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2970.67 443.70 -6.695 3.06e-11 ***
## Age 256.57 11.67 21.995 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4084 on 1468 degrees of freedom
## Multiple R-squared: 0.2479, Adjusted R-squared: 0.2473
## F-statistic: 483.8 on 1 and 1468 DF, p-value: < 2.2e-16
Interpretation: The p-value for the Age predictor is 6.6695392^{-93}, which is below 0.05, indicating that Age significantly affects MonthlyIncome. However, the R-squared value of 0.2478592 suggests that only 25% of the variance in MonthlyIncome is explained by Age alone. This highlights the model’s limitations in predictive power.
age_workingyears_model <- lm(MonthlyIncome ~ TotalWorkingYears + Age, data = hrdata)
multi_model_summary <- summary(age_workingyears_model)
multi_model_summary
##
## Call:
## lm(formula = MonthlyIncome ~ TotalWorkingYears + Age, data = hrdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11310.8 -1690.8 -91.4 1428.3 11461.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1978.08 352.36 5.614 2.36e-08 ***
## TotalWorkingYears 489.13 13.65 35.824 < 2e-16 ***
## Age -26.87 11.63 -2.311 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2984 on 1467 degrees of freedom
## Multiple R-squared: 0.5988, Adjusted R-squared: 0.5983
## F-statistic: 1095 on 2 and 1467 DF, p-value: < 2.2e-16
Interpretation: The p-values for both TotalWorkingYears (1.8375117^{-202}) and Age (0.0209729) are below 0.05, indicating their significance in the model. The R-squared value of 0.5988244 shows that this model explains 60% of the variance in MonthlyIncome, making it substantially more predictive than the simple linear model. Notably, TotalWorkingYears emerges as a stronger predictor than Age.
total_workingyears_model <- lm(MonthlyIncome ~ TotalWorkingYears, data = hrdata)
workingyears_model_summary <- summary(total_workingyears_model)
workingyears_model_summary
##
## Call:
## lm(formula = MonthlyIncome ~ TotalWorkingYears, data = hrdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11271.3 -1750.8 -87.5 1398.6 11539.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1227.94 137.30 8.944 <2e-16 ***
## TotalWorkingYears 467.66 10.02 46.669 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2988 on 1468 degrees of freedom
## Multiple R-squared: 0.5974, Adjusted R-squared: 0.5971
## F-statistic: 2178 on 1 and 1468 DF, p-value: < 2.2e-16
Interpretation: With a p-value of 2.7293476^{-292} and an R-squared value of 0.597364, TotalWorkingYears alone accounts for 60% of the variance in MonthlyIncome. This confirms it as the most powerful individual predictor in this dataset.
Recommendations: - Communicate these findings to the HR team to address employee concerns transparently. - Use TotalWorkingYears as a key feature in salary-related decisions. - Investigate other factors like job role or satisfaction to refine understanding of attrition trends.