Introduction

At OrigoMigo Inc., recent layoffs have raised concerns among employees about potential biases in the selection process. This analysis seeks to explore the HR data to identify patterns or biases, address specific accusations, and provide actionable insights. The ultimate goal is to reassure stakeholders and help OrigoMigo refine its HR practices.

This analysis uses the fictional IBM HR Attrition dataset from Kaggle (https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset). The dataset was created by IBM data scientists to mimic real-world HR scenarios. For this analysis, attrition is interpreted as employees being laid off.

Load Data

# Load dataset
hrdata <- read.csv("HR-Employee-Attrition.csv")

# Preview data
head(hrdata)

##   Age Attrition    BusinessTravel DailyRate             Department
## 1  41       Yes     Travel_Rarely      1102                  Sales
## 2  49        No Travel_Frequently       279 Research & Development
## 3  37       Yes     Travel_Rarely      1373 Research & Development
## 4  33        No Travel_Frequently      1392 Research & Development
## 5  27        No     Travel_Rarely       591 Research & Development
## 6  32        No Travel_Frequently      1005 Research & Development
##   DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1                1         2  Life Sciences             1              1
## 2                8         1  Life Sciences             1              2
## 3                2         2          Other             1              4
## 4                3         4  Life Sciences             1              5
## 5                2         1        Medical             1              7
## 6                2         2  Life Sciences             1              8
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2 Female         94              3        2
## 2                       3   Male         61              2        2
## 3                       4   Male         92              2        1
## 4                       4 Female         56              3        1
## 5                       1   Male         40              3        1
## 6                       4   Male         79              3        1
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1       Sales Executive               4        Single          5993       19479
## 2    Research Scientist               2       Married          5130       24907
## 3 Laboratory Technician               3        Single          2090        2396
## 4    Research Scientist               3       Married          2909       23159
## 5 Laboratory Technician               2       Married          3468       16632
## 6 Laboratory Technician               4        Single          3068       11864
##   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1                  8      Y      Yes                11                 3
## 2                  1      Y       No                23                 4
## 3                  6      Y      Yes                15                 3
## 4                  1      Y      Yes                11                 3
## 5                  9      Y       No                12                 3
## 6                  0      Y       No                13                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        1            80                0                 8
## 2                        4            80                1                10
## 3                        2            80                0                 7
## 4                        3            80                0                 8
## 5                        4            80                1                 6
## 6                        3            80                0                 8
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     0               1              6                  4
## 2                     3               3             10                  7
## 3                     3               3              0                  0
## 4                     3               3              8                  7
## 5                     3               3              2                  2
## 6                     2               2              7                  7
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       0                    5
## 2                       1                    7
## 3                       0                    0
## 4                       3                    0
## 5                       2                    2
## 6                       3                    6

summary(hrdata)

##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 465.0  
##  Median :36.00   Mode  :character   Mode  :character   Median : 802.0  
##  Mean   :36.92                                         Mean   : 802.5  
##  3rd Qu.:43.00                                         3rd Qu.:1157.0  
##  Max.   :60.00                                         Max.   :1499.0  
##   Department        DistanceFromHome   Education     EducationField    
##  Length:1470        Min.   : 1.000   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median : 7.000   Median :3.000   Mode  :character  
##                     Mean   : 9.193   Mean   :2.913                     
##                     3rd Qu.:14.000   3rd Qu.:4.000                     
##                     Max.   :29.000   Max.   :5.000                     
##  EmployeeCount EmployeeNumber   EnvironmentSatisfaction    Gender         
##  Min.   :1     Min.   :   1.0   Min.   :1.000           Length:1470       
##  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           Class :character  
##  Median :1     Median :1020.5   Median :3.000           Mode  :character  
##  Mean   :1     Mean   :1024.9   Mean   :2.722                             
##  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000                             
##  Max.   :1     Max.   :2068.0   Max.   :4.000                             
##    HourlyRate     JobInvolvement    JobLevel       JobRole         
##  Min.   : 30.00   Min.   :1.00   Min.   :1.000   Length:1470       
##  1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Median : 66.00   Median :3.00   Median :2.000   Mode  :character  
##  Mean   : 65.89   Mean   :2.73   Mean   :2.064                     
##  3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000                     
##  Max.   :100.00   Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome    MonthlyRate   
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   : 2094  
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.: 8047  
##  Median :3.000   Mode  :character   Median : 4919   Median :14236  
##  Mean   :2.729                      Mean   : 6503   Mean   :14313  
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:20462  
##  Max.   :4.000                      Max.   :19999   Max.   :26999  
##  NumCompaniesWorked    Over18            OverTime         PercentSalaryHike
##  Min.   :0.000      Length:1470        Length:1470        Min.   :11.00    
##  1st Qu.:1.000      Class :character   Class :character   1st Qu.:12.00    
##  Median :2.000      Mode  :character   Mode  :character   Median :14.00    
##  Mean   :2.693                                            Mean   :15.21    
##  3rd Qu.:4.000                                            3rd Qu.:18.00    
##  Max.   :9.000                                            Max.   :25.00    
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

Exploratory Data Analysis (EDA)

Business Context

The first step in understanding potential biases is to explore the data distribution. For OrigoMigo, knowing the demographic and numerical trends in the data is crucial for identifying areas of concern.

Attrition Distribution

Understanding attrition rates is key to assessing whether layoffs were disproportionately impacting certain groups.

ggplot(hrdata, aes(x = Attrition, fill = Attrition)) +
  geom_bar() +
  ggtitle("Attrition Distribution") +
  xlab("Attrition") +
  ylab("Count") +
  theme_minimal()

Age Distribution by Attrition

Analyzing the age distribution of employees who left versus those who stayed helps assess claims of ageism.

ggplot(hrdata, aes(x = Age, fill = Attrition)) +
  geom_histogram(position = "dodge", bins = 20) +
  ggtitle("Age Distribution by Attrition") +
  xlab("Age") +
  ylab("Count") +
  theme_minimal()

Correlation Analysis

Business Context

At OrigoMigo, understanding how key variables relate to one another can provide insights into systemic patterns. For instance, does experience correlate with higher income, and do these relationships influence layoffs?

# Select relevant columns
correlation_columns <- c("Age", "DailyRate", "DistanceFromHome", "Education", 
                         "HourlyRate", "MonthlyIncome", "MonthlyRate", 
                         "NumCompaniesWorked", "TotalWorkingYears", "TrainingTimesLastYear")

# Compute correlation matrix
cor_matrix <- cor(hrdata[, correlation_columns], use = "complete.obs")

# Visualize correlation matrix
corrplot(cor_matrix, method = "circle", type = "upper", title = "Correlation Matrix",
         tl.col = "black", tl.cex = 0.8)

Statistical Testing

Testing Age and Attrition

Business Context

A specific claim from a former employee is that layoffs were biased against older employees. This analysis tests whether there is a statistically significant difference in ages between employees who were laid off and those who were not.

Null Hypothesis

H₀: There is no significant difference in the ages of employees who were laid off versus those who were not.

Boxplot

ggplot(hrdata, aes(x = Attrition, y = Age, fill = Attrition)) +
  geom_boxplot() +
  ggtitle("Age Distribution by Attrition") +
  xlab("Attrition") +
  ylab("Age") +
  theme_minimal()

Interpretation: The boxplot shows that the median age of employees who were laid off is lower than that of those who stayed. Additionally, the range of ages is smaller for those who left, with a notable absence of older employees in this group compared to those who stayed.

T-Test

# Split data into groups
yes_group <- hrdata[(hrdata$Attrition == "Yes"), "Age"]
no_group <- hrdata[(hrdata$Attrition == "No"), "Age"]

# Perform Welch Two Sample T-Test
t_test_age <- t.test(yes_group, no_group)
print(t_test_age)

## 
##  Welch Two Sample t-test
## 
## data:  yes_group and no_group
## t = -5.828, df = 316.93, p-value = 1.38e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.288346 -2.618930
## sample estimates:
## mean of x mean of y 
##  33.60759  37.56123

Interpretation: With a p-value of 1.3797601^{-8}, which is less than the threshold of 0.05, we reject the null hypothesis. This indicates a statistically significant difference in ages between the two groups. The evidence suggests younger employees were more likely to be laid off, countering the claim of age bias against older employees.

Testing Employee Number and Attrition

Business Context

Another accusation is that newer employees (indicated by lower employee numbers) were targeted. This test examines whether tenure was a factor in layoffs.

Null Hypothesis

H₀: There is no significant difference in the employee numbers of those who were laid off versus those who were not.

Boxplot

ggplot(hrdata, aes(x = Attrition, y = EmployeeNumber, fill = Attrition)) +
  geom_boxplot() +
  ggtitle("Employee Number Distribution by Attrition") +
  xlab("Attrition") +
  ylab("Employee Number") +
  theme_minimal()

Interpretation: The boxplot shows that the distribution of employee numbers for both groups is relatively similar. While there are slight differences in the range and median, no dramatic disparities are evident.

T-Test

# Split data
yes_group_enum <- hrdata[(hrdata$Attrition == "Yes"), "EmployeeNumber"]
no_group_enum <- hrdata[(hrdata$Attrition == "No"), "EmployeeNumber"]

# Perform T-Test
t_test_enum <- t.test(yes_group_enum, no_group_enum)
print(t_test_enum)

## 
##  Welch Two Sample t-test
## 
## data:  yes_group_enum and no_group_enum
## t = -0.41725, df = 342.33, p-value = 0.6768
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -98.91087  64.29061
## sample estimates:
## mean of x mean of y 
##  1010.346  1027.656

Interpretation: The p-value of 0.6767598 exceeds 0.05, meaning we fail to reject the null hypothesis. This indicates no statistically significant difference in employee numbers between the two groups, countering the claim that newer employees were disproportionately targeted.

Predictive Modeling

Business Context

OrigoMigo wants to better understand salary dynamics to attract and retain talent. This section uses regression models to explore predictors of MonthlyIncome.

Simple Linear Regression: Age and Monthly Income

age_model <- lm(MonthlyIncome ~ Age, data = hrdata)
simple_model_summary <- summary(age_model)
simple_model_summary

## 
## Call:
## lm(formula = MonthlyIncome ~ Age, data = hrdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9990.1 -2592.7  -677.9  1810.5 12540.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2970.67     443.70  -6.695 3.06e-11 ***
## Age           256.57      11.67  21.995  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4084 on 1468 degrees of freedom
## Multiple R-squared:  0.2479, Adjusted R-squared:  0.2473 
## F-statistic: 483.8 on 1 and 1468 DF,  p-value: < 2.2e-16

Interpretation: The p-value for the Age predictor is 6.6695392^{-93}, which is below 0.05, indicating that Age significantly affects MonthlyIncome. However, the R-squared value of 0.2478592 suggests that only 25% of the variance in MonthlyIncome is explained by Age alone. This highlights the model’s limitations in predictive power.

Multiple Linear Regression: Age and TotalWorkingYears

age_workingyears_model <- lm(MonthlyIncome ~ TotalWorkingYears + Age, data = hrdata)
multi_model_summary <- summary(age_workingyears_model)
multi_model_summary

## 
## Call:
## lm(formula = MonthlyIncome ~ TotalWorkingYears + Age, data = hrdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11310.8  -1690.8    -91.4   1428.3  11461.5 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1978.08     352.36   5.614 2.36e-08 ***
## TotalWorkingYears   489.13      13.65  35.824  < 2e-16 ***
## Age                 -26.87      11.63  -2.311    0.021 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2984 on 1467 degrees of freedom
## Multiple R-squared:  0.5988, Adjusted R-squared:  0.5983 
## F-statistic:  1095 on 2 and 1467 DF,  p-value: < 2.2e-16

Interpretation: The p-values for both TotalWorkingYears (1.8375117^{-202}) and Age (0.0209729) are below 0.05, indicating their significance in the model. The R-squared value of 0.5988244 shows that this model explains 60% of the variance in MonthlyIncome, making it substantially more predictive than the simple linear model. Notably, TotalWorkingYears emerges as a stronger predictor than Age.

Single Predictor: TotalWorkingYears

total_workingyears_model <- lm(MonthlyIncome ~ TotalWorkingYears, data = hrdata)
workingyears_model_summary <- summary(total_workingyears_model)
workingyears_model_summary

## 
## Call:
## lm(formula = MonthlyIncome ~ TotalWorkingYears, data = hrdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11271.3  -1750.8    -87.5   1398.6  11539.5 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1227.94     137.30   8.944   <2e-16 ***
## TotalWorkingYears   467.66      10.02  46.669   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2988 on 1468 degrees of freedom
## Multiple R-squared:  0.5974, Adjusted R-squared:  0.5971 
## F-statistic:  2178 on 1 and 1468 DF,  p-value: < 2.2e-16

Interpretation: With a p-value of 2.7293476^{-292} and an R-squared value of 0.597364, TotalWorkingYears alone accounts for 60% of the variance in MonthlyIncome. This confirms it as the most powerful individual predictor in this dataset.

Conclusions and Recommendations

Age and Attrition: Younger employees were more likely to be laid off, disproving claims of ageism against older employees.
Employee Tenure: No significant difference in employee numbers was observed between groups, countering the claim that newer employees were targeted.
Predictive Insights: TotalWorkingYears is the most significant predictor of MonthlyIncome, followed by Age.

Recommendations: - Communicate these findings to the HR team to address employee concerns transparently. - Use TotalWorkingYears as a key feature in salary-related decisions. - Investigate other factors like job role or satisfaction to refine understanding of attrition trends.

Caveats and Assumptions

Attrition Definition: Attrition is treated as equivalent to employees being laid off. In reality, attrition may also include voluntary resignations.
Fictional Dataset: The dataset is not real and was created for educational purposes. Thus, findings should not be generalized to real-world companies.
Legal Disclaimer: This analysis is purely academic and does not account for legal frameworks surrounding layoffs, such as age discrimination laws.
Simplified Relationships: The analysis assumes linear relationships between variables for modeling purposes, which may oversimplify real-world complexities.

Future Work

Analyze trends over time to monitor attrition patterns.
Use clustering to group employees based on satisfaction and performance metrics.
Explore more advanced machine learning models for attrition prediction.

The Numbers Behind Layoffs: An HR Attrition Analysis

Introduction

Load Data

Exploratory Data Analysis (EDA)

Business Context

Attrition Distribution

Age Distribution by Attrition

Correlation Analysis

Business Context

Statistical Testing

Testing Age and Attrition

Business Context

Null Hypothesis

Boxplot

T-Test

Testing Employee Number and Attrition

Business Context

Null Hypothesis

Boxplot

T-Test

Predictive Modeling

Business Context

Simple Linear Regression: Age and Monthly Income

Multiple Linear Regression: Age and TotalWorkingYears

Single Predictor: TotalWorkingYears

Conclusions and Recommendations

Caveats and Assumptions

Future Work