Feature Engineering & Selection

Courselet Content

2 components

	Feature Engineering & Selection (pdf)	3.17 M
	Feature Engineering & Selection (video)	92 min

Requirements

We assume concepts from previous sessions of the Business Analytics & Data Science course known.

General Overview

Description

The lecture introduces feature engineering as a manual, expert-driven process that transforms raw data into informative variables to enhance model performance. It covers common transformations such as scaling, binning, weight-of-evidence coding, and trend features, highlighting their role in increasing predictive power. Feature selection is discussed as a complementary task aimed at reducing dimensionality, improving interpretability, and preventing overfitting, using strategies like filters, wrappers, and embedded methods. The lecture emphasizes the practical challenges of redundancy, feature interaction, and model complexity, and advocates hybrid approaches for efficient variable selection.

1. Introduction and Role in the Analytics Process

Feature engineering and selection are central to the data preparation phase in the analytics process. They significantly influence downstream modeling and model performance. While feature engineering involves crafting informative input variables, feature selection aims to identify the most relevant subset of features to avoid overfitting, reduce computational burden, and improve interpretability.

2. Feature Engineering

Feature engineering is defined as the manual creation or transformation of variables based on domain knowledge, with the goal of improving a model's predictive power. It is distinct from representation learning, where features are learned automatically (e.g., in deep learning).

Key Techniques and Concepts:

Variable Transformation: Apply functions (e.g., log, Box-Cox, Yeo-Johnson) to stabilize variance and approximate normality.
Scaling and Normalization: Includes z-transformation and min-max scaling.
Outlier Treatment: Using z-scores or interquartile ranges to truncate extreme values.
Binning and Categorization: E.g., equal-width or frequency binning, or supervised discretization using decision trees.
Aggregations and Trends: Construct features by summarizing time-series data (e.g., min, max, average, trend).
Domain-Specific Ratios: Widely used in finance (e.g., debt-to-income ratio, loan-to-value).

3. Feature Transformations for Categorical Variables

Weight of Evidence (WOE) Coding:

Converts categories into numeric scores based on their relationship with the target variable.
Particularly useful in credit scoring and binary classification tasks.
Advantages: No increase in dimensionality; encodes information in a supervised way.
Caveats: Needs to handle novel or sparse levels carefully.

Tree-Based Projection:

Groups categories based on decision tree splits, using leaf assignments as transformed features.

Target Encoding (General Concept):

Both WOE and tree-based projections are forms of target encoding, which leverages the target variable to guide encoding strategies, making them suitable only in supervised learning.

4. Feature Selection

Feature selection reduces the input space by identifying and retaining only informative variables.

Motivations:

Simplify models for interpretability.
Improve model performance by removing noise, collinearity, and irrelevant features.
Reduce cost and time of data acquisition and model training.

Challenges:

Combinatorial nature: Selecting the best subset out of 2ⁿ possibilities.
Redundancy: Some features may be relevant but provide overlapping information.
Interaction effects: Some variables are only informative in combination.

5. Selection Strategies

Filter Methods:

Univariate, model-agnostic, based on statistical measures.
Suitable for initial screening.
Common methods:
- Pearson correlation (continuous variables)
- Chi-square and Cramer's V (categorical variables)
- Fisher Score (categorical vs. continuous)
- Information Value (IV) derived from WOE coding

Wrapper Methods:

Model-dependent, assess subsets of features by training models.
Forward selection, backward elimination, stepwise selection, often guided by validation performance (e.g., AUC, MSE).
Computationally expensive but often more accurate than filters.

Embedded Methods:

Feature selection is integrated into model training.
Regularization techniques like LASSO (L1), ridge (L2), and elastic net shrink or eliminate coefficients.
Tree-based models and boosting methods (e.g., XGBoost) offer built-in importance metrics.

6. Practical Recommendations

Hybrid approaches are advised: begin with filters to eliminate clearly irrelevant variables, then apply wrapper methods to refine the selection.
Selection thresholds should be chosen using visual diagnostics (e.g., elbow method) or performance metrics.

7. Conclusion

The lecture emphasizes that "better features beat better models"—strong feature engineering and prudent feature selection often yield greater improvements than tuning model hyperparameters. As such, these tasks are essential skills in applied data science and machine learning workflows.

Courses that include this CL

Business Analytics and Data Science

Last Updated 12th June 2025

Free

Recommended for you

Utilizing Graphical Models to Measu...

Last Updated 3rd December 2024

Free

Community Detection

Last Updated 19th July 2023

Distribution based Trading Strategi...

Last Updated 16th June 2023

Genetic Algorithm

Last Updated 17th December 2022

Introduction Into Time Series Analy...

Last Updated 7th January 2023

Free

The present and future of cryptocur...

Last Updated 7th January 2023

Free

Does non-linear factorisation of fi...

Last Updated 20th May 2025

The Basics of Option Management 2

Last Updated 14th March 2025

Free

Applied Time Series Analysis with P...

Last Updated 13th December 2022

Free

Introduction to Blockchain and Cryp...

Last Updated 7th November 2022

Free

Statistics of Financial Markets

Last Updated 21st March 2025

Free

Meet the instructors !

Stefan Lessmann

About the Instructor

Stefan received a PhD from the University of Hamburg in 2007, where he also completed his habilitation on decision analysis and support using ensemble forecasting models in 2012. He then joined the Humboldt-University of Berlin in 2014, where he heads the Chair of Information Systems at the School of Business and Economics. He serves as an associate editor for the International Journal of Business Analytics, Digital Finance, and the International Journal of Forecasting, and as department editor of Business and Information System Engineering (BISE). Stefan has secured substantial amounts of research funding and published several papers in leading international journals and conferences. His research concerns the support of managerial decision-making using quantitative empirical methods. He specializes in applications of (deep) machine learning techniques in the broad scope of marketing and risk analytics. Stefan actively participates in knowledge transfer and consulting projects with industry partners; from start-up companies to global players and not-for-profit organizations.

Quantlet

Machine Learning

Digital Economy

Data Science

Cryptocurrency

Fintech

Blockchain

Explainable AI

Maths & Stats

Feature Engineering & Selection

Feature Engineering & Selection

Courselet Content

Requirements

General Overview

Description

1. Introduction and Role in the Analytics Process

2. Feature Engineering

Key Techniques and Concepts:

3. Feature Transformations for Categorical Variables

Weight of Evidence (WOE) Coding:

Tree-Based Projection:

Target Encoding (General Concept):

4. Feature Selection

Motivations:

Challenges:

5. Selection Strategies

Filter Methods:

Wrapper Methods:

Embedded Methods:

6. Practical Recommendations

7. Conclusion

Courses that include this CL

Recommended for you

Meet the instructors !