The lecture introduces feature engineering as a manual, expert-driven process that transforms raw data into informative variables to enhance model performance. It covers common transformations such as scaling, binning, weight-of-evidence coding, and trend features, highlighting their role in increasing predictive power. Feature selection is discussed as a complementary task aimed at reducing dimensionality, improving interpretability, and preventing overfitting, using strategies like filters, wrappers, and embedded methods. The lecture emphasizes the practical challenges of redundancy, feature interaction, and model complexity, and advocates hybrid approaches for efficient variable selection.
The lecture introduces feature engineering as a manual, expert-driven process that transforms raw data into informative variables to enhance model performance. It covers common transformations such as scaling, binning, weight-of-evidence coding, and trend features, highlighting their role in increasing predictive power. Feature selection is discussed as a complementary task aimed at reducing dimensionality, improving interpretability, and preventing overfitting, using strategies like filters, wrappers, and embedded methods. The lecture emphasizes the practical challenges of redundancy, feature interaction, and model complexity, and advocates hybrid approaches for efficient variable selection.
Feature engineering and selection are central to the data preparation phase in the analytics process. They significantly influence downstream modeling and model performance. While feature engineering involves crafting informative input variables, feature selection aims to identify the most relevant subset of features to avoid overfitting, reduce computational burden, and improve interpretability.
Feature engineering is defined as the manual creation or transformation of variables based on domain knowledge, with the goal of improving a model's predictive power. It is distinct from representation learning, where features are learned automatically (e.g., in deep learning).
Variable Transformation: Apply functions (e.g., log, Box-Cox, Yeo-Johnson) to stabilize variance and approximate normality.
Scaling and Normalization: Includes z-transformation and min-max scaling.
Outlier Treatment: Using z-scores or interquartile ranges to truncate extreme values.
Binning and Categorization: E.g., equal-width or frequency binning, or supervised discretization using decision trees.
Aggregations and Trends: Construct features by summarizing time-series data (e.g., min, max, average, trend).
Domain-Specific Ratios: Widely used in finance (e.g., debt-to-income ratio, loan-to-value).
Converts categories into numeric scores based on their relationship with the target variable.
Particularly useful in credit scoring and binary classification tasks.
Advantages: No increase in dimensionality; encodes information in a supervised way.
Caveats: Needs to handle novel or sparse levels carefully.
Groups categories based on decision tree splits, using leaf assignments as transformed features.
Both WOE and tree-based projections are forms of target encoding, which leverages the target variable to guide encoding strategies, making them suitable only in supervised learning.
Feature selection reduces the input space by identifying and retaining only informative variables.
Simplify models for interpretability.
Improve model performance by removing noise, collinearity, and irrelevant features.
Reduce cost and time of data acquisition and model training.
Combinatorial nature: Selecting the best subset out of 2ⁿ possibilities.
Redundancy: Some features may be relevant but provide overlapping information.
Interaction effects: Some variables are only informative in combination.
Univariate, model-agnostic, based on statistical measures.
Suitable for initial screening.
Common methods:
Pearson correlation (continuous variables)
Chi-square and Cramer's V (categorical variables)
Fisher Score (categorical vs. continuous)
Information Value (IV) derived from WOE coding
Model-dependent, assess subsets of features by training models.
Forward selection, backward elimination, stepwise selection, often guided by validation performance (e.g., AUC, MSE).
Computationally expensive but often more accurate than filters.
Feature selection is integrated into model training.
Regularization techniques like LASSO (L1), ridge (L2), and elastic net shrink or eliminate coefficients.
Tree-based models and boosting methods (e.g., XGBoost) offer built-in importance metrics.
Hybrid approaches are advised: begin with filters to eliminate clearly irrelevant variables, then apply wrapper methods to refine the selection.
Selection thresholds should be chosen using visual diagnostics (e.g., elbow method) or performance metrics.
The lecture emphasizes that "better features beat better models"—strong feature engineering and prudent feature selection often yield greater improvements than tuning model hyperparameters. As such, these tasks are essential skills in applied data science and machine learning workflows.
Stefan received a PhD from the University of Hamburg in 2007, where he also completed his habilitation on decision analysis and support using ensemble forecasting models in 2012. He then joined the Humboldt-University of Berlin in 2014, where he heads the Chair of Information Systems at the School of Business and Economics. He serves as an associate editor for the International Journal of Business Analytics, Digital Finance, and the International Journal of Forecasting, and as department editor of Business and Information System Engineering (BISE). Stefan has secured substantial amounts of research funding and published several papers in leading international journals and conferences. His research concerns the support of managerial decision-making using quantitative empirical methods. He specializes in applications of (deep) machine learning techniques in the broad scope of marketing and risk analytics. Stefan actively participates in knowledge transfer and consulting projects with industry partners; from start-up companies to global players and not-for-profit organizations.