Predicting Top-10 Songs: A Logistic + LASSO Walkthrough in Python

From music features to hit probabilities

predictive analytics
music
classification
ROC
Python
Author

A. Srikanth

Published

December 5, 2025

Project Spotlight

Context

Within this project, I treated “hit prediction” in a very literal way; each row in the dataset is a single commercial release, with a binary flag (0s and 1s) for whether it ever reached the Top 10 of the charts. Those Top 10 songs are the hits. Everything else is a non-hit. The modeling problem is to take the information we have at the song level and estimate the probability that a new track would end up on the hits side of that line.

The inputs are a mix of metadata and audio features that you could imagine pulling from a streaming or analysis service: year (year), artist (artistname), tempo (tempo, with corresponding confidence tempo_confidence), loudness (loudness), time signature (timesignature, with corresponding confidence timesignature_confidence), key (key, with corresponding confidence key_confidence), measures of energy (energy), pitch (pitch), and a set of 11 minimum and maximum timbre components that summarize the texture of the audio signal (timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max,… timbre_11_min timbre_11_max). Given that set of predictors, the goal here is not to write the perfect algorithm for Artists & Repertoire (A&R), but rather to show how a clean logistic pipeline, a regularized variant, and a careful evaluation on a held out test set can get you to a useful ranking of songs by “hit likelihood” with transparent coefficients we can actually talk about.

Objectives

The work ahead of us for this project has four main parts. The first is to set up a clean baseline logistic regression that uses all the available predictors after basic cleaning. The second is to judge that model on how well it ranks songs rather than how often it guesses correctly, with ROC curves and test set AUC as the main yardsticks. The third is to try an L1 regularized logistic model and see if it can lift out of sample performance while shrinking weak coefficients toward zero so the final model is easier to read. The last job is to pull out the strongest coefficients from the regularized model and turn them into odds multipliers that actually tell a story about what separates hits from non hits.

Data Sources

The assignment data are from a 7,574-row fictitious CSV file of historical music records. As a reminder: each row in this file represents one song with fields for the release year, song title, artist name, numeric IDs, and a bundle of acoustic features such as time signature, tempo, loudness, key, energy, pitch, and a set of timbre components and their minima and maxima. The target variable Top10 flags whether the song reached the Top 10. For modeling, the ID columns and free text fields like songID, artistID, songtitle, and artistname were treated as non predictive metadata and removed from the feature matrix. The entire dataset was read once into memory and all subsequent steps worked off that in a memory table.

Analysis

To prepare the features and target, the analysis defined Y as the Top10 indicator and X as the remaining columns after dropping IDs and titles. Categorical predictors were identified by type and all others were treated as numeric. A stratified 70 / 30 train test split preserved the proportion of Top 10 songs in both sets, which matters for logistic regression when the positive class is relatively rare. The training set had 5,301 rows and 34 predictors, and the test set had 2,273 rows with the same columns.

Note: The rationale for a 70/30 split was that it strikes a practical balance between training and evaluation. Allocating 70% of the data to training ensures the model has enough observations ti estimate coefficients reliably, while reserving 30% for testing provides a sufficiently large and independent sample to evaluate performance.

The baseline pipeline standardized the numeric variables and one hot encoded the categorical ones, then fit a logistic regression with the saga solver. On the held out test set this model achieved a ROC AUC of about 0.814. At a default threshold of 0.50 the confusion matrix and classification report showed accuracy of 0.865. For non Top 10 songs the model reached precision of 0.875, recall of 0.981, and an F1 score of 0.925, which reflects how easy it is to identify the majority class. For Top 10 songs the precision was 0.644, recall was 0.193, and F1 was 0.297, which highlights the recall problem that often appears when the positive class is rare and we use a symmetric threshold.

The ROC curve for this baseline model traced out the familiar concave shape above the 45 degree chance line, and the AUC of 0.814 summarized its ability to rank true hits ahead of non hits without committing to a particular cutoff.

Methodology

To improve generalization and perform embedded variable selection, the same preprocessing pipeline was plugged into an L1 regularized logistic regression with cross validation. LogisticRegressionCV with an L1 penalty, saga solver, a grid of 10 candidate C values, and five fold cross validation chose the regularization strength by maximizing ROC AUC on the training folds. Using a pipeline meant that both the baseline and improved models saw exactly the same standardized and encoded features, so any difference in performance could be attributed to the penalty rather than to preprocessing choices.

After fitting the L1 model on the training set, predicted probabilities and class labels were computed on the same test set as before. The improved model delivered a ROC AUC of about 0.816. At the 0.50 threshold, precision for non Top 10 songs was 0.877 and recall was 0.977 with an F1 of 0.924. For Top 10 songs, precision was 0.614, recall was 0.208, and F1 was 0.311. Overall accuracy was 0.864, very similar to the baseline. The gain is subtle numerically but real in terms of ranking quality and a small lift in recall for the positive class.

Once the L1 model was fit, the analysis pulled out the feature names after preprocessing and paired them with the fitted coefficients. The odds multipliers, computed as exp(beta), translate model weights into more intuitive statements. The maximum of timbre component 0 had a coefficient of about −1.07 with an odds multiplier around 0.34, which means a one standard deviation increase in that feature cuts the odds of reaching the Top 10 to about one third, holding other variables fixed. Loudness had a coefficient of roughly +0.90 with an odds multiplier of about 2.47, suggesting that louder tracks are associated with almost two and a half times the odds of charting, again conditional on other features. Pitch carried a coefficient of about −0.62 with an odds multiplier near 0.54, so higher pitch values corresponded to roughly half the odds of Top 10 success. The minimum of timbre component 6 had a coefficient around −0.42 with an odds multiplier of 0.66, and the minimum of timbre component 1 had a coefficient of about +0.39 with an odds multiplier around 1.48, pointing to a positive association.

Results & Next Steps

The main story is that a straightforward logistic regression with sensible preprocessing already does a solid job of ranking songs by their likelihood of reaching the Top 10, with test set ROC AUC just over 0.81. Introducing L1 regularization and tuning the penalty with cross validation nudges that AUC up to about 0.816, improves recall for the Top 10 class from 0.193 to 0.208 at the default threshold, and delivers a sparser model that highlights a handful of musically interpretable features. The coefficients for timbre maxima and minima, loudness, and pitch give concrete levers for thinking about how audio characteristics relate to chart performance, even if we are careful not to overstate causality.

If this were a live product rather than a homework piece, the next steps would be to tune the decision threshold explicitly for the business problem, for example by trading off precision and recall depending on how expensive missed hits are relative to false alarms. It would also be worth exploring calibration of predicted probabilities, simple ensemble methods that respect the same preprocessing pipeline, and robustness checks across time slices or genres. Finally, a stripped down version of the model with only the strongest features could be deployed as a quick ranking tool for A&R or playlist teams, while the full pipeline remains in the background as a benchmark they can revisit when new data arrive.

Code
import sys
from pathlib import Path
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import roc_auc_score, roc_curve, classification_report, confusion_matrix

from packaging import version
import sklearn

FIG_W = 9
FIG_H = 5
FIG_DPI = 110

color_blue = "#033c73"
color_indigo = "#6610f2"
color_purple = "#6f42c1"

try:
    fm.findfont("Ramabhadra", fallback_to_default=False)
    base_font = "Ramabhadra"
except Exception:
    base_font = "DejaVu Sans"

warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib")

plt.rcParams.update(
    {
        "figure.figsize": (FIG_W, FIG_H),
        "figure.dpi": FIG_DPI,
        "font.family": base_font,
        "font.weight": "bold",
        "axes.titlesize": 14,
        "axes.titleweight": "bold",
        "axes.labelsize": 12,
        "axes.labelweight": "bold",
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
        "legend.fontsize": 10,
        "axes.grid": True,
        "grid.alpha": 0.25,
    }
)

pd.set_option("display.width", 100)
pd.set_option("display.max_columns", 10)
pd.set_option("display.float_format", lambda x: f"{x:0.3f}")


def make_ohe() -> OneHotEncoder:
    if version.parse(sklearn.__version__) >= version.parse("1.2"):
        return OneHotEncoder(handle_unknown="ignore", sparse_output=True)
    else:
        return OneHotEncoder(handle_unknown="ignore", sparse=True)


def section(title: str):
    bar = "-" * len(title)
    print(f"\n{title}\n{bar}")


def fit_and_report(name, pipeline, X_train, y_train, X_test, y_test):
    section(f"{name} – fit")
    pipeline.fit(X_train, y_train)
    y_proba = pipeline.predict_proba(X_test)[:, 1]
    y_pred = (y_proba >= 0.5).astype(int)
    auc = roc_auc_score(y_test, y_proba)
    print(f"ROC AUC: {auc:0.3f}")

    cr = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
    cr_df = pd.DataFrame(cr).T[["precision", "recall", "f1-score", "support"]]
    section(f"{name} – classification report")
    print(cr_df.round(3))

    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm, index=["true_0", "true_1"], columns=["pred_0", "pred_1"])
    section(f"{name} – confusion matrix")
    print(cm_df)

    fpr, tpr, _ = roc_curve(y_test, y_proba)
    return auc, fpr, tpr, y_proba, y_pred


data_path = Path("data") / "Music_Records.csv"
df = pd.read_csv(data_path, encoding="ISO-8859-1")

print("Shape:", df.shape)
Shape: (7574, 39)
Code
print("Columns (first 15):", list(df.columns)[:15])
Columns (first 15): ['year', 'songtitle', 'artistname', 'songID', 'artistID', 'timesignature', 'timesignature_confidence', 'loudness', 'tempo', 'tempo_confidence', 'key', 'key_confidence', 'energy', 'pitch', 'timbre_0_min']
Code
df.head(3)
   year                           songtitle         artistname              songID  \
0  2010  This Is the House That Doubt Built  A Day to Remember  SOBGGAB12C5664F054   
1  2010                     Sticks & Bricks  A Day to Remember  SOPAQHU1315CD47F31   
2  2010                          All I Want  A Day to Remember  SOOIZOU1376E7C6386   

             artistID  ...  timbre_10_min  timbre_10_max  timbre_11_min  timbre_11_max  Top10  
0  AROBSHL1187B9AFB01  ...       -126.440         18.658        -44.770         25.989      0  
1  AROBSHL1187B9AFB01  ...       -103.808        121.935        -38.892         22.513      0  
2  AROBSHL1187B9AFB01  ...       -108.313         33.300        -43.733         25.744      0  

[3 rows x 39 columns]
Code
drop_cols = ["songID", "artistID", "songtitle", "artistname"]
X = df.drop(columns=["Top10"] + [c for c in drop_cols if c in df.columns], errors="ignore")
y = df["Top10"].astype(int)

cat_cols = [c for c in X.columns if X[c].dtype == "object"]
num_cols = [c for c in X.columns if c not in cat_cols]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Train shape:", X_train.shape, "| Test shape:", X_test.shape)
Train shape: (5301, 34) | Test shape: (2273, 34)
Code
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(with_mean=False), num_cols),
        ("cat", make_ohe(), cat_cols),
    ],
    sparse_threshold=1.0,
)

baseline = Pipeline(
    steps=[
        ("prep", preprocess),
        ("clf", LogisticRegression(max_iter=1000, solver="saga")),
    ]
)

auc_base, fpr_b, tpr_b, y_proba_base, y_pred_base = fit_and_report(
    "Baseline logistic regression",
    baseline,
    X_train,
    y_train,
    X_test,
    y_test,
)

Baseline logistic regression – fit
----------------------------------
ROC AUC: 0.812

Baseline logistic regression – classification report
----------------------------------------------------
              precision  recall  f1-score  support
0                 0.874   0.983     0.925 1937.000
1                 0.656   0.182     0.284  336.000
accuracy          0.865   0.865     0.865    0.865
macro avg         0.765   0.583     0.605 2273.000
weighted avg      0.842   0.865     0.831 2273.000

Baseline logistic regression – confusion matrix
-----------------------------------------------
        pred_0  pred_1
true_0    1905      32
true_1     275      61
Code
plt.figure()
plt.plot(
    fpr_b,
    tpr_b,
    label=f"BASELINE AUC = {auc_base:0.3f}",
    color=color_blue,
    linewidth=2.0,
)
plt.plot(
    [0, 1],
    [0, 1],
    "--",
    color=color_indigo,
    linewidth=1.5,
    label="CHANCE",
)
plt.xlim(0, 1)
(0.0, 1.0)
Code
plt.ylim(0, 1)
(0.0, 1.0)
Code
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE – BASELINE LOGISTIC REGRESSION")
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

Code
logit_cv = LogisticRegressionCV(
    Cs=5,
    cv=3,
    penalty="l1",
    solver="saga",
    scoring="roc_auc",
    max_iter=1500,
    refit=True,
    n_jobs=-1,
)

improved = Pipeline(
    steps=[
        ("prep", preprocess),
        ("clf", logit_cv),
    ]
)

auc_imp, fpr_i, tpr_i, y_proba_imp, y_pred_imp = fit_and_report(
    "Improved L1-regularized logistic regression",
    improved,
    X_train,
    y_train,
    X_test,
    y_test,
)

Improved L1-regularized logistic regression – fit
-------------------------------------------------
ROC AUC: 0.816

Improved L1-regularized logistic regression – classification report
-------------------------------------------------------------------
              precision  recall  f1-score  support
0                 0.877   0.977     0.924 1937.000
1                 0.614   0.208     0.311  336.000
accuracy          0.864   0.864     0.864    0.864
macro avg         0.745   0.593     0.618 2273.000
weighted avg      0.838   0.864     0.834 2273.000

Improved L1-regularized logistic regression – confusion matrix
--------------------------------------------------------------
        pred_0  pred_1
true_0    1893      44
true_1     266      70
Code
plt.figure()
plt.plot(
    fpr_b,
    tpr_b,
    label=f"BASELINE AUC = {auc_base:0.3f}",
    color=color_blue,
    linewidth=2.0,
)
plt.plot(
    fpr_i,
    tpr_i,
    label=f"IMPROVED AUC = {auc_imp:0.3f}",
    color=color_indigo,
    linewidth=2.0,
)
plt.plot(
    [0, 1],
    [0, 1],
    "--",
    color=color_purple,
    linewidth=1.5,
    label="CHANCE",
)
plt.xlim(0, 1)
(0.0, 1.0)
Code
plt.ylim(0, 1)
(0.0, 1.0)
Code
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE – BASELINE VS IMPROVED")
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

Code
prep = improved.named_steps["prep"]
cat_names = []

if cat_cols:
    enc = prep.named_transformers_["cat"]
    try:
        cat_names = list(enc.get_feature_names_out(cat_cols))
    except AttributeError:
        cat_names = enc.get_feature_names(cat_cols)

features = num_cols + cat_names
coefs = improved.named_steps["clf"].coef_.ravel()

coef_table = pd.DataFrame(
    {
        "feature": features,
        "beta": coefs,
        "odds_multiplier": np.exp(coefs),
    }
)

coef_table = coef_table.reindex(
    coef_table["beta"].abs().sort_values(ascending=False).index
)

print(coef_table.head(15).round({"beta": 3, "odds_multiplier": 3}))
                     feature   beta  odds_multiplier
11              timbre_0_max -1.066            0.344
3                   loudness  0.902            2.464
9                      pitch -0.621            0.537
22              timbre_6_min -0.422            0.656
12              timbre_1_min  0.393            1.481
17              timbre_3_max -0.356            0.700
32             timbre_11_min -0.293            0.746
10              timbre_0_min  0.260            1.297
33             timbre_11_max  0.248            1.282
2   timesignature_confidence  0.229            1.258
18              timbre_4_min  0.212            1.236
20              timbre_5_min -0.180            0.836
19              timbre_4_max  0.169            1.185
8                     energy -0.160            0.852
5           tempo_confidence  0.150            1.161