Predicting Top-10 Songs: A Logistic + LASSO Walkthrough in Python
From music features to hit probabilities
predictive analytics
music
classification
ROC
Python
Author
A. Srikanth
Published
December 5, 2025
Project Spotlight
Context
Within this project, I treated “hit prediction” in a very literal way; each row in the dataset is a single commercial release, with a binary flag (0s and 1s) for whether it ever reached the Top 10 of the charts. Those Top 10 songs are the hits. Everything else is a non-hit. The modeling problem is to take the information we have at the song level and estimate the probability that a new track would end up on the hits side of that line.
The inputs are a mix of metadata and audio features that you could imagine pulling from a streaming or analysis service: year (year), artist (artistname), tempo (tempo, with corresponding confidence tempo_confidence), loudness (loudness), time signature (timesignature, with corresponding confidence timesignature_confidence), key (key, with corresponding confidence key_confidence), measures of energy (energy), pitch (pitch), and a set of 11 minimum and maximum timbre components that summarize the texture of the audio signal (timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max,… timbre_11_mintimbre_11_max). Given that set of predictors, the goal here is not to write the perfect algorithm for Artists & Repertoire (A&R), but rather to show how a clean logistic pipeline, a regularized variant, and a careful evaluation on a held out test set can get you to a useful ranking of songs by “hit likelihood” with transparent coefficients we can actually talk about.
Objectives
The work ahead of us for this project has four main parts. The first is to set up a clean baseline logistic regression that uses all the available predictors after basic cleaning. The second is to judge that model on how well it ranks songs rather than how often it guesses correctly, with ROC curves and test set AUC as the main yardsticks. The third is to try an L1 regularized logistic model and see if it can lift out of sample performance while shrinking weak coefficients toward zero so the final model is easier to read. The last job is to pull out the strongest coefficients from the regularized model and turn them into odds multipliers that actually tell a story about what separates hits from non hits.
Data Sources
The assignment data are from a 7,574-row fictitious CSV file of historical music records. As a reminder: each row in this file represents one song with fields for the release year, song title, artist name, numeric IDs, and a bundle of acoustic features such as time signature, tempo, loudness, key, energy, pitch, and a set of timbre components and their minima and maxima. The target variable Top10 flags whether the song reached the Top 10. For modeling, the ID columns and free text fields like songID, artistID, songtitle, and artistname were treated as non predictive metadata and removed from the feature matrix. The entire dataset was read once into memory and all subsequent steps worked off that in a memory table.
Analysis
To prepare the features and target, the analysis defined Y as the Top10 indicator and X as the remaining columns after dropping IDs and titles. Categorical predictors were identified by type and all others were treated as numeric. A stratified 70 / 30 train test split preserved the proportion of Top 10 songs in both sets, which matters for logistic regression when the positive class is relatively rare. The training set had 5,301 rows and 34 predictors, and the test set had 2,273 rows with the same columns.
Note: The rationale for a 70/30 split was that it strikes a practical balance between training and evaluation. Allocating 70% of the data to training ensures the model has enough observations ti estimate coefficients reliably, while reserving 30% for testing provides a sufficiently large and independent sample to evaluate performance.
The baseline pipeline standardized the numeric variables and one hot encoded the categorical ones, then fit a logistic regression with the saga solver. On the held out test set this model achieved a ROC AUC of about 0.814. At a default threshold of 0.50 the confusion matrix and classification report showed accuracy of 0.865. For non Top 10 songs the model reached precision of 0.875, recall of 0.981, and an F1 score of 0.925, which reflects how easy it is to identify the majority class. For Top 10 songs the precision was 0.644, recall was 0.193, and F1 was 0.297, which highlights the recall problem that often appears when the positive class is rare and we use a symmetric threshold.
The ROC curve for this baseline model traced out the familiar concave shape above the 45 degree chance line, and the AUC of 0.814 summarized its ability to rank true hits ahead of non hits without committing to a particular cutoff.
Methodology
To improve generalization and perform embedded variable selection, the same preprocessing pipeline was plugged into an L1 regularized logistic regression with cross validation. LogisticRegressionCV with an L1 penalty, saga solver, a grid of 10 candidate C values, and five fold cross validation chose the regularization strength by maximizing ROC AUC on the training folds. Using a pipeline meant that both the baseline and improved models saw exactly the same standardized and encoded features, so any difference in performance could be attributed to the penalty rather than to preprocessing choices.
After fitting the L1 model on the training set, predicted probabilities and class labels were computed on the same test set as before. The improved model delivered a ROC AUC of about 0.816. At the 0.50 threshold, precision for non Top 10 songs was 0.877 and recall was 0.977 with an F1 of 0.924. For Top 10 songs, precision was 0.614, recall was 0.208, and F1 was 0.311. Overall accuracy was 0.864, very similar to the baseline. The gain is subtle numerically but real in terms of ranking quality and a small lift in recall for the positive class.
Once the L1 model was fit, the analysis pulled out the feature names after preprocessing and paired them with the fitted coefficients. The odds multipliers, computed as exp(beta), translate model weights into more intuitive statements. The maximum of timbre component 0 had a coefficient of about −1.07 with an odds multiplier around 0.34, which means a one standard deviation increase in that feature cuts the odds of reaching the Top 10 to about one third, holding other variables fixed. Loudness had a coefficient of roughly +0.90 with an odds multiplier of about 2.47, suggesting that louder tracks are associated with almost two and a half times the odds of charting, again conditional on other features. Pitch carried a coefficient of about −0.62 with an odds multiplier near 0.54, so higher pitch values corresponded to roughly half the odds of Top 10 success. The minimum of timbre component 6 had a coefficient around −0.42 with an odds multiplier of 0.66, and the minimum of timbre component 1 had a coefficient of about +0.39 with an odds multiplier around 1.48, pointing to a positive association.
Results & Next Steps
The main story is that a straightforward logistic regression with sensible preprocessing already does a solid job of ranking songs by their likelihood of reaching the Top 10, with test set ROC AUC just over 0.81. Introducing L1 regularization and tuning the penalty with cross validation nudges that AUC up to about 0.816, improves recall for the Top 10 class from 0.193 to 0.208 at the default threshold, and delivers a sparser model that highlights a handful of musically interpretable features. The coefficients for timbre maxima and minima, loudness, and pitch give concrete levers for thinking about how audio characteristics relate to chart performance, even if we are careful not to overstate causality.
If this were a live product rather than a homework piece, the next steps would be to tune the decision threshold explicitly for the business problem, for example by trading off precision and recall depending on how expensive missed hits are relative to false alarms. It would also be worth exploring calibration of predicted probabilities, simple ensemble methods that respect the same preprocessing pipeline, and robustness checks across time slices or genres. Finally, a stripped down version of the model with only the strongest features could be deployed as a quick ranking tool for A&R or playlist teams, while the full pipeline remains in the background as a benchmark they can revisit when new data arrive.
year songtitle artistname songID \
0 2010 This Is the House That Doubt Built A Day to Remember SOBGGAB12C5664F054
1 2010 Sticks & Bricks A Day to Remember SOPAQHU1315CD47F31
2 2010 All I Want A Day to Remember SOOIZOU1376E7C6386
artistID ... timbre_10_min timbre_10_max timbre_11_min timbre_11_max Top10
0 AROBSHL1187B9AFB01 ... -126.440 18.658 -44.770 25.989 0
1 AROBSHL1187B9AFB01 ... -103.808 121.935 -38.892 22.513 0
2 AROBSHL1187B9AFB01 ... -108.313 33.300 -43.733 25.744 0
[3 rows x 39 columns]
Code
drop_cols = ["songID", "artistID", "songtitle", "artistname"]X = df.drop(columns=["Top10"] + [c for c in drop_cols if c in df.columns], errors="ignore")y = df["Top10"].astype(int)cat_cols = [c for c in X.columns if X[c].dtype =="object"]num_cols = [c for c in X.columns if c notin cat_cols]X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y)print("Train shape:", X_train.shape, "| Test shape:", X_test.shape)