Spotify Audio Analysis: How to Impress (or Disappoint) Pitchfork

21 min readOct 16, 2020

Pitchfork is an online music review website that has been actively reviewing albums and individual songs since the mid-90's. The site started as a platform for reviewing independent, lo-fi, and underground artists that typically did not receive mainstream attention from mainstream music review publications. Over time, it gradually expanded to include reviews of more mainstream releases as well as classic album reissues. It still maintains a reputation as both a tastemaker and a bastion of musical snobbery, although currently it more closely resembles a traditional music review platform than it did in its early days.

The site gained popularity in the early 2000’s due to the unique and polarizing writing style presented in its album reviews. Reviews are often strongly opinionated, heaping glowing praise and flowery language on albums the writers liked and tearing apart albums the writers disliked. As readership grew, the site gained enough clout to help launch the careers of a large number of indie artists, including Arcade Fire, Sufjan Stevens, Bon Iver, The Decemberists, and countless other bands.

Over its lifespan, the site has reviewed thousands of albums spanning multiple genres. Each album is given a score of 0.0–10.0, with 0.0 being the worst possible score and 10.0 representing a “perfect” album.

Analysis Goals

The goal of this analysis is to compare audio features from songs appearing in the the top 10% and bottom 10% of albums reviewed by Pitchfork. In doing so, I hope to identify some specific audio preferences among Pitchfork reviewers, as well as some audio characteristics they find distasteful or less appealing.

This analysis will be performed by analyzing audio features for the individual songs making up the albums in each subset (top 10% vs. bottom 10%). Individual songs will be viewed collectively as opposed to within the context of their source album. I plan on performing a similar analysis using aggregated album data at a later date. Additionally, I will compare audio features for the two subsets grouped by album genre to try and identify genre-specific reviewer preferences.

To keep this article at a readable length, some of the less exciting results from the analysis have been omitted. The full analysis and results can be found in a notebook saved in the Github repository for this project.

Genre Classification

Albums reviewed by Pitchfork are classified using the genre descriptors shown below:

Some albums are classified as belonging to multiple genres. For example, Cluster & Eno, an album collaboration between Brian Eno and Cluster, is classified as electronic, experimental, global, and rock. The rock classification on that one is a bit of a stretch, but you can judge for yourself.

Some albums were not assigned a genre classification. These albums were grouped together in a genre labeled “unknown.”

Dataset Description

Reviews

The dataset I will be using contains 18,393 Pitchfork album reviews from 1999–2017. All data was scraped from pitchfork.com, and the full scraped dataset is available on Kaggle. Each review is assigned a unique ID (reviewid), and the dataset includes several different pieces of information for each review in addition to its text content, such as:

Artist
Album Title
Genre(s)
Score

For this analysis, I will be focusing on the attributes listed above as a starting point. The dataset also includes additional information about each album review that will not be included in the analysis, such as the review author name and role at Pitchfork (contributor, associate staff writer, etc.), review publication date, album release year, etc.

Several natural language processing (NLP) analyses have been completed using the content of Pitchfork reviews as source material, and there are dozens of exploratory data analyses (EDA) that have been performed to identify score trends based on genre, artist performance over time, review author score distributions, etc. For this analysis, I will be looking at the audio features of the songs that make up each album in the top and bottom 10% of all reviewed albums sorted by score to try and identify general Pitchfork reviewer preferences.

Scraping Audio Features from Spotify

Spotify is an audio streaming platform that includes millions of songs and podcasts. In addition to allowing customers to stream music, the Spotify Web API may be used to extract a number of different audio features for each individual track hosted on the platform. This data may be easily accessed for data analysis using Spotipy, a Python library that works with the Spotify Web API.

User authorization is required to make use of the Spotify Web API. To gain authorization, you will need to register an app with the Spotify Web API at My Dashboard to generate the necessary credentials (a client ID and a client secret). Directions for getting started with Spotipy can be found in the library documentation.

The Spotify Web API allows a user to perform either a general search query or a target-specific query. Examples of target-specific queries that can be used include album, artist, track, and others. Spotify makes use of Uniform Resource Indicators (URIs) to identify pieces of information. URIs may be assigned to albums, tracks, artists, songs, and other data stored on the platform’s servers. Fortunately, URIs can easily be extracted from search query results and used to extract additional information quickly.

You can see the script I used to acquire audio data using Spotipy and the Spotify Web API in the Github repository for this project. This script also includes code that can be used to scrape song lyrics from Genius using the BeatifulSoup library. I am still fine-tuning this script, and I plan to perform a lyrical analysis of songs from Pitchfork-reviewed albums at a later date.

Audio Features

Categorical Features

Categorical features included in this analysis and their respective descriptions from the Spotify Web API documentation are as follows:

Key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
Explicit Lyric Status: Whether or not the track has explicit lyrics ( True = yes it does; False = no it does not OR unknown).
Mode: Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

The song key feature was used to identify the root note of the song’s base scale, and this information was combined with the mode feature to obtain a song scale feature that includes modality. This way we can get a better idea of the particular scale each song is based on (i.e. A minor, C major, etc.).

Numerical Features

Spotify also provides a number of numerical audio features for each song. Numerical audio features we will be focusing on in this analysis are listed below along with their descriptions (source: Spotify Web API documentation) and some examples of extreme values:

Song Duration (ms): Length of the track in ms.
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. The closer this value is to 1.0, the greater the confidence that the track is acoustic. An example song with high acousticness is Offing by Julianna Barwick (acousticness = 0.996), while an example song with a low acousticness is Memory Machine by Dismemberment Plan (acousticness = 0.003320).
Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. An example song with a high danceability is Leave Me Now by Herbert (danceability = 0.985), while an example of a song with low danceability is Surf by Fennesz (danceability = 0.0636)
Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. An example of a song with high energy is When Doves Cry by Prince (energy = 0.989), while an example of a song with low energy is Zawinul/Lava by Brian Eno (energy = 0.010600)
Instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
Tempo (bmp): The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). An example of a song with high valence is Don’t Laugh (I Love You) by Ween (valence = 0.983), while an example of a song with low valence is Hedphelym by Aphex Twin (valence = 0.0188)
Popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

These descriptions are taken directly from the Spotify Web API documentation. In some cases, such as song duration and tempo, the logic behind how these features were collected using raw audio data is intuitive and well established. However. the methodologies for collecting more complex features such as danceability and valence are not as obvious. Assumptions about how these classifications were made can be drawn from a literature review of subjective audio classification algorithms, but the specific methods used by Spotify are not publicly documented.

Top 10% vs. Bottom 10%

Album reviews were sorted based on their review score, and subsets were taken from the top and bottom 10% of the review collection for comparison. Each subset initially contained 1,839 albums prior to scraping on Spotify. Some albums were removed from each set after it was observed that they triggered incorrect search results that were processed by the scraping script as “correct.” The absence of these albums on the Spotify server was confirmed prior to their removal.

While Spotify provides an enormous amount of song and album information and features a wide range of artists, not all albums that Pitchfork has reviewed can be found on the platform. Song data was recovered for a total of 1,602/1,839 (88.0%) of the top 10% of reviewed albums and 1,572/1,839 (88.4%) of the bottom 10% via the Spotify Web API. Reviewed album types that were consistently difficult to find on Spotify included compilations, box sets, and greatest hits collections.

A total of 45,765 songs (25,270 from the top 10% and 20,495 from the bottom 10%) were included in this analysis. The average album review score of the top 10% of albums for which song information was able to be obtained was 8.8 with a standard deviation of 0.43, while the average score of the bottom 10% was 4.2 with a standard deviation of 1.06.

Choosing a Statistical Test for Comparison

Several statistical tests are available for comparing the means of two independent distributions and determining whether or not there is a significant difference between the two. A two-sample independent t-test is a common test for achieving this goal. However, one of the core assumptions of a two-sample t-test is that the data from each distribution is normal.

We will need to look at the normality of each feature to determine if this test is appropriate. To do this, we can look at histograms of the data for each numerical audio feature. Some features, such as danceability, demonstrate some evidence of a normal distribution:

Other features, such as instrumentalness, show evidence of a skewed distribution:

Similar distribution shapes are observed when reviewing the distributions for each feature for all songs on Spotify provided in the Spotify Web API documentation. Since few of these numerical feature distributions appear to be normal, we will need to use a nonparametric test for our distributions.

Mann-Whitney U Test

The Mann-Whitney U test (also known as the Mann-Whitney-Wilcoxon test, the Wilcoxon-Mann-Whitney test, or the Wilcoxon rank-sum test) is a nonparametric test that can be used to compare non-normal distributions. This test requires that the observations from each distribution be independent and that the numerical data are ordinal (i.e. when looking at two values one can clearly be said to be greater than the other). Under this test, the null hypothesis is that the two distributions are equal, while the alternative hypothesis is that the two distributions are not equal.

The Mann-Whitney U test first sorts all elements of a distribution and assigns a rank to each one, beginning with a rank of 1 for the smallest value. Equal ranks are adjusted by assigning a rank equal to the midpoint of the unadjusted rankings. The sum of all rankings (R) is taken for each distribution, and the U statistic is calculated based on the following formula:

In this equation, N is the number of elements in the sample distribution. The U statistic is calculated for each of the two distributions being compared, and the lower of the two values is maintained for calculation of significance. The significance of a Mann-Whitney U test is reported as a p-value. Significance was determined using a p-value threshold of 0.05.

Common Language Effect Size

When reporting the results of an inferential test such as the Mann-Whitney U test, an effect size should also be reported as a way of quantifying the magnitude of the test results. There are a number of different effect size statistics that can be calculated, but we will be using the common language effect size (CL). The CL for each Mann-Whitney U test can be calculated using the following equation:

In this equation, the U statistic calculated as part of the Mann-Whitney U test is divided by the product of the sample sizes from each distribution being compared (N1 and N2). This value is also equal to the area-under-curve (AUC) statistic for the receiver operating characteristic (ROC) that can be calculated for the distributions.

The CL value can be interpreted as the probability that a randomly selected observation from one group will support the test’s null hypothesis. In this case, the null hypothesis is that the two distributions are identical. CL values range from 0 to 0.5. Two identical distributions would produce a CL value of 0.5. The closer the CL value is to zero, the greater the interpreted difference between the two distributions.

Analysis Results

First, let’s take a look at the number of data points included in the analysis:

There are far fewer global albums / songs than any other genre, limiting the conclusions we can draw using this genre classification. We will still look at global album data out of curiosity, but we will need to take the results with a grain of salt.

Song Mode

Major scales dominated songs across all genres in both the top and bottom 10% of albums:

One of the more interesting findings is the heavier prevalence of major scales in songs on the top 10% of folk/country and jazz albums when compared to the bottom 10%:

For both of these genres, the percentage of songs utilizing a major scale was nearly 10% greater for albums in the top 10% than in the bottom 10%.

Song Scale

With 12 different keys and 2 different modes available in western music, there are a total of 24 different scales that can serve as the foundation of a song. Each of the 24 different different scales that are available was represented in both data subsets (top 10% and bottom 10%).

The most popular scales were C, G, and D major, with each scale being used in approximately 10% of all of the songs in the dataset (top 10% + bottom 10%). Together, these three scales make up over 30% of the dataset. The least popular scale was D#/Eb minor, appearing in only 0.9% of the songs in the combined dataset.

The figure below shows the proportions of each scale in both the top and bottom 10%. Please note that the proportions shown below are relative to the individual subset size (top 10% or bottom 10%). For example, the figure below shows that D major scales showed up in just over 10% of songs on albums in the top 10% and just under 10% of songs on albums in the bottom 10%.

Grouping the data by genre revealed some interesting trends in terms of scale preferences among Pitchfork reviewers. Songs in G major tended to be more prevalent in the bottom 10% than in the top 10% across all genres except rock.

Jazz Scales: Stay Away From D Major, B Minor, and E Minor

Analysis of jazz song scales revealed some of the widest discrepancies between songs on the top and bottom 10% of albums. The percentages of jazz songs in A#/Bb major, G#/Ab major, F major, D#/Eb major, and F minor were all >2% higher in the top 10% than in the bottom 10%. On the other hand, the percentages of songs in D major, B minor, and E minor were all >4% higher in the bottom 10% than in the top 10%.

Folk/Country Scales

As demonstrated in the preceding sections, folk/country songs in the top 10% of albums were characterized by a greater prevalence of major scales than songs in the bottom 10%. Specifically, the following five major scales appeared with >2% greater frequency in the top 10% than the bottom 10%: A, B, G#/Ab, F#/Gb, and F. Alternatively, songs in G and C major as well as songs in A and E minor appeared with >2% greater frequency in the bottom 10% than in the top 10%.

Song Duration: More Long Songs

Song duration distributions across the top and bottom 10% of albums were comparable, with median values of 221,787 ms (3:41) and 225,520 ms (3:45), respectively. Both distributions were characterized by a large number of outliers. Outliers were defined as having a Z score of 3 or greater (a duration 3 or more standard deviations from the mean). The top 10% of albums included 366 outlier songs (song length > 12:47) while the bottom 10% of albums included 217 outliers (song length > 9:44).

These findings indicate that the top 10% albums are more likely to contain unusually long tracks. This held true across all genres except for metal and pop/r&b. For each of those genres, more song_duration outliers were found in the bottom 10% of albums than the top 10%.

Song Duration: Longer Metal Songs, Shorter Folk/Country Songs

Among the different album genres, metal and folk/country songs demonstrated the most significant differences in terms of song duration distributions (CL = 0.395 and 0.431, respectively). Folk/country songs on the top 10% albums tended to be shorter, with a median time of 188,426 ms (3:08) compared to a median time of 215,040 ms (3:35) for bottom 10%. Metal songs, on the other hand, had a top 10% median song duration of 257,000 ms (4:17) compared to a median duration of 217,193 ms (3:37) for the bottom 10%.

Acousticness: More Acousticness is Better (Except For Metal)

Higher levels of acousticness were favored in the top 10% albums in general:

This trend was observable with significance for all genres except for one (metal). The preference for acousticness was particularly pronounced in folk/country, jazz, and pop/r&b albums, with jazz demonstrating the highest level of preference:

The highest observed value of acousticness in any of the songs included in the analysis was 0.996. This value was observed in 13 separate tracks, including Bird of Paradise by The Appleseed Cast.

Danceability: Less Danceable is Better

It appears that Pitchfork reviewers typically prefer slightly lower levels of danceability in their songs, as was demonstrated by the example danceability histogram presented previously. The difference is more pronounced in some genres, especially metal and jazz:

Folk/country and rap albums went against this trend, with songs from albums in the top 10% demonstrating slightly higher levels of danceability than the bottom 10%:

Rap and global songs demonstrated the highest median levels of danceability, while metal and experimental songs demonstrated the lowest.

Energy: Lower is Better (Except for Metal)

The distribution of energy levels for songs on the top 10% of albums had a median of 0.633, while the median energy from the bottom 10% was 0.691. These distributions demonstrated a significant difference (CL=0.447), indicating a slight preference for lower energy songs among Pitchfork reviewers.

This effect was most pronounced for jazz (CL=0.333), pop/r&b ( CL=0.373), and electronic songs (CL=0.412) than songs from other genres:

The only genre for which this trend was reversed was metal, although the difference between the two distributions was relatively small (CL=0.470). This indicates a preference for higher energy metal songs.

The highest observed energy values were observed in metal songs, with median values > 0.85 for songs from both the top and bottom 10% of reviewed albums. Folk/country songs demonstrated the lowest energy values, with median values < 0.34 for songs from both the top and bottom 10% of reviewed albums.

Instrumentalness: Leave the Vocals Out of Metal and Electronic

Higher levels of instrumentalness were observed in songs on the top 10% of albums when compared to songs on the bottom 10% of albums (CL=0.454), although the difference in medians between the two groups was small (0.0131).

Metal (CL=0.345) and electronic (CL=0.400) songs demonstrated the highest median levels of instrumentalness and the most pronounced difference between songs in the top 10% of albums vs. songs in the bottom 10%:

Jazz, on the other hand, followed the reverse trend, with a significantly higher median value of instrumentalness observed in the bottom 10% of songs compared to the top 10% (CL = 0.388).

Liveness: It’s All the Same

The distributions of liveness levels for songs from the top and bottom 10% were nearly identical:

The differences between the distributions for each genre all demonstrated significance, but the difference between the top and bottom 10% medians was ≤ 0.02 for all genres, indicating that the liveness of the songs on an album has very little impact on whether or not Pitchfork gives the album a good score.

Speechiness: Rap is King

The distributions of speechiness levels for songs from the top and bottom 10% of albums demonstrated a statistically significant difference, but the difference in median values between the two sets of songs was negligible (0.0015).

Songs from rap albums demonstrated the highest levels of speechiness among both the top and bottom 10% albums. Songs from rap albums in the top 10% demonstrated a median speechiness of 0.26, while songs from rap albums in the bottom 10% demonstrated a median speechiness of 0.24. The difference between these two distributions was statistically significant (CL=0.442) Speechiness levels were much lower across all other genres, with no other genre subset demonstrating a median value higher than 0.07.

Tempo: Slower is Better, Except for Global Music

Pitchfork reviewers seem to prefer slightly slower tempos. The difference between the distributions of tempos for songs in the top and bottom 10% of albums was statistically significant (CL=0.488), but the difference in medians was only 1.1 BPM.

Only one genre (global) did not follow this trend. The median tempo of songs on the top 10% of albums was 121.88 BPM, while the median tempo of songs on the bottom 10% of albums was 113.95 BPM. However, the tempo distributions for global albums did not demonstrate a statistically significant difference (p =0.053).

Statistically significant differences were observed for electronic, pop/r&b, and rock albums, with slightly lower tempos being favored across each of these genres. Songs from rap albums demonstrated much lower tempos than songs from all genres across both the top and bottom 10% of albums. For songs from both the top and bottom 10% of rap albums, the median tempo was > 10 BPM lower than songs from the next lowest tempo genres (jazz and folk/country).

Valence: Happier Folk/Country, Angrier Metal

Pitchfork reviewers tended to prefer happier songs (higher valence). Songs from the top 10% of albums demonstrated a median valence of 0.452, while songs from the bottom 10% of albums demonstrated a median valence of 0.428. The difference in valence distributions between the two album subsets was statistically significant (CL=0.483).

The difference in valence levels between songs from the top and bottom 10% of albums was most pronounced in the folk/country and metal genres. Pitchfork reviewers tended to favor happier folk/country (higher valence). Songs from folk/country albums in the top 10% demonstrated a median valence of 0.50, while songs from folk/country albums in the bottom 10% demonstrated a median valence of 0.31. The difference between the distributions of valence values for the two subsets was statistically significant (CL=0.345).

Once again, the metal genre defied the trends observed in the compiled dataset. Pitchfork reviewers prefer angrier metal (lower valence). Songs from albums in the top 10% demonstrated a median valence of 0.20, while songs from albums in the bottom 10% demonstrated a median valence of 0.37. The difference between the valence distributions for these two subsets was statistically significant (CL=0.349).

Songs from metal and experimental albums demonstrated lower median valence levels than other genres.

Popularity: Sticking With the Trends, Except for Jazz

Despite its reputation as a trendsetter, Pitchfork’s reviews tend to follow general trends in popularity. The median popularity of songs from albums in the top 10% was 17, while the median popularity of songs from albums in the bottom 10% was 10. This trend was statistically significant (CL=0.399).

This trend was observed for all genres except for one: jazz. The median popularity of songs from jazz albums in the top 10% was 9, while the popularity of songs from jazz albums in the bottom 10% was 15.5. The difference between the popularity distributions for jazz songs was statistically significant (CL=0.410).

Songs from the top and bottom 10% of albums reviewed by Pitchfork with the highest observed popularity values were from the rap and pop/r&b genres.

Room For Improvement

There are quite a few opportunities to build upon this analysis, including:

Improved Spotify Web API search script — the script I am using can find a majority of the available albums on Spotify, but not 100% of them. There is likely still some tweaking that can be done to the script to improve its accuracy and overall hit ratio.
Lyrical analysis — lyrical data can be scraped from Genius without too much additional work. The code contained in the Github repository for this project includes a script for scraping lyrical data, but it requires additional refinement to improve its search capacity and speed.
Complete the analysis full the full Pitchfork dataset and look for correlations.
Compare the results of this analysis to observations for the most popular songs on Spotify to identify similarities/differences.
Evaluate aggregate album data — the data in this analysis was kept at a song level. It would be interesting to determine if the trends observed here match the trends that are observed on aggregated album data. It would also be interesting to look at variable variances across albums to determine how much Pitchfork reviewers prefer consistency across an album.

Thanks for Reading!

Be sure to give this article some claps if you enjoyed it!