What makes a K-Pop song popular worldwide?

Kevin L
11 min readMay 2, 2021

by Kevin Liman

BTS performing on Saturday Night Live in 2020. Source: MTV

K-Pop has seen a meteoric rise in the international music scene in the past 10 years. What began in the early ’90s as a largely domestic and Asia-specific phenomenon has truly grown into a mainstream international spectacle — with K-Pop music regularly featuring on the top charts of countries all over the world. In the US alone, 37 songs have appeared on the Billboard Hot 100 chart since 2009 — and 19 of these were released just in 2020. Notable K Pop boy-bands like BTS have gone on to perform at the American Music Awards and even guest-star on SNL, while other groups like Blackpink have collaborated with Western artists like Selena Gomez, Lady Gaga, and Dua Lipa. This trend is also reflected in KPop’s growing audience on streaming platforms over the past two years: Spotify reported a 65% growth in the number of K-Pop listeners from 2019–2020, whereas Apple Music reported an 86% growth.

This begs the question: what attributes make a Kpop song more or less likely to be popular on an international scale, and can we use this analysis to predict a K-Pop song’s likelihood of success?

Seeing as Spotify is a streaming platform that’s used worldwide (and gathers data from its users worldwide), I will be using a song’s popularity on Spotify as a yardstick for its international success. I will also be drawing on text data from Genius, Google’s Compact Language Detector (CLD3), and Syuzhet as part of my analysis.

Part 1: Musical Attribute Analysis — What attributes do popular K-Pop songs tend to have?

First, I was interested in finding out what musical attributes might contribute to a K-Pop song’s international success. Fortunately, Spotify has fairly detailed metadata on songs published on its platform — including metrics like “danceability”, “energy”, and “speechiness”.

I began by using the Spotify API to download statistics and metadata on all 2,309 K-Pop titles published on Spotify in 2020. I ranked the data by “popularity”, and pulled the Top 100 most popular Kpop songs released in 2020. (The Spotify API doesn’t give access to the “play count” of each track; they only give a track’s “popularity” index, which is algorithmically calculated based on the track’s play count and how recently those counts were generated). I then created density charts on 6 of these metrics: Danceability (0 to 1), Energy (0 to 1), Loudness (normally between -20dBs to +5dBs), Speechiness (0 to 1), Acousticness (0 to 1) and Instrumentalness (0 to 1).

Exhibit 1: Density Charts of Top-100 most popular K-Pop Songs on Spotify

As seen in the charts above, the Top 100 K-Pop songs tend to be highly danceable (clustered around 0.7), highly energetic (clustered around 0.8), moderate-loud (clustered around -5dB). Not surprising, considering the high-energy live dance performances that k-pop bands are so well-known for.

On the other hand, highly popular K-Pop songs tend to be low in “speechiness” (clustered around 0.1), low in “acousticness” (clustered around 0.1), and low in instrumentalness (clustered around 0.000). The low speechiness score reflects the fact K-Pop songs tend to be highly melody-driven and less reliant on rap; although rap verses do appear in K-Pop songs, they are usually delivered in a “sing-rap” style. The low “acousticness” and “instrumentalness” scores reflect the fact that the vast majority of K-Pop songs are heavily electronically produced, and don’t feature a lot of purely instrumental tracks.

Part 2: Language Analysis — Do more English lyrics make a K-Pop song more popular?

Next, I was curious to find out whether the language that K-Pop songs are written and performed in has a bearing on its level of international popularity. While Korean is no doubt a very popular language — it is the 14th most popular language in the world, with ~77 million speakers, it does not compare in popularity to English, which has ~360 million speakers worldwide. Thus, I hypothesized that the greater the percentage of English lyrics in a K-Pop song, the more accessible it should be to an international audience, and thus the more popular it should be.

To test this, I used the Genius API to extract song lyrics for the ~2,000 K-Pop songs that were published to Spotify in 2020. (Unfortunately, I was only able to scrape lyrics for ~1,300 of these songs, because Genius did not have lyrics for songs released by less-popular or independent artists.) I then used the CLD3 package — Google’s compact language detector — to find the % of English lyrics in each of these songs.

Exhibit 2: Pie Charts showing language breakdown of K-Pop Songs

As seen in the pie charts above, 77.2% of all K-Pop songs contain some form of English lyrics. This percentage is even higher in the Top 100 Most Popular K=Pop songs: 83% of them feature some form of English lyrics. These results are consistent with the fact that many K-Pop songs incorporate English lyrics into their choruses or refrains in order to maximize their catchiness.

Exhibit 3: Dot plot showing the percentage of english lyrics in a K-Pop Song vs its popularity on Spotify

To more concretely determine whether there is a relationship between % English lyrics and popularity, I calculated the correlation between the Percentage of English lyrics and Track Popularity, and created a dot plot (see above). The two variables do indeed have a small, yet statistically significant positive correlation of 0.129, with a P-Value of 0.00001257.

Based on this analysis, it would appear that more English lyrics do make a K-Pop song slightly more popular.

Part 3: Sentiment Analysis — Are happier K-Pop songs more popular?

Next, I wanted to find out whether sentiment contributes to a K-Pop song’s international popularity. I used the Syuzhet package to calculate the sentiment of each of the song’s lyrics — making sure to exclude any songs with 0 English characters, since Syuzhet does not recognize Korean characters. I then created a Sentiment density chart for the top 100 most popular K-Pop songs released in 2020, as well as a dot plot

Exhibit 4: Sentiment score density plot of the Top 100 most popular K-Pop songs on Spotify

The density chart above shows that most of the songs are clustered around 0 — indicating “neutral” sentiment, with a long tail extending towards the negative side. This suggests that of the English lyrics that feature in popular K-Pop songs, they tend to either be neutral, or lean slightly “sad” or “negative”.

Exhibit 5: Dot plot showing Syuzhet-generated sentiment score vs popularity on Spotify

The dot plot of sentiment v track popularity appears to confirm a slight negative relationship between the two variables. I found the two variables to have a very small, but statistically significant negative correlation of -0.08 (P-Value 0.01757).

Based on this analysis, it would appear that K-pop songs with sad(der) English lyrics do have a slightly higher chance of being popular. However, I noticed that there were several shortcomings with this Syuzhet-based sentiment analysis. First, it only takes into account the English lyrics of the song, which may not be an accurate reflection of the sentiment of the whole song. Second, lyrics aren’t the only determinants of “sentiment” or “emotion” within a song: for instance, the key and tonality of a song can significantly affect the mood of a song.

Fortunately, Spotify actually has a metric called “Valence”, which measures the mood or “sentiment” of a song — primarily based on the key, tonality, and tempo. I thought it would be interesting to compare the results for the Syuzhet-based sentiment analysis above, with an analysis of Valence as measured by Spotify.

First, I replicated the Sentiment density chart, using “Valence” as the X variable instead of “Sensitivity”. Note, Valence scores range between 0 and 1, with 0 being very sad, 1 being very happy. As you can see in the chart below, it shows a very similar shape to the sensitivity density chart — with most songs being clustered around 0.6 (slightly above neutral), with a long tail extending towards the lower valences.

Exhibit 5: Valence density plot of the Top 100 most popular K-Pop songs on Spotify

Next, I constructed a dot plot showing valence v track popularity — which also appears to confirm a slight negative relationship between the two variables. The two variables to have a very small negative correlation of -0.035, which is not statistically significant (P-Value 0.237).

Exhibit 6: Dot plot showing Valence score vs popularity on Spotify

Looking at these two analyses in aggregate, we can conclude that there doesn’t appear to be a particularly strong relationship between a K-Pop song’s sentiment and it’s popularity: both happy and sad songs can be popular! However, there is a very small correlation between having sad English lyrics and the popularity of the song.

Part 4: Do Collaborations Make a song more Popular?

Next, I wanted to examine whether collaborations (songs featuring more than one artist) are more popular than individually-released songs. K-Pop artists have been engaging in more and more collaborations in recent years — either with other well-established K-Pop artists, or with internationally-recognized Western artists.

First, I wanted to find out what percentage of K-Pop songs were collaborative — both within the entire data set, and within the Top 100 most popular K-Pop songs. As seen in the pie charts below, 5.8% of all K-Pop songs released in 2020 were collaborations. On the other hand, collaborative songs have slightly higher representation in the Top 100: making up 8% of songs.

Exhibit 7: Pie Charts showing percentage breakdown of release types

Next, I created side-by-side box plots of the popularity by release type of all K-Pop songs released in 2020. As seen below, both Collaborations and Individual Releases have extremely similar popularity distributions. The only notable differences are that Collaborations have a slightly higher mean and median than Individual Releases (~37.5 vs ~35), and Individual Releases have 2 extremely popular outliers. However, these differences don’t appear significant at all, and the distributions are highly similar.

Exhibit 8: Side-by-side box plots showing distribution of popularity by release type

This was confirmed by calculating the correlation of collaboration type and popularity: the two variables have a small positive correlation of 0.02491143, which is not statistically significant (P-Value 0.2517).

Part 5: Predictive Model

Finally, I wanted to see if I could make a linear regression model using some of the metrics I created in earlier parts (eg. % of English Lyrics, Sensitivity), as well as the musical attribute data from Spotify, to predict the popularity of a song.

Exhibit 9: Correlation Matrix of Variables

I first made a correlation matrix of the variables I wanted to use, in order to identify any interaction terms I might want to incorporate into the model. From the correlation matrix, I noticed two main things. First, there are some variables with strong correlations — for example, Acousticness has strong correlations with Energy, Loudness, Danceability, and Valence; Energy has strong correlation with Loudness and Valence; Danceability has strong correlation with Valence. Second, I was concerned to see that Popularity doesn’t appear to be strongly correlated with any of the variables — it only has a weak negative correlation with Acousticness, and a weak positive correlation with Percent English. This was the first “warning sign” that a predictive model might not be very effective using the variables I had in the dataset.

Below is my first iteration of the model: (Fit 1)

Exhibit 10: Fit 1

There are a number of variables that are statistically significant: notably, loudness, acousticness, valence, and percent.english. While the model itself is statistically significant (P-Value 3.345e-12), it has a very small adjusted R-Squared of 0.07466.

For the next iteration (Fit 2), I tried adding different types of interactions, but the vast majority of them ultimately caused the adjusted R-Squared to decrease instead of increase. The only interaction that added to the model meaningfully was (Acousticness * Energy), which increased the adjusted R-Squared to 0.7557, and reduced the P-Value to 4.333e-12.

Exhibit 11: Fit 2

Comparing the RMSEs of Fit 1 and Fit 2, I was able to confirm that Fit 2 has a slightly lower RMSE of 15.297 (vs 15.315) when tested on the Training data set.

I then used Fit 2 to predict the popularity of songs in the Test Data Set. Fit 2 was able to predict the Test Data Set’s popularity with an RMSE of 16.175 — which was a similar, but less accurate error score than on the Training Data Set.

Exhibit 12: RMS Calculation of Fit 2 when applied to the Test Data

To conclude — while the linear regression model was statistically significant when applied to the train data set, it had a rather high RMSE of ~16 when applied to the test data set. Thus, it was not particularly accurate or successful as a predictive model. If I were to iterate on this model further, I would be interested in introducing other, non-musical variables to the analysis — for example, the number of times the song was performed on Prime Time TV, or perhaps the number of views its music video has on YouTube.

Conclusion

From the above analysis, it would appear that there are a number of attributes that contribute to the popularity of K-Pop songs on the international music scene. First, according to the Spotify musical attribute data, Top-100 Kpop songs tend to be highly danceable, energetic, and loud, but have low speechiness, acousticness, and instrumentalness. Furthermore, it would appear that popular K-Pop songs have a higher percentage of English lyrics, and also have a slight tendency to have sadder, more negative lyrics as well. Finally, while collaborations between artists are often loved by fans, there appears to be no statistically significant correlation between collaboration and the popularity of a track. Unfortunately, a predictive model purely based on musical and lyrical attributes was not particularly successful in predicting a K-Pop song’s popularity. Such a model would probably require additional non-musical variables in order to be refined, since other factors, such as the amount of media promotion and advertising spend a song receives, are also likely to have a major impact on its popularity.

Data Sources

  • Spotify: All music attributes
  • Genius: All lyric data

All data cleaned and manipulated in R.

Created for OIDD245 at the University of Pennsylvania, Professor Sonny Tambe.

--

--