‘Billboard’, the biggest journalism typhoon in the music industry has come a long way from just being a weekly magazine that was founded back in 1894 by William Donaldson and James Hennegan as a trade publication for bill posters.
Apart from engaging articles written about music charts, news, video, opinion, reviews, events, and style trends related to the music industry. Its music charts are the most popular which usually include the Hot 100, the 200, and the Global 200, tracking the most popular albums and songs in different genres of music. It also hosts events, owns a publishing firm, and operates several TV shows.
It’s intriguing to discuss how the charts data is managed and computed. Within the past couple of decades, we have seen a significant shift in the sales of music from physical albums to digital copies with the rise of online streaming platforms like Spotify, Apple Music, Pandora, etc. Now the popularity of songs no longer only depends on physical sales (even though they are still a very influential factor in a song’s popularity). A majority of data that requires a song or album to top the charts comes from the digital streams it gets from various platforms be it through video streaming platforms like YouTube or other audio streaming platforms as mentioned earlier. In this blog, we will take a look at how this vast data is managed and how it contributes to the chart results.
Data Sources
Three sources of information are used for this project. One structured dataset with the Billboard 200 charts comes in a SQL database. The other sources come from two websites, Metro Lyrics and Last.Fm and provide unstructured data. With this information, new datasets will be built. The datasets used for this project are presented next.
2.1 Billboard 200 Datasets
The main dataset used is a database provided by the components of one group. This database is free for use. Two tables of this table are used, the Albums table and the Acoustics table. The Albums table provides information about the albums that were nominated for the Billboard 200 ranking, more specifically, the date it was introduced on the rank, the name, the artist, as well as the length of the album. The Acoustics table provides information about the songs of the albums selected for the Billboard 200, which include the name of the song, the album, the artist, as well as a number of technical features about the songs, like the acoustics, danceability, duration, energy, instrumentals, key, liveness, loudness, mode, tempo, time_signature and valence. But this extra information will not be used since we consider that it doesn’t add valuable information. In total, these 2 tables cover 33011 albums, 9675 artists, and 339854 songs.
2.2 Last.fm
Last.fm is a music website founded in the United Kingdom in 2002. This website provides a handful of information about music, which is used to complement the albums, songs, and artists, that can be used for personal and non-commercial purposes. The process we used to extract and treat this information as described in section 3.2. The data obtained is stored on JavaScript Object Notation (.json) files.
2.3 MetroLyrics
MetroLyrics is a lyrics-dedicated website, founded in December 2002. It is used to obtain the lyrics of the songs from the albums on the Billboard 200. The data obtained is stored in JavaScript Object Notation (.json) files and can be used for personal and non-commercial purposes. More about the process of extraction can be found in section 3.2
3 Data Preparation
The main dataset used is very organized and needs minimal action to extract the relevant data, besides that it comes in a SQL database which also makes it easy to query the dataset. In the same way, the information retrieved via web scrapping is also very organized and easy to get and process. The pipeline used to extract and enrich the data is outlined in Figure 1. This pipeline has two main purposes, to clean the data and enrich it. These two operations will be described in more detail in the next sections.
3.1 Data Cleaning
The first stage of the pipeline from Figure 1 is the data cleaning, accomplished through the use of the tool OpenRefine. This tool is used to remove empty entries from the database and to remove entries with extra characters. This operation will result in two files: • Albums — A file with all the albums and their ranks in the Billboard 200 from 1963 to 2019; • Tracks — A file with songs and artists that match with the albums file. In this dataset, one album can have hundreds of entries, because it can be featured in the Billboard 200 for several weeks or even months or years, this imposes a problem in the next phase of the pipeline, the scraping data. This will lead to an album being processed several times and the pipeline becomes extremely inefficient and slow. To overcome this problem, and since we cannot discard any entry (we would end up losing the rank information) from the previous files the pipeline generates auxiliary files with duplicates removed. This is done in steps 1 and 2 marked in the pipeline. In these steps all the characters are also escaped to HTML format to be used to generate the URL addresses for MetroLyrics and Last.Fm.
Figure 1: Process of Data Cleaning
3.2 Data Enrichment
The second stage of the pipeline is data enrichment. To enrich the datasets already obtained from Billboard 200, the websites Last.fm and MetroLyrics were crawled using the Scrapy framework. The crawler uses the cleaned files stated in the previous section and loads these files into 4 data frames using Pandas, an open-source data analysis and manipulation library for Python:
• ranks: Table loaded from albums.csv with information on the album’s positions on the Billboard 200 charts on different dates.
• albums: Table with album information obtained from albums.csv by separating the album columns and removing duplicates.
• tracks: Table loaded from tracks.csv.
• artists: Table with artists’ information obtained from tracks.csv by separating the artist’s columns and removing duplicates.
Both Last. fm and Metrolyrics have user-friendly links, than can be built following a common structure and using the information from the columns of the data frames above:
LASTFM_URL = https://www.last.fm/music ML_URL = https://www.metrolyrics.com
• LASTFM_URL/{artist}/{subset} : Links to pages with information of an artist.;
• LASTFM_URL/{artist}/_/{song}/{subset} : Links to pages with information of a song;
• LASTFM_URL/{artist}/{album}/{subset} : Links to pages with information of an album;
• ML_URL/{song}-lyrics-{artist}.html :
Links to pages with the lyrics of a song. The {artist}, {song} and {album} fields can be obtained from the data frames to search for specific albums, artists and tracks. The {subset} section indicates what kind of information we want to obtain for that album, artist, or song, which can be either “tags”, “wiki”, or even empty if we want an overview. Each spider searches for a subsect of all artists, albums, or tracks and exports that information to a JSON file. As a result, the crawler obtains 8 JSON files with scraped data:
• albums_overview: Number of listeners and release date of the albums.
• albums_tags: List of tags for each album.
• tracks_lyrics: Lyrics of the tracks.
• tracks_overview: Number of listeners and duration of the tracks.
• tracks_tags: List of tags for each track.
• artists_overview: Number of listeners of the artists.
• artists_tags: List of tags for each artist.
• artists_wiki: The biography and number of listeners of all artists.
If the artist is an individual (Solo) it also contains their birth date and birth location. If the artist is a group of individuals (Band) it also contains the location of the foundation, years of activity, and a list of its members. The next step is to complement the data frames above with the scraped data that was stored in the JSON files. Each JSON file has the keys to the albums, artists, or tracks to which the scraped data corresponds, so the information can be aggregated. This aggregation would result in 4 main tables: ranks, albums, tracks, and artists (solo and band).
4 Data Model
The Data Model is comprised of the following elements:
• Rank: List of top albums on the Billboard 200 chart on a specific date. An album can be at the top of the chart on different dates, so it can be associated with several ranks and in different positions between them.
• Album: Saves all the information of an album, such as the name of the album, name of the artist, release date, number of tracks, total duration, and number of listeners.
• Artist: Saves general information about an Artist, such as the name of the artist, number of listeners, and its biography (textual information). An Artist can also be one of two subclasses that hold more specific information, depending on whether it is a Solo (referring to an individual) or a Band (group of individuals).
• Solo: Saves specific information of an individual artist, such as the birth date and birth location.
• Band: Saves information about a group of artists that perform together, such as foundation location, years of activity, and the list of members (can be Solo artists, if their information is provided by the datasets).
• Tag: A tag contains a string that can label many albums, artists, or songs, giving important information about them, such as genre.
5 Data Characterization
The dataset in the analysis is very big, having 574000 entries. Given the nature of the data, the same album can have hundreds of entries in the database, because really popular entries will be featured in the Billboard 200 for several years, or months. Since this dataset is in a SQL database we can characterize the data with relative ease. In this dataset, the album with the most entries is The Dark Side of the moon with 942 entries. Another particularity that increases the complexity of the analysis is the number of albums named Greatest Hits from different artists, this represents 5905 entries. To overcome this, the scripts always consider the pair album and artist. The number of albums per year is constant from one year to the other as can be seen in Section A — Figure 2. The year 2019 has only data for the first month. Looking at the number of songs per year, Section A — Figure 3, we can see that the number of songs included in the albums has increased over the years, being the year 2014 the year with the most songs on the Billboard 200. Similar conclusions can be taken relative to the number of artists with albums in the rank, Section A — Figure 3, this can be a consequence of the increased number of diffusion mediums.
Figure 2 Albums per year in Billboard Top 200
Figure 3 Songs per year in Billboard Top 200
Figure 4 Artist per year in Billboard Top 200
Conclusions
The first milestone proposed is completed. The data was characterized and its usage defined. The data chosen is very complete and will provide lots of information about music since 1963. The database with the Billboard 200 proved to be an awesome starting point to extract information since it gives an exhaustive list of albums, artists, and music since 1963. Even though, the database has a huge list of albums it only has information about albums on the Billboard 200, lacking information about other less popular albums. This will render the final datasets incomplete, but on the other hand, since we are considering information from 1963 to 2019, all albums produced on these dates would represent datasets with giant proportions and challenges that are not in the scope of this analysis. The sources used to enrich this data also proved to be very satisfactory and with high-quality information, leading to very complete groups of datasets. Overall, the process of characterization, data cleaning, and enrichment performed ended in good quality documents that will allow the usage of information retrieval tools and the return of satisfactory results.
References:
[1] Acoustic and meta features of albums and songs on the Billboard 200. url: https:// components . one / datasets / billboard — 200.
[2] Billboard 200. url: https : / / www . billboard.com/charts/billboard-200.
[3] Last.fm. url: https://www.last.fm/.
[4] Last.fm on Wikipedia. url: https : / / en . wikipedia.org/wiki/Last.fm.
[5] MetroLyrics. url: https : / / www . metrolyrics.com/.
[6] MetroLyrics on Wikipedia. url: https://en. wikipedia.org/wiki/MetroLyrics.
[7] Open Refine Documentation. url: https:// openrefine.org.
[8] Pandas, Python Data Analysis Library. url: https://pandas.pydata.org/.