The Tortured Analysts Department:
The Anthology
A prediction analysis of Taylor Swift’s discography.
Summary
Like its namesake album, this is a double project. The second half of the project aims to train a machine-learning model with Swift’s entire discography to predict a Taylor-esque song.
Links
Introduction
I still remember when Taylor Swift was a small artist, only a few of us knew her songs. A couple of friends and I danced to “Our Song” in the elementary school talent show (Taylor Swift, 2006). It was also the only song I learned to master on the guitar. Then Fearless (2008) was released, and “You Belong With Me” and “Love Story” swept across the fifth-grade class. When Speak Now (2010) was released, “Dear John” was just a song, not a weaponized anthem against John Mayer, with great writing and what sounded like great vocals to my untrained middle-school-aged ears. I have grown up with Taylor Swift for the majority of my life. Now, I will fulfill my childhood dream of being the next Taylor Swift the only way I know how: by coding.
To predict the lyrics of a Swift song, an LSTM model is built. Long short-term memory (LSTM) is a type of recurrent neural network (RNN). RNNs remember previous information and use it to process the current input; however, RNNs have a vanishing gradient, so they cannot remember long-term dependencies. LSTMs are designed to avoid the dependency issues.
A word-level approach is taken, as opposed to a character-level one. The words are treated as unique units, and the model attempts to predict the next word. This will help create a comprehensible model as it is unlikely to generate random words. On the other hand, it requires a lot of memory to remember an entire vocabulary of words. My little laptop must stay strong.
Data Collection and Preprocessing
The data, lyrics from every Taylor Swift song, is collected from the Genius API. The process is discussed in Part One. The ‘Lyrics’ column is joined into a single string.
Preparing the Data
Every word in the lyrics string is identified and separated with tokenization. The vocabulary is created by finding all the unique words in the lyrics. Input sequences from the text data, the ‘Lyrics’ column, are created using a for loop to cycle through each song. The uncleaned text is used to generate a genuine song, or as close to one as possible. The words are changed into their number codes according to the vocabulary. Another for loop is inside the loop that creates n-gram sequences from the number codes. This process builds sequences of different lengths for each song by adding one word at a time to make a new sequence.
The sequences are shaped to fit the abilities of the LSTM network by padding, which ensures the sequences are the same length since LSTM networks work with fixed-length inputs.
The sequences are divided into predictors and labels: the predictors include every token except the last, which is the label. The label integers are converted into one-hot encoded format, which transforms each integer into a vector of zeros except the position, set to one, of the integer, to make it appropriate for the model. The data is finally split into 75% training and 25% test data.
First Model
Training the Model
The first model is created with the following layers:
Input or Embedding Layer: transforms input data into dense layers of a fixed size, 50-dimensional vector
LSTM Layer: 100 LSTM units operate to comprehend the sequence and context of words
Dropout Layer: randomly skips some neurons during training making the model less sensitive to the weights of neurons and thus avoiding overfitting; set to 0.1 dropout rate
Output or Dense Layer: has as many neurons as there are in vocabulary; prepares the model to choose the next word
Since this is a multi-class classification, the loss function is sparse_categorical_crossentropy. Early stopping is implemented to avoid over-training by stopping the training process if the model stops improving. The accuracy and loss are graphed as the model is trained.
Predicting the Next Word
A function is defined to predict the next word given two arguments, the model and the seed text. The seed text is the first word given to the model which it will use to guess the next word. The words “I am” is used the seed text. In a for loop, the seed text is prepared for the model by tokenizing and padding.
The model finds the probability for the next word by going through the whole vocabulary. The word with the highest probability is chosen as the next word, and it is added to the seed text. The seed text is updated with the word predicted to include the new additions. The new seed text goes into the model for the next word prediction. The process is repeated until the lyrics are all predicted, for example, 150 words.
This baseline model generates the following:
'I am a whole times but i do i wish i wish he was bought it just right yeah you would have to break for me and you have to wrapped buried with you scarred and its under her eyes but its a sunshine honey youre taking a might feelin you just hear my hand off your version of your moment foolish one day well boarded up on your grave and its laughin at our house behind never never turn in paris oh she was both gone was the hospital first wonderland home now it was right tonight with we went around the stairs but i seem really will closed you makes you want to call right but now we know you think theres a mastermind you make colder and now youre ill worse ive been with you ooh if i wanna keep you all no twinkling world but i never listen'
The model is far from creating a top-hit, or even something that makes sense. However, many of the words and phrases in the output are actual lyrics from Swift’s songs.
First Model: Improved LSTM
To improve upon the baseline model, a first model was trained with adjustments to better capture structure and reduce repetitive outputs. The goal was not only to generate coherent text, but to move closer to something that resembles natural lyrical flow.
Using the same seed text, “I am”, the updated model generated:
'I am just act down just me in the meaning of gray i reached look and the best at your eye was his lost of black of her’d to watch your life uh but you search hoping bad over the babys blue and they was lettin in his dream long every way instead that clandestine rides and wearing the song i see it all mine but it had and nothin i comb hold to you even well close too the last love id ever go i say that ooh ah ha ah we didnt even ever dance in me that you dont and about me that fuck you is an twenty night and i move better with flames when i felt about silence things in the hand of feelin off you didnt got to the town above the lot of me because give me so its on the film while you can'
Key Observations
While neither model produces a gorgeous lyrical hit, there are clear differences:
Reduced repetition: The improved model shows more varied word usage.
More diverse structure: The second model produces longer, less repetitive sequences that resemble sentence-like patterns.
Increased creativity (with tradeoffs): The improved model introduces more variation, but at the cost of grammatical consistency and clarity.
Relentless limitations: Both models struggle with:
long-term coherence
grammatical structure
maintaining a consistent narrative
Interpretation
The model is successfully learning statistical patterns in Taylor Swift’s lyrics, particularly word associations and stylistic fragments. However, it lacks a deeper understanding of language structure and meaning.
This reflects the great war with of LSTM-based text generation: while effective at capturing short-term dependencies, they often struggle with long-range coherence and semantic consistency.
Future Improvements
To further improve performance, several approaches could be explored:
Using pretrained embeddings (e.g. Word2Vec or GloVe)
Transitioning to transformer-based models (e.g. BERT or GPT-style architectures)
Increasing dataset size or augmenting with additional lyrics
Implementing temperature sampling to balance creativity vs. coherence
Fine-tuning hyperparameters (sequence length, embedding size, LSTM units)
My personal laptop’s memory is my albatross, though, so perhaps these improvements will have to wait for another day.
Conclusion
This project demonstrates that even relatively simple neural networks can capture recognizable stylistic elements of an artist’s work. However, generating truly coherent and creative lyrics remains a challenging task, emphasizing the gap between pattern recognition and genuine language understanding.