The Tortured Analysts Department:

The Anthology

Training an LSTM model on Taylor Swift's discography to generate Swift-esque lyrics.

Summary

Like its namesake album, this is a double project. This project builds on the lyrical dataset from Part One to train a word-level LSTM model capable of generating new lyrics in Taylor Swift's style. Two models are trained and compared: a baseline and an improved version with architectural adjustments to reduce repetition and improve sequence diversity.

Skills Demonstrated

LSTM architecture design, RNN theory (vanishing gradient problem, LSTM vs. vanilla RNN trade-offs), sequence modeling, n-gram sequence generation, one-hot encoding, embedding layers, dropout regularization, early stopping, model iteration and comparative evaluation, multi-class classification

Tools & Libraries

Python, TensorFlow/Keras, pandas, Genius API, GitHub

What Worked

Both models successfully learn short-range word associations and stylistic fragments from the corpus: recognizable phrases and lyrical patterns surface in the output. The improved model shows less repetition and more varied sentence-like structure.

What Did Not Work

Neither model produces coherent lyrics, which is expected. Word-level LSTMs on a ~250-song corpus struggle with long-range coherence, grammatical structure, and narrative consistency. This is a known limitation of the architecture, not a data problem.

Potential Next Steps

Pretrained embeddings (Word2Vec, GloVe), temperature sampling to tune the creativity/coherence trade-off, and a transformer-based approach (GPT-style fine-tuning) would all meaningfully improve output quality. The honest constraint was compute; my laptop's memory set the ceiling on feasibility.

Links

Introduction

To predict the lyrics of a Swift song, an LSTM model is built. Long short-term memory (LSTM) is a type of recurrent neural network (RNN). RNNs remember previous information and use it to process the current input; however, RNNs have a vanishing gradient, so they cannot remember long-term dependencies. LSTMs are designed to avoid the dependency issues.

A word-level approach is taken, as opposed to a character-level one. The words are treated as unique units, and the model attempts to predict the next word. This will help create a comprehensible model as it is unlikely to generate random words. On the other hand, it requires a lot of memory to remember an entire vocabulary of words. My little laptop must stay strong.

Skills Demonstrated

LSTM architecture design, RNN theory (vanishing gradient, LSTM vs RNN trade-offs), sequence modeling, one-hot encoding, embedding layers, dropout regularization, early stopping, model iteration and comparison, PyTorch/Keras

Key Observations

While neither model produces a gorgeous lyrical hit, there are clear differences:

  • Reduced repetition: The improved model shows more varied word usage.

  • More diverse structure: The second model produces longer, less repetitive sequences that resemble sentence-like patterns.

  • Increased creativity (with tradeoffs): The improved model introduces more variation, but at the cost of grammatical consistency and clarity.

  • Relentless limitations: Both models struggle with:

    • long-term coherence

    • grammatical structure

    • maintaining a consistent narrative

Future Improvements

To further improve performance, several approaches could be explored:

  • Using pretrained embeddings (e.g. Word2Vec or GloVe)

  • Transitioning to transformer-based models (e.g. BERT or GPT-style architectures)

  • Increasing dataset size or augmenting with additional lyrics

  • Implementing temperature sampling to balance creativity vs. coherence

  • Fine-tuning hyperparameters (sequence length, embedding size, LSTM units)

My personal laptop’s memory is my albatross, though, so perhaps these improvements will have to wait for another day.

Data Collection and Preprocessing

The data, lyrics from every Taylor Swift song, is collected from the Genius API. The process is discussed in Part One. The ‘Lyrics’ column is joined into a single string.

Preparing the Data

Every word in the lyrics string is identified and separated with tokenization. The vocabulary is created by finding all the unique words in the lyrics. Input sequences from the text data, the ‘Lyrics’ column, are created using a for loop to cycle through each song. The uncleaned text is used to generate a genuine song, or as close to one as possible. The words are changed into their number codes according to the vocabulary. Another for loop is inside the loop that creates n-gram sequences from the number codes. This process builds sequences of different lengths for each song by adding one word at a time to make a new sequence.

The sequences are shaped to fit the abilities of the LSTM network by padding, which ensures the sequences are the same length since LSTM networks work with fixed-length inputs.

The sequences are divided into predictors and labels: the predictors include every token except the last, which is the label. The label integers are converted into one-hot encoded format, which transforms each integer into a vector of zeros except the position, set to one, of the integer, to make it appropriate for the model. The data is finally split into 75% training and 25% test data.

First Model

Training the Model

The first model is created with the following layers:

  • Input or Embedding Layer: transforms input data into dense layers of a fixed size, 50-dimensional vector

  • LSTM Layer: 100 LSTM units operate to comprehend the sequence and context of words

  • Dropout Layer: randomly skips some neurons during training making the model less sensitive to the weights of neurons and thus avoiding overfitting; set to 0.1 dropout rate

  • Output or Dense Layer: has as many neurons as there are in vocabulary; prepares the model to choose the next word

Since this is a multi-class classification, the loss function is sparse_categorical_crossentropy. Early stopping is implemented to avoid over-training by stopping the training process if the model stops improving. The accuracy and loss are graphed as the model is trained.

Predicting the Next Word

A function is defined to predict the next word given two arguments, the model and the seed text. The seed text is the first word given to the model which it will use to guess the next word. The words “I am” is used the seed text. In a for loop, the seed text is prepared for the model by tokenizing and padding.

The model finds the probability for the next word by going through the whole vocabulary. The word with the highest probability is chosen as the next word, and it is added to the seed text. The seed text is updated with the word predicted to include the new additions. The new seed text goes into the model for the next word prediction. The process is repeated until the lyrics are all predicted, for example, 150 words.

This baseline model generates the following:

'I am a whole times but i do i wish i wish he was bought it just right yeah you would have to break for me and you have to wrapped buried with you scarred and its under her eyes but its a sunshine honey youre taking a might feelin you just hear my hand off your version of your moment foolish one day well boarded up on your grave and its laughin at our house behind never never turn in paris oh she was both gone was the hospital first wonderland home now it was right tonight with we went around the stairs but i seem really will closed you makes you want to call right but now we know you think theres a mastermind you make colder and now youre ill worse ive been with you ooh if i wanna keep you all no twinkling world but i never listen'

The model is far from creating a top-hit, or even something that makes sense. However, many of the words and phrases in the output are actual lyrics from Swift’s songs.

First Model: Improved LSTM

To improve upon the baseline model, a first model was trained with adjustments to better capture structure and reduce repetitive outputs. The goal was not only to generate coherent text, but to move closer to something that resembles natural lyrical flow.

Using the same seed text, “I am”, the updated model generated:

'I am just act down just me in the meaning of gray i reached look and the best at your eye was his lost of black of her’d to watch your life uh but you search hoping bad over the babys blue and they was lettin in his dream long every way instead that clandestine rides and wearing the song i see it all mine but it had and nothin i comb hold to you even well close too the last love id ever go i say that ooh ah ha ah we didnt even ever dance in me that you dont and about me that fuck you is an twenty night and i move better with flames when i felt about silence things in the hand of feelin off you didnt got to the town above the lot of me because give me so its on the film while you can'

Interpretation

The model is successfully learning statistical patterns in Taylor Swift’s lyrics, particularly word associations and stylistic fragments. However, it lacks a deeper understanding of language structure and meaning.

This reflects the great war with of LSTM-based text generation: while effective at capturing short-term dependencies, they often struggle with long-range coherence and semantic consistency.

Conclusion

This project demonstrates that even relatively simple neural networks can capture recognizable stylistic elements of an artist’s work. However, generating truly coherent and creative lyrics remains a challenging task, emphasizing the gap between pattern recognition and genuine language understanding.