- Built with Python3 and Flask
- Scripts from 1st season of AMC's Breaking Bad from springfieldspringfield.co.uk
- NLTK for data cleaning
- sklearn for tf-idf and cosine similarity
- NLTK: Natural Language Tool Kit. A software suite used to perform natural language processing.
- Lemmatize: This is the act of translating words into their base forms. These base forms are known as 'lemmas' or 'dictionary forms'. For example, 'walk' is the base form of 'walking', 'walked', and 'walks'.
- tf-idf: Term frequency-inverse document frequency. This is a number that measures how important a word is to a document that is in a collection of documents. A higher score means that the word happens a lot in the document but not that often in the entire collection.
- Cosine similarity: this measures how close two vectors are -- just like the cosine we learned in elementary school, but this time in k-dimensional space.
We first pulled scripts for the first season of AMC's Breaking Bad from springfieldspringfield, then did a quick pass to scrub out obscenities and such since this is a family website. (NB: We do believe what we do here constitutes fair use of any potentially copyrighted material, as this is solely for educational/research purposes.)
Using NLTK, we extract words, eliminate English stop words, and lemmatize with WordNetLemmatizer. Using sklearn's TfidfVectorizer, we get the tf-idf-weighted document-term matrix based on the lines in the scripts. We then find the vector/line most similar to the sentence typed in by the user using sklearn's cosine similarity, and we return the next line of the script.
It's not quite at the level of a real chatbot, but it is fairly entertaining for just a few lines of python! Most of our effort was spent trying to figure out if it was a good idea to try to force Flask to keep the matrix in memory to speed up performance. It...was not. Likely best to set up as an API on a separate server when doing this sort of thing in a production environment.
This Helps Find Financial Crimes?
Natural language processing has so many applications! With good NLP, you can automate negative news searches, comb through legal documents to identify beneficial ownership, pre-populate suspicious activity reports -- basically, it can help ease the aggregation and comprehension of unstructured language documents, which is usually a very labor-intensive job.