Building an ML pipeline with ElasticSearch - Part 2
In part one of this tutorial, we were able to successfully push course information such as ID, title and description from PeopleSoft into ElasticSearch and then retrieve it with a Jupyter notebook. In this part, let's examine how we can process unstructured course information to extract keywords. This can be used in a variety of use cases. For example, we can build a simple word cloud, or perhaps, use it as a start for creating digital badges for a blockchain solution.
First step is to read the data from ElasticSearch instance, and then create a Panda's DataFrame object to hold the loaded content for further manipulation.
The next step is to review and clean the data. After converting data frame types to string, we can see that there are 2868 courses available:
Not all of the courses contain description that can be analyzed. There are plenty of options to fix that. We can generate some data, perhaps by using course title, or by replacing blanks with a default value. To keep things simple, I would just discard those courses with blank description.
Now we are ready to start with the text extraction. There are plenty of approaches on how we can do that in Python:
- Use natural language processing tools like NLTK and Rake
- Run statistical models to analyze term frequency
- Use linguistic and/or graph-based algorithms
- Build custom ML models.
This looks all right, consider how little effort it was. I hope we can do a lot better with just a bit more effort. Without going into part of speech analysis, stemming and lemmatization, we can probably build a list of additional stop words for Rake to clean up before keyword extraction. To get those stop words, I manually reviewed most frequent terms in the corpus and added applicable stopwords to the default English stopword dictionary provided by NLTK
Now, let's re-run the extraction:
With this small tweak, we were able to get much better results. We should now be ready to generate word cloud:
Obviously, this is still far away from being production ready, yet it is still extremely exciting to see how much we can accomplish with just a few lines of code!
Full source of the Jupyter notebook can be found in my GitHub repo: https://github.com/nvg/ml_sandbox/blob/master/projects/misc/pses-integration.ipynb
Link to Part 1 - http://blog.psdev.ca/2020/07/building-ml-pipline-with-elasticsearch.html