Building an ML pipeline with ElasticSearch - Part 2

In part one of this tutorial, we were able to successfully push course information such as ID, title and description from PeopleSoft into ElasticSearch and then retrieve it with a Jupyter notebook. In this part, let's examine how we can process unstructured course information to extract keywords. This can be used in a variety of use cases. For example, we can build a simple word cloud, or perhaps, use it as a start for creating digital badges for a blockchain solution.

First step is to read the data from ElasticSearch instance, and then create a Panda's DataFrame object to hold the loaded content for further manipulation.

No alt text provided for this image

The next step is to review and clean the data. After converting data frame types to string, we can see that there are 2868 courses available:

No alt text provided for this image

Not all of the courses contain description that can be analyzed. There are plenty of options to fix that. We can generate some data, perhaps by using course title, or by replacing blanks with a default value. To keep things simple, I would just discard those courses with blank description.

No alt text provided for this image

Now we are ready to start with the text extraction. There are plenty of approaches on how we can do that in Python:

  • Use natural language processing tools like NLTK and Rake
  • Run statistical models to analyze term frequency
  • Use linguistic and/or graph-based algorithms
  • Build custom ML models.
To continue keeping it simple, I would try it with Rake. First, let's just feed course description "as is" into Rake and see what we get:
No alt text provided for this image

This looks all right, consider how little effort it was. I hope we can do a lot better with just a bit more effort. Without going into part of speech analysis, stemming and lemmatization, we can probably build a list of additional stop words for Rake to clean up before keyword extraction. To get those stop words, I manually reviewed most frequent terms in the corpus and added applicable stopwords to the default English stopword dictionary provided by NLTK

No alt text provided for this image

Now, let's re-run the extraction:

No alt text provided for this image

With this small tweak, we were able to get much better results. We should now be ready to generate word cloud:

No alt text provided for this image

Obviously, this is still far away from being production ready, yet it is still extremely exciting to see how much we can accomplish with just a few lines of code!

Full source of the Jupyter notebook can be found in my GitHub repo: https://github.com/nvg/ml_sandbox/blob/master/projects/misc/pses-integration.ipynb

Link to Part 1 - http://blog.psdev.ca/2020/07/building-ml-pipline-with-elasticsearch.html

Popular posts from this blog

Building an ML pipeline with ElasticSearch - Part 1

Stitching PeopleSoft and SharePoint

Kibana visualization cheat-sheet