Building an ML pipeline with ElasticSearch - Part 1

ElasticSearch (ES) makes it easier than ever to explore PeopleSoft data. Prior to ES, the two ways to access the data were integration or direct SQL access. Both are came with some downsides. Integration is always hard - it requires development of an App Engine or Integration Broker service. SQL access imposes security risks as it requires direct connection to the underlying database. ElasticSearch is a great solution for this problem.


In this tutorial series the plan is to build a Machine Learning pipeline that pushes PeopleSoft Campus Solutions course info such as ID, title and description into ElasticSearch server. The next step would be to run a keyword extraction algorithm on this data. This post will focus on the first part - getting the data ready. Let's start!

Building a PSQuery

We begin by building a PSQuery to extract the data. To keep things simple, I would ignore effective date logic and retrieve all records from the course catalogue:


In the query, please ensure that there is a timestamp field that is configured as a prompt value. This is crucial for building search index increments. Another very important query part is the drilling URL. This URL is used as the document ID in the index, and therefore must be unique for each document:


This is how the query looks like:



For testing it, you can use '01/01/1900 0:0' in the prompt:

Building Search Index

Once query is ready, we need to push results to ElasticSearch. For that, a search index definition needs to be created and exported. Steps to accomplish that are described in this post http://blog.psdev.ca/2020/06/kibana-visualization-cheat-sheet.html.


Fingers crossed, search index should be available in ElasticSearch:


In the index, the drilling URL is used as document ID, so it's crucial to have it distinct for each document. Otherwise documents will be overwritten by ElasticSearch:


After the content is available in ES, we can now consume it in Jupyter notebook with the use of appropriate Python library:


ES python library is a wrapper over ElasticSearch REST API. PeopleSoft adds a few customizations to ES, so it's necessary to provide additional security in the headers when making the search calls. Otherwise, ES would return a HTTP 500-response.

After all the steps are complete, course information should be available for text analysis!

Misc Info

In the vanilla PUM image, ElasticSearch and Kibana sometimes will not start after the VM restart. To restart them, you can run appropriate scripts in:

/opt/oracle/psft/pt/ES/pt


cd /opt/oracle/psft/pt/ES/pt/elasticsearch7.0.0/
./bin/elasticsearch -d -p pid
cd /opt/oracle/psft/pt/ES/pt/Kibana7.0.0/bin
nohup kibana &

To test ES, you can use CURL:

curl -u admin:pass -H "SearchUser: PS" -XGET "http://mypum.oraclevcn.com:9200/psd_course_catalog_cs2hq2/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'


Link to Part 2 - http://blog.psdev.ca/2020/07/building-ml-pipline-with-elasticsearch.html

Popular posts from this blog

Stitching PeopleSoft and SharePoint

Kibana visualization cheat-sheet