I’ve been very fortunate to have had the opportunity to intern at PeerIslands. It has been a priceless experience working alongside a team of cross-functional engineers, experts in their respective fields. Over the course of my internship, I’ve been able to lean on and learn from the team’s countless years of collective experience, and make meaningful contributions to work that is helping the industry grow, here in our beautiful Cayman Islands.
My task during the internship
I was tasked with building a Natural Language Processing (NLP) engine. This was to aid in pulling worthwhile information from a set of unstructured text, to auto-fill a form. There are many common NLP models available across open source platforms that have been trained off of the everyday English language. Initially, we considered Azure’s cognitive services; however, there were too many limitations when it came to customizing it for our needs.
In order to fit the requirements for our NLP project, two factors had to be considered:
- the effort needed to train a model from scratch
- the level of complexity to adjust an existing model to fit our specific needs
Building a model from scratch was not the most efficient way to complete the task. Making adjustments to entity types began to confuse the statistical system. Along with learning new entity types, these models can forget phrases that were previously recognized as entity types.
Our approach
‘SpaCy’ is an open-source library for advanced NLP in Python. It provides a default model that can recognize a range of named or numerical entities. Instead of training a brand new model, I decided to use this existing model as a base. On top of the existing model, phrase matching pipelines were put in place. A terminology list of the desired entity linking phrases was built, through scraping open sources like LinkedIn and Google jobs for relative information.
Along with pipelines, token-based rules were added to influence the existing model. This process made it possible to manipulate specific entities, and create our own. To illustrate: we looked at the surrounding words of a “Date” entity type. This helped understand the context in which the “Date” was being used, making it easier to distinguish between a “Start Date” and a “Duration”.
The distance between certain general entity types was also taken into consideration. This provided further insight into what these entities were, and their relationship to the overall intent of the text. Thus increasing the probability for the statistical model to make a proper evaluation on what the given words might mean.
In my short time here I’ve learned a great deal of innovative techniques, through the process of building a functional application to deploying it to production. This has been a great learning experience for me, immersed in a working culture which thrives off innovation and the sharing of knowledge. PeerIslands is the perfect hub for someone willing to continuously learn!
