As alternative data is becoming more and more important in investment strategies, natural language processing is the technology that will supercharge this. However, it’s not an easy feat to get NLP right, as Saeed Amen is about to explain why. Amen explores the intricacies of linguistics and looks at the techniques that can help analyse language.
Humans are pretty flexible when it comes to interpreting the world. We are constantly ingesting both sound and vision, and this complicated dataset is structured into useable information. Sound waves are immediately converted into language, as are the outline of words on a page which we see. We can even do the reverse transformation, converting words into sound through speech and text through writing. Despite the complexity of all these processes, even a toddler can master many of these tasks.
How can we get a computer to both: generate language and understand language like a human is able to do? The area of natural language processing seeks to do this.
We can ingest massive amounts of text and get a computer to number crunch it. For traders, there is simply too much text being generated to read it all themselves, whether it’s text from the web, social media, or newswires. Hence, being able to automate the reading (and the understanding) of this text would be very helpful.
The building blocks of language
If we take a step back and think about linguistics, there several different ways to look at language. The first stage is phonetics, relating to the sounds humans can make. Then we have phonology, relating to the sounds of specific languages, which can vary considerably between languages. Then we have morphology, which governs how words are constructed and how they can be broken down into different parts. In certain languages, morphology can be particularly important.
Syntax meanwhile dictates how words can be combined together to make grammatically correct sentences. We again have significant differences in how syntax is applied. In some languages, word order is less important and instead the word forms dictate how words relate to one another. In others, such as English, word order is very important. Indeed, in English, syntax dictates that words follow a subject-verb-object ordering. Semantics is about the meaning of language, answering questions such as who, what, where, and when? And finally, we have pragmatics, in which we seek to understand language in context – the information which is not necessarily available in the text.
Natural language processing encompasses many of the tasks and functions that the above stages fulfil.
How to kick-start your natural language processing model?
In order to do many of the higher level NLP tasks, we need to break down the text first in a process called normalisation. In particular, we need to identify what parts of the text are words, using tokenisation or word segmentation. How this is done depends on the language. As an example, word segmentation algorithms in Chinese are very different to those in English. It is also necessary to remove stop words, such as “the”, which do not add to the meaning.
Computers like numbers! Hence, we need to create a numerical representation of the text to do more complex analysis. A word embedding is a vectorised representation of our text. There are many different types of word embedding algorithms. How these word embeddings are constructed will impact the accuracy of higher level NLP tasks.
Bag-of-words is a simple word embedding. It involves counting the number of times a word appears in a text. The words are represented as a “bag”. We can then use this bag-of-words to do a higher level task such as sentiment analysis. Using a lexicon of words assigns scores to how positive/negative words are, we can calculate the sentiment of a text. We simply multiply the frequency of words in the text by their positive/negative scores, and take an average. Obviously, this word embedding will ignore things like word order, which can impact meaning. TFIDR (term frequency-inverse document frequency) gives weighting to words, depending on how frequently they appear in a corpus. We can also measure the co-occurrence of words within sentences. This would extend our vector to a matrix. The problem is that many words are unlikely to occur in the same sentence, resulting in a very sparse matrix.
So far, we’ve discussed hand crafted features, a rules based approach. However, what if we tried using machine learning to create dense word embeddings with lower dimensionality? word2vec converts words to vectors (as the name suggests!). It computes the probability of words being written near one another, in other words is it a probabilistic approach, rather than a deterministic approach. There are many other more advanced word embeddings such as BERT, which can also incorporate context.
If you want to do your own NLP on your own text corpus, Python has many libraries for generating these various word embeddings (often with trained models) and for doing NLP tasks more broadly. I’ll be demonstrating a few of these Python libraries during my talk. Hope to see you in Hamburg at QuantMinds!