I have been a regular user of Spacy and I have used it to solve interesting problems in the past. One of the basic problems in NLP is finding similarities between words or phrases. Many NLP libraries provide the feature to check whether a word/phrase is similar or not through cosine similarity score.
Finding similarity between 2 words is easy. What if, you had say 200k words !! and wanted to check the similarity of each word against say a table containing 10k words. And not to mention store 3 most similar words for each word and write to a csv file or your DB.
Your normal laptop probably can’t handle such memory intensive job.
Even if it did, it would be painstakingly slow!!
What is Redis
Redis stands forRemoteDictionaryServer. It is a fast, open-source, in-memory key-value data store for use as a database, cache, message broker, and queue. Redis was developed by Salvatore Sanfilippo.
Why use Redis?
Because it is insanely fast !!
Now the next obvious question that you might have is, what makes it so fast?
Well the answer is, all Redis data resides in-memory, in contrast to databases that store data on disk or SSDs. By eliminating the need to access disks, in-memory data stores such as Redis avoid seek time delays and can access data in microseconds.
Back to the problem
Okay then, back to our problem. Since my laptop did not have adequate computing power and storage, I decided to spin up an EC2 instance on AWS.
You can refer this article of mine and follow steps 1 to 3 on how to create an EC2 instance and choose an appropriate machine image. For this task I would recommend higher instance type having at least 64 GB ram.
Once you have created an EC2 instance, you would have to do the following steps
Install Python (3.5+ recommended, you might also have to install other packages like numpy,pandas)
Install Tmux (I will get to it, on why we need this)
Assuming you have installed all the above correctly, lets get to the code.
Step 1: Start redis server and Access it
The initial part is self-explanatory, we import required packages. Then we access the redis server.
Step 2: Read the words from the big list.
Redis supports many data structures. One of the data structures which we will cover extensively in this article is a list (in context of redis). A list is a series of ordered values. Some of the important commands for interacting with lists are RPUSH, LPUSH, LLEN, LRANGE, LPOP, and RPOP. To learn more, you can visit this wonderful interactive tutorial from redis.
In the below code snippet, we define a function to read the 200k words. The words are stored in a column called ‘keyword’ in the csv file ‘big_Keywords’. The words are read one by one and stored in a list are under the key “big_words”.
You could plug in your file containing a large corpus of words here instead of the file ‘big_keywords’.
Step 3: Word similarity through Spacy.
So lets unpack the above code. First we download the pretrained spacy model “en_vectors_web_lg”. Then we read the reference file containing 10k words ‘small_Topics.csv’. (you could plug in your own file here).
Next we use the command LPOP (lpop)to remove the elements from the bigger list ‘big_keywords’ one by one. (Remember we created a list earlier ‘big_keywords’ and put the words in this list through LPUSH).
Then we have the standard for loop where we are doing the core work of finding similarity between words from the two lists.We then store the results in the format (big_word, topic, score) into a set through the command Zadd.
Example (Apple, pineapple, 0.7) where the Apple is a word from the big list and it has a close match in the smaller list ‘topic’. The score of 0.7 is the cosine similarity score.
Now there is a small issue here. With Zadd the values are stored in ascending order, but we want the values in descending order. The reason being a cosine score of 1.0 implies an exact match. So we are looking for exact word matches or atleast the most closest matches. To do this, we use the command zrevrangebyscore.
Now that we have the sorted list we wanted, we put them in the list titled ‘results’ through lpush.
Step 4: Calling the functions.
We have finally completed all the logical work and now all that remains is to put the results into a new csv file. After this, we write a small snippet of code to call each function.
Show me the magic
I had titled the article Spacy + Redis= Magic. Now it is time to reveal what is so magical about the combination of Spacy and Redis.
First we make sure that our redis server is up and running. You can check that by entering the command ‘redis-cli’ followed by ‘ping’. If redis is installed properly and the server is up and running, then you should get ‘pong’
Now comes the magical part, we now read 200K + plus words.
The magical part was that the 200k+ words were read in less than 15 secs time !!
The words were read so quickly that, at first I thought the words were not read at all.I verified the words were read through the command ‘llen’.
I had told you earlier that we will require Tmux. So what is Tmux?
Tmux stands for terminal multiplexer and it is useful because it lets you tile window panes in a command-line environment. This in turn allows you to run, or keep an eye on, multiple programs within one terminal.
There are various shortcuts to split the panes like
ctrl + b %: to split the screen vertically
ctrl+b “: divides screen in half horizontally
ctrl+b c: creates a new pane
ctrl+b arrow keys lets you navigate from one screen to other.
How did I use Tmux in my case
Well if you remember, I told you that this whole process of finding similar words is a very memory intensive given the nature of huge list of words. We have a remote machine that has many cores . It would be ideal to use as many cores as possible since it will speed up our process.
Through Tmux, we can sync the screens and run our process simultaneously like below.