Easily Access Pre-trained Word Embeddings with Gensim

Easily Access Pre-trained Word Embeddings with Gensim

What are pre-trained embeddings and why? Pre-trained word embeddings are vector representation of words trained on a large dataset. With pre-trained embeddings, you will essentially be using the weights and vocabulary from the end result of the training process done by….someone else! (It could also be you) One benefit of using pre-trained embeddings is that you can hit the ground running without the need for finding a large text corpora which you will have to preprocess and train with the…

Read More Read More

How to Use Tfidftransformer & Tfidfvectorizer?

How to Use Tfidftransformer & Tfidfvectorizer?

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when. The full working notebook for this tutorial, can be found in my repo….

Read More Read More

Building scalable, production ready NLP solutions

Building scalable, production ready NLP solutions

Having spent a big part of my career as a graduate student researcher and now a Data Scientist in the industry, I have come to realize that a vast majority of solutions proposed both in academic research papers and in the work place are just not meant to ship — they just don’t scale! And when I say scale, I mean handling real world uses cases,  ability to handle large amounts of data and ease of deployment in a production…

Read More Read More

Word Cloud in Python for Jupyter Notebooks and Web Apps

Word Cloud in Python for Jupyter Notebooks and Web Apps

About a year ago, I looked high and low for a python word cloud library that I could use from within my Jupyter notebook that was flexible enough to use counts or tfidf when needed or just accept a set of words and corresponding weights. I was a bit surprised that something like that did not already exist within libraries like plotly. All I wanted to do, was to get a quick understanding of my text data and word vectors….

Read More Read More

Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn

Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn

In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus. Keywords are descriptive words or phrases that characterize your documents….

Read More Read More

Tips for Constructing Custom Stop Word Lists

Tips for Constructing Custom Stop Word Lists

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead. While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts,…

Read More Read More

How to incorporate phrases into Word2Vec – a text mining approach

How to incorporate phrases into Word2Vec – a text mining approach

Training a Word2Vec model with phrases is very similar to training a Word2Vec model with single words. The difference: you would need to add a layer of intelligence in processing your text data to pre-discover phrases. In this tutorial, you will learn how to create embeddings with phrases without explicitly specifying the number of words that should make-up a phrase (i.e. the n-gram size). This means that you could have phrases with 2 words, 3 words and in some rare…

Read More Read More

Gensim Word2Vec Tutorial – Full Working Example

Gensim Word2Vec Tutorial – Full Working Example

The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are”. If you have two words that have very similar neighbors (meaning: the context in which it’s used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words…

Read More Read More

How to read CSV & JSON files in Spark – word count example

How to read CSV & JSON files in Spark – word count example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python…

Read More Read More