I am a Data Scientist at Github working on problems relating to large scale text analytics – including topic modeling and extraction, recommendation engines and domain specific dictionary generation. I recently developed the algorithm for suggesting topics for the GitHub Topics feature. I also write articles related to text analytics and actively contribute to RxNLP’s text mining tools.
I received my Ph.D. from the Department of Computer Science at the University of Illinois at Urbana Champaign. My advisor was Dr. ChengXiang Zhai. My interests and work spans several hybrid areas in text and data mining, machine learning, information retrieval and natural language processing. I am specifically interested in building general and scalable solutions for text data related problems in different application domains.
Data-Driven Decision Support Systems
Opinion-Driven Decision Support System
Opinions are essential for decision making. For example, when visiting a new city, we use online opinions to decide which hotel to stay at or what attractions to visit. Without the availability of online opinions, decision making becomes a much more difficult task as we only have limited information (e.g. price and location) to base our decisions on. My thesis focusses on leveraging large amounts of opinions on the Web to facilitate opinion-driven decision support. The idea is to combine the strengths of search technologies with opinion analysis and mining tools to provide a powerful decision making platform. This special platform is called an Opinion Driven Decision Support System (ODSS) which encompasses problems related to opinion acquisition, opinion based search, opinion analysis tools and presentation of opinions.
Opinion-Based Entity Ranking
In this work, we propose a novel way of leveraging opinionated content (e.g. reviews), by directly ranking entities based on a user’s preferences. The idea is to represent each entity with the text of all the reviews of that entity. Given a user’s keyword query that expresses the desired features of an entity, we can then rank all the candidate entities based on how well opinions on these entities match the user’s preferences. We study several methods for solving this problem, including both standard text retrieval models and some extensions of these models.
Scalable & Lightweight Text Summarization for Speedy Decision Making
Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions
This work presents a flexible framework for generating very short abstractive summaries. The key idea is to use a word graph data structure referred to as the Opinosis-Graph to represent the text to be summarized. Then, we repeatedly find paths through this graph to produce concise summaries. We consider Opinosis a "shallow" abstractive summarizer as it uses the original text itself to generate summaries. This is unlike a true abstractive summarizer that would need a deeper level of natural language understanding.
Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions
This paper presents a new unsupervised approach to generating ultra-concise summaries of opinions. We formulate the problem of generating such a micropinion summary as an optimization problem, where we seek a set of concise and non-redundant phrases that are readable and represent key opinions in text. We measure representativeness based on a modifi ed mutual information function and model readability with an n-gram language model. We propose some heuristic algorithms to efficiently solve this optimization problem. Evaluation results show that our unsupervised approach outperforms other state of the art summarization methods and the generated summaries are informative and readable.
Mining Tag Clouds and Emoticons Behind Community Feedback
This work is related to a data mining system which automatically mines tags from feedback text in an eCommerce scenario. It renders these tags in a visually appealing manner. Further, emoticons are attached to mined tags to add sentiment to the visual aspect.
Adding Semantics and Structure to Unstructured Text
A General Supervised Approach to Segmentation of Clinical Texts
Proceedings of IEEE Conference on Big Data 2014
Segmentation of clinical texts is critical for all sorts of tasks such as medical coding for billing, auto drafting of discharge summaries, patient problem list generation and many such applications. Most methods tend to rely on basic regular expressions or document specific segmentation methods. This work presents a highly generalized statistical model for segmenting clinical texts, based on a set of line-wise predictions by a classifier with constraints imposing coherence of predictions. Evaluation results on 5 independent test sets shows that the approach can work on all sorts of document types and performs consistently across hospitals.
Automated Story Capture from Conversational Speech
Proceedings of K-CAP’05
While storytelling has long been recognized as an important part of effective knowledge management in organizations, knowledge management technologies have generally not distinguished between stories and other types of discourse. In this work we describe a new type of technological support for storytelling that involves automatically capturing the stories that people tell to each other in conversations. We describe our first attempt at constructing an automated story extraction system using statistical text classification and a simple voting scheme.
Large Scale Data Management and Collection
OpinoFetch: A Practical Approach to Collecting Opinions on Arbitrary Entities
In this work, we propose a lightweight approach to collecting opinion containing pages, namely review pages on the Web for arbitrary entities. We leverage existing Web search engines and use a novel information network called the FetchGraph to efficiently obtain review pages for entities of interest. Experiments in three different domains show that OpinoFetch is more effective than plain search engine results and OpinoFetch is able to collect entity specific review pages efficiently with reasonable precision and accuracy.