NLP powered ‘Voice of Patient’

A case study on how various data engineering and NLP concepts are used to analyze cancer related posts shared by patients and care-givers on cancer support forums


The Advanced Analytics team of a leading Biopharma firm wanted to analyze activity on social media and patient forums in order to generate insights that would help in designing physician and patient level tactics. The client wanted to build a scalable solution that could then be extended to other therapeutic areas as well.


Oncology therapeutic area was chosen for this initiative given the huge number of unmet needs in the market. The approach we followed to achieve this:

Web-Scraping Modules

  • Concepts of ‘Unified Data Models’ were used in standardizing and automating data extraction from patient forums.
  • Custom medical taxonomies were leveraged to standardize the key words, thus enabling high accuracy of the model.
  • Extractor modules were developed in order to periodically execute the scraping exercise to maintain data relevancy.

Text Mining & NLP Modules

  • A combination of supervised and unsupervised algorithms were leveraged to build a text mining pipeline from data extraction to inference publishing.
  • Named Entity Recognition models extracted various domain relevant information that helped in understanding the in-depth insights about patient experiences.
  • Sentiment analysis was customized to not only extract the tone of the overall post but also the sentiment towards various entities.


D Cube provided an end-to-end solution that:

  1. Systematically sources posts from patients or care-givers from various shortlisted patient forums.
  2. Provides basic structuring and required metadata for the extracted posts to enable further processing.
  3. Leverages best-in-class NLP techniques, and domain specific taxonomies & dictionaries, to achieve high classification accuracy.

Solution Results (sample outputs)