Second Data Science Symposium of 2021 Updates Firm on Recent Applications of Models and Tools to Analyze Text Data
July 6, 2021
In June 2021, several data scientists and collaborating consultants hosted the second Data Science Symposiums of the year to share updates on the firm’s latest tools and innovations applicable to case work. This installment covered tools and methodologies for working with massive amounts of text.
- Analysis Group’s eSearch platform leverages machine learning (ML) techniques to comb through large sets of heterogeneous documents. In health care cases, the platform can identity discussions relevant to patient-reported outcomes (PROs). PROs directly measure patient health status without interpretation by a clinician and are increasingly recognized among regulatory institutions as useful clinical trial endpoints. To streamline the identification of PROs used across disease areas, Analysis Group constructed a health care search platform to provide a unified search interface that works across regulatory and health technology assessment sources. The platform applies ML algorithms that can be customized to find information relevant to different health care concepts and can scale up to search billions of documents. Beyond health care, the platform and ML functionality can be customized across client engagements to identify concepts relevant to other industries, and they have played key roles in multiple antitrust and finance litigation cases at Analysis Group.
- To measure the degree of similarity between many documents in a copyright dispute, an Analysis Group team used natural language processing (NLP) to turn text into mathematical objects that could be easily and efficiently compared via algorithms. In this case, the defendants claimed they had copied documents from non-copyrighted, pre-existing materials, rather than from the plaintiffs. By identifying similar word groupings, the Analysis Group team provided quantitative analysis showing that the defendants’ and plaintiffs’ documents were far more similar to each other than to the defendants’ alleged pre-existing source materials.
- Supervised document clustering significantly reduces the time needed to compare large amounts of text. In a recent case, Analysis Group employed this methodology to evaluate thousands of patents for insight into the competitive proximity of the parties in a proposed merger. The case team leveraged a labelled subset of patents to develop ML models to assess approximately 5,000 patent families for relevance and similarity. This approach provided deeper insight into competition in innovation markets than traditional market share analysis alone. This methodology is applicable to a wide array of patent cases, including those involving questions of essentiality, portfolio valuation and patent infringement, and can be applied more broadly to other cases involving the segmentation of complex datasets.
At the next symposium, the firm’s data scientists will discuss web applications and platforms Analysis Group has developed to enhance interaction with clients, particularly regarding data collection and dynamic data exploration.