Multifeature fusion for claim scope-aware litigation risk prediction for patent drafts

Chitrakala Sakthivel; Jinesh Jose

doi:10.7717/peerj-cs.3069

Multifeature fusion for claim scope-aware litigation risk prediction for patent drafts

Department of Computer Science and Engineering, College of Engineering Guindy Campus, Anna University, Chennai, Tamil Nadu, India

DOI: 10.7717/peerj-cs.3069

Published: 2025-08-05
Accepted: 2025-07-03
Received: 2025-01-24

Academic Editor: Xiangjie Kong

Subject Areas: Artificial Intelligence, Data Mining and Machine Learning, Data Science, Natural Language and Speech, Text Mining
Keywords: Claim scope indicator, Patent analytics, Patent litigation prediction, Deep learning, Hyponym analysis, Multifeature fusion model

Copyright: © 2025 Sakthivel and Jose
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Sakthivel C, Jose J. 2025. Multifeature fusion for claim scope-aware litigation risk prediction for patent drafts. PeerJ Computer Science 11:e3069 https://doi.org/10.7717/peerj-cs.3069

The authors have chosen to make the review history of this article public.

Abstract

The ‘claim scope’, or the ‘legal boundaries’ defined by patent claims, has been considered crucial for determining a patent’s value and its associated litigation risk. However, no direct claim semantics-based indicators currently exist to quantify patent claim scope, and existing scope measures are primarily indirect, which limits their ability to capture the semantic nuances of claim text. Additionally, the reliance on post-grant features restricts the applicability of existing litigation prediction models to patent drafts. These limitations complicate the patent drafting process, during which claims are formulated without feedback on scope and litigation risk. This often leads to suboptimal claim articulation, resulting in inadequate protection, increased legal vulnerabilities, or reduced patent grant probability. To address this gap, the hyponym tree score (HTS) is proposed as a novel indicator for quantifying claim scope by analysing hyponym counts, sentence structure, and dependency relations within patent claims. Building on this, early-stage litigation risk prediction has been achieved using a new deep learning model, the Multifeature BERT-Powered Fusion for Author-level Patent Litigation Risk Analysis (MAPRA). The MAPRA model restricts its input features to those available at early stages, such as indicators derived from claim text, inventor information, assignee details, and HTS, ensuring applicability to both draft-stage and granted patents. Despite excluding all post-grant or acquired data, MAPRA achieves a superior area under the receiver operating characteristic curve (AUC) of 0.878, outperforming the most comparable prior study, which reports an AUC of 0.822 using both early-stage and immediate post-grant features. By quantifying claim scope and enabling early-stage litigation risk prediction, this research offers a valuable screening tool for patent drafters, examiners, attorneys, and innovators. It supports informed decision-making during drafting and helps mitigate potential litigation risks. Furthermore, it lays a foundation for future research on claim scope modeling and the development of predictive tools for intellectual property litigation management.

Introduction

The scope or coverage of a patent is defined by its claims, which establishes the boundaries of legal protection and serves as a critical determinant of the patent’s enforceability, value, and commercial significance. Broader claim scope increases legal coverage and enhances the patent’s market value but also raises the risk of conflicts with existing patents, thereby increasing the likelihood of litigation (Merges & Nelson, 1994; Arinas, 2012; Marco, Sarnoff & Charles, 2019). Conversely, narrower claim scope minimizes conflicts and improves the probability of a patent grant, but it may reduce the patent’s legal coverage and economic potential (Cotropia, 2005; Marco, Sarnoff & Charles, 2019). Therefore, during the drafting of the patent claims, achieving an optimal balance in claim scope is essential to ensure robust legal protection, maximize patent value, and minimize litigation risks (Tekic & Kukolj, 2013).

Despite its importance, drafting patent claims remains complex and challenging, largely due to the absence of well-established, semantically rooted indicators for quantifying claim scope. Existing scope indicators often rely on bibliographic or numerical data and fail to incorporate the semantics of the claim text. This omission leaves patent drafters without clear guidance, leading to suboptimal claim articulation that may result in inadequate protection, heightened legal vulnerabilities, or unnecessary litigation risks. Addressing this challenge necessitates a robust, semantics-based metric that can quantify claim scope and aid drafters in achieving an optimal balance between legal coverage and litigation risk.

To fill this gap, the study introduces the hyponym tree score (HTS), a novel semantics-based indicator for quantifying the scope of patent claims. HTS leverages semantic relationships within the claim text, including hyponyms, sentence structures, and interdependencies between claims, to provide a meaningful and quantifiable measure of claim scope. By incorporating text semantics, HTS offers patent drafters actionable insights to optimize claim articulation, enhance legal protection, and mitigate litigation risks.

Patent litigation, which involves resolving disputes over patent infringement, validity, or enforcement, is critical in determining a patent’s enforceability and commercial value. Litigation significantly influences market competition and potential revenue streams, underscoring its importance in the intellectual property landscape (Helmers, 2018). Predicting the likelihood of litigation is a key priority for stakeholders such as portfolio managers, insurers, patent valuators and patent drafters, as it enables strategic planning and effective risk management. Moreover, the articulation of claim scope is intricately linked to litigation risk, as broader claims are more likely to conflict with existing patents. In contrast, narrower claims may limit legal coverage (Merges & Nelson, 1990; Marco, Sarnoff & Charles, 2019).

Existing approaches to litigation prediction, however, face notable limitations. Prior studies have predominantly relied on post-grant event data and externally compiled features, such as the organisation for economic co-operation and development (OECD) patent quality indicators (PQI) (Squicciarini, Dernis & Criscuolo, 2013), available only for granted patents. These models are unsuitable for draft-stage patent documents, where such features are unavailable. Furthermore, many of these models neglect the semantic content of patent claims despite their critical importance in understanding the boundaries of patent protection and accurately predicting litigation risk.

This study proposes a novel multifeature fusion deep learning model for litigation prediction to overcome these limitations. Unlike existing models, this approach integrates HTS with other features available at the drafting stage, making it applicable to both draft and granted patents. By relying exclusively on pre-grant features, the proposed model broadens the applicability of litigation prediction to include early-stage patent documents, empowering stakeholders to assess litigation risks at any stage of the patenting process.

This study significantly contributes to patent scope analysis and litigation prediction. First, it introduces the HTS, a semantics-based metric for quantifying claim scope, providing patent drafters with a valuable indicator for optimizing claim articulation. Second, it develops a self-sufficient multifeature fusion deep learning model for litigation prediction, designed to work with features available during the draft stage, thus addressing the limitations of existing litigation prediction models that rely on post-grant data. By bridging critical gaps in patent drafting and litigation prediction, this work represents a significant step forward in improving claim drafting, enhancing decision-making, improving strategic planning, and optimizing outcomes in the intellectual property domain. This research represents the first effort dedicated to predicting the litigation risk of the early-stage patent document.

Overview of the article structure

The structure of this article is organized as follows: “Background” gives an overview of the context of this work. “Literature Review” reviews the relevant literature and identifies gaps this study aims to address. “Methodology” details the methodology, including data collection (“Dataset”), the development of the hyponym-based indicator (“Claim Scope Indicator Development”), and the development of the new deep learning model for litigation prediction (“Litigation Prediction Model Development”). “Results” presents the results of this study, including the performance of the new litigation prediction model and the relevance of the hyponym-based claim scope indicator. “Discussion” discusses the findings, implications, and potential limitations. Finally, “Conclusion” concludes the article with a summary of key insights and suggestions for future research.

Background

The research originates from an ongoing investigation into developing robust models for patent valuation. A notable trend was observed during the investigation: high-value patents are more likely to face legal events and litigation proceedings (Tekic & Kukolj, 2013). This finding raised interest in predicting patent litigation, particularly for early-stage patent documents, by leveraging machine learning techniques to forecast the likelihood of legal disputes. Even though claim text semantics play a pivotal role in defining the scope or coverage of a patent, the absence of a measure to quantify the claim scope impedes the drafters from optimally regulating the claim scope during the claim drafting. Additionally, understanding the litigation risk of a patent during the drafting stage allows professionals to regulate claim scope effectively by choosing appropriate wording. Developing a litigation prediction model that relies solely on pre-grant patent features can enable litigation risk prediction for both granted and early-stage patent documents.

Literature review

This subsection presents a comprehensive review of the relevant literature, organized into two main areas: (1) Indicators of patent scope and (2) patent litigation prediction models. Each group is critically analysed to identify existing limitations and to highlight how this study fills the identified gaps.

Patent scope indicators

Quantifying the scope of a patent is a longstanding challenge in intellectual property research. Numerous indicators have been proposed to estimate the breadth of legal protection and technological applicability offered by patents. These can be categorized into the following groups:

Citation-based indicators: Citation analysis has been extensively utilized in patent research, primarily through backward and forward citation metrics. The number of forward citations, originally proposed by Trajtenberg (1990), is widely used to assess a patent’s technological impact, with a higher number generally interpreted as reflecting broader scope. The number of backward citations indicates the extent of prior art reviewed, suggesting a wide technological foundation (Packalen & Bhattacharya, 2012). In addition, non-patent literature (NPL) citations indicate a broader research base supporting the invention, as noted by Narin, Hamilton & Olivastro (1997). However, forward citations are not available for early-stage or draft patents, limiting their practical utility during the drafting phase.

Patent classification-based indicators: Classification-based indicators assess technological breadth based on the number of categories assigned to a patent. Studies by Lerner (1994) and Harhoff, Scherer & Vopel (2003) have demonstrated that patents with a greater number of subclasses tend to span a wider array of technological fields, reflecting broader applications and scope.

Claim-based indicators: Claim-based indicators are among the most direct measures of patent scope and can be further divided into two subgroups: indicators based on claim quantity and those based on claim structure.

Claim quantity-related indicators focus on the number and types of claims included in the patent. The number of claims is widely recognized as a measure of scope, with a greater quantity generally suggesting broader protection (Lanjouw & Schankerman, 1997, 2001, 2004). Similarly, the number of independent claims is interpreted as reflecting wider coverage, since each independent claim typically represents a distinct technological aspect (Marco, Sarnoff & Charles, 2019; Graham & Mowery, 2003). In contrast, dependent claims, although providing specificity and detail, do not significantly contribute to a broader scope (Graham & Mowery, 2003).

Claim structure-related indicators evaluate the linguistic, syntactic, and logical organization of individual claims. Commonly used metrics include words per claim (Lerner, 1994; Osenga, 2011; Harhoff, 2016), independent claim length (Malackowski & Barney, 2008; Marco, Sarnoff & Charles, 2019), and first claim length (Harhoff, 2016; Wittfoth, 2019). These studies have suggested that shorter claims are generally broader in scope due to fewer embedded limitations. Okada, Naito & Nagaoka (2016) introduced character count as an alternative metric, particularly useful in languages without word spacing, arguing that longer character sequences correlate with narrower, more detailed claims. Additionally, claim dependency structure, as explored by Wittfoth (2019), plays a role in defining the hierarchical and interpretive relationship between independent and dependent claims, impacting how broadly a claim set may be interpreted.

Semantics-based indicators: In response to limitations of numeric and bibliographic features, recent studies have introduced semantics-driven approaches. Tanaka, Nakashio & Kajikawa (2018) proposed the use of semantic range of words to measure vocabulary diversity, enabling scope visualization through semantic hierarchies. Ragot (2023) introduced a novel textual metric called self-information, which quantifies the informativeness of individual claims. Their findings suggest that higher self-information scores correlate with broader conceptual scope.

The number of inventors has also been interpreted as an indirect scope metric. Chan, Mihm & Sosa (2021), highlighted that a higher number of inventors reflects greater collaboration and the non-decomposability of the invention. The scope tends to decrease with the number of inventors.

Table 1 summarizes existing scope indicators, outlining their theoretical bases and known limitations. While these metrics span a range of approaches, they predominantly rely on bibliographic data, numeric heuristics, or surface-level linguistic cues. Notably absent are robust, semantically informed indicators capable of evaluating the breadth of a patent claim based on its underlying meaning and hierarchical structure. Currently, no widely adopted method allows authors to quantify whether a claim is semantically broad or narrow. This gap hinders precise calibration of claim scope and increases the risk of either under-protecting the invention or inviting legal challenges due to overly broad claims.

Table 1:

Summary of the patent scope indicators.

Scope indicator	Literature	Remarks
Number of forward citations	Trajtenberg (1990)	More forward citations reflect greater impact and scope.
Number of claims	Lanjouw & Schankerman (1997, 2001, 2004)	More claims suggest broader scope.
Number of NPL citations	Narin, Hamilton & Olivastro (1997)	More citations to non-patent literature imply a broader research base.
Words per claim	Lerner (1994), Osenga (2011), Harhoff (2016)	Shorter claims indicate broader coverage.
Number of sub classes	Lerner (1994), Harhoff, Scherer & Vopel (2003)	More subclasses indicate technological diversity.
Number of independent claims	Marco, Sarnoff & Charles, 2019, Graham & Mowery (2003)	More independent claims mean broader scope.
Number of dependent claims	Graham & Mowery (2003)	More dependent claims provide detailed extensions of the main invention.
Independent claim length	Malackowski & Barney (2008), Marco, Sarnoff & Charles (2019)	Shorter independent claims indicate broader coverage.
Number of backward citations	Packalen & Bhattacharya (2012)	More backward citations reflect wider prior art.
First claim length	Harhoff (2016), Wittfoth (2019)	Shorter first claims are broader.
Claim’s character count	Okada, Naito & Nagaoka (2016)	More characters suggest a narrower scope.
Semantic range of words	Tanaka, Nakashio & Kajikawa (2018)	Reciprocal of the number of hierarchies is considered
Based on dependencies of independent and dependent claims	Wittfoth (2019)	Dependency structure affects the scope of the claims.
Number of inventors	Chan, Mihm & Sosa (2021)	More inventors indicate higher collaboration and non-decomposible invention.
Self-information	Ragot (2023)	Quantifies unique information each claim provides.

DOI: 10.7717/peerj-cs.3069/table-1

Patent litigation prediction models

The prediction of patent litigation has evolved substantially, transitioning from traditional statistical models to sophisticated machine learning (ML) and deep learning (DL) frameworks. The existing literature can be organized into the following categories:

Classical machine learning approaches: Early work in this area focused on regression-based and tree-based models. Chien (2011) employed logistic regression (LR) to analyze how specific intrinsic and acquired patent traits influence the likelihood of litigation. Juranek & Otneim (2021) used the XGBoost algorithm with features drawn from united states patent and trademark office (USPTO) datasets, OECD patent quality indicators (PQI), and USPTO patent litigation docket reports, achieving high predictive performance (AUC up to 0.818). They identified that indicators related to patent value, internationality, and patent owner characteristics hold higher predictive power. However, their model’s reliance on post-grant information limits its applicability during the drafting phase. Similarly, Follesø & Kaminski (2020) utilized random forest (RF) classifiers trained on PQI-derived features to assess litigation risk.

Semantic and similarity-based approaches: Several researchers have explored textual content to infer litigation potential. Park, Yoon & Kim (2012) applied semantic similarity analysis based on Subject-Action-Object (SAO) patterns and clustering to identify potential infringement scenarios. Lee, Song & Park (2013) evaluated claim text similarity using keyword vector models and analysed inter-claim dependencies. While these methods incorporate both linguistic and structural elements, they often face limitations in scalability and generalizability, particularly across large or heterogeneous patent datasets. Although effective in identifying potential overlaps between pairs of patents, extending such analysis to all patent pairs for litigation prediction poses significant computational challenges.

Unsupervised and ensemble techniques: Several studies have integrated unsupervised learning and ensemble methods to enhance prediction accuracy. Wongchaisuwat, Klabjan & McGinnis (2017) combined K-means clustering with ensemble classification models to estimate the likelihood and timing of litigation jointly. Kim et al. (2022) applied principal component analysis (PCA) for dimensionality reduction and used Autoencoders in combination with K-nearest neighbors (K-NN) for classification, improving predictive performance by emphasizing the most informative features. Chen & Lai (2023) implemented an ensemble machine learning classifier leveraging USPTO examination and assignment data, achieving 79% accuracy and demonstrating the viability of ensemble methods for litigation risk assessment.

Deep learning models: Recent advances in deep learning have enabled the modeling of complex, multi-dimensional relationships present in patent litigation data. Liu et al. (2018) proposed a convolutional tensor factorization framework to identify high-risk patents based on textual and collaboration features. Wu et al. (2024) introduced the multi-aspect neural tensor factorization (MANTF) model to predict plaintiffs, defendants, and target patents jointly. Convolutional neural networks (CNNs) have also been utilized for one-to-many infringement detection (Liu & Pei, 2023), while Kim et al. (2021) employed random survival forests to model litigation risk over time.

The most recent and closely related work to the objectives of this study is by Juranek & Otneim (2024), who refined their XGBoost model to handle newly granted patents by minimizing reliance on post-grant features that are not available at the time of grant. In their study, the XGBoost algorithm was used for litigation prediction and achieved an AUC score of up to 0.822. However, this approach remains inapplicable to draft-stage documents due to its dependence on post-grant data.

Table 2 summarizes the prominent litigation prediction models and related studies, outlining their methodological foundations and known limitations. While these approaches span a range of machine learning and deep learning techniques, the majority rely on post-grant features such as forward citations, patent family size, assignment records, and other patent quality indicators. Models that assess litigation risk using only information available at the drafting stage are notably absent from the existing literature. In particular, semantic features embedded within patent claims, despite being central to legal interpretation and enforceability, remain largely underutilized in current predictive frameworks. Although some studies have applied semantic similarity analysis to identify potential overlaps or infringement between individual patent pairs, scaling such analyses across large patent datasets introduces significant computational challenges. Furthermore, no existing model provides a structured framework for predicting litigation risk at the draft stage using claim-level semantic features. This gap restricts the ability to conduct early-stage risk assessment and reduces the practical value of these models for inventors, legal professionals, and innovation strategists. The literature survey indicates that the proposed work is a pioneering effort for litigation prediction in patent drafts, and no comparable work for a one-to-one comparison is available.

Table 2:

Summary of the litigation prediction works.

Authors	Recommended method	Remarks
Chien (2011)	Logistic regression	Analyses the impact of intrinsic and acquired traits of patents in litigation
Park, Yoon & Kim (2012)	SAO-based semantic similarity measurement and clustering	SOA-based semantic technological similarity are computed between each patent, and clustering is applied to identify the clusters of patents with possible infringements.
Lee, Song & Park (2013)	Statistical methods (t-statistics, critical mean value) and hit ratios.	Similarity between all the patents are calculated based on keyword vectors and claim interdependence
Wongchaisuwat, Klabjan & McGinnis (2017)	K-means clustering and ensemble classification.	Predicts the litigation likelihood and the expected time to litigation
Liu et al. (2018)	Convolutional Tensor Factorization	Helps to identify the risky patents using their content and collaborative information
Follesø & Kaminski (2020)	Random forest	Litigation Prediction using OECD PQI features
Kim et al. (2021)	Clustering and random survival forest	Predicts patent litigation risk over time and considers the censored data
Juranek & Otneim (2021)	XGBoost	Features from different data sets provided by the USPTO, Patent Litigation Docket Reports Data & OECD PQI are used. 0.818 AUC reported with XGBoost.
Kim et al. (2022)	K-NN and autoencoder	PCA based feature extraction on quantitative features
Wu et al. (2024)	Multi-aspect neural tensor factorization	Can predict potential plaintiffs, defendants and patents
Chen & Lai (2023)	Ensemble machine learning classifier	Uses examination and assignment data and reported 79% accuracy
Liu & Pei (2023)	CNN	One to many infringement detection
Juranek & Otneim (2024)	XGBoost	Restricted to the features available at the time of grant. Indicators related to value, inter-nationality and patent-owners have higher predictive power. 0.822 AUC Reported with XGBoost.

DOI: 10.7717/peerj-cs.3069/table-2

Research gaps

Current scope indicators for patents can be broadly categorized into pre-grant and post-grant indicators based on their availability. For instance, indicators like the ‘number of claims’ and ‘backward citations’ are accessible during the pre-grant stage. In contrast, indicators such as ‘forward citations’ and ‘grant lag’ become available only after a patent is granted. Relying on indicators available at the pre-grant stage is crucial for assessing the scope of early-stage patent documents. As depicted in Table 1, established scope indicators do not focus on the semantics of the claim text when determining the patent scope. The lack of well-established claim scope indicators rooted in claim text semantics complicates the drafting process, frequently leading to suboptimal articulation of claim scope. This deficiency may lead to future financial losses due to insufficient protection or excessive legal costs associated with overly broad claims.

The current research on patent litigation prediction predominantly relies on externally compiled or post-grant features, such as international patent classification (IPC) details, forward citations, etc., which are only available for granted patents. Such feature requirements make them unsuitable for performing the litigation prediction on draft stage documents for which such features are unavailable. Another notable observation is that current works predominantly neglect claim semantics, which define the legal boundaries. To expand the applicability of litigation prediction models to a broader range of patent documents, including those in the pre-grant stage, it is imperative to develop methods that use only the features available at the early stage.

Research objectives

This study seeks to address the aforementioned gaps and advance the field of patent litigation prediction through the following objectives:

1.

To develop the HTS, a novel metric to quantify the scope of patent claims by analyzing semantic relationships in claim text, leveraging hyponyms, sentence structures, and interdependencies among claims.
2.

To design a multifeature fusion deep learning litigation prediction model that relies on claim text semantics and uses only early-stage features, ensuring applicability to both granted and draft-stage patent documents.

Research questions

This study aims to address the following research questions:

RQ1

How can a semantics-based indicator be developed to quantify the scope of patent claim text?

RQ2

What is the impact of incorporating the new claim scope indicator on patent litigation prediction tasks?

RQ3

How can a high-performance litigation prediction model be developed to predict the litigation risk of draft-stage patent documents?

Methodology

The development of a new indicator to quantify the patent claim scope and its evaluation using a litigation prediction task is presented in the first part of this work. The HTS is the proposed indicator. A litigation prediction model for draft-stage patent documents is developed in the second part. The proposed litigation prediction model is named Multifeature BERT-Powered Fusion for Author-level Patent Litigation Risk Analysis (MAPRA).

Dataset

This study is based on four primary datasets, each contributing essential information for patent scope analysis and litigation prediction. The USPTO PatentsView dataset (U.S. Patent and Trademark Office, 2024a; Toole, Jones & Madhavan, 2021) serves as the primary source of patent data, offering information on classification codes, inventors, assignees, and claim text. The 2024 update of this dataset is utilized in the present work. Complementing this, the OECD PQI database, January 2024 version (Organisation for Economic Co-operation and Development (OECD), 2024; Squicciarini, Dernis & Criscuolo, 2013), provides quantitative indicators capturing various dimensions of patent quality, such as technological relevance and potential economic value. Although only pre-grant features are incorporated into the prediction models, select PQI indicators are employed to evaluate the HTS.

Litigation data are obtained from the USPTO Patent Litigation Dataset (U.S. Patent and Trademark Office, 2024b; Toole, Miller & Sichelman, 2024), which records U.S. district court cases involving patent disputes filed between 1963 and 2020. This dataset includes 56,488 unique litigated patents. After applying a series of preprocessing operations, including merging and filtering, the final set comprises 40,897 unique litigated patents, each linked to its claim text, IPC classifications, and other relevant features. Patents not listed in the litigation dataset are treated as non-litigated as of 2020. However, to mitigate potential mislabeling due to delayed litigation, the sampling of non-litigated patents is restricted to those filed on or before 2010. This criterion ensures that most patents would have been granted by 2015, allowing for at least five years of post-grant observation within the litigation data collection window. Following established methodologies in the literature (Juranek & Otneim, 2024; Liu, Li & Liu, 2024), a total of 40,897 non-litigated patents are sampled to serve as the negative class. Patent litigation is a relatively rare event and affects fewer than 2% of all granted patents (Chien, 2011; Wongchaisuwat, Klabjan & McGinnis, 2017; Juranek & Otneim, 2021). Including all non-litigated patents would reflect real-world distributions but would also introduce substantial computational burdens, particularly for transformer-based models such as bidirectional encoder representations from transformers (BERT). To address this challenge, a 1:1 matched sampling strategy is employed.

As shown in Fig. 1, each litigated patent is paired with a non-litigated patent, resulting in a balanced dataset for training and evaluation. This approach aligns with the methodology adopted by Park, Bhardwaj & Hsu (2023), who implemented matched sampling based on filing year and cooperative patent classification (CPC) subclass code in the context of robustly optimized BERT pretraining approach (RoBERTa) based litigation prediction (Park, Bhardwaj & Hsu, 2023). In the present work, non-litigated patents filed on or before 2010 are sampled to mirror the distribution of IPC sections found in the litigated patent set. The final dataset consists of 81,794 records, with an equal number of litigated and non-litigated patents. A detailed description of all variables, their sources, and their intended roles in the analysis is provided in Table 3.

Figure 1: Number of IPC sections in the sampled dataset.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-1

Table 3:

Details of the features used in this study.

Feature	Data source	Description
bwd_cits	OECD PQI	Number of backward citations
npl_cits	OECD PQI	Number of non-patent literature backward citations
claims_x	OECD PQI	Number of claims
filing	OECD PQI	Year of filing
dependent_claims	PatentsView	Number of dependent claims, calculated from claim text
independent_claims	PatentsView	Number of independent claims, calculated from claim text
claim_text	PatentsView	Text containing all the patent claims
assignee_pcount	PatentsView	Number of patents owned by the assignee, calculated from Assignee data
num_inventors	PatentsView	Number of inventors
avg_claim_length	PatentsView	Average claim length, calculated value
fc_word_count	PatentsView	Number of words in the first claim, calculated from claim text
hts_spacy	Generated	Generated feature, not used in the final modal
hts_spacy_wtd	Generated	Generated feature, used in the final modal
hts_stanza	Generated	Generated feature, not used in the final modal
hts_stanza_wtd	Generated	Generated feature, not used in the final modal
hts_avg	Generated	Generated feature, not used in the final modal
hts_avg_wtd	Generated	Generated feature, not used in the final modal
litigation_label	Litigation Docket	Binary litigation status calculated using USPTO Litigation Docket Data

DOI: 10.7717/peerj-cs.3069/table-3

Part 1: claim scope indicator development

This part of the study aims to derive a new indicator to quantify the claim scope based on the claim text semantics. Figure 2 represents the high-level view of the work. Before determining an appropriate methodology for claim scope quantification, understanding the nature of the patent claim text is essential.

Figure 2: Overview of the HTS feature generation and evaluation.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-2

The claim text is a semi-structured text corpus with numbered claims, where each claim may explicitly reference other claim numbers to represent interconnections. Patent claims are broadly categorized into two types based on interdependency: independent and dependent claims. Independent claims are self-contained and provide a broad, comprehensive description of the invention, outlining its essential features without relying on other claims. These claims establish the widest boundaries of the patent’s protection. Conversely, dependent claims refer back to independent claims and add specific features or details, resulting in a narrower scope of protection. Dependent claims serve as fallback positions if the independent claim is deemed invalid, ensuring that specific embodiments or variations of the invention remain safeguarded. A typical claim consists of three main components: the preamble, the transitional phrase, and the body. The preamble establishes the context of the claim by identifying the invention’s category, such as a device, method, composition, or apparatus. The preamble aligns with the title of the invention and may include its objective or purpose. The transitional phrase links the preamble to the body, defining the claim’s scope. Transitional phrases are categorized into open-ended, such as “comprising”, which allows additional elements not explicitly mentioned in the claim, thereby broadening its scope, and closed-ended, such as “consisting of”, which limits the claim strictly to the listed elements. The body of the claim is the most critical part, detailing the elements and limitations of the invention and describing their meaningful interconnections. The body provides an in-depth explanation of how the components interact to realize the invention, ensuring clarity and precision in defining the scope of protection.

Hyponym tree score calculation

In natural language processing (NLP), hyponyms and hypernyms represent hierarchical relationships between words, which are crucial for understanding semantics and building structured knowledge. A hypernym refers to a broader, more general term, while a hyponym refers to a narrower, more specific term that falls under the hypernym. For example, in a taxonomy, ‘vehicle’ represents a hypernym of ‘car’, and ‘car’ serves as a hyponym of ‘vehicle’. Similarly, the hypernym ‘fruit’ encompasses hyponyms such as ‘berry’, ‘banana’, and ‘mango’. These relationships are often modelled in NLP using resources like WordNet (Fellbaum, 1998), where hypernym-hyponym hierarchies are explicitly defined. Understanding such relationships enables NLP systems to infer broader or narrower meanings, which is essential for analyzing the scope of patent claim texts. Words with more hyponyms in a patent claim indicate the potential to create multiple restrictive versions of claims, which can lead to overlaps in scope representing potential infringement cases and litigation risks. Consequently, studying hyponyms within patent claim text is pivotal in devising a new scope indicator for patents.

Patent claims often employ varying levels of specificity, where claims with broader scope support a larger interchangeability of terms to protect a more extensive set of derived ideas (Cohen & Lemley, 2001). However, when a claim employs overly generic language, the claim scope increases drastically, potentially clashing with more specific claims in other patents, leading to increased litigation risks and legal uncertainties. Conversely, highly specific claims may reduce infringement risks but face challenges in enforcing their rights against variations and derivative innovations. Analyzing hypernym and hyponym characteristics within patent claim texts (Andersson et al., 2014) can potentially play a crucial role in claim scope quantification. In this context, a new HTS indicator is developed to represent the patent scope. The HTS indicator is derived by considering the hyponym count in claim sentences and their structural composition. When words in a patent claim text have more hyponyms, the possibility of interchangeability increases, broadening the claim scope. The new scope indicator will be validated by assessing its effectiveness in predicting patent litigation likelihood.

Mathematically, let the patent claim text be represented as a hyponym dependency tree, $T = (V, E)$ , where $V = {v_{1}, v_{2}, \dots, v_{n}}$ is the set of nodes corresponding to the words in the claim, and E is the set of directed edges that denote the syntactic or semantic dependency relations between these words. Each node $v_{i} \in V$ is associated with a degree $d e g (v_{i})$ , representing the number of hyponyms (i.e., more specific terms) that can replace the corresponding word. The degree $d e g (v_{i})$ reflects the flexibility of the word within the claim text, where higher values represent greater possibilities for creating restrictive variations of the claim.

Given the tree structure, a cumulative score C for the entire set of claims as follows:

$C (T) = \prod_{v_{i} \in V} (d e g (v_{i}) + 1),$ where $d e g (v_{i})$ is the degree of node $v_{i}$ , representing the number of hyponyms for the word corresponding to node $v_{i}$ and the term $(d e g (v_{i}) + 1)$ accounts for the word itself (original term) and its associated hyponyms.

This cumulative score reflects the maximum number of specific or restrictive versions of the claims that could be generated from the given claim text. Each restricted version is a modified claim with a smaller scope, offering different legal interpretations and enforcement potentials. The cumulative score $C (T)$ indicates the scope of the original patent claims. A larger claim scope increases the likelihood of overlapping with other patents, a primary cause of litigation. Patents with higher cumulative scores are more prone to infringement due to the more significant number of possible interpretations and restrictive variations that could overlap with existing claims. Thus, $C (T)$ can quantify the scope or coverage of the patent claim and provide a theoretical foundation for predicting patent litigation risk based on hyponym analysis. Multiplicative $C (T)$ , which calculates the number of sub-trees possible from the original tree, has a problem with the lengthy claims producing very large values, and the effect of smaller claims goes unnoticed, hence discarded.

Three tasks were carried out to calculate the HTS of claim text: claim dependency tree generation, dependency tree generation for each sentence in the claim, and hyponym extraction of the words in each sentence. Algorithm 1, the Text2Scope, was developed to compute the HTS value from the patent claim text. Claims are represented as a graph with individual claims as the nodes and the dependency among them as the edges. A claims text corpus is processed using Algorithm 1 (Text2Scope), and a tree structure of the claims is generated initially. In the tree, each node contains the text corresponding to a numbered claim. Algorithm 2 (Claim2Scope) is invoked from Text2Scope to calculate the score of a given patent claim. Claim2Scope invokes Algorithm 3 (Sentence2Scope) to calculate the score of each sentence of the given claim. Sentence2Scope algorithm involves dependency tree generation to extract the sentence structure and node weight assignment using the hyponym counts of each node(word). Then it computes the cumulative score calculation for that sentence. These algorithms return hyponym tree scores and weighted hyponym tree scores. Equation (1) represents the sentence level non-weighted score calculation. Equation (2) is used for weighted score calculation.

Algorithm 1 :

Text2Scope: calculate the scope score for an entire patent claim tree.

1: Input: Text of multiple claims

2: Output:

3: Cumulative scores for the entire claim tree:

h t s_s p a c y

h t s_s p a c y_w t d

5: Parse the text to extract individual claims, each with a claim number and text.

6: Initialize a directed graph G where:

7: Nodes represent claims, and edges represent references between claims.

8: Initialize variables:

h t s_s p a c y \leftarrow 0

h t s_s p a c y_w t d \leftarrow 0

c o n n e c t e d_c o m p o n e n t s \leftarrow 0

10: for each claim do

11: Identify references to other claims.

12: Add the claim as a node in G.

13: Add edges from the claim to referenced claims.

14: end for

15: Find connected components in G.

16: for each connected component C in G do

17: Initialize

c o m p o n e n t_h t s_s p a c y \leftarrow 0

c o m p o n e n t_h t s_s p a c y_w t d \leftarrow 0

g r o u p_s i z e \leftarrow 0

18: for each claim in C do

19: Apply Claim2Scope on the claim text to compute individual scores:

20: Obtain

c l a i m_h t s_s p a c y

and

c l a i m_h t s_s p a c y_w t d

21: Update

c o m p o n e n t_h t s_s p a c y \leftarrow c o m p o n e n t_h t s_s p a c y + c l a i m_h t s_s p a c y

22: Update

c o m p o n e n t_h t s_s p a c y_w t d \leftarrow c o m p o n e n t_h t s_s p a c y_w t d + c l a i m_h t s_s p a c y_w t d

23: Increment

g r o u p_s i z e \leftarrow g r o u p_s i z e + 1

24: end for

25: Update

h t s_s p a c y \leftarrow h t s_s p a c y + \frac{c o m p o n e n t_h t s_s p a c y}{max (1, g r o u p_s i z e)}

26: Update

h t s_s p a c y_w t d \leftarrow h t s_s p a c y_w t d + \frac{c o m p o n e n t_h t s_s p a c y_w t d}{max (1, g r o u p_s i z e)}

27: end for

28: return

h t s_s p a c y

h t s_s p a c y_w t d

DOI: 10.7717/peerj-cs.3069/table-11

Algorithm 2 :

Claim2Scope: calculate the scope score for a claim.

1: Input: Claim text p

2: Output:

3: Claim score components:

h t s_s p a c y

h t s_s p a c y_w t d

a v g_t r e e_h e i g h t

t o t a l_h_c o u n t

t o t a l_h_s u m

t o t a l_n o d e_c o u n t

n u m b e r_o f_s e n t e n c e s

6: Split p into individual sentences.

7: Initialize variables to accumulate scores and counts across sentences:

h t s_s p a c y \leftarrow 0

h t s_s p a c y_w t d \leftarrow 0

t o t a l_h_c o u n t \leftarrow 0

t o t a l_h_s u m \leftarrow 0

t o t a l_n o d e_c o u n t \leftarrow 0

10:

t o t a l_t r e e_h e i g h t \leftarrow 0

n u m b e r_o f_s e n t e n c e s \leftarrow 0

11: for each sentence s in p do

12: Apply Sentence2Scope on s to obtain:

13:

h t s_s p a c y

h t s_s p a c y_w t d

t r e e_h e i g h t

c_n o d e

h_c o u n t

h_s u m

w t d_h_s u m

14: Update

h t s_s p a c y \leftarrow h t s_s p a c y + h t s_s p a c y

15: Update

h t s_s p a c y_w t d \leftarrow h t s_s p a c y_w t d + h t s_s p a c y_w t d

16: Update

t o t a l_h_c o u n t \leftarrow t o t a l_h_c o u n t + h_c o u n t

17: Update

t o t a l_h_s u m \leftarrow t o t a l_h_s u m + h_s u m

18: Update

t o t a l_n o d e_c o u n t \leftarrow t o t a l_n o d e_c o u n t + c_n o d e

19: Update

t o t a l_t r e e_h e i g h t \leftarrow t o t a l_t r e e_h e i g h t + t r e e_h e i g h t

20: Increment

n u m b e r_o f_s e n t e n c e s \leftarrow n u m b e r_o f_s e n t e n c e s + 1

21: end for

22: Calculate

a v g_t r e e_h e i g h t = \frac{t o t a l_t r e e_h e i g h t}{max (1, n u m b e r_o f_s e n t e n c e s)}

23: return

h t s_s p a c y

h t s_s p a c y_w t d

a v g_t r e e_h e i g h t

24: return

t o t a l_h_c o u n t

t o t a l_h_s u m

t o t a l_n o d e_c o u n t

n u m b e r_o f_s e n t e n c e s

DOI: 10.7717/peerj-cs.3069/table-12

Algorithm 3 :

Sentence2Scope: calculate the hyponym tree score for a sentence.

1: Input: Sentence s

2: Output: Hyponym Tree Score

h t s_s p a c y

, Weighted Hyponym Tree Score

h t s_s p a c y_w t d

, Tree Height

t r e e_h e i g h t

, Node Count

c_n o d e

, Hyponym Count

h_c o u n t

, Hyponym Sum

h_s u m

, Weighted Hyponym Sum

w t d_h_s u m

3: Initialize directed graph G, root node root_word as None, and other variables.

4: Process the sentence s to extract tokens using spaCy.

5: for each word w in s do

6: if w is not a stop word then

7: Compute hyponym count

h y p o n y m s_c o u n t

for w.

8: Update

h_c o u n t

and

h_s u m

9: Add w as a node in G with attributes (label, hyponyms_count).

10: end if

11: Add dependency relationships between tokens in G.

12: if w is the root of the dependency parse tree then

13: Set root_word

\leftarrow w

14: end if

15: end for

16: if

r o o t_w o r d

is None then

17: return default values.

18: end if

19: Assign levels and weights to nodes in G using a BFS traversal starting from root_word.

20: Compute

w t d_h_s u m

as the weighted sum of hyponym counts based on node levels.

21: Compute

t r e e_h e i g h t

as the maximum depth of G.

22: Compute

c_n o d e

as the total number of nodes in G.

23: Compute

n o r m a l i z e r = (c_n o d e \times t r e e_h e i g h t)

24: Compute

h t s_s p a c y = \frac{h_s u m}{n o r m a l i z e r}

25: Compute

h t s_s p a c y_w t d = \frac{w t d_h_s u m}{n o r m a l i z e r}

26: return

h t s_s p a c y

h t s_s p a c y_w t d

t r e e_h e i g h t

c_n o d e

h_c o u n t

h_s u m

w t d_h_s u m

DOI: 10.7717/peerj-cs.3069/table-13

(1) $H y p o n y m T r e e S c o r e = \frac{\sum_{i = 1}^{c_n o d e} H y p o n y m C o u n t (w_{i})}{c_n o d e \times t r e e_h e i g h t}$

where:

$H y p o n y m C o u n t (w_{i})$ : The number of hyponyms for the $i$ -th word in the dependency tree.
$c_n o d e$ : The total number of nodes in the dependency tree.
$t r e e_h e i g h t$ : The maximum depth of the dependency tree.

(2) $H y p o n y m T r e e S c o r e_{w e i g h t e d} = \frac{\sum_{i = 1}^{c_{n} o d e} H y p o n y m C o u n t (w_{i}) \cdot N o d e H e i g h t (w_{i})}{c_n o d e \times t r e e_h e i g h t}$

where:

$H y p o n y m C o u n t (w_{i})$ : The number of hyponyms for the $i$ -th word.
$N o d e H e i g h t (w_{i})$ : The height of the $i$ -th word in the dependency tree.
$c_n o d e$ : The total number of nodes in the dependency tree.
$t r e e_h e i g h t$ : The maximum depth of the dependency tree.

When considering the implementation options, the dependency tree of a sentence can be created using two popular NLP libraries, namely, Stanza (Qi et al., 2020) and SpaCy (Honnibal et al., 2020). It has been observed that the dependency tree representation for the same sentence differs between SpaCy and Stanza. Figures 3 and 4 shows the dependency trees created for a sample sentence using Stanza and SpaCy, respectively. The hyponym tree score is dependent upon the dependency tree structure. For this reason, the score calculation was evaluated using both libraries, and a third option was created by averaging the sentence-level scores generated by both libraries. Thus, six candidate hyponym tree scores were generated for further evaluation: three based on SpaCy, Stanza, and averaging, and the weighted versions of all three. Table 4 summarises the HTS candidates generated for evaluation. The difference between the options is primarily based on two factors: the NLP library used to generate the dependency structure of a sentence and whether the node-level score (hyponym count of that word) is multiplied by a weight. The weight corresponds to the height-based level value, where the leaf node is assigned to level 1, and the root node is assigned level N for a tree with N levels.

Figure 3: Hyponym tagged dependency tree with Stanza.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-3

Figure 4: Hyponym-tagged dependency tree with SpaCy.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-4

Table 4:

HTS candidates evaluated.

HTS	Description
hts_spacy	Dependency tree generated using Spacy and nodes are not weighted
hts_spacy_wtd	Dependency tree generated using Spacy and nodes are weighted
hts_stanza	Dependency tree generated using Stanza and nodes are not weighted
hts_stanza_wtd	Dependency tree generated using Stanza and nodes are weighted
hts_avg	Sentence level average of hts_spacy and hts_stanza
hts_avg_wtd	Sentence level average of hts_spacy_wtd and hts_stanza_wtd

DOI: 10.7717/peerj-cs.3069/table-4

Hyponym tree score validation

Length-based indicators like ‘first claim length’ blindly treat lengthy claims as specific and short claims as broader, irrespective of the semantics. Ragot (2023) presented a set of fixed-length representative claims with varying scopes in section C1 of their work to study the claim scope. The same set of claims is used in this work to study the ability of the newly calculated HTS candidate values. The HTS candidate values are calculated with all the sample claims and presented in Table 5. The sentences are arranged in the descending order of their scope. Figure 5 is a normalized plot of the scope values generated for sample sentences using all the six HTS candidates under evaluation. As per the results, HTS candidates can show scope reduction, whereas the word count fails to represent any scope change. However, due to the close similarity between the results from all the HTS candidates under evaluation, a decision is made to generate all six HTS candidate scores for the entire dataset and to make the final HTS candidate selection only after a complete evaluation with the entire dataset.

Table 5:

HTS candidate values generated for fixed-length sample claims.

Sentence Ref.	hts_spacy	hts_spacy_wtd	hts_stanza	hts_stanza_wtd	hts_avg	hts_avg_wtd	word_count
C1.1	304.94	1,369.15	406.89	1,566.00	355.92	1,467.58	25
C1.2	279.93	1,256.26	378.24	1,454.97	329.09	1,355.62	25
C1.3	144.73	976.87	260.65	1,373.70	202.69	1,175.29	25
C1.4	145.38	883.62	252.33	1,240.00	198.86	1,061.81	25
C1.5	117.92	828.69	218.33	1,172.00	168.13	1,000.35	25
C1.6	87.80	664.80	187.14	1,190.57	137.47	927.69	25
C1.7	99.00	694.80	168.69	973.80	133.84	834.30	25
C1.8	103.99	695.88	170.92	964.77	137.46	830.33	25

DOI: 10.7717/peerj-cs.3069/table-5

Figure 5: Claim scope representation using word count and HTS candidates.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-5

Connecting HTS with litigation risk and claim scope

The preliminary validation of the relationship between HTS and claim scope (CS) is demonstrated in “Hyponym Tree Score Validation”. The results indicate that higher HTS values correspond to broader CS. Previous studies (Merges & Nelson, 1994; Arinas, 2012; Marco, Sarnoff & Charles, 2019) have established that broader claim scope increases the likelihood of litigation and legal events. By transitive reasoning, the relationship between HTS and patent litigation probability ( $P_{l i t}$ ) can be considered valid. However, not all patents with broad claim scope result in litigation, as litigation requires legal action to be pursued. This observation may weaken the relationship between HTS and litigation probability.

The evaluation model to study the connection between the HTS and CS is summarized as follows:

Patents with high HTS are likely to have a broader CS.
Broader CS increases the probability of litigation ( $P_{l i t}$ ).
Patents with high HTS and high $P_{l i t}$ are indicative of broader CS.
Observation: Not all patents with broad Claim Scope (CS) will result in litigation ( $P_{l i t}$ ).

Objective: A high HTS strongly predicts $P_{l i t}$ , which indicates a broader CS. This provides a scientific basis for using HTS as a quantification method for claim scope and offers a robust framework for patent strategy formulation and risk assessment.

Predicate logic:

Let HTS(x): Patent x has a high Hyponym Tree Score.
Let CS(x): Patent x has a broad Claim Scope.
Let P_lit(x): Patent x has a high probability of litigation.

Statements:

1.

$\forall x (H T S (x) \to C S (x))$ : High HTS implies broad CS.
2.

$\forall x (C S (x) \to P_{l i t} (x))$ : Broad CS implies a high probability of litigation.
3.

$\forall x ((P_{l i t} (x) \land H T S (x)) \to C S (x))$ : High $P_{l i t}$ and HTS imply broad CS.
Observation $\exists x (C S (x) \land \neg P_{l i t} (x))$ : Not all patents with broad CS result in litigation.

Proof: Hypothesis: $\forall x (H T S (x) \to C S (x))$

Proof. 1. Assume $H T S (x)$ for an arbitrary patent $x$ .
2. From statement 1, $H T S (x) \to C S (x)$ , so $C S (x)$ holds.
3. From statement 2, $C S (x) \to P_{l i t} (x)$ , thus $P_{l i t} (x)$ holds.
4. From statement 3, $(P_{l i t} (x) \land H T S (x)) \to C S (x)$ .
5. Given $P_{l i t} (x)$ and $H T S (x)$ , concludes $C S (x)$ .

Therefore, high HTS implies broad CS.

Selecting the HTS best candidate

The values of all the HTS candidates are calculated for the entire dataset. Table 6 documents the statistics of the different HTS candidates under evaluation. The distribution of the IPC sections in the dataset is shown in Fig. 1. Section G has the most samples in the dataset. Figure 6 shows the average HTS values for non-litigated and litigated patents belonging to each section. This justifies the Proof “Connecting HTS with Litigation Risk and Claim Scope”, on the IPC section level, litigated patents have a higher HTS value than the non-litigated patents and supports the connection between the Litigation probability and HTS value.

Table 6:

Statistics of all the features.

Feature	Count	Mean	Std Dev	Min	25%	50%	75%	Max
bwd_cits	81,794	26.202	70.963	0.000	5.000	11.000	24.000	6,732.000
npl_cits	81,794	8.950	35.334	0.000	0.000	0.000	4.000	2,128.000
claims_x	81,794	18.313	18.481	1.000	8.000	15.000	22.000	887.000
avg_claim_length	81,794	41.688	33.095	1.000	23.762	33.714	49.000	3,198.000
num_dependent_claims	81,794	15.043	16.933	0.000	6.000	12.000	19.000	886.000
num_independent_claims	81,794	3.326	3.365	0.000	2.000	3.000	4.000	155.000
assignee_pcount	81,794	7,246.173	21,565.206	1.000	15.000	176.000	2,656.000	156,703.000
num_inventors	81,794	2.378	1.749	1.000	1.000	2.000	3.000	31.000
fc_word_count	81,794	167.448	114.100	2.000	100.250	147.000	209.000	7,711.000
hts_spacy	81,794	1,356.222	1,621.952	1.000	470.201	923.674	1,683.081	62,917.851
hts_spacy_wtd	81,794	10,839.710	14,770.509	1.500	3,320.736	6,830.528	13,207.373	663,865.829
hts_stanza	81,794	1,418.662	1,824.045	0.000	486.020	951.797	1,735.294	73,828.074
hts_stanza_wtd	81,794	10,250.467	13,744.568	0.000	3,191.651	6,489.478	12,493.479	650,009.383
hts_avg	81,794	1,387.442	1,668.136	0.700	485.400	943.672	1,715.445	68,372.963
hts_avg_wtd	81,794	10,545.088	14,217.464	1.500	3,267.236	6,674.460	12,860.270	656,937.606

DOI: 10.7717/peerj-cs.3069/table-6

Figure 6: Average HTS for each IPC section.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-6

The most suitable candidate to represent the CS has to be selected from the six HTS candidates. This section presents seven experiments designed to assess the relative merit of the HTS candidate in litigation prediction. The difference between the experiments is only in the features used for the classification. Set of standard pre-grant features, termed as the baseline features, include ‘bwd_cits’, ‘npl_cits’, ‘claims_x’, ‘num_dependent_claims’, ‘num_independent_claims’, ‘assignee_pcount’, fc_word_count, avg_claim_length and ‘num_inventors’. Details of these features are documented in Table 3. Each experiment used random forest, XGBoost, support vector classifier (SVC) and balanced random forest (BRF) models to perform litigation prediction to assess the impact of including the HTS candidate feature with the baseline features.

Figure 7 presents the changes in the accuracy of the litigation prediction during each experiment with different prediction models. Experiment A (Exp-A) performs the prediction by using only the baseline features. Experiments B to G added each HTS candidate along with the baseline features. In all the experiments, XGBoost resulted in the best prediction results. Table 7 presented the features used in each experiment and the best prediction performance achieved with XGBoost. Table 8 shows the correlation between the HTS candidates and other existing patent scope or value indicators. A larger value of HTS indicated a higher litigation probability.

Figure 7: Litigation prediction accuracy for all ML models during each experiment.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-7

Table 7:

Performance metrics for experiments using XGBoost.

Experiment	Features	Accuracy	Precision	Recall	F1-score	AUC
Exp-A	Baseline features	0.770	0.788	0.738	0.762	0.770
Exp-B	Baseline features + hts_spacy	0.770	0.785	0.743	0.764	0.770
Exp-C	Baseline features + hts_spacy_wtd	0.772	0.789	0.743	0.765	0.772
Exp-D	Baseline features + hts_stanza	0.771	0.787	0.742	0.764	0.771
Exp-E	Baseline features + hts_stanza_wtd	0.770	0.786	0.743	0.764	0.770
Exp-F	Baseline features + hts_avg	0.772	0.787	0.746	0.766	0.772
Exp-G	Baseline features + hts_avg_wtd	0.772	0.789	0.744	0.766	0.772

DOI: 10.7717/peerj-cs.3069/table-7

Table 8:

Correlation between the HTS candidates and other indicators.

Feature	fwd_cits5	PQI6	family_size	grant_lag	fc_word_count	litigation_label
hts_spacy	0.092	0.328	0.007	0.116	0.119	0.205
hts_spacy_wtd	0.071	0.241	−0.033	0.103	0.192	0.152
hts_stanza	0.085	0.300	0.020	0.105	0.125	0.185
hts_stanza_wtd	0.068	0.240	−0.025	0.099	0.201	0.148
hts_avg	0.091	0.323	0.014	0.114	0.126	0.201
hts_avg_wtd	0.070	0.241	−0.029	0.101	0.197	0.150

DOI: 10.7717/peerj-cs.3069/table-8

Figure 8 shows the variation of the metric from the experiment A results during each experiment. Accordingly, prediction accuracy improved marginally with the introduction of the HTS candidate features. Whenever the positive correlation between the HTS and Litigation probability is valid, the positive correlation between the HTS and Claim Scope is also valid as per the proof “Connecting HTS with Litigation Risk and Claim Scope”. Thus, the relationship between the HTS and Patent Scope is reconfirmed.

Figure 8: Percentage variation of XGBoost prediction performance compared to Exp-A.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-8

From the evaluation results, hts_stanza and hts_stanza_wtd are unambiguously ruled out. The hts_avg_wtd produced slightly better prediction results compared to hts_spacy and hts_spacy_wtd. During the experiments, it was observed that the creation of a stanza-based dependency graph failed for several sentences. Calculating the average scores requires both stanza and spacy-based dependency tree creation. The average scores were also discarded to avoid the dependency tree creation issues observed with Stanza. The remaining candidates are hts_spacy and hts_spacy_wtd, and experiments B and C evaluate the prediction with these features. When comparing the results between experiments B and C, hts_spacy_wtd produces slightly better results. Figure 9 represents the information gain of the candidate HTS features. As per the information gain of the candidates, hts_spacy shall be the candidate.

Figure 9: Information gain of the features used in litigation prediction.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-9

To overcome the ambiguity, a feature-based extremes evaluation was conducted to investigate the relationship between the candidate features and the litigation label and the results are presented in Fig. 10. Specifically, the top 100 and bottom 100 records based on the candidate feature’s values were identified, and their litigation labels were analyzed. The objective was to determine whether the top 100 records predominantly correspond to the positive class (litigated) and the bottom 100 records to the negative class (non-litigated). This analysis provided insights into the HTS candidate feature’s discriminative power, thereby offering a supplementary validation of the feature’s relevance in the litigation prediction task. Extremes analysis and correlation analysis resulted in favour of hts_spacy against hts_spacy_wtd. In these circumstances, hts_spacy was selected as the final candidate for the HTS. All further references to HTS will indicate the usage of hts_spacy as the indicator for the claim scope representation.

Figure 10: Extremes analysis for predictive label validation.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-10

Part 2: litigation prediction model development

The primary objective of this phase was to develop a litigation prediction model that incorporates claim scope understanding while remaining independent of post-grant features, thereby ensuring its applicability to patent drafts. Transformer-based, pre-trained models have garnered considerable attention for text-based analyses, owing to their exceptional ability to capture contextual nuances (Gasparetto et al., 2022). Among these, BERT models are widely recognized for their effectiveness in language comprehension tasks, making them a suitable choice for litigation prediction using claim text. Initial experiments employing BERT demonstrated superior predictive performance compared to traditional machine learning models, validating the effectiveness of transformer-based approaches in this domain.

The results obtained using the BERT (bert-base-uncased) model were based on processing only the claim text, limited to the first 512 words, which may result in incomplete comprehension of lengthy patent claims. In this context, instead of the usual chunking approach, a multifeature fusion approach was devised, integrating both claim text and numerical features to enhance prediction accuracy. “Part 1: Claim Scope Indicator Development” identified HTS as a potential indicator of claim scope, with the hts_spacy variant chosen as the optimal feature. To augment the scope-awareness of the model, HTS was incorporated alongside the baseline numerical features described in “Selecting the HTS Best Candidate”. As a result, a Multifeature Fusion Deep Learning Model was proposed and evaluated, leveraging both textual and numerical modalities to predict litigation risks effectively.

Proposed litigation prediction model

The proposed model, referred to as MAPRA (Multifeature BERT-Powered Fusion for Author-level Patent Litigation Risk Analysis), is specifically designed to assess litigation risks in patent drafts. The model’s architecture aims to capture both the semantic intricacies of claim text and the scope awareness conveyed by numerical features. The textual component is processed through a BERT-based encoder, which extracts semantic information essential for identifying litigation-prone claims. Concurrently, the numerical features provide supplementary insights, such as claim scope and other litigation-relevant factors, resulting in a holistic view of the data. Additionally, author and assignee details, often absent in the claim text, are incorporated as numerical features to enhance prediction accuracy. The ten numeric features used in this analysis are bwd_cits, npl_cits, claims_x, avg_claim_length, num_dependent_claims, num_independent_claims, assignee_pcount, num_inventors, hts_spacy, and fc_word_count.

Data preprocessing plays a pivotal role in ensuring the robustness of the model. Textual claims are tokenized and encoded using the BERT tokenizer, which standardizes the input by padding sequences to a fixed length. This ensures compatibility with the BERT encoder while preserving consistency in input dimensions. Simultaneously, numerical features are imputed, trimmed, normalized, and scaled to maintain uniformity across the dataset. Despite the 2% real-world prevalence of the positive class, coverage of positive samples is ensured through a 10% oversampling scheme, thereby guaranteeing that positive examples are included in each training epoch. The processed data are stratified into training (80%), validation (10%), and test (10%) sets, with both balanced (1:1) and realistic imbalanced ( $\approx 2 %$ positives) splits created for evaluation. This separation is critical to mitigating overfitting and validating the model’s generalizability to unseen data.

Model architecture and workflow

The proposed MAPRA model integrates textual and numerical data modalities to predict a binary litigation outcome (litigated/not litigated). The MAPRA architecture is specifically designed to assess litigation risks, combining semantic intricacies from claim text with numeric indicators representing claim scope, inventor, and assignee attributes. Figure 11 illustrates the complete workflow of the model.

1.

Input representation
Let $x$ represent tokenized patent claim text and $n = (n_{1}, n_{2}, \dots, n_{10})$ represent the vector of associated numeric features.
2.
Tokenization and embedding
- Tokenization: Claim texts are tokenized and padded to length 512 offline using BERT tokenizer, yielding cached $i n p u t_i d s$ and $a t t e n t i o n_m a s k$ .
- BERT embedding: Text embeddings are generated by fine-tuning layers 8–11 of BERT while keeping layers 0–7 frozen. The output embeddings are: $E = (e_{1}, \dots, e_{T}), e_{i} \in R^{d}, d = 768.$
3.

CLS token representation
The embedding of the [CLS] token is extracted as a semantic representation of the claim: $e_{C L S} = E_{C L S} \in R^{768} .$
4.
Combining text and numeric features
- Numeric features are subjected to median imputation, trimming, and Min-Max normalization to the range $[0, 1]$ .
- Numeric embeddings ( $e_{N u m} \in R^{48}$ ) are generated through a multilayer perceptron (MLP): $L i n e a r (10 \to 64) \to R e L U \to D r o p o u t (0.4) \to L i n e a r (64 \to 48) \to R e L U .$
- A learnable modality weighting vector $p \in R^{2}$ , clamped and normalized via softmax, yields weights $w_{0}, w_{1}$ , resulting in: $f = [w_{0} e_{C L S}; w_{1} e_{N u m}] \in R^{816} .$
5.
Classification layer
- The combined feature vector $f$ is passed through a feedforward neural network.
- Dropout with probability 0.33 is used.
- Fully connected layer, batch normalization (BN), and ReLU activation: $h_{1} = R e L U (B N (W_{1} f + b_{1})), W_{1} \in R^{256 \times 816} .$
- The logits for binary classification are computed as: $z = W_{2} h_{1} + b_{2}, W_{2} \in R^{2 \times 256} .$
- The softmax function is applied to obtain the predicted probabilities: $\hat{y} = s o f t m a x (z) \in R^{2} .$ where $\hat{y} \in R^{2}$ represents the probabilities for the two classes.
6.

Loss function
To effectively handle class imbalance during training, the MAPRA model employs a cost-sensitive variant of the binary cross-entropy loss known as the focal loss. The standard binary cross-entropy (CE) loss for binary classification, with true label $y \in {0, 1}$ and predicted probability $\hat{p}$ , is given by: $L_{C E} = - [y \log (\hat{p}) + (1 - y) \log (1 - \hat{p})] .$
The focal loss extends this by emphasizing difficult-to-classify examples, and is defined as: $L_{F L} = α (1 - p_{t})^{γ} L_{C E},$
where $p_{t}$ is the model’s predicted probability for the true class, $γ$ is the focusing parameter (set to $2$ ), and $α$ balances the class weights (set to $1$ ). To further account for severe class imbalance, class-specific weights are employed as follows: $w_{+} = \sqrt{\frac{1 - π_{+}}{π_{+}}}, w_{-} = 1, π_{+} = 0.02.$
Thus, the final weighted focal loss for the dataset of size N is expressed as: $L = - \frac{1}{N} \sum_{i = 1}^{N} w_{y_{i}} [α (1 - p_{t, i})^{γ} (y_{i} \log ({\hat{p}}_{i}) + (1 - y_{i}) \log (1 - {\hat{p}}_{i}))] .$
This combined loss function enhances the model’s sensitivity toward the minority class, significantly improving recall performance on rare litigated patents.
7.
Training objective
- The model parameters $Θ$ are optimized by minimizing the focal loss ( $L_{F L}$ ) across the training dataset using the AdamW optimizer (learning rate $1.2 \times 10^{- 5}$ , weight decay $0.01$ ), cosine scheduler with a 5% warmup phase, gradient clipping (maximum norm of $1.0$ ), and early stopping based on validation area under the precision-recall curve (AUPRC): $min_{Θ} \frac{1}{N} \sum_{i = 1}^{N} L_{F L} (y_{i}, {\hat{y}}_{i}),$ where N denotes the total number of training samples, $y_{i}$ is the true label, and ${\hat{y}}_{i}$ is the predicted probability vector for the $i$ -th training example.
8.
Evaluation under true class imbalance

Given the true positive prevalence $π_{+} = \frac{N_{+}}{N} = 0.02,$ where $N_{+}$ is the number of litigated patents in a test set of size N, the classifier produces scores ${\hat{p}}_{i} = P (y_{i} = 1 ∣ x_{i})$ . Instead of relying solely on the default threshold $0.5$ , the decision threshold is calibrated using multiple criteria on a 2%-positive validation set:
- F1-optimal threshold: $t_{F 1} = \arg max_{t} F_{1} (t), F_{1} (t) = \frac{2 P (t) R (t)}{P (t) + R (t)},$ where precision (P) and recall (R) at threshold $t$ are defined as: $P (t) = \frac{\sum_{i = 1}^{N} I [{\hat{p}}_{i} \geq t] y_{i}}{\sum_{i = 1}^{N} I [{\hat{p}}_{i} \geq t]}, R (t) = \frac{\sum_{i = 1}^{N} I [{\hat{p}}_{i} \geq t] y_{i}}{\sum_{i = 1}^{N} y_{i}} .$
- Accuracy-optimal threshold: $t_{a c c} = \arg max_{t} A c c (t) .$
- Fixed 2% flag-rate threshold ( $98^{t h}$ percentile): $t_{2 %} s u c h t h a t P [{\hat{p}}_{i} \geq t_{2 %}] = 0.02.$ Ranking quality under extreme class imbalance is further evaluated using Precision@K and Recall@K: $P @ K = \frac{1}{K} \sum_{i = 1}^{K} y_{(i)}, R @ K = \frac{1}{\sum_{i} y_{i}} \sum_{i = 1}^{K} y_{(i)},$ where $y_{(i)}$ denotes the ground-truth label of the $i$ -th highest-scoring example. Additionally, the Area Under the Precision–Recall Curve (AUPRC), $A U P R C = \int_{0}^{1} P (R) d R,$ is monitored and optimized, given its superior informativeness over area under the receiver operating characteristic curve (ROC-AUC) in highly imbalanced contexts ( $π_{+} ≪ 0.5$ ).
  
  The model training explicitly uses a class-weighted focal loss: $L_{F L} = - \frac{1}{N} \sum_{i = 1}^{N} w_{y_{i}} α (1 - p_{t, i})^{γ} [y_{i} \log {\hat{p}}_{i} + (1 - y_{i}) \log (1 - {\hat{p}}_{i})],$ with class weights $w_{+} = \sqrt{\frac{1 - π_{+}}{π_{+}}} \approx 7, w_{-} = 1, α = 1, γ = 2.$ These class weights imply that false negatives incur significantly higher penalties due to the extreme class imbalance ( $π_{+} = 0.02$ ), effectively enhancing recall for rare litigation-positive cases. Combined with offline positive-class oversampling (10%), comprehensive numeric preprocessing (imputation, trimming, scaling), multi-threshold calibration, and rigorous ranking metrics, this approach effectively maximizes recall for rare litigation events while controlling false-positive predictions.
9.
Model inference
- During inference, the class label is predicted by selecting the class with the highest probability: $\hat{y} = \arg max_{c} {\hat{y}}_{c} .$ Optionally, calibrated thresholds (e.g., F1-optimal, accuracy-optimal, or fixed flag-rate thresholds) may be applied to enhance inference quality under class imbalance.

Figure 11: Litigation prediction using MAPRA model.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-11

Summary of the workflow

1.

Offline tokenize and pad claim texts (length 512) using the BERT tokenizer.
2.

Extract the [CLS] token embedding from fine-tuned BERT as the textual representation.
3.

Preprocess numeric features (imputation, trimming, scaling) and encode them using a dedicated numeric MLP.
4.

Combine textual and numeric embeddings using learnable modality clamped-weighting and concatenation.
5.

Pass the weighted combined embedding through a feedforward neural network with dropout and batch normalization.
6.

Compute class probabilities using softmax and minimize the focal loss during training.
7.

Calibrate optimal decision thresholds on a validation set.
8.

Predict class labels based on calibrated probabilities during inference.

Training and evaluation

Training involves minimizing the focal loss using the AdamW optimizer, gradient clipping, dropout regularization, and early stopping based on validation performance (AUPRC). Final evaluation on balanced and realistic imbalanced splits involves comprehensive metric computation and visualization of receiver operating characteristic curves, precision-recall curves, and confusion matrices. This architecture, which combines NLP techniques with numeric feature integration, provides a robust framework for predicting litigation risk in legal analytics.

Results

The first part of the work developed an indicator named ‘Hyponym Tree Score’ for patent claim text scope quantification. To generate the proposed score from the patent claim text input, algorithms were designed and implemented using two popular NLP libraries, Spacy and Stanza. Thus, six candidate hyponym scores were generated for further evaluation. The validity of the HTS candidates was initially evaluated using the sample claims to verify their ability to distinguish the scope variation from a set of fixed-length claims. Based on the positive results from that experiment, seven experiments were conducted to perform the litigation prediction; the results were consolidated in Table 7. In addition to the prediction results, extremes analysis, correlation and information gain results are also considered to identify the final candidate for HTS. Based on observations, the non-weighted HTS generated using the Spacy library was selected as the final candidate for the proposed HTS to quantify the patent claim scope.

The second part of this work aimed to develop a litigation prediction model that relies solely on early-stage features, ensuring its applicability to both patent drafts and granted patents. The key direction adopted in the development of the proposed ‘MAPRA’ model was utilising the BERT Model for claim text understanding and augmenting the text information with additional numerical features by designing a Multifeature Fusion approach. Among the BERT options, the BERT base (bert-base-uncased) model was used for text understanding purposes. As a preliminary step, a baseline BERT model using only claim text was implemented to validate its capability in litigation prediction. The results were comparable to those achieved in the first part of the study. Building on this validation, the MAPRA model was developed and tested using both the claim text and the ten numerical features from Part 1, Experiment B. Table 9 presents the prediction metrics for the BERT-based experiments. Figure 12 shows the accuracy improvement in litigation prediction with different models.

Table 9:

Performance metrics for BERT-based models.

Experiment	Features	Accuracy	Precision	Recall	F1-score	AUC
BERT	Only claim text	0.8099	0.8099	0.8099	0.8099	0.8099
MAPRA	Claim text + Features from Exp-B	0.8005	0.7779	0.8411	0.8083	0.8776

DOI: 10.7717/peerj-cs.3069/table-9

Figure 12: Comparison of the performance metrics with different experiments.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-12

To enhance model interpretability and build confidence in its predictive behavior, SHAP (SHapley Additive exPlanations) values were employed to quantify the contribution of individual input features to the model’s output. Figure 13 presents a SHAP summary plot ranking features based on their impact on model predictions. The most influential inputs were num_inventors, hts_spacy, and num_independent_claims. Notably, num_inventors, despite its high ranking, represents external metadata unrelated to the content of patent claims. In contrast, hts_spacy, the proposed semantic scope indicator derived from claim text, was the most impactful claim-related feature. Its high SHAP values indicate that the semantic structure of claims plays a substantial role in predicting litigation risk. These results validate the relevance of traditional bibliometric indicators while empirically demonstrating the added value of incorporating HTS, the proposed claim scope indicator. Overall, the findings reinforce the interpretability of the MAPRA model and underscore the potential of hts_spacy as a meaningful early-stage indicator of litigation risk.

Figure 13: SHAP summary plot showing feature importance based on average impact on model output.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-13

The primary objective of the proposed model is to serve as an early-stage risk assessment tool during the patent drafting process. In this context, the model functions as a screening mechanism, where false negatives (i.e., failing to identify potentially litigated patents) pose greater strategic risk than false positives. As such, achieving high recall is essential. The MAPRA model demonstrates superior recall and ROC-AUC compared to all baseline configurations when evaluated using the F1-optimal threshold, which yields its best overall performance and underscores its effectiveness in minimizing missed high-risk cases. These results support its viability as a decision-support tool capable of providing meaningful, actionable insights to patent authors at the draft stage. These early-stage insights enable authors to manage the legal scope of their claims and assess the potential litigation risk of their patents. Figure 14 illustrates improvements in key prediction metrics across different experiments. Compared to the results from Experiment A, the MAPRA model achieved a 4% improvement in prediction accuracy and a 14% improvement in recall. As a pioneering effort in predicting litigation risk for patent drafts, MAPRA cannot be directly compared to prior models, as no published work to date has addressed this problem at the draft stage. The most comparable existing study (Juranek & Otneim, 2024) focuses on litigation prediction using features available immediately after patent grant and reports an AUC of 0.822. In contrast, the MAPRA model achieves a higher AUC of 0.878 while relying exclusively on early-stage features. A key advantage of MAPRA is that it does not depend on post-grant or acquired information, yet it demonstrates superior predictive performance. These results highlight MAPRA’s capability to assess litigation risk effectively at both the draft and post-grant stages. To the best of the authors’ knowledge, this represents the first published approach explicitly designed for litigation risk prediction during the patent drafting stage.

Figure 14: Percentage improvement in key prediction metrics relative to Exp-A.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-14

Although the model was trained on a balanced dataset, its evaluation on a realistically imbalanced test set, comprising approximately 2% litigated and 98% non-litigated patents, demonstrates its effectiveness in identifying high-risk cases. The test set contains 4,173 samples, including only 83 litigated instances. The model achieves a recall of 85.54%, successfully capturing the majority of truly litigated patents. Despite the expected trade-off in such imbalanced settings, it attains a precision of 6.74% and an F1-score of 0.1250. Notably, its Precision@200 is 16%, representing an eightfold improvement over random selection. Additional performance metrics include an accuracy of 76.18%, a ROC-AUC of 0.8786, and an Average Precision (AP) of 0.1909, highlighting the model’s strong ranking performance. These results suggest that the model is well suited for prioritization tasks in large-scale patent portfolios (Saito & Rehmsmeier, 2015). Moreover, its performance is expected to improve further when trained on a larger dataset that reflects the true class distribution (Buda, Maki & Mazurowski, 2018).

Discussion

The importance of the patent claim scope and its significance to different stakeholders triggered the study. Prominent indicators used to represent the patent scope are studied in this work. The first part of this work addresses the identified research gap regarding the underutilization of patent claim text semantics in assessing patent scope by proposing HTS, a new claim scope indicator. The impact of including HTS in litigation prediction using different machine learning models was evaluated to identify the most suitable candidate for HTS.

As depicted in Fig. 8, the relative performance improvement of each experiment compared to the baseline experiment is smaller. This is not an unexpected situation because scope capturing is based on identifying the hyponyms for each word. The quality of HTS depends on two factors: how the dependency graph for the sentence is generated and how the hyponym counts for the words in the claim sentences are identified. To generate the dependency tree, the Stanza and Spacy libraries were evaluated, and it was observed that Spacy produces overall better results. The hyponym count is calculated using WordNet, but upon analysis, it was observed that WordNet cannot provide the hyponyms of techno-legal terms, which are key constituents of the patent claim text. It cannot resolve the context-related ambiguity; for example, the word ‘tree’ in computer science refers to a data structure, whereas in the context of environmental science, it refers to a natural tree. Figure 15 presents the fact that the ability of the HTS to predict the litigation risk potentially varies with the patent’s IPC section, which internally refers to the linguistic diversity supported by the hyponym counting mechanism. For WordNet, only the presence or absence of the word is checked, ignoring its context. Another observation is that for certain words, WordNet provides hundreds of hyponyms. This can potentially nullify the significance of all other words in the sentence. To avoid such outliers, in this work, the maximum value of any word’s hyponym count is limited to 259, which corresponds to the hyponym count for the top 98.5 percentile. When a word is not a stopword, and there is no hyponym for that word, the original word is considered, and a minimum hyponym count of one is assigned. Figure 16 shows the hyponym count plotted against the percentage of unique words or tokens present in the patent claim text corpus of the dataset used in this work. As indicated by this, it is clear that no hyponyms exist in WordNet for most of the words extracted from the patent claims. Although WordNet is the most popular hyponym corpus, HTS calculation requires a new corpus that includes all scientific, legal, and technical terms to yield better results. Currently, there is no WordNet replacement with context awareness and inclusion of domain-specific terms available in the public domain. To achieve the full potential of the HTS, the development of a new hyponym corpus and the re-computation of HTS values using this new corpus are recommended. This pioneering study aims to trigger researchers’ interest in quantifying the claim text scope based on hyponym count and sentence structure.

Figure 15: Litigation label counts for top and bottom 100 hts_spacy records.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-15

Figure 16: Log-scale distribution of hyponym counts and word percentages.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-16

This study acknowledges that the connection between patent scope, value, and litigation probability has been reconfirmed. Prior studies have documented the correlation between patent scope and its value and the connection between scope and litigation tendency. No attempts were made in this study to promote or encourage patents with a very broad scope due to the well-known conflicting views on such patents (Kitch, 1977; Klemperer, 1990; Gilbert & Shapiro, 1990; Merges & Nelson, 1994; Chang, 1995). Such patents are often used as tools to suppress competition. The focus is solely on quantifying patent scope and enabling patent drafters to determine whether the articulated claim scope is higher or lower.

The proposed MAPRA model is specifically designed for litigation prediction at the patent drafting stage. Unlike the prior work focused on granted patents, such as (Juranek & Otneim, 2024), which reported an AUC of 0.822, the MAPRA model achieves a higher AUC of 0.8776 while using only pre-grant features. This result demonstrates that strong predictive performance can be achieved without relying on post-grant event data. The use of exclusively draft-stage information makes MAPRA well-suited for early-stage litigation risk assessment. To demonstrate the significance of post-grant features in litigation prediction, Fig. 17 shows the information gain of popular post-grant features such as PQI6, forward citation, grant lag, and family size. In this combination, the contribution of post-grant features is very high. Additionally, it is important to note that HTS candidate features outperform two well-known post-grant features: family size and grant lag. This reconfirms the significance of HTS in litigation prediction for early-stage documents for which the post-grant features are unavailable. Considering that the performance difference between the closest litigation prediction work is just 0.3%, the MAPRA model can also be effectively used for granted patents. By using HTS for claim scope identification and MAPRA model for litigation prediction, patent authors can iteratively modify the patent claim text to obtain an optimal claim scope that balances higher value and improved grant probability.

Figure 17: Information gain of post-grant features and HTS candidates.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-17

To assess the model’s real-world applicability, further analysis was conducted on an imbalanced test set reflecting the true distribution of litigated patents (approximately 2% positive cases). While the baseline results, reported earlier, indicate strong performance in terms of recall and ROC-AUC, precision and F1-score were comparatively lower. This behavior is expected in rare-event settings, where even a small number of false positives can significantly affect threshold-sensitive metrics like precision and F1. To better understand the impact of training class distribution on model generalization under deployment-like conditions, additional experiments were conducted using three training configurations, each incorporating 10,000 litigated (positive class) patents. The models were trained with positive-to-negative sampling ratios of 1:1, 1:2, and 1:3, respectively. Each model was evaluated on the 2:98 imbalanced test set. The results, presented in Table 10, show that as the training distribution progressively approximates the true class imbalance, the model exhibits notable gains in several key metrics, including precision, F1-score, and average precision (AP). Precision@200 improves from 0.080 (1:1) to 0.135 (1:2 and 1:3), suggesting enhanced ability to prioritize truly litigated patents in ranked outputs.

Table 10:

Performance on imbalanced test data with varying training ratios.

Training ratio	Precision	Recall	F1-score	AP	P@200
1:1	0.0533	0.8500	0.1003	0.2116	0.080
1:2	0.0701	0.8250	0.1292	0.2348	0.135
1:3	0.0817	0.7049	0.1465	0.1842	0.135

DOI: 10.7717/peerj-cs.3069/table-10

These empirical trends shown in Fig. 18 are consistent with theoretical expectations from probability calibration and statistical learning theory. When a model is trained on a balanced dataset, it implicitly assumes a uniform class prior (i.e., $P (y = 1) = 0.5$ ), which deviates significantly from the true prior observed in deployment scenarios. As a result, the model’s estimates of the posterior probability $P (y = 1 ∣ x)$ may become miscalibrated. According to Bayes’ theorem, the true posterior is given by:

$P (y = 1 ∣ x) = \frac{P (x ∣ y = 1) \cdot P (y = 1)}{P (x)}$ where $P (y = 1)$ is the prior probability of litigation, $P (x ∣ y = 1)$ is the likelihood of observing features $x$ given a litigated patent, and $P (x)$ is the marginal probability of observing $x$ . When the model is trained using the correct class prior (e.g., $P (y = 1) \approx 0.02$ ), the posterior probability estimation becomes more accurate, improving probability calibration and reducing the number of false positives.

Figure 18: Impact of training sampling ratios on precision and F1-score with an Imbalanced (2:98) test set.

Download full-size image

DOI: 10.7717/peerj-cs.3069/fig-18

From a risk minimization perspective, the goal is to minimize the expected loss over the true data distribution. The population risk is defined as:

$R (f) = E_{(x, y) \sim P (x, y)} [L (f (x), y)]$ where $f (x)$ is the model’s prediction for input $x$ , $y \in {0, 1}$ is the true class label, and $L (f (x), y)$ is a loss function that penalizes incorrect predictions. When training on artificially balanced datasets, the empirical risk diverges from the population risk, resulting in a biased optimization objective. As the training distribution aligns more closely with the true class distribution, the empirical risk becomes a better approximation of the true risk, yielding more generalizable models.

Improvements in threshold-independent metrics such as AP and Precision@200 further support the model’s improved ranking capability. These metrics are particularly relevant in practical scenarios such as triaging or screening large patent portfolios, where ranking high-risk cases is more actionable than producing binary predictions.

Taken together, these findings underscore the value of aligning training data distributions with real-world class priors. Empirical results and theoretical insights demonstrate that aligning the training distribution with the true class priors leads to more accurate, calibrated, and useful predictions for downstream litigation risk assessment. These findings motivate future work on cost-sensitive training and dynamic class-weighting to further improve model robustness under deployment conditions.

Conclusion

The scarcity of patent scope indicators based on the semantics of patent claim text is addressed in the first part of this study through the development of a new claim scope indicator, hyponym tree score (HTS). HTS utilizes the number of hyponyms of the words in a patent claim sentence, the sentence structure, and the inter-dependency among the patent claims in its calculation, as depicted in Algorithm 1. The final candidate for the HTS is selected from six computational options following a series of experiments that evaluate the performance improvements resulting from the inclusion of each HTS candidate in the litigation prediction task, as well as the results of the extremes study, information gain, and feature correlation. A higher HTS value indicates a broader claim scope, hinting at higher legal coverage, increased value, increased litigation probability, and decreased patent grant probability. The second part of this study focuses on the development of a high-performing litigation prediction model suitable for predicting the litigation risk of patent drafts. A multifeature fusion approach is adopted to design the proposed MAPRA model, ensuring claim text understanding through a pre-trained model and augmenting it with additional numerical features. In the MAPRA model design, a BERT model is used for capturing claim text semantics, while numerical features such as HTS and other early-stage indicators are concatenated with the BERT output to improve litigation prediction. The MAPRA model achieves an AUC score of 0.878, surpassing the closest existing litigation prediction model, which is designed for granted patents and reports an AUC of 0.822. Given that MAPRA relies solely on pre-grant features available at the draft stage, this superior performance highlights its effectiveness and suitability for predicting litigation risk in both patent drafts and granted patents. It is suggested that patent authors can strategically manage the scope of claims during the drafting stage by leveraging HTS and MAPRA. The utilization of HTS and MAPRA enables authors to define claim boundaries precisely, thereby assisting patent examiners in efficiently identifying overly broad applications. For patent portfolio managers, HTS and MAPRA provide valuable insights for accurately assessing portfolio value and potential litigation risks. Furthermore, this model supports insurance companies in evaluating the litigation risks associated with newly granted patents, contributing to a more efficient, transparent, and well-regulated patent ecosystem.

In this study, hyponym counting of words relies on WordNet, which does not cover most scientific or domain-specific terms. Developing a context-aware hyponym corpus that includes technical and domain-specific terminology remains an important direction for future research. Recent work on LLM-based hyponym generation (Yun et al., 2023) provides promising insights for advancing such corpus development.

Patents are granted across a wide range of domains, each falling under different sections. The linguistic diversity inherent in documenting innovations from these varied fields necessitates a claim scope evaluation that is specific to each patent section. Such section-specific analysis could improve the quality of HTS and enhance litigation prediction performance for particular domains. This study refrains from recommending or defining specific HTS value ranges that may indicate claim scope boundaries. Establishing such recommendations would require section-specific analyses based on significantly larger datasets that reflect the true class distribution. Increasing dataset size with realistic class distribution and conducting patent section-specific evaluations represent key areas for future work. Additionally, this study does not account for temporal changes in patent litigation risk, which is another area for improvement (Kim et al., 2021). As there are currently no definitive or final litigation prediction models, there remains substantial scope for developing improved models, particularly those that incorporate enhanced hyponym corpora.

Supplemental Information

Source code for Indicator Generation and Litigation Prediction.

DOI: 10.7717/peerj-cs.3069/supp-1

Download

Dataset with all the features required for the Litigation Prediction.

DOI: 10.7717/peerj-cs.3069/supp-2

Download

[1] Andersson L, Lupu M, Palotti JR, Piroi F, Hanbury A, Rauber A. 2014. Insight to hyponymy lexical relation extraction in the patent genre versus other text genres. In: Proceedings of the First International Workshop on Patent Mining and Its Applications (IPaMin-2014). 1292 CEUR-WS.org

[2] Arinas I. 2012. How vague can your patent be? vagueness strategies in us patents. HERMES-Journal of Language and Communication in Business 48(48):55-74

[3] Buda M, Maki A, Mazurowski MA. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106(7):249-259

[4] Chan TH, Mihm J, Sosa M. 2021. Revisiting the role of collaboration in creating breakthrough inventions. Manufacturing & Service Operations Management 23(5):1005-1024

[5] Chang HF. 1995. Patent scope, antitrust policy, and cumulative innovation. The RAND Journal of Economics 26(1):34-57

[6] Chen S-H, Lai C-Y. 2023. Patent litigation prediction using machine learning approaches.

[7] Chien CV. 2011. Predicting patent litigation. Texas Law Review 90:283

[8] Cohen JE, Lemley MA. 2001. Patent scope and innovation in the software industry. California Law Review 89:1

[9] Cotropia CA. 2005. Patent claim interpretation methodologies and their claim scope paradigms. William & Mary Law Review 47:49

[10] Fellbaum C. 1998. WordNet: an electronic lexical database The MIT Press

[11] Follesø HS, Kaminski M. 2020. Predicting patent litigation: a comprehensive comparison of machine learning algorithm performance in predicting patent litigation. Master’s thesis, Norwegian School of Economics.

[12] Gasparetto A, Marcuzzo M, Zangari A, Albarelli A. 2022. A survey on text classification algorithms: from text to predictions. Information 13(2):83

[13] Gilbert R, Shapiro C. 1990. Optimal patent length and breadth. The RAND Journal of Economics 21(1):106-112

[14] Graham SJ, Mowery DC. 2003. Intellectual property protection in the us software industry. Patents in the Knowledge-based Economy 219:231

[15] Harhoff D. 2016. Patent quality and examination in europe. American Economic Review 106(5):193-197

[16] Harhoff D, Scherer FM, Vopel K. 2003. Citations, family size, opposition and the value of patent rights. Research policy 32(8):1343-1363

[17] Helmers C. 2018. The economic analysis of patent litigation data. WIPO Economic Research Working Papers 48, World Intellectual Property Organization—Economics and Statistics Division.

[18] Honnibal M, Montani I, Van Landeghem S, Boyd A. 2020. spaCy: industrial-strength natural language processing in python. Honolulu, HI, USA: Zenodo.

[19] Juranek S, Otneim H. 2021. Using machine learning to predict patent lawsuits. Helleveien, Bergen, Norway: NHH Norwegian School of Economics, Department of Business and Management Science. Discussion Paper 2021/6

[20] Juranek S, Otneim H. 2024. Predicting patent lawsuits with machine learning. International Review of Law and Economics 80:106228

[21] Kim Y, Lee J, Kang J, Lee J, Jang D, Park S. 2022. A study on the comparison of feature extraction methods for classification of patent litigation.

[22] Kim Y, Park S, Lee J, Jang D, Kang J. 2021. Integrated survival model for predicting patent litigation hazard. Sustainability 13(4):1763

[23] Kitch EW. 1977. The nature and function of the patent system. The Journal of Law and Economics 20(2):265-290

[24] Klemperer P. 1990. How broad should the scope of patent protection be? The RAND Journal of Economics 21(1):113-130

[25] Lanjouw J, Schankerman M. 1997. Stylized facts of patent litigation: value, scope and ownership. Cambridge, Massachusetts, USA: National Bureau of Economic Research. Working Paper

[26] Lanjouw JO, Schankerman M. 2001. Characteristics of patent litigation: a window on competition. RAND Journal of Economics 32(1):129-151

[27] Lanjouw JO, Schankerman M. 2004. Patent quality and research productivity: measuring innovation with multiple indicators. The Economic Journal 114(495):441-465

[28] Lee C, Song B, Park Y. 2013. How to assess patent infringement risks: a semantic patent claim analysis using dependency relationships. Technology Analysis & Strategic Management 25(1):23-38

[29] Lerner J. 1994. The importance of patent scope: an empirical analysis. The RAND Journal of Economics 25(2):319-333

[30] Liu J, Li P, Liu X. 2024. Patent lifetime prediction using lightgbm with a customized loss. PeerJ Computer Science 10(7–8):e2044

[31] Liu W, Pei S. 2023. Convolution neural network based patent infringement detection method.

[32] Liu Q, Wu H, Ye Y, Zhao H, Liu C, Du D. 2018. Patent litigation prediction: a convolutional tensor factorization approach.

[33] Malackowski JE, Barney JA. 2008. What is patent quality? A merchant banc’s perspective. Nouvelles-Journal of the Licensing Executives Society 43(2):123

[34] Marco AC, Sarnoff JD, Charles A. 2019. Patent claims and patent scope. Research Policy 48(9):103790

[35] Merges RP, Nelson RR. 1990. On the complex economics of patent scope. Columbia Law Review 90(4):839-916

[36] Merges RP, Nelson RR. 1994. On limiting or encouraging rivalry in technical progress: the effect of patent scope decisions. Journal of Economic Behavior & Organization 25(1):1-24

[37] Narin F, Hamilton KS, Olivastro D. 1997. The increasing linkage between us technology and public science. Research Policy 26(3):317-330

[38] Okada Y, Naito Y, Nagaoka S. 2016. Contribution of patent examination to making the patent scope consistent with the invention: evidence from Japan. Japan: Research Institute of Economy, Trade and Industry (RIETI). Discussion Paper

[39] Organisation for Economic Co-operation and Development (OECD). 2024. Oecd patent quality indicators database. (accessed 2 March 2024)

[40] Osenga K. 2011. The shape of things to come: what we can learn from patent claim length. Santa Clara Computer & High Technology Law Journal 28(3):617-629

[41] Packalen M, Bhattacharya J. 2012. Words in patents: research inputs and the value of innovativeness in invention. Working Paper 18494, National Bureau of Economic Research.

[42] Park B, Bhardwaj N, Hsu SY-Y. 2023. Predicting patent litigation risk using roberta and metadata augmentation techniques. (accessed 3 March 2024)

[43] Park H, Yoon J, Kim K. 2012. Identifying patent infringement using SAO based semantic technological similarities. Scientometrics 90(2):515-529

[44] Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD. 2020. Stanza: a python natural language processing toolkit for many human languages. ArXiv

[45] Ragot S. 2023. A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models. ArXiv

[46] Saito T, Rehmsmeier M. 2015. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE 10(3):e0118432

[47] Squicciarini M, Dernis H, Criscuolo C. 2013. Measuring patent quality: indicators of technological and economic value. OECD Science, Technology and Industry Working Papers. 3

[48] Tanaka H, Nakashio Y, Kajikawa Y. 2018. Evaluation method of patent scope based on semantic information of words and dependency structure of patent claims.

[49] Tekic Z, Kukolj D. 2013. Threat of litigation and patent value: what technology managers should know. Research-Technology Management 56(2):18-25

[50] Toole A, Jones C, Madhavan S. 2021. Patentsview: an open data platform to advance science and technology policy. In: USPTO Economic Working Paper. Alexandria, VA, USA: United States Patent and Trademark Office.

[51] Toole AA, Miller R, Sichelman TM. 2024. Technical documentation for patent litigation docket reports data, 1963–2020. Technical Report 2024-1, U.S. Patent and Trademark Office.

[52] Trajtenberg M. 1990. A penny for your quotes: patent citations and the value of innovations. The Rand Journal of Economics 21(1):172-187

[53] U.S. Patent and Trademark Office. 2024a. Data download tables. (accessed 5 January 2024)

[54] U.S. Patent and Trademark Office. 2024b. Patent litigation docket reports data. (accessed 5 October 2024)

[55] Wittfoth S. 2019. Measuring technological patent scope by semantic analysis of patent claims–an indicator for valuating patents. World Patent Information 58:101906