Comparing Similarity of Patent Textual Data Through the Application of Machine Learning

Immordino, Salvatore C

doi:10.25417/uic.12481325.v1

IMMORDINO-DISSERTATION-2019.pdf (10.47 MB)

Comparing Similarity of Patent Textual Data Through the Application of Machine Learning

thesis

posted on 2019-08-01, 00:00 authored by Salvatore C Immordino

This research presents the use of machine learning (ML) computational techniques to convert unstructured patent textual data into structured actionable knowledge, and in doing so, lays the groundwork to study inventiveness and inventive knowledge flow with less dependence on numerical metrics such as citation analysis. Citation analysis is the most popular method of studying inventiveness today. It is used to form theories about inventive knowledge flow between inventors, either based on how they cite each other’s patents, or invention importance based on how frequently a patent is cited. Historically, inventors would directly cite and build upon each other’s work to demonstrate how their inventions were novel over others. Today, most citations are done after an invention has been conceived, either by the patent attorney handling the filing or the patent examiner reviewing the application. This change in citing behavior, while not formal, undermines the premise that the number of citations made is somehow reflective of knowledge flow, thus there is a need for improved methods to study invention. This research begins by ﬁrst extracting unstructured abstract, title, and claim textual data from a subset of patent documents taken from a group of competitive building product companies. This includes a novel method for removing common legal terms I call patent jargon. Second, I use natural language processing techniques to convert the unstructured textual data into a Vector Space Model (VSM) using term frequency-inverse document frequency (tf-IDF). Third, I measure the cosine angle between patent vectors within the high-dimensional vector space to assess inventive similarity. Fourth, I establish a method to determine a minimum cosine similarity threshold value which can be used to select for related patents. Fifth, to visualize the data set I reduce the higher-dimensional data using both principle component analysis and t-distributed stochastic neighbor embedding. Lastly, I compare invention impact and inventive knowledge ﬂow by visualizing patent citation data alongside my cosine relatedness method to reveal both similar and dissimilar inventive patterns.

History

Advisor

Scott, Michael J

Chair

Scott, Michael J

Department

Mechanical & Industrial Engineering

Degree Grantor

University of Illinois at Chicago

Degree Level

Doctoral

Degree name

PhD, Doctor of Philosophy

Committee Member

Darabi, Houshang Derrible, Sybil Hu, Mengqi Spanjol, Jelena

Submitted date

August 2019

Thesis type

application/pdf

Language

en

Issue date

2019-06-14

Usage metrics

Keywords

Patent Analysis Machine Learning Natural Language Processing Intellectual Property Patent Indicators Patent Data Mining Patent Metrics Citation Analysis Cosine Similarity

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Comparing Similarity of Patent Textual Data Through the Application of Machine Learning

History

Advisor

Chair

Department

Degree Grantor

Degree Level

Degree name

Committee Member

Submitted date

Thesis type

Language

Issue date

Usage metrics

Categories

Keywords

Licence

Exports