Comparing Similarity of Patent Textual Data Through the Application of Machine Learning
thesisposted on 01.08.2019, 00:00 by Salvatore C Immordino
This research presents the use of machine learning (ML) computational techniques to convert unstructured patent textual data into structured actionable knowledge, and in doing so, lays the groundwork to study inventiveness and inventive knowledge flow with less dependence on numerical metrics such as citation analysis. Citation analysis is the most popular method of studying inventiveness today. It is used to form theories about inventive knowledge flow between inventors, either based on how they cite each other’s patents, or invention importance based on how frequently a patent is cited. Historically, inventors would directly cite and build upon each other’s work to demonstrate how their inventions were novel over others. Today, most citations are done after an invention has been conceived, either by the patent attorney handling the filing or the patent examiner reviewing the application. This change in citing behavior, while not formal, undermines the premise that the number of citations made is somehow reflective of knowledge flow, thus there is a need for improved methods to study invention. This research begins by ﬁrst extracting unstructured abstract, title, and claim textual data from a subset of patent documents taken from a group of competitive building product companies. This includes a novel method for removing common legal terms I call patent jargon. Second, I use natural language processing techniques to convert the unstructured textual data into a Vector Space Model (VSM) using term frequency-inverse document frequency (tf-IDF). Third, I measure the cosine angle between patent vectors within the high-dimensional vector space to assess inventive similarity. Fourth, I establish a method to determine a minimum cosine similarity threshold value which can be used to select for related patents. Fifth, to visualize the data set I reduce the higher-dimensional data using both principle component analysis and t-distributed stochastic neighbor embedding. Lastly, I compare invention impact and inventive knowledge ﬂow by visualizing patent citation data alongside my cosine relatedness method to reveal both similar and dissimilar inventive patterns.