Analysis of Clusters With Indian Patent Data Using Different Word Embedding Techniques

Pankaj Beldar; Mohansingh Pardeshi; Rahul Rakhade; Shilpa Mene

doi:10.53555/sfs.v10i3.2110

Authors

Pankaj Beldar
Mohansingh Pardeshi
Rahul Rakhade
Shilpa Mene

DOI:

https://doi.org/10.53555/sfs.v10i3.2110

Keywords:

K-means, Agglomerative clustering, Word embedding, Patents, Silhouette Score

Abstract

This study employs advanced Unsupervised Machine Learning (UML) techniques, including K-means and Agglomerative clustering, to analyze descriptive Indian Patent data. Utilizing silhouette score evaluation, elbow method, and dendrogram analysis, optimal cluster numbers are determined. Various word embedding methods like TF-IDF, Word2Vec, and Countvectorizer, combined with rigorous text processing, are explored. Robust testing of categorical and numerical features yields a high silhouette score of 0.8965 for 2 clusters, showcasing Agglomerative clustering's effectiveness. The research emphasizes the crucial role of UML techniques, word embedding methodologies, and comprehensive text processing in revealing complex structures within Indian Patent data. Besides advancing unsupervised learning methodologies, this work aids scholars, practitioners, and policymakers in comprehending the Indian patent landscape, fostering innovation, and technological progress