STSM – Visit to the HUB Institute for Library Sciences “Same data, different results” – report of Rob Koopman

STSM – Visit to the HUB Institute for Library Sciences “Same data, different results” – report of Rob Koopman

Period: 2016-04-24 to 2016-04-30
STSM Applicant: Mr Rob Koopman, OCLC EMEA BV, Leiden (NL)
Host: Frank Havemann HUB

The STSM of Rob Koopman was devoted to discussing the problems to define topics using bibliometrics traces. Although, this problem is by no way new there are still many competing approaches, which range and validity is often not communicated and sometimes also clear.
This visit was devoted to look into alternative ways to cluster documents based on citation links and semantic indexing. In general, all elements analysed in bibliometrics—papers, journals, authors etc.—are thematically heterogeneous. Citation links are the least heterogeneous elements of analysis. In most cases a source is cited in a paper for one knowledge claim. If it is cited for two or more knowledge claims then it is likely that they are thematically not very distant from each other. Therefore, it seems reasonable to cluster citation links to reconstruct topics in networks of papers. Many cluster methods are based on measures of similarity between the elements we cluster. In network analysis (and documents form networks where links can be citations, or co-authorship, or other shared elements), Ahn et al. (2010) estimate the similarity of two links by comparing their sets of neighboring nodes. This is not very appropriate for citation links because we than would estimate thematic similarity of thematically nearly homogeneous elements (citation links) with sets of very in-homogeneous elements (papers, cited sources). Having said this, experts having read a citing paper and a cited source would be able to describe the theme of the citation link with terms. We suppose that those terms are very likely to appear in both texts, i.e. in the citing paper and in the cited source. If we would simply use the intersection of word sets extracted from the two documents we would have many noise (stop words, very common words) and would miss parts of the signal (due to synonyms and different grammatical forms of words). Therefore, the first steps are stemming, exclusion of stop words, and a TF-IDF weighting of terms. To include synonyms and nearly synonyms we use results of a co-word analysis on the corpus we want to cluster. Those preparation steps are part of the semantic indexing method. In particular, we use one-word and two-word terms from title, abstract and keywords of documents. Each document is represented as a vector in the space of terms. Because terms are not always orthogonal to each other (because there are a lot of similar terms) one can use singular-value decomposition (SVD) to obtain latent semantic dimensions of the corpus and at the same time reduce the number of dimensions by neglecting dimensions with low weights. We apply a faster but less precise alternative to SVD, namely random projection. We now construct the vector representing a citation link. We concluded to implement the new measure for similarity of citation links and to test its validity. If we succeed the results will be published in a common paper. The STSM was also used to give a lecture in German for 30 students of the iSchool about OCLC and semantic indexing, presenting a new implementation of Ariadne’s thread. (http://thoth.pica.nl/ml314/relate).

May 3, 2016

Rob Koopman