Deep-Learning Bibliographic Reference Strings with a 1-Billion-Instances Dataset

Problem/Background

Effective citation parsing is crucial for academic search engines, patent databases and many other applications in academia, law, and intellectual property protection. It helps to identify related documents and calculate the impact of researchers and journals (e.g. h-index). “Citation parsing” refers to identifying and extracting a reference like “[4]” in the full text, and extracting author names, journals, publication year etc. from the bibliography. For instance, in the following example the citation parser would have to identify the citation markers [1] and [2] and [3], [4], and then extract from the bibliography the information that “K. Balog”, “N. Takhirov” etc. are the authors.

1 Introduction
Retrieving a list of ‘related documents’ for a given source document – e.g. a web page, patent, or research article – is a common feature of many applications, including recommender systems and search engines (Figure 1). Document relatedness is typically calculated based on documents’ text (title, abstract, full-text) [1] and metadata (authors, journal, …) [2], or based on citations/hyperlinks [3], [4].
…
6 Bibliography
[1] K. Balog, N. Takhirov, H. Ramampiaro, and K. Nørvåg, “Multi-step Classification Approaches to Cumulative Citation Recommendation,” in Proceedings of the OAIR’13, 2013.
[2] D. Aumueller, “Retrieving metadata for your local scholarly papers,” 2009.
[3] B. Gipp and J. Beel, “Citation Proximity Analysis (CPA) – A new approach for identifying related work based on Co-Citation Analysis,” in Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09), 2009, vol. 2, pp. 571–575.
[4] S. Liu and C. Chen, “The Effects of Co-citation Proximity on Co-citation Analysis,” in Proceedings of the Conference of the International Society for Scientometrics and Informetrics, 2011.

Over the years many approaches to reference parsing have been proposed, including regular expressions, knowledge-based approaches and supervised machine learning. Machine learning-based solutions, in particular, those falling into the category of supervised sequence tagging, are considered the state-of-the-art technique for reference parsing. Unfortunately, they still suffer from two issues: the lack of sufficiently big and diverse data sets and problems with generalization to unseen reference formats.

We have recently created a synthetic dataset with 1 billion annotated bibliographic reference strings. This dataset is unique as other datasets typically only consist of a few thousand entries.

Your task would be to use this 1 billion dataset (or a large sample) to train a deep learning model to parse the reference strings. Potentially, your approach will achieve groundbreaking results and set a new benchmark in the community.