A Comprehensive Dataset of Citations with Identifiers from English Wikipedia
The dataset is composed of 3 parts:
1. The dataset of 23.8 million citations from 35 different citation templates, out of which 3.14 million citations already contained identifiers, and approximately 2.15 million citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip
2. An example subset with the features for the classifier. This is under the filename: subset_of_citations_features.zip
3. Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)
The data was parsed from the Wikipedia XML content dumps published in October 2018.
The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki
The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset