Overview

This data set is generated by linking two large academic graphs: Microsoft Academic Graph (MAG) and AMiner, and it is used for research purpose only. This version includes 166,192,182 papers from MAG and 154,771,162 papers from AMiner. We generated 64,639,608 linking (matching) relations between the two graphs. In the future, more linking results, like authors, will be published. It can be used as a unified large academic graph for studying citation network, paper content, and others, and can be also used to study integration of multiple academic graphs.

The overall data set includes three parts, which are described in the table below:

 Data Set
 #Paper
 #File
Total Size
 Date
64,639,608
 1
1.6GB
2017-06-22
MAG papers
166,192,182
9
104GB
2017-06-09
AMiner papers
 154,771,162
3
39GB
2017-03-22

Downloads

AMiner Papers:

 Data Set
 #Paper
 #File

MAG Papers:

 Data Set
 #Paper
 #File

Data Description

The detailed description of data is presented in this section.
For Linking relations, each linking pair is an “ID to ID” pair. More specifically, its JSON schema is:

{
  "mid": "xxxx",
  "aid": "yyyy"
}

where “mid” is MAG paper ID and “aid” is AMiner paper ID.

For data set MAG papers and AMiner papers, each paper is a JSON object. Its data schema is:

Field Name
Field Type
Description
Example
id
string
MAG or AMiner ID
53e9ab9eb7602d970354a97e
title
string
paper title
Data mining: concepts and techniques
authors.name
string
author name
Jiawei Han
author.org
string
author affiliation
department of computer science university of illinois at urbana champaign
venue
string
paper venue
Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial
year
int
published year
2000
keywords
list of strings
keywords
[“data mining”, “structured data”, “world wide web”, “social network”, “relational data”]
fos
list of strings
fields of study
[“relational database”, “data model”, “social network”]
n_citation
int
number of citation
29790
references
list of strings
citing papers’ ID
[“53e99ef4b7602d97027c2346”, “53e9aa23b7602d970338fb5e”, “53e99cf5b7602d97025aac75”]
page_stat
string
start of page
11
page_end
string
end of page
18
doc_type
string
paper type: journal, book title…
book
lang
string
detected language
en
publisher
string
publisher
Elsevier
volume
string
volume
10
issue
string
issue
29
issn
string
issn
0020-7136
isbn
string
isbn
1-55860-489-8
doi
string
doi
10.4114/ia.v10i29.873
pdf
string
pdf URL
//static.aminer.org/upload/pdf/1254/ 370/239/53e9ab9eb7602d970354a97e.pdf
url
list
external links
[“http://dx.doi.org/10.4114/ia.v10i29.873”, “http://polar.lsi.uned.es/revista/index.php/ia/ article/view/479”]
abstract
string
abstract
Our ability to generate…

For example:

{
  "id": "53e9ab9eb7602d970354a97e",
  "title": "Data mining: concepts and techniques",
  "authors": [
    {
      "name": "jiawei han",
      "org": "department of computer science university of illinois at urbana champaign"
    },
    {
      "name": "micheline kamber",
      "org": "department of computer science university of illinois at urbana champaign"
    },
    {
      "name": "jian pei",
      "org": "department of computer science university of illinois at urbana champaign"
    }
  ],
  "year": 2000,
  "keywords": [
    "data mining",
    "structured data",
    "world wide web",
    "social network",
    "relational data"
  ],
  "fos": [
    "relational database",
    "data model",
    "social network"
  ],
  "n_citation": 29790,
  "references": [
    "53e99ef4b7602d97027c2346",
    "53e9aa23b7602d970338fb5e",
    "53e99cf5b7602d97025aac75"
  ],
  "doc_type": "book",
  "lang": "en",
  "publisher": "Elsevier",
  "isbn": "1-55860-489-8",
  "doi": "10.4114/ia.v10i29.873",
  "pdf": "//static.aminer.org/upload/pdf/1254/370/239/53e9ab9eb7602d970354a97e.pdf",
  "url": [
    "http://dx.doi.org/10.4114/ia.v10i29.873",
    "http://polar.lsi.uned.es/revista/index.php/ia/article/view/479"
  ],
  "abstract": "Our ability to generate and collect data has been increasing rapidly. Not only are all of our business, scientific, and government transactions now computerized, but the widespread use of digital cameras, publication tools, and bar codes also generate data. On the collection side, scanned text and image platforms, satellite remote sensing systems, and the World Wide Web have flooded us with a tremendous amount of data. This explosive growth has generated an even more urgent need for new techniques and automated tools that can help us transform this data into useful information and knowledge. Like the first edition, voted the most popular data mining book by KD Nuggets readers, this book explores concepts and techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. However, since the publication of the first edition, great progress has been made in the development of new data mining methods, systems, and applications. This new edition substantially enhances the first edition, and new chapters have been added to address recent developments on mining complex types of data? including stream data, sequence data, graph structured data, social network data, and multi-relational data."
}

Method and Evaluation

Method

We obtain linking relations of two publication graphs by two steps:

  1. Use Microsoft Graph Search API to query each AMiner paper’s title and obtain candidate matching papers for each AMiner paper.
  2. We match two papers if they have
    • very similar titles
    • similar author names and
    • same published year
Evaluation

We random sampled 100,000 linking pairs and evaluated the matching accuracy. The number of truly matching pairs is 99,699 and the matching accuracy can achieve 99.70%.

Reference

We kindly request that any published research that makes use of this data cites the following papers.

  • Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998. [PDF] [Slides] [System] [API]
  • Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. [PDF][System][API]