GraphMLDatasets

Usage

graph = graphdata(Planetoid(), :cora)
train_X, train_y = traindata(Planetoid(), :cora)
test_X, test_y = testdata(Planetoid(), :cora)

# OBG datasets
graph = graphdata(OGBNProteins())
ef = edge_features(OGBNProteins())
nl = node_labels(OGBNProteins())

APIs

Available datasets

Planetoid dataset

Cora dataset

PPI dataset

GraphMLDatasets.PPIType
PPI()

PPI dataset contains the protein-protein interaction networks. Nodes represent proteins and edges represent if proteins have interaction with each other. Positional gene sets, motif gene sets and immunological signatures as features (50 in total) and gene ontology sets as labels (121 in total).

Implements: traindata, validdata, testdata

source

Reddit dataset

GraphMLDatasets.RedditType
Reddit()

Reddit dataset contains Reddit post networks. Reddit is a large online discussion forum where users post and comment in 50 communities. Reddit posts belonging to different communities. Nodes represent posts and edges represent if the same user comments on both posts. The task is to predict post categories of community.

Implements: graphdata, alldata, rawdata, metadata

source

QM7b dataset

GraphMLDatasets.QM7bType
QM7b()

QM7b dataset contains molecular structure graphs and is subset of the GDB-13 database. It contains stable and synthetically organic molecular structures. Nodes represent atoms in a molecule and edges represent there is a chemical bond between atoms. The 3D Cartesian coordinates of the stable conformation is given as features. The task is to predict the electronic properties. It contains 7,211 molecules with 14 regression targets.

Implements: rawdata

source

OGB Node Property Prediction

OGBNProteins dataset

GraphMLDatasets.OGBNProteinsType
OGBNProteins()

OGBNProteins dataset contains protein-protein interaction network. The task to predict the presence of protein functions in a multi-label binary classification. Training/validation/test splits are given by node indices.

Description

  • Graph: undirected, weighted, and typed (according to species) graph.
  • Node: proteins.
  • Edge: different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology.

References

  1. Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta- Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019.
  2. Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018.

Implements: graphdata, train_indices, valid_indices, test_indices, edge_features, node_labels

source

OGBNProducts dataset

GraphMLDatasets.OGBNProductsType
OGBNProducts()

OGBNProducts dataset contains an Amazon product co-purchasing network. The task to predict the category of a product in a multi-class classification. Training/validation/test splits are given by node indices.

Description

  • Graph: undirected and unweighted graph.
  • Node: products sold in Amazon.
  • Edge: the products are purchased together.

References

  1. http://manikvarma.org/downloads/XC/XMLRepository.html

Implements: graphdata, train_indices, valid_indices, test_indices, node_features, node_labels

source

OGBNArxiv dataset

GraphMLDatasets.OGBNArxivType
OGBNArxiv()

OGBNArxiv dataset contains the citation network between all Computer Science (CS) arXiv papers indexed by MAG. The task to predict the primary categories of the arXiv papers from 40 subject areas in a multi-class classification. Training/validation/test splits are given by node indices.

Description

  • Graph: directed graph.
  • Node: arXiv paper.
  • Edge: each directed edge indicates that one paper cites another one.

References

  1. Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
  2. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013.

Implements: graphdata, train_indices, valid_indices, test_indices, node_features, node_labels

source

OGB Graph Property Prediction