Nebula LiveJournal, Import LiveJournal Dataset into Nebula Graph and Run Nebula Algorithm

一个导入 Livejournal 数据集到 Nebula Graph 图数据库,并执行 Nebula Algorithm 图算法的过程分享。
Related GitHub Repo: https://github.com/wey-gu/nebula-LiveJournal
nebula-LiveJournal
LiveJournal Dataset is a Social Network Dataset in one file with two columns(FromNodeId, ToNodeId).
|
|
It could be accessed in https://snap.stanford.edu/data/soc-LiveJournal1.html.
Dataset statistics | |
---|---|
Nodes | 4847571 |
Edges | 68993773 |
Nodes in largest WCC | 4843953 (0.999) |
Edges in largest WCC | 68983820 (1.000) |
Nodes in largest SCC | 3828682 (0.790) |
Edges in largest SCC | 65825429 (0.954) |
Average clustering coefficient | 0.2742 |
Number of triangles | 285730264 |
Fraction of closed triangles | 0.04266 |
Diameter (longest shortest path) | 16 |
90-percentile effective diameter | 6.5 |
1 Dataset Download and Preprocessing
1.1 Download
It is accesissiable from the official web page:
|
|
Comments in data file should be removed to make the data import tool happy.
1.2 Preprocessing
|
|
2 Import dataset to Nebula Graph
2.1 With Nebula Importer
Nebula-Importer is a Golang Headless import tool for Nebula Graph.
You may need to edit the config file under nebula-importer/importer.yaml on Nebula Graph’s address and credential。
Then, Nebula-Importer could be called in Docker as follow:
|
|
Or if you have the binary nebula-importer locally:
|
|
2.2 With Nebula Exchange
Nebula-Exchange is a Spark Application to enable batch and streaming data import from multiple data sources to Nebula Graph.
To be done. (You can refer to https://siwei.io/nebula-exchange-sst-2.x/)
3 Run Algorithms with Nebula Graph
Nebula-Algorithm is a Spark/GraphX Application to run Graph Algorithms with data consumed from files or a Nebula Graph Cluster.
Supported Algorithms for now:
Name | Use Case |
---|---|
PageRank | page ranking, important node digging |
Louvain | community digging, hierarchical clustering |
KCore | community detection, financial risk control |
LabelPropagation | community detection, consultation propagation, advertising recommendation |
ConnectedComponent | community detection, isolated island detection |
StronglyConnectedComponent | community detection |
ShortestPath | path plan, network plan |
TriangleCount | network structure analysis |
BetweennessCentrality | important node digging, node influence calculation |
DegreeStatic | graph structure analysis |
3.1 Ad-hoc Spark Env setup
Here I assume the Nebula Graph was bootstraped with Nebula-Up, thus nebula is running in a Docker Network named nebula-docker-compose_nebula-net
.
Then let’s start a single server spark:
|
|
Thus we could make spark application submt inside this container:
|
|
3.2 Run Algorithms
There are many altorithms supported by Nebula-Algorithm, here some of their configuration files were put under nebula-algorithm as an example.
Before using them, please first edit and change Nebula Graph Cluster Addresses and credentials.
|
|
Then we could enter the spark container and call corresponding algorithms as follow.
Please adjust your --driver-memeory
accordingly, i.e. pagerank altorithm:
|
|
After the algorithm finished, the output will be under the path insdie the container defined in conf file:
|
|
题图版权:@sigmund