Nebula LiveJournal,上手 LiveJournal 数据集导入 Nebula Graph 与图算法执行
一个导入 Livejournal 数据集到 Nebula Graph 图数据库,并执行 Nebula Algorithm 图算法的过程分享。
Related GitHub Repo: https://github.com/wey-gu/nebula-LiveJournal
nebula-LiveJournal
LiveJournal Dataset is a Social Network Dataset in one file with two columns(FromNodeId, ToNodeId).
$ head soc-LiveJournal1.txt
# Directed graph (each unordered pair of nodes is saved once): soc-LiveJournal1.txt
# Directed LiveJournal friednship social network
# Nodes: 4847571 Edges: 68993773
# FromNodeId ToNodeId
0 1
0 2
0 3
0 4
0 5
0 6
It could be accessed in https://snap.stanford.edu/data/soc-LiveJournal1.html.
Dataset statistics | |
---|---|
Nodes | 4847571 |
Edges | 68993773 |
Nodes in largest WCC | 4843953 (0.999) |
Edges in largest WCC | 68983820 (1.000) |
Nodes in largest SCC | 3828682 (0.790) |
Edges in largest SCC | 65825429 (0.954) |
Average clustering coefficient | 0.2742 |
Number of triangles | 285730264 |
Fraction of closed triangles | 0.04266 |
Diameter (longest shortest path) | 16 |
90-percentile effective diameter | 6.5 |
1 Dataset Download and Preprocessing
1.1 Download
It is accesissiable from the official web page:
$ cd nebula-livejournal/data
$ wget https://snap.stanford.edu/data/soc-LiveJournal1.txt.gz
Comments in data file should be removed to make the data import tool happy.
1.2 Preprocessing
$ gzip -d soc-LiveJournal1.txt.gz
$ sed -i '1,4d' soc-LiveJournal1.txt
2 Import dataset to Nebula Graph
2.1 With Nebula Importer
Nebula-Importer is a Golang Headless import tool for Nebula Graph.
You may need to edit the config file under nebula-importer/importer.yaml on Nebula Graph’s address and credential。
Then, Nebula-Importer could be called in Docker as follow:
$ cd nebula-livejournal
$ docker run --rm -ti \
--network=nebula-net \
-v nebula-importer/importer.yaml:/root/importer.yaml \
-v data/:/root \
vesoft/nebula-importer:v2 \
--config /root/importer.yaml
Or if you have the binary nebula-importer locally:
$ cd data
$ <path_to_nebula-importer_binary> --config ../nebula-importer/importer.yaml
2.2 With Nebula Exchange
Nebula-Exchange is a Spark Application to enable batch and streaming data import from multiple data sources to Nebula Graph.
To be done. (You can refer to https://siwei.io/nebula-exchange-sst-2.x/)
3 Run Algorithms with Nebula Graph
Nebula-Algorithm is a Spark/GraphX Application to run Graph Algorithms with data consumed from files or a Nebula Graph Cluster.
Supported Algorithms for now:
Name | Use Case |
---|---|
PageRank | page ranking, important node digging |
Louvain | community digging, hierarchical clustering |
KCore | community detection, financial risk control |
LabelPropagation | community detection, consultation propagation, advertising recommendation |
ConnectedComponent | community detection, isolated island detection |
StronglyConnectedComponent | community detection |
ShortestPath | path plan, network plan |
TriangleCount | network structure analysis |
BetweennessCentrality | important node digging, node influence calculation |
DegreeStatic | graph structure analysis |
3.1 Ad-hoc Spark Env setup
Here I assume the Nebula Graph was bootstraped with Nebula-Up, thus nebula is running in a Docker Network named nebula-docker-compose_nebula-net
.
Then let’s start a single server spark:
docker run --name spark-master --network nebula-docker-compose_nebula-net \
-h spark-master -e ENABLE_INIT_DAEMON=false -d \
-v nebula-algorithm/:/root \
bde2020/spark-master:2.4.5-hadoop2.7
Thus we could make spark application submt inside this container:
docker exec -it spark-master bash
cd /root/
# download Nebula-Algorithm Jar Packagem, 2.0.0 for example, for other versions, refer to nebula-algorithm github repo and documentations.
wget https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/2.0.0/nebula-algorithm-2.0.0.jar
3.2 Run Algorithms
There are many altorithms supported by Nebula-Algorithm, here some of their configuration files were put under nebula-algorithm as an example.
Before using them, please first edit and change Nebula Graph Cluster Addresses and credentials.
vim nebula-altorithm/algo-pagerank.conf
Then we could enter the spark container and call corresponding algorithms as follow.
Please adjust your --driver-memeory
accordingly, i.e. pagerank altorithm:
/spark/bin/spark-submit --master "local" --conf spark.rpc.askTimeout=6000s \
--class com.vesoft.nebula.algorithm.Main \
--driver-memory 16g nebula-algorithm-2.0.0.jar \
-p pagerank.conf
After the algorithm finished, the output will be under the path insdie the container defined in conf file:
write:{
resultPath:/output/
}
题图版权:@sigmund