Nebula LiveJournal，上手 LiveJournal 数据集导入 Nebula Graph 与图算法执行

Wey Gu 收录于类别

2021-08-24 2021-09-18 约 556 字预计阅读 3 分钟

Import LiveJournal Dataset into Nebula Graph and Run Nebula Algorithm 导入 Livejournal 数据集到 Nebula 并运行 Nebula Algorithm 图算法

一个导入 Livejournal 数据集到 Nebula Graph 图数据库，并执行 Nebula Algorithm 图算法的过程分享。

Related GitHub Repo: https://github.com/wey-gu/nebula-LiveJournal

nebula-LiveJournal

LiveJournal Dataset is a Social Network Dataset in one file with two columns(FromNodeId, ToNodeId).

bash

$ head soc-LiveJournal1.txt
# Directed graph (each unordered pair of nodes is saved once): soc-LiveJournal1.txt
# Directed LiveJournal friednship social network
# Nodes: 4847571 Edges: 68993773
# FromNodeId	ToNodeId
0	1
0	2
0	3
0	4
0	5
0	6

It could be accessed in https://snap.stanford.edu/data/soc-LiveJournal1.html.

Dataset statistics
Nodes	4847571
Edges	68993773
Nodes in largest WCC	4843953 (0.999)
Edges in largest WCC	68983820 (1.000)
Nodes in largest SCC	3828682 (0.790)
Edges in largest SCC	65825429 (0.954)
Average clustering coefficient	0.2742
Number of triangles	285730264
Fraction of closed triangles	0.04266
Diameter (longest shortest path)	16
90-percentile effective diameter	6.5

1 Dataset Download and Preprocessing

1.1 Download

It is accesissiable from the official web page:

bash

$ cd nebula-livejournal/data
$ wget https://snap.stanford.edu/data/soc-LiveJournal1.txt.gz

Comments in data file should be removed to make the data import tool happy.

1.2 Preprocessing

bash

$ gzip -d soc-LiveJournal1.txt.gz
$ sed -i '1,4d' soc-LiveJournal1.txt

2 Import dataset to Nebula Graph

2.1 With Nebula Importer

Nebula-Importer is a Golang Headless import tool for Nebula Graph.

You may need to edit the config file under nebula-importer/importer.yaml on Nebula Graph’s address and credential。

Then, Nebula-Importer could be called in Docker as follow:

bash

$ cd nebula-livejournal

$ docker run --rm -ti \
    --network=nebula-net \
    -v nebula-importer/importer.yaml:/root/importer.yaml \
    -v data/:/root \
    vesoft/nebula-importer:v2 \
    --config /root/importer.yaml

Or if you have the binary nebula-importer locally:

bash

$ cd data
$ <path_to_nebula-importer_binary> --config ../nebula-importer/importer.yaml

2.2 With Nebula Exchange

Nebula-Exchange is a Spark Application to enable batch and streaming data import from multiple data sources to Nebula Graph.

To be done. (You can refer to https://siwei.io/nebula-exchange-sst-2.x/)

3 Run Algorithms with Nebula Graph

Nebula-Algorithm is a Spark/GraphX Application to run Graph Algorithms with data consumed from files or a Nebula Graph Cluster.

Supported Algorithms for now:

Name	Use Case
PageRank	page ranking, important node digging
Louvain	community digging, hierarchical clustering
KCore	community detection, financial risk control
LabelPropagation	community detection, consultation propagation, advertising recommendation
ConnectedComponent	community detection, isolated island detection
StronglyConnectedComponent	community detection
ShortestPath	path plan, network plan
TriangleCount	network structure analysis
BetweennessCentrality	important node digging, node influence calculation
DegreeStatic	graph structure analysis

3.1 Ad-hoc Spark Env setup

Here I assume the Nebula Graph was bootstraped with Nebula-Up, thus nebula is running in a Docker Network named nebula-docker-compose_nebula-net.

Then let’s start a single server spark:

bash

docker run --name spark-master --network nebula-docker-compose_nebula-net \
    -h spark-master -e ENABLE_INIT_DAEMON=false -d \
    -v nebula-algorithm/:/root \
    bde2020/spark-master:2.4.5-hadoop2.7

Thus we could make spark application submt inside this container:

bash

docker exec -it spark-master bash
cd /root/
# download Nebula-Algorithm Jar Packagem, 2.0.0 for example, for other versions, refer to nebula-algorithm github repo and documentations.
wget https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/2.0.0/nebula-algorithm-2.0.0.jar

3.2 Run Algorithms

There are many altorithms supported by Nebula-Algorithm, here some of their configuration files were put under nebula-algorithm as an example.

Before using them, please first edit and change Nebula Graph Cluster Addresses and credentials.

bash

vim nebula-altorithm/algo-pagerank.conf

Then we could enter the spark container and call corresponding algorithms as follow.

Please adjust your --driver-memeory accordingly, i.e. pagerank altorithm:

bash

/spark/bin/spark-submit --master "local" --conf spark.rpc.askTimeout=6000s \
    --class com.vesoft.nebula.algorithm.Main \
    --driver-memory 16g nebula-algorithm-2.0.0.jar \
    -p pagerank.conf

After the algorithm finished, the output will be under the path insdie the container defined in conf file:

toml

    write:{
        resultPath:/output/
    }

题图版权：@sigmund

目录

Nebula LiveJournal，上手 LiveJournal 数据集导入 Nebula Graph 与图算法执行

nebula-LiveJournal

1 Dataset Download and Preprocessing

1.1 Download

1.2 Preprocessing

2 Import dataset to Nebula Graph

2.1 With Nebula Importer

2.2 With Nebula Exchange

3 Run Algorithms with Nebula Graph

3.1 Ad-hoc Spark Env setup

3.2 Run Algorithms

相关内容

目录

Nebula LiveJournal，上手 LiveJournal 数据集导入 Nebula Graph 与图算法执行

nebula-LiveJournal

1 Dataset Download and Preprocessing

1.1 Download

1.2 Preprocessing

2 Import dataset to Nebula Graph

2.1 With Nebula Importer

2.2 With Nebula Exchange

3 Run Algorithms with Nebula Graph

3.1 Ad-hoc Spark Env setup

3.2 Run Algorithms

相关内容

Graph RAG: 知识图谱结合 LLM 的检索增强

Demo：NebulaGraph 的 Graph RAG

Demo：NebulaGraph 的知识图谱构建与 Text2Cypher

Text2Cypher：大语言模型驱动的图谱查询生成

NebulaGraph in Jupyter Notebook