Skip to content

Latest commit

 

History

History
171 lines (127 loc) · 5.29 KB

Environment_Setup.md

File metadata and controls

171 lines (127 loc) · 5.29 KB

Envrionment Setup

TOC

With Nebula-UP(qiuck start)

Installation

curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash -s -- v3 spark

see Nebula-UP for more details.

Then load the basketballplayer dataset:

~/.nebula-up/load-basketballplayer-dataset.sh

Access to PySpark Jupyter Notebook

Just visit http://localhost:8888 in your browser.

The default password is nebula.

Open data_intelligence_suite_demo.ipynb and run the first cell to install ng_ai, then you can run the rest cells.

Access to NebulaGraph

Just visit http://localhost:7001 in your browser, with:

  • host: graphd:9669
  • user: root
  • password: nebula

Run In Production

Run on PySpark Jupyter Notebook

Assuming we have put the nebula-spark-connector.jar and nebula-algo.jar in /opt/nebulagraph/ng_ai/package/.

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --port=8888 --no-browser"

pyspark --driver-class-path /opt/nebulagraph/ng_ai/package/nebula-spark-connector.jar \
    --driver-class-path /opt/nebulagraph/ng_ai/package/nebula-algo.jar \
    --jars /opt/nebulagraph/ng_ai/package/nebula-spark-connector.jar \
    --jars /opt/nebulagraph/ng_ai/package/nebula-algo.jar

Then we could access Jupyter Notebook with PySpark and refer to examples/spark_engine.ipynb

Submit Algorithm job to Spark Cluster

Assuming we have put the nebula-spark-connector.jar and nebula-algo.jar in /opt/nebulagraph/ng_ai/package/; We have put the ng_ai-py3-env.zip in /opt/nebulagraph/ng_ai/package/. And we have the following Algorithm job in pagerank.py:

from ng_ai import NebulaGraphConfig
from ng_ai import NebulaReader

# set NebulaGraph config
config_dict = {
    "graphd_hosts": "graphd:9669",
    "metad_hosts": "metad0:9669,metad1:9669,metad2:9669",
    "user": "root",
    "password": "nebula",
    "space": "basketballplayer",
}
config = NebulaGraphConfig(**config_dict)

# read data with spark engine, query mode
reader = NebulaReader(engine="spark")
query = """
    MATCH ()-[e:follow]->()
    RETURN e LIMIT 100000
"""
reader.query(query=query, edge="follow", props="degree")
df = reader.read()

# run pagerank algorithm
pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10)

Note, this could be done by Airflow, or other job scheduler in production.

Then we can submit the job to Spark cluster:

spark-submit --master spark://sparkmaster:7077 \
    --driver-class-path <hdfs_or_local_path_to>/nebula-spark-connector.jar \
    --driver-class-path <hdfs_or_local_path_to>/nebula-algo.jar \
    --jars <hdfs_or_local_path_to>/nebula-spark-connector.jar \
    --jars <hdfs_or_local_path_to>/nebula-algo.jar \
    --py-files <hdfs_or_local_path_to>/ng_ai-py3-env.zip \
    pagerank.py

Prepare for py-files

pip install pdm
# prepare dep list in ng_ai codebase
pdm export -o dist/requirements.txt --without-hashes
# build a wheel for ng_ai
pdm build
# output it to dependencies
pip install -r dist/requirements.txt --target dist/dependencies
pip install . --target dist/dependencies
# zip dependencies and ng_ai wheel
cd dist
zip -r ng_ai-py3-env.zip dependencies
# copy ng_ai-py3-env.zip to hdfs
hdfs dfs -put ng_ai-py3-env.zip /
# check it's there
hdfs dfs -ls /

Now we have all files ready:

# hdfs dfs -ls /
Found 4 items
-rw-r--r--   3 root supergroup  167042166 2023-03-17 03:54 /nebula-algo.jar
-rw-r--r--   3 root supergroup  165992037 2023-03-17 03:54 /nebula-spark-connector.jar
-rw-r--r--   3 root supergroup    5068513 2023-03-17 03:52 /ng_ai-py3-env.zip

Run ng_ai algorithm PySpark job from python script

We have everything ready as above, including the pagerank.py.

import subprocess

subprocess.run(["spark-submit", "--master", "spark://master:7077",
                "--driver-class-path", "/opt/nebulagraph/ng_ai/package/nebula-spark-connector.jar",
                "--driver-class-path", "/opt/nebulagraph/ng_ai/package/nebula-algo.jar",
                "--jars", "/opt/nebulagraph/ng_ai/package/nebula-spark-connector.jar",
                "--jars", "/opt/nebulagraph/ng_ai/package/nebula-algo.jar",
                "--py-files", "/opt/nebulagraph/ng_ai/package/ng_ai-py3-env.zip",
                "pagerank.py"])

Run on single machine with NebulaGraph engine

Assuming we have NebulaGraph cluster up and running, and we have the following Algorithm job in pagerank_nebula_engine.py:

This file is the same as pagerank.py except for the following line:

- reader = NebulaReader(engine="spark")
+ reader = NebulaReader(engine="nebula")

Then we can run the job on single machine:

python3 pagerank.py