Skip to content

Commit

Permalink
release-0.2 #17
Browse files Browse the repository at this point in the history
release: 0.2 polished docs and readme
  • Loading branch information
wey-gu committed Mar 1, 2023
2 parents efda3b4 + 12a2959 commit 7fadac9
Show file tree
Hide file tree
Showing 4 changed files with 229 additions and 176 deletions.
281 changes: 107 additions & 174 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,230 +1,163 @@
# NebulaGraph Data Intelligence(ngdi) Suite

![image](https://user-images.githubusercontent.com/1651790/221876073-61ef4edb-adcd-4f10-b3fc-8ddc24918ea1.png)

[![pdm-managed](https://img.shields.io/badge/pdm-managed-blueviolet)](https://pdm.fming.dev) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![PyPI version](https://badge.fury.io/py/ngdi.svg)](https://badge.fury.io/py/ngdi) [![Python](https://img.shields.io/badge/python-3.6%2B-blue.svg)](https://www.python.org/downloads/release/python-360/)
<p align="center">
<em>Data Intelligence Suite with 4 line code to run Graph Algo on NebulaGraph</em>
</p>

NebulaGraph Data Intelligence Suite for Python (ngdi) is a powerful Python library that offers a range of APIs for data scientists to effectively read, write, analyze, and compute data in NebulaGraph. This library allows data scientists to perform these operations on a single machine using NetworkX, or in a distributed computing environment using Spark, in unified and intuitive API. With ngdi, data scientists can easily access and process data in NebulaGraph, enabling them to perform advanced analytics and gain valuable insights.
<p align="center">
<a href="LICENSE" target="_blank">
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License">
</a>

```
┌───────────────────────────────────────────────────┐
│ Spark Cluster │
│ .─────. .─────. .─────. .─────. │
┌─▶│ : ; : ; : ; : ; │
│ │ `───' `───' `───' `───' │
Algorithm │
Spark └───────────────────────────────────────────────────┘
Engine ┌────────────────────────────────────────────────────────────────┐
└──┤ │
│ NebulaGraph Data Intelligence Suite(ngdi) │
│ ┌────────┐ ┌──────┐ ┌────────┐ ┌─────┐ │
│ │ Reader │ │ Algo │ │ Writer │ │ GNN │ │
│ └────────┘ └──────┘ └────────┘ └─────┘ │
│ ├────────────┴───┬────────┴─────┐ └──────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐┌───────────┐ │
┌──┤ │ SparkEngine │ │ NebulaEngine │ │ NetworkX ││ DGLEngine │ │
│ │ └─────────────┘ └──────────────┘ └──────────┘└───────────┘ │
│ └──────────┬─────────────────────────────────────────────────────┘
│ │ Spark
│ └────────Reader ────────────┐
Spark Reader Query Mode │
Scan Mode ▼
│ ┌───────────────────────────────────────────────────┐
│ │ NebulaGraph Graph Engine Nebula-GraphD │
│ ├──────────────────────────────┬────────────────────┤
│ │ NebulaGraph Storage Engine │ │
└─▶│ Nebula-StorageD │ Nebula-Metad │
└──────────────────────────────┴────────────────────┘
```
<a href="https://badge.fury.io/py/ngdi" target="_blank">
<img src="https://badge.fury.io/py/ngdi.svg" alt="PyPI version">
</a>

<a href="https://www.python.org/downloads/release/python-360/" target="_blank">
<img src="https://img.shields.io/badge/python-3.6%2B-blue.svg" alt="Python">
</a>

<a href="https://pdm.fming.dev" target="_blank">
<img src="https://img.shields.io/badge/pdm-managed-blueviolet" alt="pdm-managed">
</a>

</p>

---

**Documentation**: <a href="https://github.com/wey-gu/nebulagraph-di#documentation" target="_blank">https://github.com/wey-gu/nebulagraph-di#documentation</a>

**Source Code**: <a href="https://github.com/wey-gu/nebulagraph-di" target="_blank">https://github.com/wey-gu/nebulagraph-di</a>

---


NebulaGraph Data Intelligence Suite for Python (ngdi) is a powerful Python library that offers APIs for data scientists to effectively read, write, analyze, and compute data in NebulaGraph.

With the support of single-machine engine(NetworkX), or distributed computing environment using Spark we could perform Graph Analysis and Algorithms on top of NebulaGraph in less than 10 lines of code, in unified and intuitive API.

## Quick Start in 5 Minutes

- Setup env with Nebula-Up following [this guide](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md).
- Install ngdi with pip from the Jupyter Notebook with http://localhost:8888 (password: `nebula`).
- Open the demo notebook and run cells with `Shift+Enter` or `Ctrl+Enter`.
- Open the demo notebook and run cells one by one.
- Check the [API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md)

## Installation

```bash
pip install ngdi
```

### Spark Engine Prerequisites
- Spark 2.4, 3.0(not yet tested)
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula)
- [NebulaGraph Spark Connector 3.4+](https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/)
- [NebulaGraph Algorithm 3.1+](https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/)

### NebulaGraph Engine Prerequisites
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula)
- [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python)
- [NetworkX](https://networkx.org/)

## Run on PySpark Jupyter Notebook(Spark Engine)

Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`.
## Usage

```bash
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --port=8888 --no-browser"

pyspark --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \
--driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \
--jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \
--jars /opt/nebulagraph/ngdi/package/nebula-algo.jar
```
### Spark Engine Examples

Then we could access Jupyter Notebook with PySpark and refer to [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb)
See also: [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb)

## Submit Algorithm job to Spark Cluster(Spark Engine)
Run Algorithm on top of NebulaGraph:

Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`;
We have put the `ngdi-py3-env.zip` in `/opt/nebulagraph/ngdi/package/`.
And we have the following Algorithm job in `pagerank.py`:
> Note, there is also query mode, refer to [examples](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) or [docs](https://github.com/wey-gu/nebulagraph-di/docs/API.md) for more details.
```python
from ngdi import NebulaGraphConfig
from ngdi import NebulaReader

# set NebulaGraph config
config_dict = {
"graphd_hosts": "graphd:9669",
"metad_hosts": "metad0:9669,metad1:9669,metad2:9669",
"user": "root",
"password": "nebula",
"space": "basketballplayer",
}
config = NebulaGraphConfig(**config_dict)

# read data with spark engine, query mode
# read data with spark engine, scan mode
reader = NebulaReader(engine="spark")
query = """
MATCH ()-[e:follow]->()
RETURN e LIMIT 100000
"""
reader.query(query=query, edge="follow", props="degree")
reader.scan(edge="follow", props="degree")
df = reader.read()

# run pagerank algorithm
pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10)
```

> Note, this could be done by Airflow, or other job scheduler in production.
Then we can submit the job to Spark cluster:
Write back to NebulaGraph:

```bash
spark-submit --master spark://master:7077 \
--driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \
--driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \
--jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \
--jars /opt/nebulagraph/ngdi/package/nebula-algo.jar \
--py-files /opt/nebulagraph/ngdi/package/ngdi-py3-env.zip \
pagerank.py
```
```python
from ngdi import NebulaWriter
from ngdi.config import NebulaGraphConfig

## Run ngdi algorithm job from python script(Spark Engine)
config = NebulaGraphConfig()

We have everything ready as above, including the `pagerank.py`.
properties = {"louvain": "cluster_id"}

```python
import subprocess

subprocess.run(["spark-submit", "--master", "spark://master:7077",
"--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar",
"--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-algo.jar",
"--jars", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar",
"--jars", "/opt/nebulagraph/ngdi/package/nebula-algo.jar",
"--py-files", "/opt/nebulagraph/ngdi/package/ngdi-py3-env.zip",
"pagerank.py"])
writer = NebulaWriter(
data=df_result, sink="nebulagraph_vertex", config=config, engine="spark")
writer.set_options(
tag="louvain", vid_field="_id", properties=properties,
batch_size=256, write_mode="insert",)
writer.write()
```

## Run on single machine(NebulaGraph Engine)
Then we could query the result in NebulaGraph:

Assuming we have NebulaGraph cluster up and running, and we have the following Algorithm job in `pagerank_nebula_engine.py`:
```cypher
MATCH (v:louvain)
RETURN id(v), v.louvain.cluster_id LIMIT 10;
```

This file is the same as `pagerank.py` except for the following line:
### NebulaGraph Engine Examples(not yet implemented)

Basically the same as Spark Engine, but with `engine="nebula"`.

```diff
- reader = NebulaReader(engine="spark")
+ reader = NebulaReader(engine="nebula")
```

Then we can run the job on single machine:

```bash
python3 pagerank.py
```

## Documentation

[API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md)

## Usage

### Spark Engine Examples

See also: [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb)

```python
from ngdi import NebulaReader

# read data with spark engine, query mode
reader = NebulaReader(engine="spark")
query = """
MATCH ()-[e:follow]->()
RETURN e LIMIT 100000
"""
reader.query(query=query, edge="follow", props="degree")
df = reader.read() # this will take some time
df.show(10)
[Environment Setup](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md)

# read data with spark engine, scan mode
reader = NebulaReader(engine="spark")
reader.scan(edge="follow", props="degree")
df = reader.read() # this will take some time
df.show(10)
[API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md)

# read data with spark engine, load mode (not yet implemented)
reader = NebulaReader(engine="spark")
reader.load(source="hdfs://path/to/edge.csv", format="csv", header=True, schema="src: string, dst: string, rank: int")
df = reader.read() # this will take some time
df.show(10)
## How it works

# run pagerank algorithm
pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some time
ngdi is an unified abstraction layer for different engines, the current implementation is based on Spark, NetworkX, DGL and NebulaGraph, but it's easy to extend to other engines like Flink, GraphScope, PyG etc.

# convert dataframe to NebulaGraphObject
graph = reader.to_graphx() # not yet implemented
```
┌───────────────────────────────────────────────────┐
│ Spark Cluster │
│ .─────. .─────. .─────. .─────. │
┌─▶│ : ; : ; : ; : ; │
│ │ `───' `───' `───' `───' │
Algorithm │
Spark └───────────────────────────────────────────────────┘
Engine ┌────────────────────────────────────────────────────────────────┐
└──┤ │
│ NebulaGraph Data Intelligence Suite(ngdi) │
│ ┌────────┐ ┌──────┐ ┌────────┐ ┌─────┐ │
│ │ Reader │ │ Algo │ │ Writer │ │ GNN │ │
│ └────────┘ └──────┘ └────────┘ └─────┘ │
│ ├────────────┴───┬────────┴─────┐ └──────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐┌───────────┐ │
┌──┤ │ SparkEngine │ │ NebulaEngine │ │ NetworkX ││ DGLEngine │ │
│ │ └─────────────┘ └──────────────┘ └──────────┘└───────────┘ │
│ └──────────┬─────────────────────────────────────────────────────┘
│ │ Spark
│ └────────Reader ────────────┐
Spark Reader Query Mode │
Scan Mode ▼
│ ┌───────────────────────────────────────────────────┐
│ │ NebulaGraph Graph Engine Nebula-GraphD │
│ ├──────────────────────────────┬────────────────────┤
│ │ NebulaGraph Storage Engine │ │
└─▶│ Nebula-StorageD │ Nebula-Metad │
└──────────────────────────────┴────────────────────┘
```

### NebulaGraph Engine Examples(not yet implemented)
### Spark Engine Prerequisites
- Spark 2.4, 3.0(not yet tested)
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula)
- [NebulaGraph Spark Connector 3.4+](https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/)
- [NebulaGraph Algorithm 3.1+](https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/)

```python
from ngdi import NebulaReader
### NebulaGraph Engine Prerequisites
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula)
- [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python)
- [NetworkX](https://networkx.org/)

# read data with nebula engine, query mode
reader = NebulaReader(engine="nebula")
reader.query("""
MATCH ()-[e:follow]->()
RETURN e.src, e.dst, e.degree LIMIT 100000
""")
df = reader.read() # this will take some time
df.show(10)

# read data with nebula engine, scan mode
reader = NebulaReader(engine="nebula")
reader.scan(edge_types=["follow"])
df = reader.read() # this will take some time
df.show(10)

# convert dataframe to NebulaGraphObject
graph = reader.to_graph() # this will take some time
graph.nodes.show(10)
graph.edges.show(10)
## License

# run pagerank algorithm
pr_result = graph.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some time
```
This project is licensed under the terms of the Apache License 2.0.
12 changes: 12 additions & 0 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,18 @@ reader.query(query=query, edge="follow", props="degree")
df = reader.read()
```

- Load mode

> not yet implemented
```python
# read data with spark engine, load mode (not yet implemented)
reader = NebulaReader(engine="spark")
reader.load(source="hdfs://path/to/edge.csv", format="csv", header=True, schema="src: string, dst: string, rank: int")
df = reader.read() # this will take some time
df.show(10)
```

## engines

- `ngdi.engines.SparkEngine` is the Spark Engine for `ngdi.NebulaReader`, `ngdi.NebulaWriter` and `ngdi.NebulaAlgorithm`.
Expand Down
Loading

0 comments on commit 7fadac9

Please sign in to comment.