Building a data pipeline for highly unstructured renewable energy data.

Collecting Data & Data Pre-Processing

This process includes collecting data and preprocessing them. Finally publish the data to any queue DB for next task.

How you can read Data.

import pandas as pd
solar_data = pd.read_excel("solar_data.xls") #incase excel data

Output:

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	What Do You Want To Do?	What Kind of Renewable Energy are you loooking for?	Qualifying Questions	What are you looking to power?	What kind of property is it?	Qualifying Q	Qualifying Questions.1	Products	Unnamed: 8	Category 1	...	Minor category 1.4	Category 2	Subcategory 2	Minor category 2	Category 3	Subcategory 3	Minor category 3	Category 4	Subcategory 4	Minor category 4
0	Reivew Renewable Energy Options	Solar Power for my….	NaN	Home	On Grid	profile questions - hold for next release	NaN	Resi Grid Tie Packages	NaN	Solar	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	Off Grid	profile questions - hold for next release	NaN	Resi Off Grid Packages	NaN	Solar	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	Small Business	Own building/property	NaN	NaN	Commercial Solar Pkgs	NaN	Engineering/Design	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	Guest/Pool House	On Grid	profile questions - hold for next release	NaN	Resi Grid Tie Packages	NaN	Solar	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	Off Grid	profile questions - hold for next release	NaN	Resi Off Grid Packages	NaN	Solar	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 24 columns

# Remove the NAN data. Other wise It will make problem incase of data insertion in DBs.
solar_data = solar_data.replace(np.nan, "")

# Apply text processing
solar_data = solar_data.apply(lambda i: process_text(i))

# For the data like *Engineering/Design* . We can separate values using separator and convert to array or dict.
#incase
text = "Engineering/Design"
# we can convert it to list
text = text.split("/")


# visualize and analysis the data by grouping/ aggregating.
solar_data.group_by("Category 1")
 ####

After Pre-processing send the data to any QUEUE DB

Let's here I am taking KAFKA. Kafka has a topic and each topic has multiple subscribers and consumers. For a same service we can have multiple consumers inside consumer group for distributed workload.

Creating a topic.

from kafka.admin import KafkaAdminClient, NewTopic

admin_client = KafkaAdminClient(bootstrap_servers=CONFIG.kafka_server)

topic_list = [NewTopic(name=<topic_name>, num_partitions=<no_of_partition>, replication_factor=<no_of_replica>)]
admin_client.create_topics(new_topics=topic_list, validate_only=False)

Publish to KAFKA.

from time import sleep
from json import dumps
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=[<kafka_server>],
                         value_serializer=lambda x:
                         dumps(x).encode('utf-8'))
data = {
       ----
    }

producer.send(<topic_name>, value=data)

Applying ML Algorithm

We can mine the data for extracting keyword informations.
- We can apply named entiti recognization model for identifying the attributes from text. e.g: If in data we have: "What Kind of Renewable Energy are you loooking for?" Ans: I am looking for Home
The model will hellp to identify

HOME is actually the installation location.

How can we do this? We have all the text. and the installation location name and we can train a NER model for this.

REFERENCE: https://arxiv.org/pdf/1909.10148v1.pdf
There are multiple level classification. As an example: "What Kind of Renewable Energy are you looking for?" = "solar in my home...." "What are you looking to power?" = "home" "What kind of property is it?" = "On Grid" "Products" = "Resi Grid Tie Packages"

We can take these data and make an multilevel classification model which will predict an hierarchial classification tree. e.g: classification_l1 = "Solar" e.g: classification_l2 = "Solar Packages"

This can be in two Ways.

Apply a sequence modeling
Apply multiple softmax layer above extracted global feature for N level of classifications.
I have good experience in working with keras and tensorflow. Here I have attached oneof my repo https://github.com/IIITian-Chandan/Product-Image-Grouping

I can use FLASK/ Falcon for making ML services as API. and deploy using container.

How to Save data in MONGODB.

I use pymongo in python.

import pymongo
from pymongo import MongoClient

MONGODB_URI = <MONGO_URI>
MONGODB_DATABASE = <DB_NAME>
MONGODB_COLLECTION = <COLLECTION_NAME>

client = pymongo.MongoClient(
    MONGODB_URI,
    ssl=False
)
db = client[MONGODB_DATABASE][MONGODB_COLLECTION]
items = [{}, {}....] # list of objects
db.insert_many(items)

Sync the data to elasticsearch for querying.

You can use elasticsearch python client or node js library for moving data.

In elasticsearch you can do several filter and aggregation query for visualization.

For text matching exact match: you can use term, match (with boost 1) phrase matching: match phrase wildcard matching: wildcard for filtering: filter We can use scripted filter and all so on......

We can do several aggreggations like count, range, cardinality, histogram ...etc..

We can store the same data in NEO4J.

Where prodocts and classifications can be nodes and I can connect products with a relationship

(:Products)-[:has]->(:Classification)

We can write

CREATE (p:Products{"name":"", "setup_location":"",....}})
CREATE (c:Classification{"l1":"", "l2":"",....}})

CREATE (p)-[h:has]->(c)
RETURN p,h,c

We can do filtering and aggregation above this

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a data pipeline for highly unstructured renewable energy data.

Collecting Data & Data Pre-Processing

Applying ML Algorithm

How to Save data in MONGODB.

Sync the data to elasticsearch for querying.

We can store the same data in NEO4J.

About

Releases

Packages

chandanmishra-03/Solar_Data_modeling

Folders and files

Latest commit

History

Repository files navigation

Building a data pipeline for highly unstructured renewable energy data.

Collecting Data & Data Pre-Processing

Applying ML Algorithm

How to Save data in MONGODB.

Sync the data to elasticsearch for querying.

We can store the same data in NEO4J.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages