Skip to content

chandanmishra-03/Solar_Data_modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Building a data pipeline for highly unstructured renewable energy data.

Collecting Data & Data Pre-Processing

This process includes collecting data and preprocessing them. Finally publish the data to any queue DB for next task.

  • How you can read Data.
import pandas as pd
solar_data = pd.read_excel("solar_data.xls") #incase excel data
Output:

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
What Do You Want To Do? What Kind of Renewable Energy are you loooking for? Qualifying Questions What are you looking to power? What kind of property is it? Qualifying Q Qualifying Questions.1 Products Unnamed: 8 Category 1 ... Minor category 1.4 Category 2 Subcategory 2 Minor category 2 Category 3 Subcategory 3 Minor category 3 Category 4 Subcategory 4 Minor category 4
0 Reivew Renewable Energy Options Solar Power for my…. NaN Home On Grid profile questions - hold for next release NaN Resi Grid Tie Packages NaN Solar ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN Off Grid profile questions - hold for next release NaN Resi Off Grid Packages NaN Solar ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN Small Business Own building/property NaN NaN Commercial Solar Pkgs NaN Engineering/Design ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN Guest/Pool House On Grid profile questions - hold for next release NaN Resi Grid Tie Packages NaN Solar ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN Off Grid profile questions - hold for next release NaN Resi Off Grid Packages NaN Solar ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 24 columns

# Remove the NAN data. Other wise It will make problem incase of data insertion in DBs.
solar_data = solar_data.replace(np.nan, "")

# Apply text processing
solar_data = solar_data.apply(lambda i: process_text(i))

# For the data like *Engineering/Design* . We can separate values using separator and convert to array or dict.
#incase
text = "Engineering/Design"
# we can convert it to list
text = text.split("/")


# visualize and analysis the data by grouping/ aggregating.
solar_data.group_by("Category 1")
 ####
 
  • After Pre-processing send the data to any QUEUE DB

Let's here I am taking KAFKA. Kafka has a topic and each topic has multiple subscribers and consumers. For a same service we can have multiple consumers inside consumer group for distributed workload.

Creating a topic.

from kafka.admin import KafkaAdminClient, NewTopic

admin_client = KafkaAdminClient(bootstrap_servers=CONFIG.kafka_server)

topic_list = [NewTopic(name=<topic_name>, num_partitions=<no_of_partition>, replication_factor=<no_of_replica>)]
admin_client.create_topics(new_topics=topic_list, validate_only=False)

Publish to KAFKA.

from time import sleep
from json import dumps
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=[<kafka_server>],
                         value_serializer=lambda x:
                         dumps(x).encode('utf-8'))
data = {
       ----
    }

producer.send(<topic_name>, value=data)

Applying ML Algorithm

  • We can mine the data for extracting keyword informations.

    • We can apply named entiti recognization model for identifying the attributes from text. e.g: If in data we have: "What Kind of Renewable Energy are you loooking for?" Ans: I am looking for Home

    The model will hellp to identify

    HOME is actually the installation location.

    How can we do this? We have all the text. and the installation location name and we can train a NER model for this.

    REFERENCE: https://arxiv.org/pdf/1909.10148v1.pdf

  • There are multiple level classification. As an example: "What Kind of Renewable Energy are you looking for?" = "solar in my home...." "What are you looking to power?" = "home" "What kind of property is it?" = "On Grid" "Products" = "Resi Grid Tie Packages"

We can take these data and make an multilevel classification model which will predict an hierarchial classification tree. e.g: classification_l1 = "Solar" e.g: classification_l2 = "Solar Packages"

This can be in two Ways.

I can use FLASK/ Falcon for making ML services as API. and deploy using container.

How to Save data in MONGODB.

I use pymongo in python.

import pymongo
from pymongo import MongoClient

MONGODB_URI = <MONGO_URI>
MONGODB_DATABASE = <DB_NAME>
MONGODB_COLLECTION = <COLLECTION_NAME>

client = pymongo.MongoClient(
    MONGODB_URI,
    ssl=False
)
db = client[MONGODB_DATABASE][MONGODB_COLLECTION]
items = [{}, {}....] # list of objects
db.insert_many(items)

Sync the data to elasticsearch for querying.

You can use elasticsearch python client or node js library for moving data.

In elasticsearch you can do several filter and aggregation query for visualization.

For text matching exact match: you can use term, match (with boost 1) phrase matching: match phrase wildcard matching: wildcard for filtering: filter We can use scripted filter and all so on......

We can do several aggreggations like count, range, cardinality, histogram ...etc..

We can store the same data in NEO4J.

Where prodocts and classifications can be nodes and I can connect products with a relationship

(:Products)-[:has]->(:Classification)

We can write

CREATE (p:Products{"name":"", "setup_location":"",....}})
CREATE (c:Classification{"l1":"", "l2":"",....}})

CREATE (p)-[h:has]->(c)
RETURN p,h,c

We can do filtering and aggregation above this

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published