This process includes collecting data and preprocessing them. Finally publish the data to any queue DB for next task.
- How you can read Data.
import pandas as pd
solar_data = pd.read_excel("solar_data.xls") #incase excel data
Output:
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
What Do You Want To Do? | What Kind of Renewable Energy are you loooking for? | Qualifying Questions | What are you looking to power? | What kind of property is it? | Qualifying Q | Qualifying Questions.1 | Products | Unnamed: 8 | Category 1 | ... | Minor category 1.4 | Category 2 | Subcategory 2 | Minor category 2 | Category 3 | Subcategory 3 | Minor category 3 | Category 4 | Subcategory 4 | Minor category 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Reivew Renewable Energy Options | Solar Power for my…. | NaN | Home | On Grid | profile questions - hold for next release | NaN | Resi Grid Tie Packages | NaN | Solar | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | Off Grid | profile questions - hold for next release | NaN | Resi Off Grid Packages | NaN | Solar | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | Small Business | Own building/property | NaN | NaN | Commercial Solar Pkgs | NaN | Engineering/Design | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | Guest/Pool House | On Grid | profile questions - hold for next release | NaN | Resi Grid Tie Packages | NaN | Solar | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | Off Grid | profile questions - hold for next release | NaN | Resi Off Grid Packages | NaN | Solar | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 24 columns
# Remove the NAN data. Other wise It will make problem incase of data insertion in DBs.
solar_data = solar_data.replace(np.nan, "")
# Apply text processing
solar_data = solar_data.apply(lambda i: process_text(i))
# For the data like *Engineering/Design* . We can separate values using separator and convert to array or dict.
#incase
text = "Engineering/Design"
# we can convert it to list
text = text.split("/")
# visualize and analysis the data by grouping/ aggregating.
solar_data.group_by("Category 1")
####
- After Pre-processing send the data to any QUEUE DB
Let's here I am taking KAFKA. Kafka has a topic and each topic has multiple subscribers and consumers. For a same service we can have multiple consumers inside consumer group for distributed workload.
Creating a topic.
from kafka.admin import KafkaAdminClient, NewTopic
admin_client = KafkaAdminClient(bootstrap_servers=CONFIG.kafka_server)
topic_list = [NewTopic(name=<topic_name>, num_partitions=<no_of_partition>, replication_factor=<no_of_replica>)]
admin_client.create_topics(new_topics=topic_list, validate_only=False)
Publish to KAFKA.
from time import sleep
from json import dumps
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=[<kafka_server>],
value_serializer=lambda x:
dumps(x).encode('utf-8'))
data = {
----
}
producer.send(<topic_name>, value=data)
-
We can mine the data for extracting keyword informations.
- We can apply named entiti recognization model for identifying the attributes from text. e.g: If in data we have: "What Kind of Renewable Energy are you loooking for?" Ans: I am looking for Home
The model will hellp to identify
HOME is actually the installation location.
How can we do this? We have all the text. and the installation location name and we can train a NER model for this.
REFERENCE: https://arxiv.org/pdf/1909.10148v1.pdf
-
There are multiple level classification. As an example: "What Kind of Renewable Energy are you looking for?" = "solar in my home...." "What are you looking to power?" = "home" "What kind of property is it?" = "On Grid" "Products" = "Resi Grid Tie Packages"
We can take these data and make an multilevel classification model which will predict an hierarchial classification tree. e.g: classification_l1 = "Solar" e.g: classification_l2 = "Solar Packages"
This can be in two Ways.
-
Apply a sequence modeling
-
Apply multiple softmax layer above extracted global feature for N level of classifications.
-
I have good experience in working with keras and tensorflow. Here I have attached oneof my repo https://github.com/IIITian-Chandan/Product-Image-Grouping
I can use FLASK/ Falcon for making ML services as API. and deploy using container.
I use pymongo in python.
import pymongo
from pymongo import MongoClient
MONGODB_URI = <MONGO_URI>
MONGODB_DATABASE = <DB_NAME>
MONGODB_COLLECTION = <COLLECTION_NAME>
client = pymongo.MongoClient(
MONGODB_URI,
ssl=False
)
db = client[MONGODB_DATABASE][MONGODB_COLLECTION]
items = [{}, {}....] # list of objects
db.insert_many(items)
You can use elasticsearch python client or node js library for moving data.
In elasticsearch you can do several filter and aggregation query for visualization.
For text matching exact match: you can use term, match (with boost 1) phrase matching: match phrase wildcard matching: wildcard for filtering: filter We can use scripted filter and all so on......
We can do several aggreggations like count, range, cardinality, histogram ...etc..
Where prodocts and classifications can be nodes and I can connect products with a relationship
(:Products)-[:has]->(:Classification)
We can write
CREATE (p:Products{"name":"", "setup_location":"",....}})
CREATE (c:Classification{"l1":"", "l2":"",....}})
CREATE (p)-[h:has]->(c)
RETURN p,h,c
We can do filtering and aggregation above this