This repository has been archived by the owner on Oct 8, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 153
Support feature selection #338
Comments
@amaya382 depends on input format of each algorithm. If we can use same format as the input, better to provide a Also, how to apply distributed processing is another issue. mRMR is processed in parallel by MapReduce in some implementation. |
@amaya382 we can do better as follows:
create table input (
X array<double>, -- features
Y array<int> -- binarized label
);
WITH stats (
select
-- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
transpose_and_dot(Y, X) as observed, -- array<array<double>> # n_classes * n_features matrix
array_sum(X) as feature_count, -- n_features col vector # array<double>
array_avg(Y) as class_prob -- n_class col vector # array<double>
from
input
),
test as (
select
observed,
-- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
transpose_and_dot(class_prob, feature_count) as expected -- array<array<double>> # n_class * n_features matrix
from
stats
),
select
-- UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>>
chi2(observed, expected) as (chi2, pval) -- struct<array<double>,array<double>> # n_features
from
test;
select
select_k_best(X, T.chi2, ${k}) as X, -- feature selection based on chi2 score
Y
from
input
CROSS JOIN chi2 T
; What you have to develop is just |
@myui okay, I'll try this strategy |
@amaya382 could you validate your implementation to other chi2 implementation (e.g., scikit-learn) in the unit and system test? |
chi2 (Iris dataset)results by hivemall with systemtest, which exec query actually
results by scikit-learn
|
SNR (Iris dataset)results by hivemall with systemtest, which exec query actually, incremental algorithm
results by python-numpy, batch algorithm
Also already tested on EMR, worked properly |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Feature selection
Feature selection is the process of selecting a subset consisting of influential features from multiple features. It is an important technique to enhance results, shorten training time and make features human-understandable.
Currently, following is temporary I/F
Candidates for internal selecting methods
mRMRCommon
[UDAF]
transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>
Input
Output
dot(X.T, Y)
, shape = (X.#cols, Y.#cols)[UDF]
select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
Input
Output
/***********************************************************************
Note
importance_list
andk
are equal_. It maybe confuse us.importance_list
andk
***********************************************************************/
chi2
[UDF]
chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>
Input
both
observed
andexpected
, shape = (#classes, #features)dot(class_prob.T, feature_count)
Output
Example - chi2
SNR
[UDAF]
snr(X::array<number>, Y::array<int>)::array<double>
Input
Output
Note
Example - snr
The text was updated successfully, but these errors were encountered: