-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flesh out WeaveFeaturization #128
base: main
Are you sure you want to change the base?
Conversation
The result should still be a |
@@ -48,6 +48,7 @@ function SpeciesFeatureDescriptor(name::String) | |||
else | |||
# TODO: figure out default binning situation for continuous-valued SFD's | |||
#codec = OneHotOneCold(false, ) | |||
codec = DirectCodec(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually make sense as a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it, but I needed something there for the time being. Ideally we would have have to get a number of bins from the user or guess. I don't think OneHotOneCold would work since that implies categorical variable, which continuous variables are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneHotOneCold
works for categorical variables, you just have to specify bins. That's what we've been doing all along since the DirectCodec
was just added...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. DataFrames does some kind of trickery to inspect whether a column is categorical or notz but I don't think we need to be doing anything too sophisticated at this stage. I was thinking assuming something is a continuous variable as the default would be reasonable and if a feature is categorical that would need to be described explicitly.
default_bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"]) | ||
species_feats = intersect(keys(Utils.SpeciesFeatureUtils.sfd_names_props), default_atom_feature_list) | ||
|
||
atoms = SpeciesFeatureDescriptor.(species_feats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a placeholder, since we would want to get both the species features and the atom features from different arrays. Would it be fine to split those out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean how this is clunky. What do you mean by "split them out," though? Just split the species features and element features into separate lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, explicitly have different lists for different feature descriptors. We can concatenate the emergent encodings at either the encode
step or the featurize
step.
Codecov Report
@@ Coverage Diff @@
## main #128 +/- ##
==========================================
- Coverage 80.73% 77.42% -3.32%
==========================================
Files 20 20
Lines 571 598 +27
==========================================
+ Hits 461 463 +2
- Misses 110 135 +25
Continue to review full report at Codecov.
|
If we have |
Where are the missings used ultimately? |
The |
function WeaveFeaturization(element_feature_list = ["Atomic no"], | ||
species_feature_list = ["degree", | ||
"implicithconnected", | ||
"charge", | ||
"radical_electrons", | ||
"hybridization", | ||
"isaromatic", | ||
"hydrogenconnected"], | ||
bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"]) | ||
elements = ElementFeatureDescriptor.(element_feature_list) | ||
species = SpeciesFeatureDescriptor.(species_feature_list) | ||
bonds = BondFeatureDescriptor.(bond_feature_list) | ||
# pairs = PairFeatureDescriptor.(default_atom_feature_list) | ||
WeaveFeaturization(elements, species, bonds, bonds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
function WeaveFeaturization(element_feature_list = ["Atomic no"], | |
species_feature_list = ["degree", | |
"implicithconnected", | |
"charge", | |
"radical_electrons", | |
"hybridization", | |
"isaromatic", | |
"hydrogenconnected"], | |
bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"]) | |
elements = ElementFeatureDescriptor.(element_feature_list) | |
species = SpeciesFeatureDescriptor.(species_feature_list) | |
bonds = BondFeatureDescriptor.(bond_feature_list) | |
# pairs = PairFeatureDescriptor.(default_atom_feature_list) | |
WeaveFeaturization(elements, species, bonds, bonds) | |
function WeaveFeaturization( | |
element_feature_list = ["Atomic no"], | |
species_feature_list = [ | |
"degree", | |
"implicithconnected", | |
"charge", | |
"radical_electrons", | |
"hybridization", | |
"isaromatic", | |
"hydrogenconnected", | |
], | |
bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"], | |
) | |
elements = ElementFeatureDescriptor.(element_feature_list) | |
species = SpeciesFeatureDescriptor.(species_feature_list) | |
bonds = BondFeatureDescriptor.(bond_feature_list) | |
# pairs = PairFeatureDescriptor.(default_atom_feature_list) | |
WeaveFeaturization(elements, species, bonds, bonds) |
end | ||
|
||
function atom_features(feat, mol; kw...) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
"C", | ||
"N", | ||
"O", | ||
"S", | ||
"F", | ||
"Si", | ||
"P", | ||
"Cl", | ||
"Br", | ||
"Mg", | ||
"Na", | ||
"Ca", | ||
"Fe", | ||
"As", | ||
"Al", | ||
"I", | ||
"B", | ||
"V", | ||
"K", | ||
"Tl", | ||
"Yb", | ||
"Sb", | ||
"Sn", | ||
"Ag", | ||
"Pd", | ||
"Co", | ||
"Se", | ||
"Ti", | ||
"Zn", | ||
"H", # H? | ||
"Li", | ||
"Ge", | ||
"Cu", | ||
"Au", | ||
"Ni", | ||
"Cd", | ||
"In", | ||
"Mn", | ||
"Zr", | ||
"Cr", | ||
"Pt", | ||
"Hg", | ||
"Pb", | ||
"Unknown" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
"C", | |
"N", | |
"O", | |
"S", | |
"F", | |
"Si", | |
"P", | |
"Cl", | |
"Br", | |
"Mg", | |
"Na", | |
"Ca", | |
"Fe", | |
"As", | |
"Al", | |
"I", | |
"B", | |
"V", | |
"K", | |
"Tl", | |
"Yb", | |
"Sb", | |
"Sn", | |
"Ag", | |
"Pd", | |
"Co", | |
"Se", | |
"Ti", | |
"Zn", | |
"H", # H? | |
"Li", | |
"Ge", | |
"Cu", | |
"Au", | |
"Ni", | |
"Cd", | |
"In", | |
"Mn", | |
"Zr", | |
"Cr", | |
"Pt", | |
"Hg", | |
"Pb", | |
"Unknown" | |
] | |
"C", | |
"N", | |
"O", | |
"S", | |
"F", | |
"Si", | |
"P", | |
"Cl", | |
"Br", | |
"Mg", | |
"Na", | |
"Ca", | |
"Fe", | |
"As", | |
"Al", | |
"I", | |
"B", | |
"V", | |
"K", | |
"Tl", | |
"Yb", | |
"Sb", | |
"Sn", | |
"Ag", | |
"Pd", | |
"Co", | |
"Se", | |
"Ti", | |
"Zn", | |
"H", # H? | |
"Li", | |
"Ge", | |
"Cu", | |
"Au", | |
"Ni", | |
"Cd", | |
"In", | |
"Mn", | |
"Zr", | |
"Cr", | |
"Pt", | |
"Hg", | |
"Pb", | |
"Unknown", | |
] |
atom_features | ||
bond_features | ||
pair_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
atom_features | |
bond_features | |
pair_features | |
atom_features::Any | |
bond_features::Any | |
pair_features::Any |
function encode(fzn::WeaveFeaturization, ag::AtomGraph; atom_feature_kwargs = (;), | ||
bond_feature_kwargs = (;), | ||
pair_feature_kwargs = (;)) | ||
sf = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.atom_features) | ||
ef = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.element_features) | ||
atom_and_elements = vcat(sf, ef) | ||
bf = cat(map(x -> encode(x, ag, bond_feature_kwargs...), fzn.bond_features)..., dims = 3) | ||
pf = cat(map(x -> encode(x, ag, pair_feature_kwargs...), fzn.pair_features)..., dims = 3) | ||
# Return FeaturizedAtoms here | ||
atom_and_elements, vcat(bf, pf) | ||
# FeaturizedWeave(atom_and_elements, bf, pf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
function encode(fzn::WeaveFeaturization, ag::AtomGraph; atom_feature_kwargs = (;), | |
bond_feature_kwargs = (;), | |
pair_feature_kwargs = (;)) | |
sf = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.atom_features) | |
ef = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.element_features) | |
atom_and_elements = vcat(sf, ef) | |
bf = cat(map(x -> encode(x, ag, bond_feature_kwargs...), fzn.bond_features)..., dims = 3) | |
pf = cat(map(x -> encode(x, ag, pair_feature_kwargs...), fzn.pair_features)..., dims = 3) | |
# Return FeaturizedAtoms here | |
atom_and_elements, vcat(bf, pf) | |
# FeaturizedWeave(atom_and_elements, bf, pf) | |
function encode( | |
fzn::WeaveFeaturization, | |
ag::AtomGraph; | |
atom_feature_kwargs = (;), | |
bond_feature_kwargs = (;), | |
pair_feature_kwargs = (;), | |
) | |
sf = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.atom_features) | |
ef = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.element_features) | |
atom_and_elements = vcat(sf, ef) | |
bf = | |
cat(map(x -> encode(x, ag, bond_feature_kwargs...), fzn.bond_features)..., dims = 3) | |
pf = | |
cat(map(x -> encode(x, ag, pair_feature_kwargs...), fzn.pair_features)..., dims = 3) | |
# Return FeaturizedAtoms here | |
atom_and_elements, vcat(bf, pf) | |
# FeaturizedWeave(atom_and_elements, bf, pf) |
println(io, "WeaveFeaturization(") | ||
println(io) | ||
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))") | ||
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))") | ||
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))") | ||
println(io, ")") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
println(io, "WeaveFeaturization(") | |
println(io) | |
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))") | |
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))") | |
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))") | |
println(io, ")") | |
println(io, "WeaveFeaturization(") | |
println(io) | |
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))") | |
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))") | |
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))") | |
println(io, ")") |
end | ||
|
||
function bond_feature(bond, mol; kw...) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
println(io, "WeaveFeaturization(") | ||
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))") | ||
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))") | ||
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))") | ||
println(io, ")") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
println(io, "WeaveFeaturization(") | |
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))") | |
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))") | |
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))") | |
println(io, ")") | |
println(io, "WeaveFeaturization(") | |
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))") | |
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))") | |
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))") | |
println(io, ")") |
JFYI, this should eventually be part of WeaveModel under the new organization, so if you want to make it into a PR over there now feel free, or if you'd rather wait until the reorg is actually registered, that's fine too. |
dea0c7a
to
2b41b38
Compare
This is based on the weve featurization from the weave implementation in deepchem, but not exactly a direct port.
I'm not sure I love the way we map from species feature names to the MolecularGraph functions here, seems like that would make it very manual to "just" use some feature provided by MolecularGraph.
There are still a couple of TODOs, one of them being fleshing out both
atom_features
as well aspair_features
, but before I implement them, I wanted to ask whether we would be fine with something like aFeaturizedMol
(similar toFeaturizedAtom
, it can hold featurized atom, bond and pair info), or is there a different method I should look into?