Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate data source in feature registry #2581

Closed
tkch3n opened this issue Apr 20, 2022 · 4 comments · Fixed by #2603
Closed

Duplicate data source in feature registry #2581

tkch3n opened this issue Apr 20, 2022 · 4 comments · Fixed by #2603

Comments

@tkch3n
Copy link

tkch3n commented Apr 20, 2022

Expected Behavior

Registry protobuf should not have duplicate data source definitions.

Current Behavior

Every time we run feast apply, we noticed that a data source source is appended to the registry protobuf with the exact same details.

from feast.protos.feast.core.Registry_pb2 import Registry as RegistryProto

registry_proto = RegistryProto()

with open('registry.db', 'rb') as f: data = f.read()

print(registry_proto.FromString(data))

This is what the protobuf looks like after running feast apply 3 times in a row without making any changes to the definition.

entities {
  spec {
    name: "__dummy"
    value_type: STRING
    join_key: "__dummy_id"
    project: "myproject"
  }
  meta {
    created_timestamp {
      seconds: 1650485581
      nanos: 37674000
    }
    last_updated_timestamp {
      seconds: 1650485581
      nanos: 37723000
    }
  }
}
registry_schema_version: "1"
version_id: "743b0630-373b-405f-861d-3de738962d7c"
last_updated {
  seconds: 1650485581
  nanos: 37831000
}
feature_views {
  spec {
    name: "IRIS"
    project: "myproject"
    entities: "__dummy"
    features {
      name: "PETAL_LENGTH"
      value_type: FLOAT
    }
    features {
      name: "PETAL_WIDTH"
      value_type: FLOAT
    }
    features {
      name: "SEPAL_LENGTH"
      value_type: FLOAT
    }
    features {
      name: "SEPAL_WIDTH"
      value_type: FLOAT
    }
    features {
      name: "SPECIES"
      value_type: INT64
    }
    ttl {
      seconds: 31449600
    }
    batch_source {
      type: BATCH_SNOWFLAKE
      timestamp_field: "EVENT_TIMESTAMP"
      created_timestamp_column: "CREATE_TIMESTAMP"
      data_source_class_type: "feast.infra.offline_stores.snowflake_source.SnowflakeSource"
      snowflake_options {
        table: "IRIS"
        schema: "MY_SCHEMA"
        database: "MY_DATABASE"
      }
    }
    online: true
    description: "A sample feature view containing the Iris dataset."
  }
  meta {
    created_timestamp {
      seconds: 1650485581
      nanos: 37333000
    }
    last_updated_timestamp {
      seconds: 1650485581
      nanos: 37333000
    }
  }
}
data_sources {
  type: BATCH_SNOWFLAKE
  timestamp_field: "EVENT_TIMESTAMP"
  created_timestamp_column: "CREATE_TIMESTAMP"
  data_source_class_type: "feast.infra.offline_stores.snowflake_source.SnowflakeSource"
  snowflake_options {
    table: "IRIS"
    schema: "MY_SCHEMA"
    database: "MY_DATABASE"
  }
  project: "myproject"
}
data_sources {
  type: BATCH_SNOWFLAKE
  timestamp_field: "EVENT_TIMESTAMP"
  created_timestamp_column: "CREATE_TIMESTAMP"
  data_source_class_type: "feast.infra.offline_stores.snowflake_source.SnowflakeSource"
  snowflake_options {
    table: "IRIS"
    schema: "MY_SCHEMA"
    database: "MY_DATABASE"
  }
  project: "myproject"
}
data_sources {
  type: BATCH_SNOWFLAKE
  timestamp_field: "EVENT_TIMESTAMP"
  created_timestamp_column: "CREATE_TIMESTAMP"
  data_source_class_type: "feast.infra.offline_stores.snowflake_source.SnowflakeSource"
  snowflake_options {
    table: "IRIS"
    schema: "MY_SCHEMA"
    database: "MY_DATABASE"
  }
  project: "myproject"
}

Steps to reproduce

Here's the feature_store.yaml:

project: myproject
registry: s3://mybucket/registry.db
provider: local
offline_store:
  type: snowflake.offline
  account: myaccount
  user: myuser
  password: mypasssword
  role: myrole
  warehouse: mywarehouse
  database: mydatabase
online_store:
  type: redis
  connection_string: myredisconnectionstring

And here's the feature definition feature_definitions/iris.py:

from datetime import timedelta
from pathlib import Path

from feast import Feature, FeatureView, SnowflakeSource, ValueType

name = Path(__file__).stem

iris_source = SnowflakeSource(
    schema="MY_SCHEMA",
    database="MY_DATABASE",
    table="IRIS",
    timestamp_field="EVENT_TIMESTAMP",
    created_timestamp_column="CREATE_TIMESTAMP"
)

iris = FeatureView(
    name=name.upper(),
    entities=[],
    ttl=timedelta(weeks=52),
    schema=[
        Feature(name="PETAL_LENGTHS", dtype=ValueType.FLOAT),
        Feature(name="PETAL_WIDTH", dtype=ValueType.FLOAT),
        Feature(name="SEPAL_LENGTH", dtype=ValueType.FLOAT),
        Feature(name="SEPAL_WIDTH", dtype=ValueType.FLOAT),
        Feature(name="SPECIES", dtype=ValueType.INT64),
    ],
    online=True,
    source=iris_source,
    description="A sample feature view containing the Iris dataset."
)

Specifications

  • Version: 0.2.0
  • Platform: AWS EC2 (Linux2 AMI)
  • Subsystem:

Possible Solution

@achals
Copy link
Member

achals commented Apr 20, 2022

This is probably fixed by #2506 , @felixwang9817 what do you think?

@felixwang9817 felixwang9817 self-assigned this Apr 20, 2022
@felixwang9817
Copy link
Collaborator

@achals actually the root cause isn't #2506; there's a bug with how we're handling data sources as FCOs. I'll send out a PR later today.

@felixwang9817
Copy link
Collaborator

Quick note here: the root cause was that when converting a SnowflakeSource to a proto, we didn't add the name field. This meant that the registry never deduplicated correctly, so it kept adding more protos.

@tkch3n
Copy link
Author

tkch3n commented Apr 27, 2022

Quick note here: the root cause was that when converting a SnowflakeSource to a proto, we didn't add the name field. This meant that the registry never deduplicated correctly, so it kept adding more protos.

@felixwang9817 Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants