Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SAR Model Fails When userCol and itemCol are of Integer Type #2274

Open
3 of 19 tasks
SGA-daiki-kimura opened this issue Aug 30, 2024 · 0 comments
Open
3 of 19 tasks
Assignees

Comments

@SGA-daiki-kimura
Copy link

SGA-daiki-kimura commented Aug 30, 2024

SynapseML version

1.0.5

System information

  • Language version (e.g. python 3.8, scala 2.12):Python 3.11.0rc1
  • Spark Version (e.g. 3.2.3):3.5.0
  • Spark Platform (e.g. Synapse, Databricks):Databricks

Describe the problem

Expected Behavior:
The SAR model should accept userId and itemId as an integer type as specified in the documentation.

Actual Behavior:
The SAR model only works when userId and itemId are cast to DoubleType. This is contrary to the documentation which states that userId should be within the integer value range.

Code to reproduce issue

import requests
import zipfile
import io
import pandas as pd
from pyspark.sql.types import DoubleType, LongType
from synapse.ml.recommendation import SAR

url = "http://files.grouplens.org/datasets/movielens/ml-25m.zip"
response = requests.get(url)

with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    with z.open('ml-25m/ratings.csv') as csvfile:
        pdf_ratings = pd.read_csv(csvfile)

pdf_ratings["rating"] = 1.0
spark_df_ratings = spark.createDataFrame(pdf_ratings)

print("Before casting:")
spark_df_ratings.printSchema()

spark_df_ratings = spark_df_ratings.withColumn("userId", spark_df_ratings["userId"].cast(LongType()))
spark_df_ratings = spark_df_ratings.withColumn("movieId", spark_df_ratings["movieId"].cast(LongType()))
spark_df_ratings = spark_df_ratings.withColumn("rating", spark_df_ratings["rating"].cast(DoubleType()))
spark_df_ratings = spark_df_ratings.withColumn("timestamp", spark_df_ratings["timestamp"].cast(LongType()))

print("After casting:")
spark_df_ratings.printSchema()

sar = SAR(
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
    timeCol="timestamp",
    implicitPrefs=True,
    activityTimeFormat="epoch"
)

model = sar.fit(spark_df_ratings)

Other info / logs

Py4JJavaError: An error occurred while calling o629.fit.
: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
    at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:116)
    at org.apache.spark.sql.Row.getDouble(Row.scala:275)
    at org.apache.spark.sql.Row.getDouble$(Row.scala:275)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:28)
    at com.microsoft.azure.synapse.ml.recommendation.SAR.calculateUserItemAffinities(SAR.scala:99)
    at com.microsoft.azure.synapse.ml.recommendation.SAR.$anonfun$fit$1(SAR.scala:69)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:163)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:160)
    at com.microsoft.azure.synapse.ml.recommendation.SAR.logVerb(SAR.scala:36)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logFit(SynapseMLLogging.scala:153)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logFit$(SynapseMLLogging.scala:152)
    at com.microsoft.azure.synapse.ml.recommendation.SAR.logFit(SAR.scala:36)
    at com.microsoft.azure.synapse.ml.recommendation.SAR.fit(SAR.scala:75)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
    at py4j.Gateway.invoke(Gateway.java:306)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
    at java.lang.Thread.run(Thread.java:750)

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
@SGA-daiki-kimura SGA-daiki-kimura changed the title [BUG] [BUG] SAR Model Fails When userCol and itemCol are of Integer Type Aug 30, 2024
@dciborow dciborow self-assigned this Sep 7, 2024
dciborow added a commit that referenced this issue Sep 7, 2024
Fixes #2274

Update SAR model to accept `userId` and `itemId` as integer types (`LongType`).

* **SAR.scala**
  - Update `calculateUserItemAffinities` method to handle `userId` and `itemId` as `LongType`.
  - Update `calculateItemItemSimilarity` method to handle `userId` and `itemId` as `LongType`.

* **test_ranking.py**
  - Add test case `test_adapter_evaluator_sar_with_long` to verify `userId` and `itemId` as `LongType`.

* **Smart Adaptive Recommendations.md**
  - Update documentation to reflect that `userId` and `itemId` can be of `LongType`.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/SynapseML/issues/2274?shareId=XXXX-XXXX-XXXX-XXXX).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants