You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using this code, which loads the full model and quantize it during load time. This takes about 6 mins.
If we can save a copy of the model with a new name, e.g. llama-2-7b-chat-hf-4bit, the loading time should significantly decrease.
The easiest way to do this is probably in a notebook that runs in the workspace and then something along these lines:
fromazureml.coreimportWorkspace, Datastorenew_name="llama-2-7b-chat-hf-4bit"model= ... # load and quantize model like in the code predict_hf.py code above# save it locallymodel.save_pretrained(f"./models/{new_name}")
# double check that the quantized model can be loaded too ...# If all goes well, upload to blob storage:workspace=Workspace.from_config()
ds=workspace.get_default_datastore()
ds.upload(f"./models/{new_name}", f"./base_models/{new_name}", show_progress=True, overwrite=True)
# verify the model can be loaded from blob storage by submitting a new prediction job with the new model. See README.md
The text was updated successfully, but these errors were encountered:
We currently load the model from the blob storage in the ML workspace: https://autoraml3241530052.blob.core.windows.net/azureml-blobstore-b7ef477b-ca4a-44e3-a029-0e0542bdcd47/base_models/llama-2-7b-chat-hf/
Using this code, which loads the full model and quantize it during load time. This takes about 6 mins.
If we can save a copy of the model with a new name, e.g.
llama-2-7b-chat-hf-4bit
, the loading time should significantly decrease.The easiest way to do this is probably in a notebook that runs in the workspace and then something along these lines:
The text was updated successfully, but these errors were encountered: