Multi-stage, hierarchical transformer-based encoder-decoder model to sequentially predict ingredients of a recipe, given an image and title. Introduced a co-attention module and batch contrastive triplet loss to maximize cross-modal fusion
- Create a folder
recipe1M_layers
outside the codebase folder - Add the data (
val
/test
) folder outside the codebase folder - Download and add
cleaned_ingredients.json
andcleaned_layers.json
files insiderecipe1M_layers
folder - Run and use dataloader as before: the input tuple has been expanded to load the
title
,ingredients
andinstructions
as text along with the other fields already present
Files can be found here: https://drive.google.com/drive/folders/14brtR12WlZ8fqvRttcv43wXkOfusSVUo?usp=sharing Link to Base Paper's GitHub page: https://github.com/torralba-lab/im2recipe-Pytorch Base paper link: http://pic2recipe.csail.mit.edu/