Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



16 Commits

Repository files navigation


Visual Question Generation task adapted from Microsoft ->

Work began -> Summer 2018 (at the time code / exact technical specs for task unavailable)

  • Danish Farid (currently @ U Waterloo) - LUMS (SBASSE)
  • Rabeez Riaz (currently @ McKinsey & Co.) - LUMS (SBASSE)

Work conducted:

1. Downloading dataset from ->
    - Custom script used to avoid trying to download images from dead links (using http get requests)

2. Data Cleaning along standard methods: 
    - Running a final check for data that had incorrect question data -> found a few rows without correct questions
    - Removing quotation marks and excess formatting
    - normalizing letter-case
    - Adding start and end tokens
    - Building a vocabulary dictionary for [word -> integer] mapping (numerical data)
    - Converting all list of strings to list of integers using vocab dictionary 
    - Converting raw (variable length) sentences into constant length slices - (length determines by max length of sentences)
        - A RNN works by having training data for a single sample at multiple time slices 
        - we went ahead and slowly "revealed" our training example sentences one token at a time, filling them 
        with Stop Tokens before that:
        Example dataset
        Max Length: 5 (including start token)
        __image: img_1.jpeg (one image)__ 
            "What is that man?"
        After (all tokens replaced with their integer representation):
            0/ <START> <STOP> <STOP> <STOP> <STOP> <STOP>
            1/ <START> what <STOP> <STOP> <STOP> <STOP>
            2/ <START> what is <STOP> <STOP> <STOP>
            3/ <START> what is that <STOP> <STOP>
            4/ <START> what is that man <STOP>

3. Rebuilding microsoft model
    - Picecing together the VQG Microsoft model from overall information from MS paper and using Keras 
       documentation to find ways to implement features
    - Tried several methods for feeding auxiliary picture data to  
        -> including embedding data into each row of an image data
    - Found way to include data as aux information

4. Model definition 
    Overall model hierarchy
        1. Image and Sentence information 
        2. VGG-16 With last layer removed (INPUT: image) - (OUTPUT: 500 length feature vector)
        3. RNN layer (LSTM/GRU units) -> 
            (X data: Sentence pushed thru an embedding), (Y data: Next word), (Side information: Picture features)
    Every training sample is:
    -> The image is read and is run through a VGG-16 with it's last layer removed -> so every imager returns 
    a high level feature vector of (length = 500). 
    -> This feature vector is used as "side information" to the RNN and is the recursively modified data value over each recurrent run
5. Testing
    -> Model trained on a 1080Ti for 25 epochs -> details and results in notebook
    -> Other details available in all reports

5. Other work conducted
    -> We noticed that questions are not exactly open-ended sentences and that there must be a significant pattern to exploit (a form of questions to expect), a question starting with the word "Watermelon" is unlikely and so the model need not account for this as well. 

    A word map in the report -> {} contains a word map that shows a visualization for our questions vocabulary corpus indicating that at the first position, questions were mostly only composed to a few different words. 

    We tried to find a way to exploit this property by also providing a 1 dim "prevailance" vector to our model is auxiliary information, with the expection that this might speed up model testing and guide the optimization algo to a better minima sooner.

    We were, due to lack of availability of compute resource, not able to test this extensively but, after a few tests, our models reached a lower loss value faster (approx 15% sooner) or at the same rate as they did without this "prior" vector information being passed, depending on how the prior infomration was provided

6. Possible future work -> 
    -> A heavy emphasis on this 'prior' dataset information holds potential to be able to speed up models being trained on a certain type of general data. "Questions" for instance as a subset or a certain 'type' of sentence.