Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should have a script that scrapes the bagnowka images and writes them as photos/photoUnits to BH dbs #132

Open
OriHoch opened this issue Mar 22, 2017 · 6 comments
Assignees

Comments

@OriHoch
Copy link
Contributor

OriHoch commented Mar 22, 2017

preconditions

reproduction steps

  • run the script (doesn't exist yet..):
(dbs-back) $ PYTHONPATH=. scripts/migrate_bagnowka_images.py
expecected
  • all the bagnowka images are in BH dbs - in the correct format of photos / photoUnits
  • all the images are uploaded to google storage / S3 - and available as public URL
  • this allows to use the images in our dbs website as all other websites
actual
  • no bagnowka images in our DB

related issues

@Libisch Libisch self-assigned this Mar 22, 2017
@Libisch
Copy link
Contributor

Libisch commented Mar 22, 2017

@OriHoch I'll work on the script to convert scraping output into DBS-API structure.

@Libisch
Copy link
Contributor

Libisch commented Mar 26, 2017

@OriHoch
I need to know if there are any fields missing. Other than that, script is about ready and returns this (example for one image):

[
    {
        "NegativeNumbers": "",
        "IsLandscape": "",
        "PrevPictureUnitsId": "",
        "ExhibitionIsPreview": "",
        "LocationInMuseum": "",
        "Pictures": [
            {
                "IsLandscape": "",
                "ForDisplay": "1",
                "ToScan": "0",
                "IsPreview": "0",
                "PictureTypeDesc": {
                    "En": "Picture",
                    "He": "תצלום"
                },
                "PicId": "",
                "PictureId": "95eb0e2ccf2a62ad46939994718ba150",
                "LocationCode": ""
            },
            {
                "PictureTypeDesc": {}
            }
        ],
        "ToScan": "||",
        "PeriodDateTypeDesc": {
            "En": "Year|",
            "He": "שנים|"
        },
        "related": [
            "palce_Latvia"
        ],
        "ExhibitionId": "",
        "UpdateDate": "",
        "OldPictureNumbers": "|||",
        "OldUnitId": "",
        "id": "",
        "UpdateUser": "BH Online",
        "PictureLocations": "",
        "UnitDisplayStatus": 3,
        "main_image_url": "full/58d8b7015e3883d243d4dd46e85fd6f74d575c41.jpg",
        "UnitPeriod": [
            {
                "PeriodDateTypeDesc": {
                    "En": "Year",
                    "He": "שנים"
                },
                "PeriodEndDate": "",
                "PeriodNum": "0",
                "PeriodTypeDesc": {
                    "En": "Period",
                    "He": "תקופת"
                },
                "PeriodDesc": {
                    "En": "1930",
                    "He": "1930"
                },
                "PeriodStartDate": "",
                "PeriodTypeCode": "",
                "PeriodDateTypeCode": ""
            },
            {
                "PeriodTypeDesc": {}
            }
        ],
        "PrevPicturePaths": "",
        "TS": "",
        "PrevPictureId": "",
        "PictureSources": "",
        "UnitPersonalities": [
            {
                "OrderBy": "1"
            }
        ],
        "UnitTypeDesc": "Photo",
        "PeriodStartDate": "",
        "Slug": {
            "En": "",
            "He": ""
        },
        "PeriodDateTypeCode": "1|",
        "RightsDesc": "Full",
        "OrderBy": "1|",
        "Bibiliography": {
            "En": "null",
            "He": ""
        },
        "UnitText1": {
            "En": "Today: Latvia, Pre 1914 Russia, pre 1795 Poland.\nCourtesy of www.bagnowka.pl ",
            "He": ""
        },
        "UnitText2": {
            "En": "",
            "He": ""
        },
        "UnitPlaces": [
            {
                "PlaceIds": ""
            }
        ],
        "PictureFileNames": "",
        "UnitStatus": 3,
        "PersonalityIds": "",
        "PeriodNum": "0|",
        "PeriodDesc": {
            "En": "1930",
            "He": "1930"
        },
        "UnitType": 1,
        "UnitHeaderDMSoundex": {
            "En": "",
            "He": ""
        },
        "PrevPictureFileNames": "",
        "RightsCode": 1,
        "PeriodTypeCode": "",
        "UnitId": "",
        "PicId": "",
        "IsValueUnit": "true",
        "StatusDesc": "Completed",
        "PreviewPics": [
            {
                "PrevPictureId": ""
            }
        ],
        "PicturePaths": "",
        "Attachments": [],
        "UserLexicon": "",
        "EditorRemarks": "",
        "Exhibitions": [],
        "ForDisplay": "|||",
        "DisplayStatusDesc": "Museum and Internet",
        "PeriodEndDate": "",
        "PictureTypeCodes": "",
        "PictureTypeDesc": {
            "En": "Picture|None|Picture|",
            "He": "תצלום - ש/ל|לא מוגדר|תצלום - ש/ל|"
        },
        "PIctureReceivedIds": "",
        "Header": {
            "En": "Auce, [None]. 1930"
        },
        "PeriodTypeDesc": {
            "En": "Period|",
            "He": "תקופת צילום|"
        },
        "thumbnail_url": "thumbs/small/95eb0e2ccf2a62ad46939994718ba150",
        "ForPreview": "false",
        "Resolutions": "1",
        "_id": "",
        "LocationCode": ""
    }
]

@Libisch
Copy link
Contributor

Libisch commented Apr 26, 2017

@OriHoch
Done, scraper and added files are in dbs-bagnowka-scrape. Opening new issues for back and front to identify bagnowka items and direct all image_urls to AWS.

@Libisch Libisch closed this as completed Apr 26, 2017
@OriHoch
Copy link
Contributor Author

OriHoch commented May 3, 2017

@Libisch
couple of things are missing (maybe they are done somewhere else):

  • where is the code that adds the bagnowka items to our DB?
  • how does this code run? where will it run? who will run it and when? (AKA deployment plan)

@Libisch
Copy link
Contributor

Libisch commented May 3, 2017

@OriHoch

  • Short instructions are here.
  • I've created a "dump_to_mongo.py" file but it dumps to local db:
import json
from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client['bhdata']
photoUnits = db['photoUnits']

file = open("bagnowka_all.json", "r")
data = json.load(file)

count = 0
for slug in data:
    data[slug]["Header"]["He"] = "null"
    
    photoUnits.insert_one(data[slug])
    count += 1
    print("1 item was added to photoUnits")

print("{} items were inserted to photoUnits.".format(count))

I guess there's a better way to do this, please let me know how it should be handled and I'll do it as soon as I return.

@Libisch Libisch reopened this May 3, 2017
@OriHoch
Copy link
Contributor Author

OriHoch commented May 4, 2017

deployment plan

  • ssh into relevant mongo server
    • gcloud compute ssh mongo1
  • download the latest version of the bagnowka_dump_to_mongo.py script
    • curl https://gist.githubusercontent.com/OriHoch/978c7a62323ba380b9d06df56721cabe/raw/ab6abbd5f5842320fd0831b87b7200f69d17c510/bagnowka_dump_to_mongo.py > bagnowka_dump_to_mongo.py
  • make the script executable
    • chmod +x bagnowka_dump_to_mongo.py
  • set environment variables for the mongo instance you want to use
    • export MONGO_DB=mojp-dev
    • defaults to use mongo on localhost:27017, you can change by setting MONGO_HOST and MONGO_PORT
  • run the script
    • ./bagnowka_dump_to_mongo.py
  • verify that the items were loaded correctly to mongo
    • mongo $MONGO_DB
    • > db.photoUnits.find({"bagnowka": "True"}).count()
    • 1987
  • Exit mongo1
  • run ensure_required_metadata script for photoUnits - to sync to elasticsearch
    • ssh into dev server:
      *gcloud compute ssh bhs-dev
      *sudo su -l bhs
      * cd api
      *. env/bin/activate
      *export PYTHONPATH=.
      *scripts/ensure_required_metadata.py --collection photoUnits --add
  • verify the items are in elasticsearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants