Thredup implemented CloudFlare to block all bots. I may look into libraries to bypass it but for now, this project has been tabled."
Thredup Scraper API is a command line, python based web scraper that uses beautiful soup to extract clothing information onto a csv file. Later the project will be migrated to a back end framework to use as an API.
There are two ways to reduce your carbon footprint when it comes to clothing:
- buying used whenever possible
- consists of natural fabrics (wool, silk, cotton, etc.) over man-made fabrics
This project is an attempt to combine the two ways by scraping for sustainable fabrics from the world's largest online consignment store.
But clothing is environmentally damaging, even AFTER it's been purchased. For example, throwing polyester or any plastic-made clothing in the washing machine releases microplastics in the ocean. Of course, the environmental damage also depends on the company, manufacturer, process of using materials (Ex: recycled polyester), quality, etc. In addition, wearing non-natural fibers is less comfortable, less breathable, and falls apart quicker than stronger fabrics made of linen, wool, silk, etc.
The basic rule to follow is Used natural-fabric clothing > new clothing for the following reasons:
- Less environmental damage
- less waste ($billions of usable clothing is thrown away each year)
- less shipping involved. New items are shipped from location to location to create the final product where as a used item is shipped once to a new owner
- more available styles since vintage pieces are hard to buy new
- many more benefits that you can read here:
This program makes the following disctinctions between good vs. bad fabrics:
Good fabrics: | Bad fabrics: |
---|---|
cotton | polyester |
silk | polyamide |
wool | acrylic |
merino wool | fabric not found* |
alpaca | No Fabric Content* |
linen | |
hemp | |
bamboo | |
tencel |
*many items on the site don't have fabric information, so we will assume worst case scenario
- You can either clone the project by running
git clone https://github.com/tas09009/thredup-scraper-api.git
in your terminal or fork the project in order to contribute later: See Contributing below. - Set up your Python virtual environment by running
pyvenv venv
in that directory and runningsource venv/bin/activate
to active it. Or create a conda environment. - Install Python requirements with
pip install -r requirements.txt
.
Run the program by typing python code/thredup_fullscrape.py
. The terminal will then ask for the following three inputs:
- url of thredup
- number of pages to be scrapped
- file name and location to save csv output
Beautiful Soup pulls all product links from a search page (50 per page) and then parses each product link to pull the following information:*
Information to be extracted | Function | Example |
---|---|---|
Link | url of each item on a page | Item_Link |
Category Type | clothing type | Tops |
Image_Link | front picture of item | Picture_Link |
Description | distinct features | 'Crew neckline', 'Color blocked detail', 'Long sleeve', 'Blue' |
Materials | fabric content and it's percentage | 100% Cotton |
Size | item size | Size XS |
Measurement | measurements depend on item itself | 28" Chest, 22" Length |
Price | price | 3.99 |
Brand | name brand | Tommy Hilfiger |
Picture of what the data export. You can also look at the "datasets/test_runs" to see more csv examples.
FYIs:
- This project does not use rotating proxies nor HTTP headers due to time/money. Therefore, the code has a 5-10 second timer delay to each request being pulled.
- Scraping one page i.e. 50 items per page, will take 6 to 8 minutes.
This project contains other libraries/python programs separate from the database project within the '/code/additional_modules' directory. They are:
Scrapes a given number of items within a search page to filter out clothing by the following "Materials":
- Polyester
- Acrylic
- Fabric details not available
- No Fabric Content All results (that don't contain the forbidden words) are opened in a new tab for viewing.
Usage
- input: url of current page
- output: new chrome tabs open one by one only showing fabrics that don't contain any of the banned words. 3 second delay per tab
Removes all "sold" items from favorites list
Usage
- input: url of "favorites" page
- output: CLI notifying when items have been removed. Refresh page to see updates.
More to be added
Please follow along this excellent step-by-step guide to learn how to contribute to an open-source project
The web scrapping code can be made more efficient such as scraping multiple elements from one CSS tag rather than the whole page. Right now, it's been built to work. The code will be updated at a later time. See the Projects Board for the latest status
The following sections include further research, plans, etc.
- Future Goals
- Questions to answer
- Website inconsistencies
- Lessons Learned
- Clothing Sustainabilty Issues & Ideas
- Resources
Make it as easy as possible to buy second hand clothing
- Python library to include for scraping:
- thredup
- poshmark
- ebay
- heroine
- etsy
- ebay
- The Real Real (luxury)
- Vestiaire Collective (luxury)
- local thrift stores How to get them online?
- expand to men's clothing. Ex: grailed
- include a WHY section
- If a company has a store (ex: amour vert, reformation, etc.) then try on their clothes and remember their sizes
- order an item or two from them, then buy the used version online
- clothing websites should have a "used section" that you can sell back to them" elieen fisher now has this
- what percentage of clothing is considered "environmentally damaging" i.e. made of "banned" products
- how many items are correctly sorted in their category?
- Ex: clicked on casual dresses and many formal work dresses showed up
- how many items are missing categories such as "accents" and "pattern"
- how many have a tag such as "3/4 sleeve" but don't belong to any category
- sizes vary per clothing item
- Ex: size 00 and 0 for top but 2 for bottoms. But website cannot differentiate
- data may need to be cleaned up prior to putting into database?
- links will need to be made beforehand
- Machine Learning
- Classify sweaters as actual sweaters?
- Pick clothing based on fashion styles. Ex: boho, chic, grunge, etc.
- where does Viscose actually fall into place?
- Some items sold are using 'recycled polyester' such as this Eileen Fisher Trenchcoat
- how much of the clothing is fast fashion? obviously only in the petite category
- Other thredup projects:
- Thredup A project to extract data from the website and do statistical calculations on it Below is the description of the requirement
- Thredup-Cart-Refresher Refreshes items inside the Thredup account's cart
- WebCrawler-ThredUp I created this web crawler to scrape data from ThredUp products into a database
- build a seasonal wardrobe with 5 items under $100 or $200? Use Vetta for ideas
- limited filters within the "petite" category such as
- not able to search by fabrics
-Ex: linen/cotton combination
- Ex: 100% wool
- not able to filter out fabrics
- Ex: no polyester
- Ex: no polyester or acrylic
- not able to search by fabrics
-Ex: linen/cotton combination
- shop by style on thredup's website. All the links go to the same link for all womens clothing
- Search shorts: rompers are also displayed and identified as dresses
- two tops are exactly the same but have different descriptions. Here and here
- when jumping between different categories, the "sort by" method changes to "Recently Discounted" by default
- Only product filters all clothing items have in common are:
- color
- pattern
- accents
- This project will be helpful to only those who are petite but eventually should expand to the others as well
- Catch microplastics in washing machine (if you have to buy polyester) with:
- Express casual pants - amour vert knockoff
- Thredup's classes, id, div tags all have unintuitive names. Other websites's labels make much more sense
- thredup.com/robots.txt
- Tutorial: Web Scraping and BeautifulSoup exactly what I'm doing
- Integrate IP addresses Web scraping with Python - 3 medium articles
- robots.txt: rules of scraping such as frequency and specific pages
- Thredup doesn't have an API, not for Python atleast
- Thredup should have a database of the top 10 brands and their measurements and it should automatically pull from that when a brand is matched
- item picture - high resolution only
- website link
- very difficult to pull, none of the links would appear. Realized that the search results display in order of "Recently Discounted" with no account login. As opposed to how I was searching "Newest First" with account logged in
- Organized by Newest First which makes re-running code much easier, can update the database by webscarpping until first item is found already in the database
- very difficult to pull, none of the links would appear. Realized that the search results display in order of "Recently Discounted" with no account login. As opposed to how I was searching "Newest First" with account logged in
- All item details
- Description: dictionary with 6 to 8 keywords. These are values only. Need keys from search results link (left column). All values match a key to the columns on the left
- Pull all keys from the columns first, then match their values based on the item description
- Search by petite first, then sort. Rather than search all and then filter by petite. In the case of searching by a specific fabric (Ex: 100% merino wool), it's easier to search within the petite clothing and then filter out by fabric.
- difficult to loop through different clothing types and multiples pages within a clothing type. Easier for now to search for one clothing type at a time.
- filter out clothing by fabrics (polyester, polyamide, etc.)
- second layer of filter for rayon, nylon, etc.
- sort out clothing specifically by fabrics (wool, linen, etc.)
- importing functions: caused circular dependencies
- Can't use the Beautiful Soup HTML parser for Thredup because I cannot extract all the hrefs from the site for all the items. I have no idea where they are then!
- XML will be the way to go, all items are in a grid with the 2nd to last number increasing for each item.
add item per row, rather than at the end of the listwould require too much memory and time to write each row rather than 50 rows at a time
- detailed email: all that is inconvenient + link to my thredup library. Would love to recommend the website to friends and others if these issues are fixed!
- sizing and fabrics are usually incorrect, which is problematic since my main filter is by fabric. I have returned a few items in the past but there were some I later saw the discrepancy and it was too late to return
- retail value incorrect. One blog mentioned this
- need more feedback loops from customers
- When I switch to Petite, all filters are reset
- Cannot search by material
- cannot search by eco-friendly materials either
- email: not recommendations based on style and fabrics
- Ask for access to API - read the docs
- read their engineering blog
- ML to create goody box
- don't like how I don't know what I'm getting
- display items within hours
- choose what you like, or get similar recommendations
- Suggested item: red turtleneck sweater. Suggested alternatives: different color, fabric, mockneck, etc.
- thredup monthly renting? similar to rent the runway? already do Goody Boxes and Rescues
- How happy are people with the Goody Boxes and Rescues? Online research
- Thredup must have this data in their yearly report?
- see ML info above
- ML to take monthly feedback to learn how to improve next time
- better sizing
- style (if people preferred blazers over sweaters
- create a ML and test it (buy all the items as a "goody box" bundle
- return and give feedback to ML model. Test again with another order
- verify: sizing, color matching to original picture, fabric, original retail price estimation, cut accuracy
- return and give feedback to ML model. Test again with another order
- How happy are people with the Goody Boxes and Rescues? Online research
- Blog post about using Goody Box. Send her two shipments and give some feedback to ML model?
- Huge Rescue Mystery Box prefers Free People. Preferences can be prioritized?
- competitor to Prime wardrobe but better
- possibility of adding men's clothing
- phase out bad fabrics. Over time when people donate and their items are logged into their accounts, a warning sign should come up saying we will not take polyester after this point
- get local thrift stores online as well. With thredup's help?
- Blog post: 11 ETHICAL OR SUSTAINABLE CLOTHING BRANDS LIKE EVERLANE is wildly inaccurate. write a comment on the site
- Bad: Everlane, Madewell, Uniqulo: misrepresenting what "ethical" fashion mean. Goodonyou website proves this
- Blog: IS EVERLANE ETHICAL? WHY GOOD ON YOU’S RATING IS NOT QUITE ACCURATE - oh crap. Who am I suppose to trust? This is getting exhausting
- Good: Reformation
- Don't know: rest of the brands Research
- Bad: Everlane, Madewell, Uniqulo: misrepresenting what "ethical" fashion mean. Goodonyou website proves this
- Blog post: 6 MUST-WATCH DOCUMENTARIES TO LEARN ABOUT SUSTAINABLE FASHION. True cost mentions Uniqulo as "fast fashion" which it is. But in the blog above, it recommends it as ethical and sustainable clothing
- What are the best channels to reach out to Thredup?
- How trustable is good on you? Other websites that have more brand ratings? Tried out several on good on you and they didn't have it
- What if a local thrift store scanned the clothing tag, and it matched with the brand and page of the item with the picture?
- Build a database from web-scraping and save all important info:
- picture
- link to item
- details: material
- 4 Fabrics That Are Harming Our Planet + What To Look For Instead
- The Most Harmful Fabrics in Fashion (and A Personal Challenge
- banned fabric keywords: Polyester, Polyamide, Acrylic, No Fabric Content
- Next level to block: nylon, rayon, viscose
- Good fabrics: organic cotton, wool, silk, hemp, linen, cupro, ramie, tencel (used only)
- Iffy fabircs:
- modal
- Good: closed-loop system, fewer harmful byproducts
- Bad: semi-seynthetic
- tencel
- Good: closed-loop process. Depending on chemicals - biodegradable
- Bad: man-made fabric. Heavy use of chemicals
- Acetate and triacetate
- Good: wood pulp
- Bad: man-made fibre
- modal
- Web scraping with Python — A to Z
- Automatic ticket classification - thredup automated tickets
- robots.txt doesn't seem to mind scraping petite items. No crawl rate mentioned either
- Use baserow as an online database to host the extension?