In this project, I create a scraper that builds a CSV file containing data for 10,000 houses from every region of Belgium. I scrape data from Immoweb. In my project, I use an OOP approach, as well as the asyncio module for asynchronous link processing, and BeautifulSoup for scraping data from HTML pages.
.
├──data/
├──links/
├──houselinks_for_postcode.json
├──raw_data_houses.json
├── houselinks_for_postcode.json
├──scraper
├──scraper.py
├── src/
│ ├── link_creator.py
├──pipeline.py
├── .gitignore
├── main.py
├── maintest.csv
├── postal-codes.json
└── README.md
- Clone the repository to your local machine.
2 .To run the script, you can execute the main.py
file from your command line:
```
python main.py
```
- The script creates an instance of the Pipeline class and builds the csv file with data for 10000 houses in Belgium. The resulting file is saved to chousen filepath in your root directory. One run of script takes aproximately 15 minutes.
from src.pipeline import Pipeline
import time
start_time = time.localtime()
start = time.strftime("%H:%M:%S", start_time)
print(start)
pipeline = Pipeline()
pipeline.run(input('Enter name of csv file that you want to save'))
finish_time = time.localtime()
finish = time.strftime("%H:%M:%S", finish_time)
print(finish)
This project took tree days for completion.
I used Postal codes - Belgium dataset from "Opendatasoft" web site.