Date: 7 July 2020
Housing is still one of the greatest concerns of Hong Kong people. I think it will be fun to build a model to predict the house price in Hong Kong.
Unlike, Kaggle competition, data is not readily available. So, the first step, which I am going to share in this post, will be about scraping house data.
I am going to find transactional data instead of listing data. Transactional data is recorded by Land Registry in Hong Kong. The data includes the actual transactions between sellers and buyers. Listing data is the listings on different websites, such as Centaline, 28House, etc. Those listing data is the quoted price only. I think using transactional data is more reliable since listing data may be manipulated to attract buyers.
The tables are arranged in a nice format. Scraping them should be easy.
- Find the pattern in URL/ Find the request form
- Use Python Requests package to loop through the links built from the pattern
- Use async programming to make it more efficient (optional)
- Make the program more robust on error handling and blocking (optional)
If you look at the request form, it is easy to discover the http requests and the payload:
After we figured out this, the scraping task becomes clear. You may check the code in the Github repo. I used aiohttp to make the requests async so that the waiting time can be saved. You can also find all the data scraped in the CSV file (I just scraped Cheung Sha Wan and Sham Shui Po District since I want to first build a pricing model on that region).
I will have a look on the data to have a feel on the missing data, distribution, and correlation, etc.