in #4, I talked about how to find the hidden APIs of Centaline Android App. In this article, I am going to show you how to write a scraper to scrape the data.
You can find the Scrapy Crawler in this Git directory.
3 Levels of Data
The scraping was done in 3 levels. First, it is the transaction level. transaction API contains basic information about the transaction. Second, it is the transaction detail level, the transaction detail API contains more information on the transaction. At last, buildings’ information level. It contains even more information regarding the building. For more details on the data field, please refer to this wiki (The documentation has not completely done yet).
So, imagine the crawler will first get all transactions. Next, based on the returned
transaction_id, it will then get all transaction details. After that, based on the
building_id got from transaction details, it will get all building information.
My First Scrapy Project
I would say I am quite experienced in scraping data(LOL). But I never used Scrapy to complete my tasks before. I mainly used the requests module and async programming. After using Scrapy, I think I will not going to write my own scraper again. It is async by default. And by sticking with the structures, my codes are much more organised than writing my own scrapper.
The folder structure is created by default. Scrapy helps you organize your code!
centaline centaline spiders transaction_spider.py items.py pipelines.py settings.py
The following skeleton shows you the basic structure of
transaction_spider.py. Following our plan, I first scrape the first url in
start_requests using the
yield function. The
scrapy.Request also specify the function
parse_3 for parsing the returned data. In
parse, I parse the first level of data and also
yield the requests for level 2 data. In
parse_2, I parse the level 2 data, and
yield the requests for level 3 data.
class TransactionSpider(scrapy.spiders.Crawler): some constants (such as urls, headers..) def start_requests(self): some constants yield scrapy.Request(sth, sth, callback.self.parse, sth...) #level 1 data def parse(self, response): get the response into Item yield item yield scrapy.Request (sth, sth, callback.self.parse_2, sth...) #level 2 data def parse_2(self, resposne): get the response into Item yield item yield scrapy.Request (sth, sth, callback.self.parse_3, sth...) #level 3 data def parse_3(self, response): get the response into Item yield item
You may be wondering what is
item? An item is simply a temporary location for storing the scraped data and probably do some transformation in it too. For example, for level 1 transaction data, my
CentalineTransactionsItem stored in
items.py (Please refer to the folder structure) is like the following:
class CentalineTransactionsItem(scrapy.Item): # define the fields for your item here like: TransactionID = scrapy.Field() IsRelated = scrapy.Field() RegDateString = scrapy.Field() CblgCode = scrapy.Field() CestCode = scrapy.Field() Data_Source = scrapy.Field() Memorial = scrapy.Field() RegDate = scrapy.Field() InsDate = scrapy.Field() PostType = scrapy.Field() Price = scrapy.Field() Rental = scrapy.Field() RFT_NArea = scrapy.Field() RFT_UPrice = scrapy.Field() INT_GArea = scrapy.Field() INT_UPrice = scrapy.Field() CX = scrapy.Field() CY = scrapy.Field() c_estate = scrapy.Field() c_phase = scrapy.Field() c_property = scrapy.Field() scp_c = scrapy.Field() scp_mkt = scrapy.Field() pc_addr = scrapy.Field()
If you are interested to know more about Scrapy, I recommend the following short video series on YouTube.
The above skeleton is a very simplified one. After some trials and errors, I finally got all transaction data (3 levels) in the last 360 days. The resulting
CentalineTransactionItem.csv can be found in the staging_area folder.
Writing the code is quite straightforward. But I was sucked on outputting three separate files containing each level of data. By default, Scrapy outputs everything into one file.
Luckily, I found a solution in Stackoverflow. Turn out, I need to write a custom pipeline to do the job. I also added my answer to the Stackoverflow answer since the original answer is quite old and not up to date.
The next step will be doing EDA (Exploratory Data Analysis) again. I will look at if the missing data problem improved. And figure out some patterns on the data.