Very often, data on websites is in presentation format instead of a nicely formatted table. This makes analysis difficult. Therefore, the first step of doing any real world data analysis is to gather data in a nice format (usually table).
If you don’t know any programming, the task will be difficult. But, Google Sheet may make your life a little easier since we may be already familiar with Excel and typing formulas.
The Google Sheet Function I am going to use to scrape Yahoo Finance data is
IMPORTXML. The function will download the HTML text from Yahoo Finance and find the data using XPATH.
Wait…what is XPATH? XPATH is just a location of the element (e.g. text) on the webpage. Most of the time, you can right-click on the data you want to scrape, then click inspect, then right-click and copy the XPATH.
You can refer to this link for more details on how to locate elements using XPATH. One tip is to try not to locate the exact location. For example, try to locate the table containing the data instead. (Just experience)
After locating the table containing the data, we can use
INDEX function to locate the data. For example:
=index(IMPORTXML("https://finance.yahoo.com/quote/"&A3&"/key-statistics?p="&A3,"//td[@class='Fw(500) Ta(end) Pstart(10px) Miw(60px)']"),35)
AAPL. I first locate the table. Then the data is located at index 35.
Yeah! We are done. Now we can use the same concept to scrape many webpages.