Recently I have a random thought on comparing MPF performance with Index (SPX, HSI) performance. Many agents claim that their fund managers can beat the market! Is it true? Am I better off by investing in low-fee and passive funds?
I am not an expert in MPF. I just try to find data to verify such a claim and answer the above questions. Please correct me if I am wrong. To do the analysis, let first scrape data from AAStocks.
First of all, the MPF data can be found in this link. A simple right-click and inspect reveals data is stored in a form:

A simple pd.read_html
can extract the data in the table. However, pandas
assumes you are interested in the text but not hyperlink. To get the risk level, we need to go inside each fund’s hyperlink and scrap the risk level.
To do so, we leverage beautifulsoup
package. Simply locate the a
tag and get the hyperlink. After getting the hyperlink, we can then get the data using requests.
Finally, to extract data from table, we use Regex
. Now, let’s look at the code one by one.
First, we imported some required packages.
import pandas as pd
import numpy as np
import bs4 as bs
import requests
Next, we defined a function to remove rows with the same value in all columns. Those are header rows.

def checkEqual(lst):
"check if a list contains same element"
return lst[1:] == lst[:-1]
Then we defined the URL pattern and used Beautifulsoup
to extract the table in the webpage.
url = "http://www.aastocks.com/en/mpf/search.aspx?tab=1&sp="
url_details = "http://www.aastocks.com/en/mpf/compare.aspx?comp1={}"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text, 'lxml')
parsed_table = soup.find_all('table', {"class":"tblM s2 mpfDL"})[0]
Next, we use pandas
to convert html into nice table format and remove rows with same value across columns.
df = pd.read_html(str(parsed_table),encoding='utf-8', header=0)[0]
df = df.rename(columns={'Unnamed: 0':'provider'})
df['provider'] = df['provider'].ffill()
# remove rows with same element
equal_rows = []
for index, row in df.iterrows():
if checkEqual(row.to_list()):
equal_rows.append(index)
df = df[~df.index.isin(equal_rows)]
df = df[:-1]
df = df.reset_index(drop=True)
Next, we extract the MPF code using regex so that we can go to each individual webpage to extract the risk level.
# extract the detail code for each fund
df['link'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in parsed_table.find_all('a')]
df['link'] = df['link'].astype(str)
df['code'] = df['link'].str.extract(r"\((\d*)\)", expand=True)
# get the risk level
selector = '#compare_table > table > tr:nth-child(16) > td.center.cls'
risk = []
for index, row in df.iterrows():
details = requests.get(url_details.format(row['code']))
soup = bs.BeautifulSoup(details.text, "html.parser")
risk.append(soup.select_one(selector).text)
df['risk'] = risk
Finally, use Regex
to extract the data from string columns.
df['1 year return'] = df['1-Y Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['6 month return'] = df['6-M Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['3 month return'] = df['3-M Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['1 month return'] = df['1-M Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['year to date return'] = df['YTD Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['risk'] = df['risk'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df_selected = df[['provider','Constitutent Fund', 'Type', 'Price', '1 year return', '6 month return', '3 month return',
'1 month return', 'year to date return', 'risk']]
You can find the code snippet here and run it on your own:
https://gist.github.com/kychanbp/5c0cdc375bf8637a8d61164860678987
Here is the data scrapped:
https://docs.google.com/spreadsheets/d/1Df6xR0FEbbi77qcyuRAujhm9J_mu4unw/edit#gid=1617969516