Recently I have a random thought on comparing MPF performance with Index (SPX, HSI) performance. Many agents claim that their fund managers can beat the market! Is it true? Am I better off by investing in low-fee and passive funds?

I am not an expert in MPF. I just try to find data to verify such a claim and answer the above questions. Please correct me if I am wrong. To do the analysis, let first scrape data from AAStocks.

First of all, the MPF data can be found in this link. A simple right-click and inspect reveals data is stored in a form:

Inspect in Google Chrome

A simple pd.read_html can extract the data in the table. However, pandas assumes you are interested in the text but not hyperlink. To get the risk level, we need to go inside each fund’s hyperlink and scrap the risk level.

To do so, we leverage beautifulsoup package. Simply locate the a tag and get the hyperlink. After getting the hyperlink, we can then get the data using requests.

Finally, to extract data from table, we use Regex. Now, let’s look at the code one by one.


First, we imported some required packages.

import pandas as pd
import numpy as np
import bs4 as bs
import requests

Next, we defined a function to remove rows with the same value in all columns. Those are header rows.

Header column: AMTD MPF Scheme
def checkEqual(lst):
    "check if a list contains same element"
    return lst[1:] == lst[:-1]

Then we defined the URL pattern and used Beautifulsoup to extract the table in the webpage.

url = "http://www.aastocks.com/en/mpf/search.aspx?tab=1&sp="
url_details = "http://www.aastocks.com/en/mpf/compare.aspx?comp1={}"

page = requests.get(url)
soup = bs.BeautifulSoup(page.text, 'lxml')
parsed_table = soup.find_all('table', {"class":"tblM s2 mpfDL"})[0]

Next, we use pandas to convert html into nice table format and remove rows with same value across columns.

df = pd.read_html(str(parsed_table),encoding='utf-8', header=0)[0]
df = df.rename(columns={'Unnamed: 0':'provider'})
df['provider'] = df['provider'].ffill()

# remove rows with same element
equal_rows = []
for index, row in df.iterrows():
    if checkEqual(row.to_list()):
        equal_rows.append(index)

df = df[~df.index.isin(equal_rows)]
df = df[:-1]
df = df.reset_index(drop=True)

Next, we extract the MPF code using regex so that we can go to each individual webpage to extract the risk level.

# extract the detail code for each fund
df['link'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in parsed_table.find_all('a')]
df['link'] = df['link'].astype(str)
df['code'] = df['link'].str.extract(r"\((\d*)\)", expand=True)

# get the risk level
selector = '#compare_table > table > tr:nth-child(16) > td.center.cls'
risk = []
for index, row in df.iterrows():
    details = requests.get(url_details.format(row['code']))
    soup = bs.BeautifulSoup(details.text, "html.parser")
    risk.append(soup.select_one(selector).text)

df['risk'] = risk

Finally, use Regex to extract the data from string columns.

df['1 year return'] = df['1-Y Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['6 month return'] = df['6-M Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['3 month return'] = df['3-M Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['1 month return'] = df['1-M Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['year to date return'] = df['YTD Chg'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)
df['risk'] = df['risk'].str.extract(r"([+,-]?\d{1,3}.\d{1,2})\%", expand=True).astype(float)

df_selected = df[['provider','Constitutent Fund', 'Type', 'Price', '1 year return', '6 month return', '3 month return',
                 '1 month return', 'year to date return', 'risk']]

You can find the code snippet here and run it on your own:
https://gist.github.com/kychanbp/5c0cdc375bf8637a8d61164860678987

Here is the data scrapped:

https://docs.google.com/spreadsheets/d/1Df6xR0FEbbi77qcyuRAujhm9J_mu4unw/edit#gid=1617969516

Leave a Reply

%d bloggers like this: