There are some good stock scanners, like Tradeview, that can filter stocks based on financial ratios, technical indicators, and fundamentals. What if you want a more tailor-made scanner? How about filtering all the stocks with a median 5-year Return on Equity (ROE) greater than 5%? It is not an easy task if you use online stock scanners. Neither does building your own (But at least possible).
In the United States, the U.S. SECURITIES AND EXCHANGE COMMISSION will upload financial statements data every quarter, thanks to the XBRL standard. You can find 10 years of data in this link. The only problem is it is quite difficult to understand the data structure and the meaning of those tags.
Let’s try to do an exercise to filter all stocks with a median 5-year return on equity (ROE) greater than 5%.
Step 1: Find the Tag for Net Income and Shareholders’ Equity
Instead of reading the messy taxonomy file, you can use Yeti Taxonomy Viewer to find the tag easily. Using the viewer, net income is NetIncomeLoss
and shareholders’ equity is StockholdersEquity
. Please also pay attention to the data type. Since we are interested in numeric data, the data type should be
. Also, for some items, the tag name may be different in different versions of the accounting standard. But from my experience, for some big items, they are consistent across different versions.
xbrli:monetaryItemType
Using the Yeti Taxonomy Viewer is quite intuitive. First, you need to register an account and login. Second, select the accounting standard. Third, just browse to the statement you normally would find the information and check the property:name
.
Step 2: Calculate the Median ROE Group by Stocks
I will now use SQL to query the data downloaded directly. As for how to do that, I will show you in Part 2. Basically, AWS S3, AWS Glue, and AWS Athena are used to make this possible.
Before showing you the SQL code, let me first show you an overview of the information contains in each table.
Table Name | What is inside |
SUB (Submissions) [I name it as sub_tsv] | Summary of submission. Basic information on the company. |
TAG (Tags) [I name it as tag_tsv] | All standard taxonomy tags. |
NUM (Numbers) [I name it as num_tsv] | All numeric data associated with submissions and tags. |
For details on other tables, please refer to the documentation.
Getting 5-year NetIncomeLoss
Example Results:
adsh | tag | version | ddate | qtrs | uom | dimh | iprx | value | footnote | footlen | dimn | coreg | durp | datp | dcml | partition_0 |
0001640334-16-001065 | NetIncomeLoss | us-gaap/2015 | 20160131 | 4 | USD | 0x00000000 | 0 | -2672412 |
| 0 | 0 |
| 0.01095891 | 0 | 0 | 2016q2 |
0001056285-19-000017 | NetIncomeLoss | us-gaap/2018 | 20190131 | 4 | USD | 0x00000000 | 0 | 3780000 |
| 0 | 0 |
| 0.021918058 | -2 | -3 | 2019q1 |
0001574135-19-000009 | NetIncomeLoss | us-gaap/2018 | 20190131 | 4 | USD | 0x00000000 | 0 | -1.41903E+08 |
| 0 | 0 |
| 0.021918058 | -1 | -3 | 2019q1 |
0000751652-16-000410 | NetIncomeLoss | us-gaap/2015 | 20160630 | 4 | USD | 0x00000000 | 0 | 1.951E+08 |
| 0 | 0 |
| 0 | 0 | -5 | 2016q3 |
0000109563-16-000283 | NetIncomeLoss | us-gaap/2015 | 20160630 | 4 | USD | 0x00000000 | 0 | 2.9577E+07 |
| 0 | 0 |
| 0 | 0 | -3 | 2016q3 |
SQL Code
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'NetIncomeLoss'
AND dimh = '0x00000000'
AND qtrs = 4
AND iprx = 0
- The
ddate
field contains the end date for the data value. I extract theYYYY
value from the date and make sure it is within the year 2015 to 2019. - I extract the year value from
adsh
using Regex (The middle 2 digit number). Then make sure it is equal to the year value inddate
. The reason for doing so is that one report may contain more than 1 year’s data. - I use
0x0000000
dimension data since otherdimh
values are used to link to other data points. - I also make sure that
qtrs
, the count of the number of quarters represented by the data value is 4, which is yearly data. - At last,
iprx
is set to0
because there may be duplicated value of the same tag in one report and0
represents the highest priority and precision.
Getting 5-year StockholdersEquity
The SQL code is similar to the above. Only qtrs
is set to 0
because StockholderEquity
is a point-in-time value. Please also noted that the values return contain both quarterly and yearly data. We will filter that out in the next step using sub_tsv
table.
Example Results
adsh | tag | version | ddate | qtrs | uom | dimh | iprx | value | footnote | footlen | dimn | coreg | durp | datp | dcml | partition_0 |
0001558891-17-000024 | StockholdersEquity | us-gaap/2016 | 20170331 | 0 | USD | 0x00000000 | 0 | 17403 |
| 0 | 0 |
| 0 | 0 | 0 | 2017q2 |
0000925645-17-000013 | StockholdersEquity | us-gaap/2015 | 20170331 | 0 | USD | 0x00000000 | 0 | -1.16707E+08 |
| 0 | 0 |
| 0 | 0 | -3 | 2017q2 |
0001564590-17-007150 | StockholdersEquity | us-gaap/2016 | 20170331 | 0 | USD | 0x00000000 | 0 | 6.93583E+08 |
| 0 | 0 |
| 0 | 0 | -3 | 2017q2 |
0001562762-17-000048 | StockholdersEquity | us-gaap/2016 | 20170331 | 0 | USD | 0x00000000 | 0 | 2.78145E+08 |
| 0 | 0 |
| 0 | 0 | -3 | 2017q2 |
0001171843-17-002300 | StockholdersEquity | us-gaap/2017 | 20170228 | 0 | USD | 0x00000000 | 0 | 2.719277E+09 |
| 0 | 0 |
| 0 | 3 | -3 | 2017q2 |
SQL Code
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'StockholdersEquity'
AND dimh = '0x00000000'
AND qtrs = 0
AND iprx = 0
Step 3: Puting It Together
I used appl
data as a sample. There are 5-year StockholdersEquity
and NetIncomeLoss
.
adsh | tag | version | ddate | qtrs | uom | dimh | iprx | value | footnote | footlen | dimn | coreg | durp | datp | dcml | partition_0 | adsh | cik | name | fp |
0000320193-19-000119 | StockholdersEquity | us-gaap/2019 | 20190930 | 0 | USD | 0x00000000 | 0 | 9.0488E+10 |
| 0 | 0 |
| 0 | 2 | -6 | 2019q4 | 0000320193-19-000119 | 320193 | APPLE INC | FY |
0001193125-15-356351 | StockholdersEquity | us-gaap/2015 | 20150930 | 0 | USD | 0x00000000 | 0 | 1.19355E+11 |
| 0 | 0 |
| 0 | 4 | -6 | 2015q4 | 0001193125-15-356351 | 320193 | APPLE INC | FY |
0000320193-19-000119 | NetIncomeLoss | us-gaap/2019 | 20190930 | 4 | USD | 0x00000000 | 0 | 5.5256E+10 |
| 0 | 0 |
| 0.021918058 | 2 | -6 | 2019q4 | 0000320193-19-000119 | 320193 | APPLE INC | FY |
0001193125-15-356351 | NetIncomeLoss | us-gaap/2015 | 20150930 | 4 | USD | 0x00000000 | 0 | 5.3394E+10 |
| 0 | 0 |
| 0.021918058 | 4 | -6 | 2015q4 | 0001193125-15-356351 | 320193 | APPLE INC | FY |
0000320193-17-000070 | StockholdersEquity | us-gaap/2017 | 20170930 | 0 | USD | 0x00000000 | 0 | 1.34047E+11 |
| 0 | 0 |
| 0 | 0 | -6 | 2017q4 | 0000320193-17-000070 | 320193 | APPLE INC | FY |
0000320193-17-000070 | NetIncomeLoss | us-gaap/2017 | 20170930 | 4 | USD | 0x00000000 | 0 | 4.8351E+10 |
| 0 | 0 |
| -0.054793835 | 0 | -6 | 2017q4 | 0000320193-17-000070 | 320193 | APPLE INC | FY |
0000320193-18-000145 | NetIncomeLoss | us-gaap/2018 | 20180930 | 4 | USD | 0x00000000 | 0 | 5.9531E+10 |
| 0 | 0 |
| 0.021918058 | 1 | -6 | 2018q4 | 0000320193-18-000145 | 320193 | APPLE INC | FY |
0000320193-18-000145 | StockholdersEquity | us-gaap/2018 | 20180930 | 0 | USD | 0x00000000 | 0 | 1.07147E+11 |
| 0 | 0 |
| 0 | 1 | -6 | 2018q4 | 0000320193-18-000145 | 320193 | APPLE INC | FY |
0001628280-16-020309 | StockholdersEquity | us-gaap/2015 | 20160930 | 0 | USD | 0x00000000 | 0 | 1.28249E+11 |
| 0 | 0 |
| 0 | 6 | -6 | 2016q4 | 0001628280-16-020309 | 320193 | APPLE INC | FY |
0001628280-16-020309 | NetIncomeLoss | us-gaap/2015 | 20160930 | 4 | USD | 0x00000000 | 0 | 4.5687E+10 |
| 0 | 0 |
| 0.021918058 | 6 | -6 | 2016q4 | 0001628280-16-020309 | 320193 | APPLE INC | FY |
SQL Code
SELECT *
FROM
(SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'NetIncomeLoss'
AND dimh = '0x00000000'
AND qtrs = 4
AND iprx = 0
UNION
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'StockholdersEquity'
AND dimh = '0x00000000'
AND qtrs = 0
AND iprx = 0 ) AS "data"
JOIN
(SELECT adsh,
cik,
name,
fp
FROM sub_tsv ) AS "basic_info"
ON "data"."adsh" = "basic_info"."adsh"
WHERE "basic_info"."fp" = 'FY'
- I use
union
to combine two table fromNetIncomeLoss
andStockholdersEquity
together. - I
join
thesub_tsv
table onadsh
code to obtain information like companycik
code andname
. - I filter
fp
byFY
(financial year) so that all records are yearly. This solves the problem inStockholdersEquity
that contaminated by quarterly values.
Step 3: Filter Stocks with Median 5-year ROE greater than 5%
Finally, I come up with this insane SQL
. The majority of the codes are duplicated from the above steps. Don’t scare by it.
What is does are
- Create a pivot table with rows as
cik
andddate
and columns asNetIncomeLoss
andStockholdersEquity
- Then calculate the ROE for each year by dividing
NetIncomeLoss
withStockholdersEquity
. - Then calculate the median ROE group by
cik
which is unique for each company
SELECT "Return"."cik",
approx_percentile("Return"."Return",
0.5) as "ROE"
FROM
(SELECT *,
"pivot"."NetIncomeLoss"/"pivot"."StockholdersEquity" AS "Return"
FROM
(SELECT "basic_info"."cik",
"data"."ddate",
max(case
WHEN "data"."tag" = 'NetIncomeLoss' THEN
"data"."value"
ELSE 0 end) AS "NetIncomeLoss", max(case
WHEN "data"."tag" = 'StockholdersEquity' THEN
"data"."value"
ELSE 0 end) AS "StockholdersEquity"
FROM
(SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'NetIncomeLoss'
AND dimh = '0x00000000'
AND qtrs = 4
AND iprx = 0
UNION
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'StockholdersEquity'
AND dimh = '0x00000000'
AND qtrs = 0
AND iprx = 0 ) AS "data"
JOIN
(SELECT adsh,
cik,
name,
fp
FROM sub_tsv ) AS "basic_info"
ON "data"."adsh" = "basic_info"."adsh"
WHERE "basic_info"."fp" = 'FY'
GROUP BY "basic_info"."cik", "data"."ddate") AS "pivot" ) AS "Return"
GROUP BY "Return"."cik"
If you don’t follow the SQL code, it is perfectly normal since even myself will get loss. We can do this in Pandas
or create Views
as intermediate steps. That would be easier to understand.
Final results:
cik | ROE | |
---|---|---|
1 | 1350593 | 0.1527611761893378 |
2 | 1002047 | 0.16399636800328052 |
3 | 874841 | NaN |
4 | 6845 | 0.15544582530727186 |
5 | 64472 | 0.06529732077754852 |
6 | 357294 | NaN |
7 | 1010470 | 0.0406067222958746 |
8 | 856135 | -Infinity |
9 | 1577603 | 0.0 |
10 | 745732 | 0.41289025728653544 |
Next, you can filter any stock with ROE
> 5%.
Next
In the following posts, I will show you how to scrap the fundamental data from SEC using Python!