There are some good stock scanners, like Tradeview, that can filter stocks based on financial ratios, technical indicators, and fundamentals. What if you want a more tailor-made scanner? How about filtering all the stocks with a median 5-year Return on Equity (ROE) greater than 5%? It is not an easy task if you use online stock scanners. Neither does building your own (But at least possible).

In the United States, the U.S. SECURITIES AND EXCHANGE COMMISSION will upload financial statements data every quarter, thanks to the XBRL standard. You can find 10 years of data in this link. The only problem is it is quite difficult to understand the data structure and the meaning of those tags.

Let’s try to do an exercise to filter all stocks with a median 5-year return on equity (ROE) greater than 5%.

Step 1: Find the Tag for Net Income and Shareholders’ Equity

Instead of reading the messy taxonomy file, you can use Yeti Taxonomy Viewer to find the tag easily. Using the viewer, net income is NetIncomeLoss and shareholders’ equity is StockholdersEquity. Please also pay attention to the data type. Since we are interested in numeric data, the data type should be
xbrli:monetaryItemType
. Also, for some items, the tag name may be different in different versions of the accounting standard. But from my experience, for some big items, they are consistent across different versions.

Using the Yeti Taxonomy Viewer is quite intuitive. First, you need to register an account and login. Second, select the accounting standard. Third, just browse to the statement you normally would find the information and check the property:name.

Step 2: Calculate the Median ROE Group by Stocks

I will now use SQL to query the data downloaded directly. As for how to do that, I will show you in Part 2. Basically, AWS S3, AWS Glue, and AWS Athena are used to make this possible.

Before showing you the SQL code, let me first show you an overview of the information contains in each table.

Table NameWhat is inside
SUB (Submissions) [I name it as sub_tsv]Summary of submission. Basic information on the company.
TAG (Tags) [I name it as tag_tsv]All standard taxonomy tags.
NUM (Numbers) [I name it as num_tsv]All numeric data associated with submissions and tags.

For details on other tables, please refer to the documentation.

Getting 5-year NetIncomeLoss

Example Results:

adsh tag version ddate qtrs uom dimh iprx value footnote footlen dimn coreg durp datp dcml partition_0
0001640334-16-001065 NetIncomeLoss us-gaap/2015 20160131 4 USD 0x00000000 0 -2672412
0 0
0.01095891 0 0 2016q2
0001056285-19-000017 NetIncomeLoss us-gaap/2018 20190131 4 USD 0x00000000 0 3780000
0 0
0.021918058 -2 -3 2019q1
0001574135-19-000009 NetIncomeLoss us-gaap/2018 20190131 4 USD 0x00000000 0 -1.41903E+08
0 0
0.021918058 -1 -3 2019q1
0000751652-16-000410 NetIncomeLoss us-gaap/2015 20160630 4 USD 0x00000000 0 1.951E+08
0 0
0 0 -5 2016q3
0000109563-16-000283 NetIncomeLoss us-gaap/2015 20160630 4 USD 0x00000000 0 2.9577E+07
0 0
0 0 -3 2016q3

SQL Code

SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
        AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
        AND tag = 'NetIncomeLoss'
        AND dimh = '0x00000000'
        AND qtrs = 4
        AND iprx = 0
  • The ddate field contains the end date for the data value. I extract the YYYY value from the date and make sure it is within the year 2015 to 2019.
  • I extract the year value from adsh using Regex (The middle 2 digit number). Then make sure it is equal to the year value in ddate. The reason for doing so is that one report may contain more than 1 year’s data.
  • I use 0x0000000 dimension data since other dimh values are used to link to other data points.
  • I also make sure that qtrs, the count of the number of quarters represented by the data value is 4, which is yearly data.
  • At last, iprx is set to 0 because there may be duplicated value of the same tag in one report and 0 represents the highest priority and precision.

Getting 5-year StockholdersEquity

The SQL code is similar to the above. Only qtrs is set to 0 because StockholderEquity is a point-in-time value. Please also noted that the values return contain both quarterly and yearly data. We will filter that out in the next step using sub_tsv table.

Example Results

adsh tag version ddate qtrs uom dimh iprx value footnote footlen dimn coreg durp datp dcml partition_0
0001558891-17-000024 StockholdersEquity us-gaap/2016 20170331 0 USD 0x00000000 0 17403
0 0
0 0 0 2017q2
0000925645-17-000013 StockholdersEquity us-gaap/2015 20170331 0 USD 0x00000000 0 -1.16707E+08
0 0
0 0 -3 2017q2
0001564590-17-007150 StockholdersEquity us-gaap/2016 20170331 0 USD 0x00000000 0 6.93583E+08
0 0
0 0 -3 2017q2
0001562762-17-000048 StockholdersEquity us-gaap/2016 20170331 0 USD 0x00000000 0 2.78145E+08
0 0
0 0 -3 2017q2
0001171843-17-002300 StockholdersEquity us-gaap/2017 20170228 0 USD 0x00000000 0 2.719277E+09
0 0
0 3 -3 2017q2

SQL Code

SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
        AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
        AND tag = 'StockholdersEquity'
        AND dimh = '0x00000000'
        AND qtrs = 0
        AND iprx = 0

Step 3: Puting It Together

I used appl data as a sample. There are 5-year StockholdersEquity and NetIncomeLoss.

adsh tag version ddate qtrs uom dimh iprx value footnote footlen dimn coreg durp datp dcml partition_0 adsh cik name fp
0000320193-19-000119 StockholdersEquity us-gaap/2019 20190930 0 USD 0x00000000 0 9.0488E+10
0 0
0 2 -6 2019q4 0000320193-19-000119 320193 APPLE INC FY
0001193125-15-356351 StockholdersEquity us-gaap/2015 20150930 0 USD 0x00000000 0 1.19355E+11
0 0
0 4 -6 2015q4 0001193125-15-356351 320193 APPLE INC FY
0000320193-19-000119 NetIncomeLoss us-gaap/2019 20190930 4 USD 0x00000000 0 5.5256E+10
0 0
0.021918058 2 -6 2019q4 0000320193-19-000119 320193 APPLE INC FY
0001193125-15-356351 NetIncomeLoss us-gaap/2015 20150930 4 USD 0x00000000 0 5.3394E+10
0 0
0.021918058 4 -6 2015q4 0001193125-15-356351 320193 APPLE INC FY
0000320193-17-000070 StockholdersEquity us-gaap/2017 20170930 0 USD 0x00000000 0 1.34047E+11
0 0
0 0 -6 2017q4 0000320193-17-000070 320193 APPLE INC FY
0000320193-17-000070 NetIncomeLoss us-gaap/2017 20170930 4 USD 0x00000000 0 4.8351E+10
0 0
-0.054793835 0 -6 2017q4 0000320193-17-000070 320193 APPLE INC FY
0000320193-18-000145 NetIncomeLoss us-gaap/2018 20180930 4 USD 0x00000000 0 5.9531E+10
0 0
0.021918058 1 -6 2018q4 0000320193-18-000145 320193 APPLE INC FY
0000320193-18-000145 StockholdersEquity us-gaap/2018 20180930 0 USD 0x00000000 0 1.07147E+11
0 0
0 1 -6 2018q4 0000320193-18-000145 320193 APPLE INC FY
0001628280-16-020309 StockholdersEquity us-gaap/2015 20160930 0 USD 0x00000000 0 1.28249E+11
0 0
0 6 -6 2016q4 0001628280-16-020309 320193 APPLE INC FY
0001628280-16-020309 NetIncomeLoss us-gaap/2015 20160930 4 USD 0x00000000 0 4.5687E+10
0 0
0.021918058 6 -6 2016q4 0001628280-16-020309 320193 APPLE INC FY

SQL Code

SELECT *
FROM 
    (SELECT *
    FROM num_tsv
    WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
            AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
            AND tag = 'NetIncomeLoss'
            AND dimh = '0x00000000'
            AND qtrs = 4
            AND iprx = 0
    UNION
    SELECT *
    FROM num_tsv
    WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
            AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
            AND tag = 'StockholdersEquity'
            AND dimh = '0x00000000'
            AND qtrs = 0
            AND iprx = 0 ) AS "data"
JOIN 
    (SELECT adsh,
         cik,
         name,
         fp
    FROM sub_tsv ) AS "basic_info"
    ON "data"."adsh" = "basic_info"."adsh"
WHERE "basic_info"."fp" = 'FY'
  • I use union to combine two table from NetIncomeLoss and StockholdersEquity together.
  • I join the sub_tsv table on adsh code to obtain information like company cik code and name.
  • I filter fp by FY (financial year) so that all records are yearly. This solves the problem in StockholdersEquity that contaminated by quarterly values.

Step 3: Filter Stocks with Median 5-year ROE greater than 5%

Finally, I come up with this insane SQL. The majority of the codes are duplicated from the above steps. Don’t scare by it.

What is does are

  • Create a pivot table with rows as cik and ddate and columns as NetIncomeLoss and StockholdersEquity
  • Then calculate the ROE for each year by dividing NetIncomeLoss with StockholdersEquity.
  • Then calculate the median ROE group by cik which is unique for each company
SELECT "Return"."cik",
         approx_percentile("Return"."Return",
         0.5) as "ROE"
FROM 
    (SELECT *,
         "pivot"."NetIncomeLoss"/"pivot"."StockholdersEquity" AS "Return"
    FROM 
        (SELECT "basic_info"."cik",
         "data"."ddate",
         max(case
            WHEN "data"."tag" = 'NetIncomeLoss' THEN
            "data"."value"
            ELSE 0 end) AS "NetIncomeLoss", max(case
            WHEN "data"."tag" = 'StockholdersEquity' THEN
            "data"."value"
            ELSE 0 end) AS "StockholdersEquity"
        FROM 
            (SELECT *
            FROM num_tsv
            WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
                    AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
                    AND tag = 'NetIncomeLoss'
                    AND dimh = '0x00000000'
                    AND qtrs = 4
                    AND iprx = 0
            UNION
            SELECT *
            FROM num_tsv
            WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
                    AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
                    AND tag = 'StockholdersEquity'
                    AND dimh = '0x00000000'
                    AND qtrs = 0
                    AND iprx = 0 ) AS "data"
            JOIN 
                (SELECT adsh,
         cik,
         name,
         fp
                FROM sub_tsv ) AS "basic_info"
                    ON "data"."adsh" = "basic_info"."adsh"
                WHERE "basic_info"."fp" = 'FY'
                GROUP BY  "basic_info"."cik", "data"."ddate") AS "pivot" ) AS "Return"
            GROUP BY  "Return"."cik"

If you don’t follow the SQL code, it is perfectly normal since even myself will get loss. We can do this in Pandas or create Views as intermediate steps. That would be easier to understand.

Final results:

 cikROE
113505930.1527611761893378
210020470.16399636800328052
3874841NaN
468450.15544582530727186
5644720.06529732077754852
6357294NaN
710104700.0406067222958746
8856135-Infinity
915776030.0
107457320.41289025728653544

Next, you can filter any stock with ROE > 5%.

Next

In the following posts, I will show you how to scrap the fundamental data from SEC using Python!

Subscribe and Get Updates!

Leave a Reply

%d