A few weeks ago) (In this post), I have shared with you where to find 10 years of financial statement data and how to use AWS Athena to query the files stored in AWS S3. In this post, I am going to show you how to write a program to download the data and store it in S3.

To download the files using Python in SEC web is not that difficult.

  1. Find the pattern in the download links
  2. Use requests package to get the files
  3. Unzip the files and upload to S3

The basic steps are like that. But there are some details we need to handle. For example, how to keep track of the files we have already processed? We don’t want to keep duplicated copies of data in S3. How can we select only certain file types to be uploaded?

Without further a due, let me show you the code:

It may look complicated. But the main part is in the for loop. I basically loop through all the URLs found, create a folder called zipdate if not existed yet, and extract the downloaded zip files to the folder, and finally upload to s3 path.

To get the URLs for all financial statements data, I used BeautifulSoup to find all href links on the webpage. Then I used regex to filter only those links with zip extension.

Next, to maintain a record of processed URLs and prevent duplication, I used logging package to create a log file containing the URLs already processed. Then, I defined a function called find_url_in_file to retrieve those processed URLs and used set operation to get the symmetric difference. Finally, I also check if the files already existed in s3 using the upload_dir_s3` function.

Leave a Reply