Accessing Historical Financial News Headlines with Python

Get up to 1 year for North American companies with the free version of the Finnhub API.

Philippe Ostiguy, M. Sc.
Python in Plain English

--

Historical financial headlines can be useful to perform sentiment analysis on the financial markets. This article and this one show how sentiment analysis is used to predict the stock markets. To check if headlines have predictive power for stock price movements, we need to backtest the strategy. One of the main challenges with that is obtaining historical data. More often than not, to get a good data set, you’ll have to pay for it… and it doesn’t guarantee the quality of the data set!

It’s possible web scraping historical financial headlines of individual stocks on Morningstar, Seeking Alpha, or The Business Times. The main issue with these sources is that there is generally a maximum of 2 or 3 headlines per day on a given stock. It’s not enough to test the correlation between the financial news headlines and stock markets movement. Any sample size below 30 is considered small. The smaller the dataset under 30, the smaller the predictive power of sentiment analysis of headlines on stock market movements.

FinViz allows obtaining between 20 to 30 headlines per stock per day. This article shows how to achieve it. However, it’s not possible to get historical news headlines.

The paid version of the Tiingo API ($10/month at the moment of writing this script) allows three months of queryable history, which is already an upgrade.

After extensive research and workarounds, I found the best option for the need was the free version of the Finnhub API. I create an algorithm that efficiently gets the historical financial news headlines for North American companies, up to 1 year. For any given stock, it provides on average 70 headlines per day.

Financial headlines
Photo by Markus Spiske on Unsplash

Finnhub’s free version allows testing your ideas, whereas the premium version gives you access to more advanced features. It’s one of the most powerful and complete stock market API, from stock fundamentals and technical analysis to alternative data. Whether you need real-time market data for your algorithmic trading project or you want historical data to backtest a strategy, it worths checking the Finnhub API.

We provide 30+ years of financial statements, 20 years of earnings call transcripts, 30 years of historical market data, 25+ years of analysts’ estimate for 65,000 global companies and much more.

Before I begin, make sure you comply with the Finnhub Stock API’s terms of service. The free version is meant for personal use only (test your ideas). There is a maximum of 60 API calls per minute (which the code considers). With that being said, let’s begin!

1- Accessing the Finnhub API

The first thing you need to do is to get your unique key here. You can use the premium version if you want, but this code works fine with the free version.

Screenshot from Finnhub.io’s main page

After signing up, you’ll have an API key. You can easily copy it by clicking on the button :

Screenshot from Finnhub.io’s dashboard

A good practice with an API key is to store it as an environment variable (good article here on how to do it). The key variable name in the code is stored as an environment variable in the .env file, like that :

2. Import Libraries

First, we import the modules we need to parse the data. os sets the directory paths. requests gets the data from the Finnhub API. datetime allows manipulating dates. dateutil calculates the difference between dates (extension of the datetime module). pathlib creates new paths if necessary. The json library is for encoding and decoding JSON data. decouple allows setting a local environment variable. langdetect is used to detect the English language in a Python string. sqlite3 is an API interface for SQLite database. time library is to allow the program to pause. bs4 is for pulling data out of HTML and XML files.

3. Initialize the Attributes

Then, we need to initialize our attributes. self.start_date is the beginning of the date from which you want to obtain the data. Keep in mind that the free version allows up to one year of company news from now.

The script generates an error if the start date is after the end date or if the start date is older than one year from the current date.

4. Create a Class to Make API Calls to Finnhub

To make the module more flexible for future uses, I created a class to make the API calls to Finnhub. In other words, let’s say you want to get data elsewhere; you need to create another class in the module and takes the attributes from the Init class (step 3).

We will see these four methods in detail in the next steps :

#call the methods to retrieve historical financial headlines
self.req_new()
self.create_table()
self.clean_table()
self.lang_review()

Note that the function get_tickers()is optional. It gets the symbols from companies listed in the S&P 500, which means that the algorithm will get all the historical news headlines for all companies listed in the S&P 500. You can replace the variable ticker(in the code above) with any list of tickers you want to retrieve the historical data if you prefer.

5. Make API Calls to Finnhub

The method req_new() gets the data. The decorator iterate_day() is a decorator that will call the method req_new()for each day between the date self.start_dateand self.end_datedefined in Class Init()(step 3). We make one request per day as the API doesn’t return all the items for more than one day (at least for the free version).

The decorator iterate_day()pauses the program for self.time_sleepseconds (60 seconds by default) whenever the maximum number of API calls self.max_callis reached (60 API calls per minute).

6. Creating tables in SQLite Database

We create a table per symbol in the self.db_nameSQLite3 database. Each table is named with the ticker followed by ‘_’ to avoid SQL manipulation errors, such as with the Allstate ticker, which is ‘ALL’.

self.ticker = ticker_ + '_'

The wrapper inni_sql()is used to open and save the SQL database.

7. Cleaning the Data

As the expression says :

Garbage in, garbage out

Photo by Henry & Co. on Unsplash

For this reason, having a good data set is an essential part of data analysis, if not the most crucial part. Here, the algorithm removes the N/A entries from the headline and datetime. It also removes duplicate entries.

8. Deleting non-English Headlines

For easier use of the headlines for sentiment analysis, the algorithm deletes the non-English headlines. The reason being that most Natural Language Processing (NLP) algorithms predominantly work well for English.

Special consideration

Photo by Waranont (Joe) on Unsplash

The package langdetectis overall good to detect the right language, but it’s not perfect. So, keep in mind that by using the news headlines obtained from Finnhub, some of them may not be in English, even after this clean-up.

I hope you find this useful. Please leave comments, feedback, and ideas if you have any. I would appreciate it.

The link to the project repo is here.

Discuss further with me on LinkedIn or via ostiguyphilippe@gmail.com!

--

--