Marco A. García

Sentiment Analysis of Financial News: Insights and Actions


Leveraging Natural Language Processing to understand market sentiment and financial trends.


Marco A. García - 12/01/2025

You can find this project in the GitHub repository.

Introduction

In this project, we analyze a dataset of financial news to gain insights into the sentiment and trends present in financial media. This involves several key steps, including:

Our goal is to uncover patterns in financial sentiment and understand how news impacts financial markets.

The dataset contains fields like:

Let’s dive into the analysis!

Packages to Use

In this project, we will use the following Python packages to perform various text analysis and visualization tasks:

These packages provide a comprehensive set of tools to handle the entire pipeline of text analysis, from data preparation to visualization.

import pandas as pd

import nltk
from nltk import word_tokenize
nltk.download('punkt_tab')

from nltk.corpus import stopwords
nltk.download("stopwords")

from nltk.stem.porter import PorterStemmer

import matplotlib.pyplot as plt

from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")

from wordcloud import WordCloud
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

1. Preparating and Creating a Corpus

The first step in our analysis is to prepare the dataset and create a corpus. This involves loading the data, cleaning it, and formatting it into a structure that can be analyzed effectively.

df = pd.read_csv("financial-news.csv")
df.shape
(28276, 8)

As we can see, we have a total of 28,276 registers and 8 features.

After reviewing the dataset, I will preserve only the “text” feature for the corpus. This is because it is the only feature that contains meaningful text data relevant for our analysis. Other features, while informative, are not directly related to the sentiment and content of the financial news.

corpus = df["text"]

After creating the corpus, it is essential to ensure that it does not contain any missing values (NaN). Missing values can interfere with text processing and lead to errors in subsequent analysis.

corpus[corpus.isna()]

With a clean, non-NaN corpus, the next step is to check for duplicate entries. Duplicate values can skew the analysis by over-representing certain text entries, so it is important to identify and handle them appropriately.

corpus[corpus.duplicated()]

2584 rows × 1 columns

After verifying for duplicates, we found that the corpus contains 2,584 duplicated values. To ensure the quality and diversity of the data, we will remove these duplicates.

corpus = corpus.drop_duplicates(keep="first")
corpus.shape
(25692,)

After removing duplicated values, we reduced the total number of entries in the corpus from 28,276 to 25,692. This remains a substantial amount of data for meaningful analysis.

2. Frequency of texts lengths

Analyzing the frequency of text lengths in the corpus provides insights into the distribution of the data and helps identify patterns in the length of financial news articles or posts.

Before adding new features, it is essential to confirm the type of the corpus object to ensure compatibility with the operations we want to perform.

print(type(corpus))
<class 'pandas.core.series.Series'>

Since the corpus is currently a pandas.Series, we need to convert it into a pandas.DataFrame to add additional columns. We will use the apply method to calculate the length of each text entry and include it as a new column called doc_len.

corpus = pd.DataFrame({
    "text": corpus,
    "doc_len": corpus.apply(lambda x: len(x))
})

corpus.head(3)

25692 rows × 2 columns

With the doc_len column added to the corpus DataFrame, we can now explore the minimum and maximum lengths of the texts. This helps us understand the range of text lengths and identify outliers or unusual entries.

print("Min:", corpus["doc_len"].min())
print("Max:", corpus["doc_len"].max())
Min: 2
Max: 71521

Identifying texts with lengths below 20 characters helps us examine potentially anomalous or overly concise entries. These texts might not carry meaningful information and may require additional consideration during analysis.

corpus[corpus["doc_len"] < 20]

129 rows × 2 columns

After reviewing the corpus, we found that 129 entries with lengths below 20 characters lack meaningful information. To maintain the quality of the dataset, we will filter out these entries.

corpus = corpus[corpus["doc_len"] > 20]
corpus.shape
(25557, 2)

3. Tokenization

Tokenization is the process of transforming text into meaningful lexical tokens. This step is essential because:

To perform tokenization, we will use NLTK (Natural Language Toolkit), a Python package designed for various NLP tasks, including tokenization.

corpus["tokens"] = corpus["text"].apply(lambda x: word_tokenize(x.lower()))
corpus.head(3)

25557 rows × 3 columns

Now that we have the tokenized representation of each text, we can calculate the number of tokens for each entry. This information will help us understand the token distribution across the corpus and identify any anomalies before proceeding with stopword removal.

corpus["number_tokens"] = corpus["tokens"].apply(lambda x: len(x))
corpus.head(3)

25557 rows × 4 columns

4. Cleaning Stopwords

Stopwords are common words (e.g., “the”, “and”, “is”) that do not contribute meaningful information to the analysis. Removing stopwords reduces noise in the dataset and improves the quality of insights derived from the text.

To streamline the process of removing stopwords, we will create a custom function that:

This function will then be applied to each row of the corpus.

def remove_stopwords(tokens, lang="english"):
  stopwords_list = stopwords.words(lang)
  return [token for token in tokens if token not in stopwords_list]
corpus["tokens"] = corpus["tokens"].apply(lambda x: remove_stopwords(x))
corpus["number_tokens"] = corpus["tokens"].apply(lambda x: len(x))
corpus.head(3)

25557 rows × 4 columns

5. Stemming

Stemming is the process of reducing words to their base or root form. For example, “running” becomes “run” and “studies” becomes “studi”. Stemming helps normalize the text by grouping words with similar meanings under a common root.

porter_stemmer = PorterStemmer()

def stem_tokens(tokens):
  return [porter_stemmer.stem(token) for token in tokens]
corpus["tokens"] = corpus["tokens"].apply(lambda x: stem_tokens(x))
corpus.head(3)

25557 rows × 4 columns

6. Data visualization

6.1 Frequency plot

With the text data fully processed, we can now explore the corpus using various visualizations. These visualizations help uncover patterns, trends, and key insights in the dataset.

To understand the most frequently occurring words in the corpus, we will analyze the 20 most repeated tokens. This helps identify key themes and topics in the dataset.

corpus_tokens = list(corpus["tokens"])
tokens_count = {}

for document_tokens in corpus_tokens:
  for token in document_tokens:
    if token in tokens_count:
      tokens_count[token] = tokens_count[token] + 1
    else:
      tokens_count[token] = 1

ordered_tokens_count = sorted(tokens_count.items(), key=lambda x: x[1], reverse=True)
def plot_tokens_frequency(tokens):
  fig, ax = plt.subplots(figsize=(15, 5))

  labels = [item[0] for item in tokens]
  counts = [item[1] for item in tokens]

  plt.bar(labels, counts)
plot_tokens_frequency(ordered_tokens_count[0:21])

Based on the bar chart:

Dollar Sign ($):

http:

Punctuation and Special Characters:

Financial Lexical Tokens:

To gain a deeper understanding of the corpus, we will analyze the next 20 most repeated tokens, excluding the first 20 that we already examined. This will help uncover additional patterns and frequent terms that may not have been immediately apparent.

plot_tokens_frequency(ordered_tokens_count[20:41])

From the new chart, we observe tokens such as “hold”, “buy”, and “high”, which suggest an inclination toward upward financial trends. In contrast, “short” could represent downward trends or a bearish sentiment. To validate this hypothesis, we will apply sentiment analysis to the corpus.

6.2 Sentiment Analisys

Sentiment analysis helps us determine the emotional tone or polarity of the text data. This step is particularly useful for financial datasets to understand whether the overall sentiment is positive, negative, or neutral. It also helps uncover sentiment trends associated with specific tokens like “buy”, “hold”, or “short”.

To enhance the analysis, we will classify each text entry in the corpus as Positive, Negative, or Neutral based on its sentiment polarity score. For this, we will define a function that uses the VADER sentiment analyzer to calculate the sentiment score and assign a label accordingly.

sentiment_analyzer = SentimentIntensityAnalyzer()

def sentiment_classification(text):
  result = sentiment_analyzer.polarity_scores(text)["compound"]
  if result >= 0.5:
    return "Positive"
  elif result <= -0.5:
    return "Negative"
  else:
    return "Neutral"
corpus["sentiment"] = corpus["tokens"].apply(lambda x: sentiment_classification(" ".join(x)))
corpus.head(3)

25557 rows × 5 columns

With the sentiment classification added to the corpus, we can now create a visualization to see the distribution of sentiments (Positive, Negative, Neutral) across the dataset. This helps us understand the overall emotional tone of the financial news corpus.

fig, ax = plt.subplots(figsize=(15, 5))

sentiments = corpus["sentiment"].value_counts()

labels = sentiments.index
sizes = sentiments.values

ax.pie(sizes, labels=labels, autopct="%1.1f%%")

plt.show()

The pie chart reveals that the most predominant sentiment in the dataset is Neutral, followed by Positive, with Negative being the least frequent. This distribution offers significant insights:

Neutral Sentiment (88.9%):

Positive Sentiment (8.9%):

Negative Sentiment (2.1%):

6.3 Word Cloud

A Word Cloud is a visually engaging way to represent the frequency of words in a text corpus. Words that appear more frequently are displayed in larger fonts, allowing us to quickly identify dominant themes or topics.

def display_word_cloud(corpus, stopwords=[], color="white"):
  stopwords = set(stopwords)
  wordcloud = WordCloud(
      background_color=color,
      stopwords=stopwords,
      max_words=100,
      max_font_size=50,
      random_state=1,
      collocations=False
  )

  wordcloud = wordcloud.generate(str(corpus))

  fig, ax = plt.subplots(figsize=(15, 15))
  plt.imshow(wordcloud)
  plt.axis("on")
  plt.show()
corpus_string = " ".join(token for vector_tokens in corpus["tokens"] for token in vector_tokens)
display_word_cloud(corpus_string)

The Word Cloud provides valuable insights, but it also contains noise, such as tokens like “http” and other terms that may not be relevant to our analysis. To refine the visualization, we can pass a custom list of stopwords to exclude these terms.

display_word_cloud(corpus_string, ["http", "co", "com", "t", "inc", "s", "u"])

The refined Word Cloud highlights a cleaner representation of the most frequently used words in the financial news corpus. By removing irrelevant tokens like “http”, “com”, and others, the visualization provides more meaningful insights.

Key Observations:

Dominant Words:

Emerging Themes:

Directional Sentiments:

7. Conclusions

The analysis of the financial news dataset provided valuable insights into the sentiment and thematic composition of the data. Here are the main conclusions and potential actions one can take based on these findings:

Key Findings:

Predominantly Neutral Sentiment:

Presence of Positive and Negative Sentiment:

Frequent Financial Lexicon:

Potential Actions:

Market Strategy Development:

Trend Monitoring:

Content Creation and Communication:

Data-Driven Decision Making:

Final Takeaway:

This analysis demonstrates how businesses, investors, and market participants can leverage sentiment analysis and financial lexicon insights to make informed decisions. Whether it’s identifying growth opportunities, mitigating risks, or understanding market trends, the processed data provides actionable intelligence to stay competitive in dynamic financial markets.

Contact

Feel free to reach out through my social media.