Data Extraction from Financial Documents

Accelex’s information extraction engine converts unstructured documents into structured data that can be used to create actionable insights for our clients. The exposure of an investment portfolio by sector is a key data point required to facilitate a large part of this downstream analysis in a portfolio management or analytics system.

When reporting investment performance, one asset manager might describe the sector of an investment as ‘Software Services’, but a different asset manager may use the term ‘Information Technology’. From an investor’s perspective, these refer to the same thing. Therefore, it becomes necessary to map these free-text descriptions to the same categoric representation to, for example, calculate the exposure-weighted MOIC for all portfolio assets in the technology sector. Making this classification manually is costly, time-consuming and subjective. Combining advanced data science techniques with deep domain knowledge to automate and interpret private markets data is at the heart of what Accelex does. It was therefore a natural step to add automatic sector classification to our data-extraction workflow, helping our clients obtain clean, structured data straight from their investment documents. 

We present, in the following sections, the approach Accelex uses to map the extracted sectors into the Global Industry Classification Standard (GICS). 

Global Industry Classification Standard

The Global Industry Classification Standard, created and maintained by Morgan Stanley Capital International (MSCI) and Standard & Poor’s (S&P), is the most widely used sector classification standard in the finance industry. 

The GICS nomenclature is organized into 4 hierarchical levels represented by unique numerical codes. It includes: 

  • 11 sectors: 2-digit codes
  • 24 industry groups: 4-digit codes
  • 69 industries: 6-digit codes
  • 158 sub-industries: 8-digit codes
Example of GICS hierarchical representation [1]

This standard is updated regularly to ensure it accurately reflects the composition of global markets. The classification of a company may also change from time to time. Typically, this occurs when the main activity of a company changes, following a corporate action, for example.  

Sector Classification

Accelex’s sector classification approach consists in automatically mapping the free-text sectors extracted from a given document to the relevant GICS category. This mapping is performed using a state-of-the-art NLP model trained on millions of sentence pairs. It captures the sectors as semantically meaningful text entities with considerable computational efficiency. 

The model we are leveraging classifies the extracted sectors to the GICS industry level. This allows mapping to higher levels in the GICS hierarchy, namely the industry group and the sector.  

Extracted SectorGICS IndustryGICS Industry GroupGICS Sector
healthcare providers & servicers
healthcare equipment & services
IT Services451020
IT Services
software and services
information technology
Example of sector classification results


Diversity in investment reporting makes it necessary to classify sectors according to a well-defined standard. To this end, Accelex’s powerful data acquisition platform now automatically maps extracted sectors to a GICS category using cutting-edge machine learning techniques. Using the Accelex platform helps our clients derive greater value from their investment data, by building a standardized dataset from investment documents, which can act as the single source of truth for investment decisions and downstream reporting. 


[1] S&P Global Market Intelligence | MSCI. (2018). Global Industry Classification Standard. 

Explore the Accelex platform, learn how it could help your workflows by scheduling a demo today!

Written by Dr. Jihene Younes, connect via LinkedIn.

No responses yet

Leave a Reply

Your email address will not be published.