Researchers frequently need to retrieve data on thousands of companies from several sources and combine the results. Each business database that lists securities (companies, bonds, options, etc.) uses their own identifier code. For example, Datastream uses DS Code, Compustat uses GVKEY. They record the security name (and some kind of ticker symbol), but this is often slightly different between each supplier, as shown in the following table.
|Database||Name of Amazon.com||Ticker/mnemonic symbol|
|Bloomberg||Amazon.com Inc||AMZN US Equity|
|Capital IQ||Amazon.com, Inc.||NasdaqGS:AMZN|
The names can vary many ways, such as punctuation, the presence of Inc/Plc, and whether numbers are written as letters. (We do not care about case differences.) The ticker symbol can use different conventions for the exchange code (before, after, different punctuation and so on).
If the names do not match exactly, and we don’t want to go to the trouble of manual matching or automated fuzzy matching, then we can try to cross reference using different IDs.
There is some overlap between the identifiers across different databases. Sometimes you can choose to output a selection of identifiers for a given list of securities, one of which will be used by the second database you wish to retrieve data from. Sometimes you can read from a wider selection of identifiers than a database’s user interface first suggests.
I am working on a “living document” which is mainly a table of business databases against a list of popular identifier types. A sample of this is shown in the table below.
The key is as follows.
- Green: easy to use, often good coverage (but not always)
- Amber: limited coverage or difficult to use, may require additional work
- Red: not available, or coverage too limited to be of use
- Proprietary: a code generated by that database provider, unique (but may change with time)
- Ticker style: Many databases have an ID type called ‘ticker’ or ‘something ticker’; in the case of the former, I named it myself in brackets, e.g. (CIQ) Ticker
- * may give more than one result (for example, multiple bonds or share classes for a company) (incomplete)
- † values are prefixed with other characters (e.g. a GVKEY value in Capital IQ begins “GV_” before the number)
- (#) number in brackets denotes number of digits in the identifier
Some other concerns
A table like this could never represent all the detail. Some other factors to consider are time, uniqueness and security type.
Time: names of companies can change with time and so can some other identifiers (tickers and CUSIP, to name two). It is better to use an identifier that does not change with time, such as ISIN.
Uniqueness: an identifier is not always unique within one platform; also it could be recycled to a different company at a different date.
Private firms and governments: different identifiers may be required when looking at different types of security, such as quoted and private companies.
People: company directors data such as remuneration and employment history are available from several sources such as Bloomberg, BoardEx, Capital IQ, ExecuComp and OptionMetrics. I have started to develop a comparison table for this, supplementing the company table.
Transactions: mergers and acquisitions, IPOs and other deals are identified separately too, in sources such as Bloomberg, Capital IQ, SDC Platinum and ThomsonONE.com. I have started to develop a table for comparing these identifiers, although this looks much more difficult.
Conversion tools: there are tools that exist accompanying some of the database platforms to convert between various pairs of identifiers and I have built prototypes of my own (see post on using Market CDS for one example). I am compiling a list of these.
Fuzzy matching: this major alternative to identifier matching has been used extensively in research and I have already begun investigating it, with help from other researchers. More on this in later posts.