Data Management
Fire provides a user-friendly interface for downloading and managing financial data. By leveraging the pre-cleaned and processed data pipeline from the Fire Institute, you can focus more on research and modeling rather than data preparation.
Currently, Fire only porvides the data from the Chinese A stock market. We will provide more data in the future.
Download Data
We provide a simple command-line interface to download the data. You can use the following command to download the data:
firefin download
This command will download the latest data from the Fire Institute and store it in the ~/.fire/data/raw
directory, all data
will be organized in feather format. (Maybe we will consider other database or k-v store in the future) cause we do not update the data frequently, so we choose feather format for its fast read/write speed.
Load Data
If you have downloaded the data manually or received it from another source, you can use the following command to load it into the Firefin system:
firefin load <file_path>
This command will extract the contents of the provided tar file and place them in the appropriate directory within the Firefin system.
Data Structure
The data is organized in a structured format to facilitate easy access and manipulation. Here is an overview of the data structure:
Date | security1 | security2 | security2 | … | securityN |
---|---|---|---|---|---|
2023-01-01 | 10.5 | 10.7 | 10.8 | … | 10.9 |
2023-01-02 | 10.6 | 10.8 | 10.9 | … | 11.1 |
… | … | … | … | … | |
2023-12-31 | 11.0 | 11.2 | 11.3 | … | 11.4 |
- ALL data is stored in a single Feather file named
data_name.feather
. - Each row represents a date.
- Each column represents a security, identified by its ticker symbol.
- The values in the cells represent the closing prices of the securities on the corresponding dates.
- index(date) and columns(securities) are exactly the same across A datasets. For example, ‘A-share chinese market’
With the above structure, you can easily perform time-series analysis, portfolio optimization, and other financial analyses, with out thinking about the data alignment issue.
Current Data Structure Limitations
- Inclusion of Delisted Securities: The dataset retains entries for delisted securities, though their corresponding fields are populated with zeroes. While this poses limited disruption due to the low incidence of delistings in the A-share market, it may introduce unnecessary complexity for users unaware of this convention.
- Unstructured Security Sequencing: The “Securities” column does not follow a predetermined order (e.g., listing date or ticker sequence). This lack of inherent sorting logic necessitates external reference to the ListingData index for proper chronological organization.
- Ambiguous Naming Conventions: The dataset’s frame naming relies on user-defined labels, which can lead to confusion when shared across teams or workflows. Standardized naming protocols would enhance interpretability and maintain consistency with industry conventions.
More On Data Structure
Currently, we have two types of data structure:
pd.DataFrame
stand for (Time × Stock)pd.Series
stand for Time Series for market variable
which is enough for most of the cases. But for transaction data or order by order data, we have no choice but to use pd.DataFrame
to store the data, but index have to be (Time × Stock) and columns should be the variable name.
If the data is periodic in time, but may not be float/int/bool, maybe we should design a new data structure like n-dimensional array to store the data.