Datasets

Available Datasets

Name Domain Granularity Variates Clients (max) Samples CV (mean±std) URL
BaseStation5G Communication 2 minutes 11 3 9_004±5_018 Github
BeijingAirQuality Environment 1 hour 11 12 31_847±981 0.93±0.03 UCI
CitiesILI Healthcare 1 week 1 122 Github
COVID19Cases Healthcare 1 day 10 55 Github
CryptoDataDownloadDay Economic 1 day 4 CDD
CryptoDataDownloadHour Economic 1 hour 4 CDD
CryptoDataDownloadMinute Economic 1 minute 4 CDD
ETTh1 Energy 1 hour 7 1 14_400±0 0.74±0.29 Github
ETTh2 Energy 1 hour 7 1 14_400±0 0.74±0.29 Github
ETDatasetHour Energy 1 hour 7 2 14_400±0 0.74±0.29 Github
ETTm1 Energy 15 minutes 7 1 57_600±0 Github
ETTm2 Energy 15 minutes 7 1 57_600±0 Github
ETDatasetMinute Energy 15 minutes 7 2 57_600±0 Github
Electricity Energy 15 minutes 1 321 26_304±0 0.41±0.28 Github
ElectricityLoadDiagrams Energy 15 minutes 1 370 140_256±0 UCI
ExchangeRate Economic 1 day 1 8 7_588±0 Github
METRLA Traffic 5 minutes 1 207 34_272±0 Github
MekongSalinity Environment 1 day 1 38 1_500±953 0.90±0.40 Springer
PeMS03 Traffic 5 minutes 1 358 26_208±0 Github
PeMS04 Traffic 5 minutes 1 307 16_992±0 Github
PeMS07 Traffic 5 minutes 1 883 28_224±0 Github
PeMS08 Traffic 5 minutes 3 170 17_856±0 Github
PeMSBAY Traffic 5 minutes 1 325 52_116±0 Github
PeMSSF Traffic 10 minutes 1 963 63_345±0 UCI
SolarCSGREGFC Energy 15 minutes 5 8 63_852±16_443 Github
SolarEnergy Energy 1 hour 1 137 52_560±0 1.46±0.04 Github
StatesILI Healthcare 1 week 1 37 Github
TetouanPowerConsumption Energy 10 minutes 1 3 52_416±0 UCI
Traffic Traffic 1 hour 1 862 17_544±0 0.81±0.22 Github
TinyWeather5K Environment 1 hour 5 200 87_648±0 0.57±0.22 Github
Weather5K Environment 1 hour 5 5_672 Github
WindCSGREGFC Energy 15 minutes 10 6 70_146±66 Github

Note: Number of clients will be decided after splitting the data since clients with insufficient data (cannot form at least 10 samples) will be discarded. Clients (max) is the maximum number of clients possible.


Usage

1. Single Dataset with Single Configuration

To use a dataset in your experiment (default scenario), specify the dataset name with the --dataset argument when running your training or analysis scripts.

Example:

python main.py --dataset=ETTh1

You can also set other related arguments such as --input_len, --output_len, and --batch_size to control the window size, forecast horizon, and batch size for your experiment.

Example:

python main.py --dataset=SolarEnergy --input_len=168 --output_len=24 --batch_size=64

All clients will use the same configuration as specified above.

Refer to the table above for available dataset names and their details.


2. Single Dataset with Multiple Configurations

Different clients from the same dataset may have different configurations (e.g., different output lengths or channels).

Examples: - PeMS08OutVar1: 75% of clients have output_len=96, 25% have output_len=720.

python main.py --dataset=PeMS08OutVar1
See: data_factory/PeMS08.py/PeMS08OutVar1 - PeMS08OutVar2: 50% of clients have output_len=96, 50% have output_len=720.
python main.py --dataset=PeMS08OutVar2
See: data_factory/PeMS08.py/PeMS08OutVar2 - PeMS08OutVar3: 25% of clients have output_len=96, 75% have output_len=720.
python main.py --dataset=PeMS08OutVar3
See: data_factory/PeMS08.py/PeMS08OutVar3 - Customized2: 50% of clients have 1 output channel and 1 input channel, 50% have 7 output channels and 7 input channels.
python main.py --dataset=Customized2
See: data_factory/Customized.py/Customized2


3. Multi-task / Multi-dataset

Merge multiple datasets, each client belongs to one dataset. Useful for multi-task learning or federated learning across different domains.

Example:
- Customized1: Merges ETDatasetHour (2 clients), TetouanPowerConsumption (3 clients), SolarEnergy (137 clients), Electricity (321 clients) for a total of 463 clients.

python main.py --dataset=Customized1
See: data_factory/Customized.py/Customized1


4. Real / Customized (Heterogeneous Configurations)

Merge multiple datasets, each with potentially different configurations per dataset or client.

Example:
- Customized3: Merges ETDatasetHour (2 clients, output_len=96), TetouanPowerConsumption (3 clients, output_len=192).

python main.py --dataset=Customized3
See: data_factory/Customized.py/Customized3


Note: - For all scenarios, you can further control client configuration using arguments like --input_len, --output_len, and --batch_size. - Refer to the dataset table above for available dataset names and their details.


Time Marks

Processed datasets (.npz files) automatically include temporal time mark features (x_mark, y_mark) alongside the input (x) and target (y) arrays. These are integer-valued calendar features extracted from the date column, ordered as [month, day_of_month, day_of_week, hour, minute] (not all frequencies use all columns).

Models that accept time marks (e.g., Transformer) use them automatically. Models that do not accept marks simply ignore them — the training pipeline only passes marks when the model's forward signature accepts x_mark/y_mark keyword arguments.

The number of mark columns depends on the dataset granularity:

Granularity Columns Count
s (second) month, day, weekday, hour, minute, second 6
t (minute) month, day, weekday, hour, minute 5
h (hour) month, day, weekday, hour 4
d (day) month, day, weekday 3
w (week) month, day, week_of_year 3
mo (month) month 1
q (quarter) month 1