Datasets¶
Available Datasets¶
| Name | Domain | Granularity | Variates | Clients (max) | Samples | CV (mean±std) | URL |
|---|---|---|---|---|---|---|---|
| BaseStation5G | Communication | 2 minutes | 11 | 3 | 9_004±5_018 | Github | |
| BeijingAirQuality | Environment | 1 hour | 11 | 12 | 31_847±981 | 0.93±0.03 | UCI |
| CitiesILI | Healthcare | 1 week | 1 | 122 | Github | ||
| COVID19Cases | Healthcare | 1 day | 10 | 55 | Github | ||
| CryptoDataDownloadDay | Economic | 1 day | 4 | CDD | |||
| CryptoDataDownloadHour | Economic | 1 hour | 4 | CDD | |||
| CryptoDataDownloadMinute | Economic | 1 minute | 4 | CDD | |||
| ETTh1 | Energy | 1 hour | 7 | 1 | 14_400±0 | 0.74±0.29 | Github |
| ETTh2 | Energy | 1 hour | 7 | 1 | 14_400±0 | 0.74±0.29 | Github |
| ETDatasetHour | Energy | 1 hour | 7 | 2 | 14_400±0 | 0.74±0.29 | Github |
| ETTm1 | Energy | 15 minutes | 7 | 1 | 57_600±0 | Github | |
| ETTm2 | Energy | 15 minutes | 7 | 1 | 57_600±0 | Github | |
| ETDatasetMinute | Energy | 15 minutes | 7 | 2 | 57_600±0 | Github | |
| Electricity | Energy | 15 minutes | 1 | 321 | 26_304±0 | 0.41±0.28 | Github |
| ElectricityLoadDiagrams | Energy | 15 minutes | 1 | 370 | 140_256±0 | UCI | |
| ExchangeRate | Economic | 1 day | 1 | 8 | 7_588±0 | Github | |
| METRLA | Traffic | 5 minutes | 1 | 207 | 34_272±0 | Github | |
| MekongSalinity | Environment | 1 day | 1 | 38 | 1_500±953 | 0.90±0.40 | Springer |
| PeMS03 | Traffic | 5 minutes | 1 | 358 | 26_208±0 | Github | |
| PeMS04 | Traffic | 5 minutes | 1 | 307 | 16_992±0 | Github | |
| PeMS07 | Traffic | 5 minutes | 1 | 883 | 28_224±0 | Github | |
| PeMS08 | Traffic | 5 minutes | 3 | 170 | 17_856±0 | Github | |
| PeMSBAY | Traffic | 5 minutes | 1 | 325 | 52_116±0 | Github | |
| PeMSSF | Traffic | 10 minutes | 1 | 963 | 63_345±0 | UCI | |
| SolarCSGREGFC | Energy | 15 minutes | 5 | 8 | 63_852±16_443 | Github | |
| SolarEnergy | Energy | 1 hour | 1 | 137 | 52_560±0 | 1.46±0.04 | Github |
| StatesILI | Healthcare | 1 week | 1 | 37 | Github | ||
| TetouanPowerConsumption | Energy | 10 minutes | 1 | 3 | 52_416±0 | UCI | |
| Traffic | Traffic | 1 hour | 1 | 862 | 17_544±0 | 0.81±0.22 | Github |
| TinyWeather5K | Environment | 1 hour | 5 | 200 | 87_648±0 | 0.57±0.22 | Github |
| Weather5K | Environment | 1 hour | 5 | 5_672 | Github | ||
| WindCSGREGFC | Energy | 15 minutes | 10 | 6 | 70_146±66 | Github |
Note: Number of clients will be decided after splitting the data since clients with insufficient data (cannot form at least 10 samples) will be discarded. Clients (max) is the maximum number of clients possible.
Usage¶
1. Single Dataset with Single Configuration¶
To use a dataset in your experiment (default scenario), specify the dataset name with the --dataset argument when running your training or analysis scripts.
Example:
python main.py --dataset=ETTh1
You can also set other related arguments such as --input_len, --output_len, and --batch_size to control the window size, forecast horizon, and batch size for your experiment.
Example:
python main.py --dataset=SolarEnergy --input_len=168 --output_len=24 --batch_size=64
All clients will use the same configuration as specified above.
Refer to the table above for available dataset names and their details.
2. Single Dataset with Multiple Configurations¶
Different clients from the same dataset may have different configurations (e.g., different output lengths or channels).
Examples:
- PeMS08OutVar1: 75% of clients have output_len=96, 25% have output_len=720.
python main.py --dataset=PeMS08OutVar1
See: data_factory/PeMS08.py/PeMS08OutVar1
- PeMS08OutVar2: 50% of clients have output_len=96, 50% have output_len=720.python main.py --dataset=PeMS08OutVar2
See: data_factory/PeMS08.py/PeMS08OutVar2
- PeMS08OutVar3: 25% of clients have output_len=96, 75% have output_len=720.python main.py --dataset=PeMS08OutVar3
See: data_factory/PeMS08.py/PeMS08OutVar3
- Customized2: 50% of clients have 1 output channel and 1 input channel, 50% have 7 output channels and 7 input channels.python main.py --dataset=Customized2
See: data_factory/Customized.py/Customized2
3. Multi-task / Multi-dataset¶
Merge multiple datasets, each client belongs to one dataset. Useful for multi-task learning or federated learning across different domains.
Example:
- Customized1: Merges ETDatasetHour (2 clients), TetouanPowerConsumption (3 clients), SolarEnergy (137 clients), Electricity (321 clients) for a total of 463 clients.
python main.py --dataset=Customized1
See: data_factory/Customized.py/Customized1
4. Real / Customized (Heterogeneous Configurations)¶
Merge multiple datasets, each with potentially different configurations per dataset or client.
Example:
- Customized3: Merges ETDatasetHour (2 clients, output_len=96), TetouanPowerConsumption (3 clients, output_len=192).
python main.py --dataset=Customized3
See: data_factory/Customized.py/Customized3
Note:
- For all scenarios, you can further control client configuration using arguments like --input_len, --output_len, and --batch_size.
- Refer to the dataset table above for available dataset names and their details.
Time Marks¶
Processed datasets (.npz files) automatically include temporal time mark features (x_mark, y_mark) alongside the input (x) and target (y) arrays. These are integer-valued calendar features extracted from the date column, ordered as [month, day_of_month, day_of_week, hour, minute] (not all frequencies use all columns).
Models that accept time marks (e.g., Transformer) use them automatically. Models that do not accept marks simply ignore them — the training pipeline only passes marks when the model's forward signature accepts x_mark/y_mark keyword arguments.
The number of mark columns depends on the dataset granularity:
| Granularity | Columns | Count | |
|---|---|---|---|
s (second) |
month, day, weekday, hour, minute, second | 6 | |
t (minute) |
month, day, weekday, hour, minute | 5 | |
h (hour) |
month, day, weekday, hour | 4 | |
d (day) |
month, day, weekday | 3 | |
w (week) |
month, day, week_of_year | 3 | |
mo (month) |
month | 1 | |
q (quarter) |
month | 1 |