Datasets¶

Available Datasets¶

Name	Domain	Granularity	Variates	Clients (max)	Samples	CV (mean±std)	License	URL
BaseStation5G	Communication	2 minutes	11	3	9_004±5_018			Github
BeijingAirQuality	Environment	1 hour	11	12	31_847±981	0.93±0.03	CC BY 4.0	UCI
CitiesILI	Healthcare	1 week	1	122				Github
COVID19Cases	Healthcare	1 day	10	55			Apache-2.0	Github
CryptoDataDownloadDay	Economic	1 day	4					CDD
CryptoDataDownloadHour	Economic	1 hour	4					CDD
CryptoDataDownloadMinute	Economic	1 minute	4					CDD
ETTh1	Energy	1 hour	7	1	14_400±0	0.74±0.29	CC BY-ND 4.0	Github
ETTh2	Energy	1 hour	7	1	14_400±0	0.74±0.29	CC BY-ND 4.0	Github
ETDatasetHour	Energy	1 hour	7	2	14_400±0	0.74±0.29	CC BY-ND 4.0	Github
ETTm1	Energy	15 minutes	7	1	57_600±0		CC BY-ND 4.0	Github
ETTm2	Energy	15 minutes	7	1	57_600±0		CC BY-ND 4.0	Github
ETDatasetMinute	Energy	15 minutes	7	2	57_600±0		CC BY-ND 4.0	Github
Electricity	Energy	15 minutes	1	321	26_304±0	0.41±0.28		Github
ElectricityLoadDiagrams	Energy	15 minutes	1	370	140_256±0		CC BY 4.0	UCI
ExchangeRate	Economic	1 day	1	8	7_588±0			Github
M4	Mixed	Mixed	1	100_000	253±593	0.35±0.25		Github
M4Daily	Mixed	1 day	1	4_227	2_371±1_756	0.28±0.22		Github
M4Hourly	Mixed	1 hour	1	414	902±128	0.39±0.30		Github
M4Monthly	Mixed	1 month	1	48_000	234±137	0.31±0.22		Github
M4Quarterly	Mixed	1 quarter	1	24_000	100±51	0.36±0.25		Github
M4Weekly	Mixed	1 week	1	359	1_035±706	0.44±0.27		Github
M4Yearly	Mixed	1 year	1	23_000	37±25	0.45±0.28		Github
METRLA	Traffic	5 minutes	1	207	34_272±0		MIT	Github
MekongSalinity	Environment	1 day	1	38	1_500±953	0.90±0.40		Springer
PeMS03	Traffic	5 minutes	1	358	26_208±0			Github
PeMS04	Traffic	5 minutes	1	307	16_992±0			Github
PeMS07	Traffic	5 minutes	1	883	28_224±0			Github
PeMS08	Traffic	5 minutes	3	170	17_856±0			Github
PeMSBAY	Traffic	5 minutes	1	325	52_116±0		MIT	Github
PeMSSF	Traffic	10 minutes	1	963	63_345±0		CC BY 4.0	UCI
SolarCSGREGFC	Energy	15 minutes	5	8	63_852±16_443			Github
SolarEnergy	Energy	1 hour	1	137	52_560±0	1.46±0.04		Github
StatesILI	Healthcare	1 week	1	37				Github
TetouanPowerConsumption	Energy	10 minutes	1	3	52_416±0		CC BY 4.0	UCI
ThreeW	Energy	1 second	5	1_314	37_658±40_010	0.09±0.13	CC BY 4.0	Github
ThreeWReal	Energy	1 second	5	440	33_042±63_970	0.03±0.12	CC BY 4.0	Github
ThreeWSimulated	Energy	1 second	5	874	39_982±18_178	0.12±0.13	CC BY 4.0	Github
Traffic	Traffic	1 hour	1	862	17_544±0	0.81±0.22		Github
TinyWeather5K	Environment	1 hour	5	200	87_648±0	0.57±0.22	MIT	Github
Weather5K	Environment	1 hour	5	5_672			MIT	Github
WindCSGREGFC	Energy	15 minutes	10	6	70_146±66			Github

Note: Number of clients will be decided after splitting the data since clients with insufficient data (cannot form at least 10 samples) will be discarded. Clients (max) is the maximum number of clients possible.

Usage¶

1. Single Dataset with Single Configuration¶

To use a dataset in your experiment (default scenario), specify the dataset name with the --dataset argument when running your training or analysis scripts.

Example:

python main.py --dataset=ETTh1

You can also set other related arguments such as --input_len, --output_len, and --batch_size to control the window size, forecast horizon, and batch size for your experiment.

Example:

python main.py --dataset=SolarEnergy --input_len=168 --output_len=24 --batch_size=64

All clients will use the same configuration as specified above.

Refer to the table above for available dataset names and their details.

2. Single Dataset with Multiple Configurations¶

Different clients from the same dataset may have different configurations (e.g., different output lengths or channels).

Examples: - PeMS08OutVar1: 75% of clients have output_len=96, 25% have output_len=720.

python main.py --dataset=PeMS08OutVar1

See: data_factory/PeMS08.py/PeMS08OutVar1 - PeMS08OutVar2: 50% of clients have output_len=96, 50% have output_len=720.

python main.py --dataset=PeMS08OutVar2

See: data_factory/PeMS08.py/PeMS08OutVar2 - PeMS08OutVar3: 25% of clients have output_len=96, 75% have output_len=720.

python main.py --dataset=PeMS08OutVar3

See: data_factory/PeMS08.py/PeMS08OutVar3 - Customized2: 50% of clients have 1 output channel and 1 input channel, 50% have 7 output channels and 7 input channels.

python main.py --dataset=Customized2

See: data_factory/Customized.py/Customized2

3. Multi-task / Multi-dataset¶

Merge multiple datasets, each client belongs to one dataset. Useful for multi-task learning or federated learning across different domains.

Example:
- Customized1: Merges ETDatasetHour (2 clients), TetouanPowerConsumption (3 clients), SolarEnergy (137 clients), Electricity (321 clients) for a total of 463 clients.

python main.py --dataset=Customized1

See: data_factory/Customized.py/Customized1

4. Real / Customized (Heterogeneous Configurations)¶

Merge multiple datasets, each with potentially different configurations per dataset or client.

Example:
- Customized3: Merges ETDatasetHour (2 clients, output_len=96), TetouanPowerConsumption (3 clients, output_len=192).

python main.py --dataset=Customized3

See: data_factory/Customized.py/Customized3

Note: - For all scenarios, you can further control client configuration using arguments like --input_len, --output_len, and --batch_size. - Refer to the dataset table above for available dataset names and their details.

Time Marks¶

Processed datasets (.npz files) automatically include temporal time mark features (x_mark, y_mark) alongside the input (x) and target (y) arrays. These are integer-valued calendar features extracted from the date column, ordered as [month, day_of_month, day_of_week, hour, minute] (not all frequencies use all columns).

Models that accept time marks (e.g., Transformer) use them automatically. Models that do not accept marks simply ignore them — the training pipeline only passes marks when the model's forward signature accepts x_mark/y_mark keyword arguments.

The number of mark columns depends on the dataset granularity:

Granularity	Columns	Count
`s` (second)	month, day, weekday, hour, minute, second	6
`t` (minute)	month, day, weekday, hour, minute	5
`h` (hour)	month, day, weekday, hour	4
`d` (day)	month, day, weekday	3
`w` (week)	month, day, week_of_year	3
`mo` (month)	month	1
`q` (quarter)	month	1