About

Dataset Converter

Requirements and recommendations for writing parsers that convert raw data into the AHORN format.

Each dataset in this project is accompanied by a parser that takes the raw dataset and converts it into our dataset format. The parsers are implemented in Python.

Output File

Converters SHOULD produce a plain .txt file when the resulting dataset is small and remains convenient to inspect directly in a text editor. Converters SHOULD produce a .txt.gz file when the dataset is large enough that compression materially reduces storage or download size. The file content is identical in both cases; gzip only affects how the dataset is stored and distributed, not the AHORN format itself.

Validation

We provide a validation script that checks whether a converted dataset conforms to the expected AHORN format and catches common structural issues before a dataset is submitted. The validation script is part of the ahorn-loader package. Any new dataset MUST pass the validation script before it can be included in the repository.

uvx ahorn-loader validate PATH_TO_DATASET