About
Dataset Converter
Requirements and recommendations for writing parsers that convert raw data into the AHORN format.
Each dataset in this project is accompanied by a parser that takes the raw dataset and converts it into our dataset format. The parsers are implemented in Python.
Output File
Converters SHOULD produce a plain .txt file when the resulting dataset is small and remains convenient to inspect directly in a text editor.
Converters SHOULD produce a .txt.gz file when the dataset is large enough that compression materially reduces storage or download size.
The file content is identical in both cases; gzip only affects how the dataset is stored and distributed, not the AHORN format itself.
Validation
We provide a validation script that checks whether a converted dataset conforms to the expected AHORN format and catches common structural issues before a dataset is submitted.
The validation script is part of the ahorn-loader package.
Any new dataset MUST pass the validation script before it can be included in the repository.
uvx ahorn-loader validate PATH_TO_DATASET