About
ahorn-loader
Command-line and Python tooling for downloading, reading, and validating datasets from AHORN.
ahorn-loader is both a command-line tool and a Python package designed to interact with the AHORN dataset repository.
It allows users to easily download datasets and loading them into Python for analysis and experimentation.
It is also home of our validation script to check the correctness of a dataset before publishing it.
Command-line Usage
To install and use ahorn-loader from the command line, you can run the following command:
uvx ahorn-loader [command] [args]
Commands include:
ls: List available datasets in AHORN.download: Download a dataset from AHORN.validate: Validate a specific dataset file (e.g., before adding it to AHORN).
To get a full help of available commands and options, run ahorn-loader --help.
Python Package Usage
You can install ahorn-loader as a Python package from PyPI via pip (or some other package manager of your choice):
pip install ahorn-loader
Then, you can use it in your Python scripts:
import ahorn_loader
# Download the latest revision of a dataset:
ahorn_loader.download_dataset("dataset_name", "target_path")
# Download a specific revision of a dataset:
ahorn_loader.download_dataset("dataset_name", "target_path", revision=3)
# Download and read a dataset:
# The dataset will be stored in your system's cache. For a more permanent storage
# location, use `ahorn_loader.download_dataset` instead.
with ahorn_loader.read_dataset("dataset_name") as dataset:
for line in dataset:
...
# Validate a specific dataset (e.g., before adding it to AHORN):
ahorn_loader.validate_dataset("path_to_dataset_file")
ahorn-loader also provides an asynchronous API, which you can use for non-blocking contexts.
Asynchronous functions are suffixed with _async and available for all operations.
import asyncio
import ahorn_loader
async def main() -> None:
await ahorn_loader.download_dataset_async("dataset_name", "target_path")
async with ahorn_loader.read_dataset_async("dataset_name") as dataset:
for line in dataset:
...
asyncio.run(main())
ahorn-loader loads the latest revision of a dataset by default.
To ensure future reproducibility, it is recommended to pin the revision number of the datasets you use.