v0.5.0

What's Changed

build(deps): bump actions/cache from 4.1.0 to 4.1.1 by @dependabot in #300
build(deps): bump rojopolis/spellcheck-github-actions from 0.42.0 to 0.43.0 by @dependabot in #299
build(deps): bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #298
chore: rename 'basic-workflow-tests' to 'e2e-custom' by @nathan-weinberg in #306
fix: change "group" to "tag" for mmlu_branch task config by @alimaredia in #305
fix: remove stop token from mixtral by @cdoern in #310
ci: update small E2E job to align with CLI and Training by @nathan-weinberg in #317
ci: update medium job to run as PR check by @nathan-weinberg in #318
build(deps): bump rojopolis/spellcheck-github-actions from 0.43.0 to 0.43.1 by @dependabot in #314
fix: medium E2E CI job was missing HF_TOKEN by @nathan-weinberg in #319
build(deps): bump actions/cache from 4.1.1 to 4.1.2 by @dependabot in #320
ci: use org variable for AWS EC2 AMI in E2E CI jobs by @nathan-weinberg in #322
ci: convert med E2E CI job to L4 GPU by @nathan-weinberg in #325
build(deps): bump rojopolis/spellcheck-github-actions from 0.43.1 to 0.44.0 by @dependabot in #326
build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #323
build(deps): bump pypa/gh-action-pypi-publish from 1.10.3 to 1.11.0 by @dependabot in #327
build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #321
build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #328
build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #329
build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #332
build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #337
build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #338
build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #342
Integrate Context-Aware Chunking and PDF Support by @khaledsulayman in #284
feat: parametrize system prompt by @jaideepr97 in #339
feat: support converting messages datasets into multiple pre-training formats by @jaideepr97 in #341
Move to Docling v2 APIs by @bbrowning in #347
feat: expose max_num_tokens as configurable by @cdoern in #340
Remove unnecessary requirement for qna.yaml in ContextAwareChunker by @khaledsulayman in #351
Upgrade docling, expand chunking testing by @bbrowning in #349
Don't attempt batching with InstructLab's llama-cpp-python by @bbrowning in #358
Consolidate test sample documents into one subdir by @bbrowning in #356
Move a spurious print to a debug log message by @bbrowning in #359
Only use CPU for the docling OCR models by @bbrowning in #361
Data mix fix by @aakankshaduggal in #366

New Contributors

@alimaredia made their first contribution in #305

Full Changelog: v0.4.2...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

v0.5.0

What's Changed

New Contributors

Contributors