Yamd 0.15.0 release notes
There is still no YAMD-dedicated website, so here is a changelog.
#Fuzz is now a part of CI
For some reason, it was not as straightforward as I hoped. It took me six commits to make it work. Is there a better way to test Github actions than pushing your change?
That's what works for me:
permissions:
contents: read
on:
push:
branches:
pull_request:
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
name: fuzz
jobs:
required:
runs-on: ubuntu-latest
name: ubuntu / nightly
steps:
- uses: actions/checkout@v4
with:
submodules: true
- name: Install nightly
uses: dtolnay/rust-toolchain@master
with:
toolchain: nightly
- name: cargo install cargo-fuzz
uses: taiki-e/install-action@v2
with:
tool: cargo-fuzz
- name: cargo fuzz run --target x86_64-unknown-linux-gnu deserialize -- -max_total_time=10
run: cargo fuzz run --target x86_64-unknown-linux-gnu deserialize -- -max_total_time=10
I don’t know why, but I am extremely happy that Fuzz is now part of CI.
#semver-checks are now part of CI
Semver checks is a neat project that can say if you have a breaking change in your PR and fail if the minor version was not bumped. It does not cover all possible ways to violate semver, but it is close.
Integrating it into CI was very straightforward.
#Benchmarks
I want to measure execution time and throughput. Both make sense, but I think throughput is a more relevant measurement for parsers.
The first benchmark was all YAMD files I could find (that was easy because I searched only only on my hard drive), fed concatenated to the parser. This benchmark is supposed to show how much time the parser spends on a human-generated input.
That’s the most important benchmark, but it would not show the whole picture. Human input primarily consists of Literal and Space tokens with the parser rarely backtracking. A bench with random tokens should give insight into backtracking performance.
I expected the benchmark with random tokens as input to be less performant, but the bench shows a ~7x increase in throughput. The input for this benchmark contained literals up to 100 characters. That means the lexer will produce output with lower token density (fewer tokens for the same output size).
It is ~7x faster because throughput depends on the density of tokens. More tokens == more work for the parser.
Finally, it makes sense.
Benchmark integration to the CI pipeline required some digging, because benchmarking action was not able to comment PR with the result. If you are having the same trouble, check this issue Not working on GitHub enterprise (TLDR: it has nothing to do with enterprise, it lacks permission).
#~70% throughput increase
While working on benches and thinking about token density, I found a way to dramatically decrease token density for
human input. Consider this input Hello world
. Lexer would emit [“Hello”, “ “, “world’]
, or
[Literal, Space, Literal]
. Since no parsing rule involves
Space token after
Literal token, the lexer can compress it
into one Literal token without losing any information. For our Hello world
example, after compression, the lexer
would emit [“Hello world”]
or [Literal]
.
Since human input consists primarily of words separated by spaces, token density should significantly improve after compression.
That's a 4x improvement in token density for human input, which translates to a nice ~70% throughput increase and ~40% execution time decrease with no penalties 🎉
#Escape marker
We must know if it escaped when deserializing the Token back to a string. Even though it makes sense to have only one
Literal kind of token, escaped
property in Token struct
is much easier to use.
#Utils
Random token generator and token statistics gatherer are now part of the repo. I am thinking of adding main
to YAMD
itself, but it is a separate project for now.