-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: v1.13.7 XML::Reader may segfault #2598
Comments
Summary and DiagnosisThis comment is an explanation of the problem. I'll follow up with a few suggestions for approaches to fix it. The SymptomsUsers of A reproduction (provided by @bf4) captured in valgrind showed the following memory error:
Nokogiri v1.13.7 introduced some changes in how Nodes behave during garbage collection, so this stack isn't terribly surprising. But the changes were relatively straightforward and this is rather unexpected. It also happens only rarely, so it's challenging to reproduce and didn't show up in any of the valgrind jobs in our CI pipeline. How Reader's memory lifecycle worksNodes associated with an To demonstrate, if we add some debugging output to --- a/ext/nokogiri/xml_node.c
+++ b/ext/nokogiri/xml_node.c
@@ -2069,6 +2069,18 @@ noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
c_node->_private = (void *)rb_node;
+ {
+ VALUE message = rb_sprintf(
+ "MIKE: %"PRIsVALUE" name=%s, c=%p | document: c=%p, ruby=%s\n",
+ rb_funcall(rb_class, rb_intern("to_s"), 0),
+ c_node->name,
+ c_node,
+ c_node->doc,
+ node_has_a_document ? "yes" : "no"
+ );
+ fprintf(stderr, "%s", StringValueCStr(message));
+ }
+
if (node_has_a_document) {
rb_document = DOC_RUBY_OBJECT(c_doc);
rb_node_cache = DOC_NODE_CACHE(c_doc); Then traversing a document with Reader (specifically, inspecting node attributes) will emit output like:
Note that these are all attribute nodes returned by (We can verify this statement by adding these lines to if (!node_has_a_document && rb_class != cNokogiriXmlAttr) {
abort();
} and seeing that the Nokogiri test suite runs without hitting this abort statement.) There has always been special handling in
We can see that the C structs are getting reused: search for
This behavior is dangerous, and it makes the Reader API unsafe (as mentioned in But really, this is fine! It's a wonky memory lifecycle, but the code in v1.13.6 and earlier handled it (since v1.4.5 in 2011! see The changes in Nokogiri v1.13.7In v1.13.7 we introduced some code to make sure that nodes would behave well when compacted by Ruby's new compacting garbage collector. As part of this code, the Ruby object wrapping went from this: static void
_xml_node_mark(xmlNodePtr node)
{
xmlDocPtr doc = node->doc;
if (doc->type == XML_DOCUMENT_NODE || doc->type == XML_HTML_DOCUMENT_NODE) {
if (DOC_RUBY_OBJECT_TEST(doc)) {
rb_gc_mark(DOC_RUBY_OBJECT(doc));
}
} else if (node->doc->_private) {
rb_gc_mark((VALUE)doc->_private);
}
}
VALUE
noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
{
VALUE rb_document, rb_node_cache, rb_node;
nokogiriTuplePtr node_has_a_document;
xmlDocPtr c_doc;
void (*f_mark)(xmlNodePtr) = NULL ;
/* ... */
node_has_a_document = DOC_RUBY_OBJECT_TEST(c_doc);
/* ... */
f_mark = node_has_a_document ? _xml_node_mark : NULL ;
rb_node = Data_Wrap_Struct(rb_class, f_mark, _xml_node_dealloc, c_node) ;
/* ... */
} to (simplified slightly): static void
_xml_node_mark(xmlNodePtr node)
{
if (!DOC_RUBY_OBJECT_TEST(node->doc)) {
return;
}
xmlDocPtr doc = node->doc;
if (doc->type == XML_DOCUMENT_NODE || doc->type == XML_HTML_DOCUMENT_NODE) {
if (DOC_RUBY_OBJECT_TEST(doc)) {
rb_gc_mark(DOC_RUBY_OBJECT(doc));
}
} else if (node->doc->_private) {
rb_gc_mark((VALUE)doc->_private);
}
}
static const rb_data_type_t nokogiri_node_type = {
"Nokogiri/XMLNode",
{
(gc_callback_t)_xml_node_mark, (gc_callback_t)_xml_node_dealloc, 0, (gc_callback_t)_xml_node_update_references
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY,
};
VALUE
noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
{
VALUE rb_document, rb_node_cache, rb_node;
nokogiriTuplePtr node_has_a_document;
xmlDocPtr c_doc;
/* ... */
rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
/* ... */
} In summary: the check for the ruby-wrapped-ness of the node's associated document was moved from "wrap time" to "mark time". From this: f_mark = DOC_RUBY_OBJECT_TEST(c_doc) ? _xml_node_mark : NULL ;
rb_node = Data_Wrap_Struct(rb_class, f_mark, _xml_node_dealloc, c_node) ; to this: static void
_xml_node_mark(xmlNodePtr node)
{
if (!DOC_RUBY_OBJECT_TEST(node->doc)) {
return;
}
/* ... */
} 99.9999% of the time, this is fine. Except when it isn'tBetween the time the node is wrapped and the time the mark function gets called, libxml2 may have decided to free (and possibly re-use) the C struct (or the memory). We can see this by applying this patch: --- a/ext/nokogiri/xml_node.c
+++ b/ext/nokogiri/xml_node.c
@@ -21,6 +21,8 @@ _xml_node_dealloc(xmlNodePtr x)
static void
_xml_node_mark(xmlNodePtr node)
{
+ fprintf(stderr, "MIKE: xml_node.c:mark: node=%p, document=%p\n", node, node->doc);
+
if (!DOC_RUBY_OBJECT_TEST(node->doc)) {
return;
}
@@ -2069,6 +2071,10 @@ noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
c_node->_private = (void *)rb_node;
+ if (!node_has_a_document) {
+ fprintf(stderr, "MIKE: xml_node.c:wrap: node=%p, document=%p\n", c_node, c_doc);
+ }
+
if (node_has_a_document) {
rb_document = DOC_RUBY_OBJECT(c_doc);
rb_node_cache = DOC_NODE_CACHE(c_doc); and letting the repro script run. In one particular case, the last lines emitted were:
Look backwards in the debug log for the node pointer
Whoa! The node was wrapped with a valid document pointer. But between the two Let's apply this patch to libxml2's diff --git a/xmlreader.c b/xmlreader.c
index ba95813..8759f14 100644
--- a/xmlreader.c
+++ b/xmlreader.c
@@ -413,7 +413,16 @@ xmlTextReaderFreeNode(xmlTextReaderPtr reader, xmlNodePtr cur) {
(cur->type == XML_XINCLUDE_START) ||
(cur->type == XML_XINCLUDE_END)) &&
(cur->properties != NULL))
+ {
+ fprintf(stderr, "MIKE: freeing properties:");
+ xmlAttrPtr x = cur->properties;
+ while (x != NULL) {
+ fprintf(stderr, " %p", x);
+ x = x->next;
+ }
+ fprintf(stderr, "\n");
xmlTextReaderFreePropList(reader, cur->properties);
+ }
if ((cur->content != (xmlChar *) &(cur->properties)) &&
(cur->type != XML_ELEMENT_NODE) &&
(cur->type != XML_XINCLUDE_START) && This patch dumps the addresses of all the node attribute structs that are getting freed, like this:
Now, when we repro and segfault during the mark phase, we can see the node being marked, and we can also see if it's been previously freed. So in this case when we see:
we can grep for
We can totally see that this struct is being freed -- twice! -- between being wrapped and being marked. Being freed indicates that the memory could have been re-used, which would explain why the value of the document pointer has changed. To prove our use-after-free theory, we can ask valgrind to fill memory when it's freed: VALGRIND="valgrind --free-fill=66"
bundle exec $VALGRIND $(rbenv which ruby) ./repro.rb and we see
🎉 OK, now I fully understand the problem, and hopefully you do, too. What can we do about this?See my next comment for some ideas. |
How is |
|
How to move forwardSome thoughts on what we might do. The naive solutionA reasonable idea might be to move the ruby-wrapped-ness test out of the mark function and back into the wrap function. This would avoid this particular code path, and would certainly segfault less. We could do that by making a second static const rb_data_type_t nokogiri_node_type_nowrap = {
"Nokogiri/XMLNode",
{
0, (gc_callback_t)_xml_node_dealloc, 0,
#ifdef HAVE_RB_GC_LOCATION
(gc_callback_t)_xml_node_update_references
#endif
},
0, 0,
#ifdef RUBY_TYPED_FREE_IMMEDIATELY
RUBY_TYPED_FREE_IMMEDIATELY,
#endif
}; and use this at wrap time: if (node_has_a_document) {
rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
} else {
rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type_nowrap, c_node) ;
} "It's complicated"But if the node's C pointer is unsafe to dereference at "mark time", then it's also going to be unsafe to dereference at "update references time". So we will have made this crash less often, but the path to segfaulting still exists, and it's:
So I think the real culprit here is the fact that Nokogiri is wrapping Ruby objects around C structs that are ephemeral and can't be relied upon to last the entire lifetime of the Ruby object. So: I think that Option 1: Copy the C structWe could, in Although this option seems simple, there's a lot not to like about it. First, we still have to maintain a separate object lifecycle for this particular kind of node, and I don't like that complication in Nokogiri. Over the years I've had to spend quite a bit of time debugging Reader memory problems, and I'd prefer to use this as an opportunity to simplify the code. Second, making a copy of a single C struct doesn't solve the deeper problem, which is that libxml2 structs represent a graph of nodes. The struct _xmlAttribute {
void *_private; /* application data */
xmlElementType type; /* XML_ATTRIBUTE_DECL, must be second ! */
const xmlChar *name; /* Attribute name */
struct _xmlNode *children; /* NULL */
struct _xmlNode *last; /* NULL */
struct _xmlDtd *parent; /* -> DTD */
struct _xmlNode *next; /* next sibling link */
struct _xmlNode *prev; /* previous sibling link */
struct _xmlDoc *doc; /* the containing document */
struct _xmlAttribute *nexth; /* next in hash table */
xmlAttributeType atype; /* The attribute type */
xmlAttributeDefault def; /* the default */
const xmlChar *defaultValue; /* or the default value */
xmlEnumerationPtr tree; /* or the enumeration tree if any */
const xmlChar *prefix; /* the namespace prefix if any */
const xmlChar *elem; /* Element holding the attribute */
}; Which of these members should we NULL out (possibly breaking current functionality), and which should we (gulp) also make copies of and garbage collect? The idea of managing our own separate DOM graph makes me really nervous, and this feels like the risk/reward ratio is too high. Option 2: Deprecate
|
I haven't read this closely enough to have an opinion, but did you mean struct _xmlAttr {
void *_private; /* application data */
xmlElementType type; /* XML_ATTRIBUTE_NODE, must be second ! */
const xmlChar *name; /* the name of the property */
struct _xmlNode *children; /* the value of the property */
struct _xmlNode *last; /* NULL */
struct _xmlNode *parent; /* child->parent link */
struct _xmlAttr *next; /* next sibling link */
struct _xmlAttr *prev; /* previous sibling link */
struct _xmlDoc *doc; /* the containing document */
xmlNs *ns; /* pointer to the associated namespace */
xmlAttributeType atype; /* the attribute type if validating */
void *psvi; /* for type/PSVI information */
}; Either way, maintaining copies seems tricky at best. |
@stevecheckoway Ah, thank you for catching that! It's been a long week. I absolutely meant |
There's another option that @tenderlove suggested in a chat today ... Option 3: ask the Reader to preserve nodes
I'm concerned, though, about automatically asking libxml2 to preserve any node on which we've called Side note: the JRuby implementation of Nokogiri::XML::Reader suffers from this problem, and we have a long history of complaints about it:
But if we assume that we could call Additionally we'd be trusting libxml2 to not be buggy on a little-used and untested code path, which ... well, we've been burned before. I honestly think the better thing to do is to plan to avoid instantiating wrapped objects related to Reader. Let's move towards a simpler implementation rather an a more complex implementation. Avoiding this option embraces the spirit of Reader being a cursor through the document, allows us to be very conservative with memory usage, and the worst-case scenario is we add more methods to Reader to allow users to get only the information they need. I'm open to hearing other opinions, but this doesn't seem like a better option to me than "Option 2" above. |
Trying to get to a decision here ... I've written a PR that implements Option 2, and I like it: #2599 I'm not going to try to implement Option 3, though I'm open to considering it if someone else wants to try! |
…de-gc fix: XML::Reader XML::Attr garbage collection --- **What problem is this PR intended to solve?** This is a proposed fix for #2598, see that issue for an extended explanation of the problem. This PR implements "option 2" from that issue's proposed solutions: - introduce a new `Reader#attribute_hash` that will return a `Hash<String ⇒ String>` (instead of an `Array<XML::Attr>`) - deprecate `Reader#attribute_nodes` with a plan to remove it entirely in a future release - re-implement `Reader#attributes` to use `#attribute_hash` (instead of `#attribute_nodes`) After this change, only applications calling `Reader#attribute_nodes` directly will be running the unsafe code. These users will see a deprecation warning and may use `#attribute_hash` as a replacement. I think it's very possible that `Reader#attribute_hash` won't meet the needs of people who are working with namespaced attributes and are using `#attribute_nodes` for this purpose. However, I'm intentionally deferring any attempt to solve that DX problem until someone who needs this functionality asks for it. **Have you included adequate test coverage?** I tried and failed to add test coverage to the suite that would reproduce the underlying GC bug. However, existing test coverage of `Reader#attributes` is sufficient for now. **Does this change affect the behavior of either the C or the Java implementations?** This PR modifies both the C and Java implementations to behave the same. Notably, the Java implementation contains a small bugfix which is that `Reader#namespaces` now returns an empty hash when there are no namespaces (it previously returned `nil`).
…de-gc_backport-v1.13.x fix: XML::Reader XML::Attr garbage collection (backport to v1.13.x) --- **What problem is this PR intended to solve?** This is a proposed fix for #2598, see that issue for an extended explanation of the problem. This PR implements "option 2" from that issue's proposed solutions: - introduce a new `Reader#attribute_hash` that will return a `Hash<String ⇒ String>` (instead of an `Array<XML::Attr>`) - deprecate `Reader#attribute_nodes` with a plan to remove it entirely in a future release - re-implement `Reader#attributes` to use `#attribute_hash` (instead of `#attribute_nodes`) After this change, only applications calling `Reader#attribute_nodes` directly will be running the unsafe code. These users will see a deprecation warning and may use `#attribute_hash` as a replacement. I think it's very possible that `Reader#attribute_hash` won't meet the needs of people who are working with namespaced attributes and are using `#attribute_nodes` for this purpose. However, I'm intentionally deferring any attempt to solve that DX problem until someone who needs this functionality asks for it. **Have you included adequate test coverage?** I tried and failed to add test coverage to the suite that would reproduce the underlying GC bug. However, existing test coverage of `Reader#attributes` is sufficient for now. **Does this change affect the behavior of either the C or the Java implementations?** This PR modifies both the C and Java implementations to behave the same. Notably, the Java implementation contains a small bugfix which is that `Reader#namespaces` now returns an empty hash when there are no namespaces (it previously returned `nil`).
Cutting v1.13.8 shortly to fix this. |
v1.13.8 has been released: https://github.com/sparklemotion/nokogiri/releases/tag/v1.13.8 |
**What problem is this PR intended to solve?** Before a minor release, I generally review deprecations and look for things we can remove. * Removed `Nokogiri::HTML5.get` which was deprecated in v1.12.0. [#2278] (@flavorjones) * Removed the CSS-to-XPath utility modules `XPathVisitorAlwaysUseBuiltins` and `XPathVisitorOptimallyUseBuiltins`, which were deprecated in v1.13.0 in favor of `XPathVisitor` constructor args. [#2403] (@flavorjones) * Removed `XML::Reader#attribute_nodes` which was deprecated in v1.13.8 in favor of `#attribute_hash`. [#2598, #2599] (@flavorjones) Also we're now specifying version numbers in remaining deprecation warnings. **Have you included adequate test coverage?** Tests have been removed, otherwise no new coverage needed. **Does this change affect the behavior of either the C or the Java implementations?** As documented above.
Originally reported downstream in pythonicrubyist/creek#105
Special thanks to @bf4 who helped reliably reproduce this.
More details coming shortly.
The text was updated successfully, but these errors were encountered: