set_textType and friends need better markdown support #298

cboettig · 2020-03-07T22:21:15Z

I don't think we have a very good way for users to supply markdown input. (The current examples showing support for markdown files still use the 2.1.1 strategy of converting to docbook via pandoc, which is error-prone as a EML doesn't support all docbook terms).

We can get valid 2.2.0 markdown 'manually', e.g. something like:

methods <- list(methodStep = 
                           list(description =
                                 list(markdown = paste(readLines("methods.md"), collapse = "\n"))))

but it would be much better to support something closer to the current sytnax in the example, e.g. maybe:

abstract <- set_TextType(markdown = "abstract.md")
methods <- set_methods(markdown = "methods.md")

(Functions would also need a check to make sure we were in 2.2.0 mode).

Anyone generating EML with markdown sections currently? Do you do something like the above? Any takers on a PR on this one?

amoeba · 2020-03-08T00:37:45Z

Good catch. Agree on the current implementation being a bit error-prone. EML 2.2.0 got Markdown support in part because of its popularity and ubiquity but also because the pre-2.2.0 TextType module was really hard to use. I think defaulting to a Markdown-first experience from here on out is a good idea.

Another API it'd be nice to support would be

list(methodStep = list( # or set_methods(markdown = ) instead
  list(description = list(
    markdown = "# Header

Example text.
"))))

so people can keep everything in the same R/Rmd.

@jeanetteclark are you or others on the Data Team using markdown for your TextType elements? @clnsmth, @earnaud?

jeanetteclark · 2020-03-09T16:45:09Z

@amoeba not usually

clnsmth · 2020-03-09T19:03:16Z

Thanks for reaching out @amoeba @cboettig. I'm mostly using .docx because that's the preferred file type our data contributors submit their abstract and methods to us. Sadly, I'm using it to remove most of the text formatting before sending it to set_TextType() to bypass the time trouble shooting .docx > pandoc > EML translation issues.

I very much welcome improved support for .md > EML (.xml) and many thanks in advance to whoever takes this on!

cboettig · 2020-03-10T17:12:03Z

@clnsmth Thanks! that's a great point. I think we should probably replace the existing mechanism which uses pandoc to convert docx and .md into docbook (since too much of modern docbook is just not EML compatible anyway). We can have it instead use pandoc to render .docx files into markdown, and pass-through .md markdown (at least so long as we are in eml-2.2.0 mode). That way, users can continue to rely on .docx as their interface but we'd avoid the myriad docbook translation issues.

I don't have any good solution for what to do about this in eml-2.1.1 mode though where we can't generate <markdown> sections. We could either leave it as is, or perhaps convert to markdown but encode it if it were just a plain text block. This would at least avoid the validation errors. (and we could encourage folks to adopt 2.2.0!)

clnsmth · 2020-03-10T19:12:01Z

Thanks @cboettig. I think I’m on board with this but have a clarifying question.

In the proposed changes, will at least the current level of support for .txt, .docx, .md > .xml be maintained when EML is operating in eml-2.2.0 mode? This is particularly important for tooling I’ve developed for the non-technical user and would be sad to see it go away : (. Additionally, continued support for .xml when operating in eml-2.2.0 mode supports data repositories and other metadata consumers using EML 2.2.0 but not ready/wanting to handle <makdown> sections.

cboettig · 2020-03-12T04:00:11Z

Thanks @clnsmth for the feedback, that's good to know. I'm a bit wary of the transforms into docbook xml because they are currently not guaranteed to produce valid EML (basically depending on what kind of markup the user uses), but if they are useful we could certainly make the migration in a non-breaking way instead. It is mostly a matter of figuring out what syntax to use. e.g. perhaps we need a toggle-parameter like:

set_TextType("myfile.docx", as = "markdown")
set_methods("mymethods.docx", as = "markdown")

etc to encode in markdown, and maybe:

set_TextType("myfile.docx", as = "xml")
set_methods("mymethods.docx", as = "xml")

to encode as docbook. we could haggle over whether markdown or xml would be the default. (Making xml the default would obviously be the backwards-compatible option, but I think it also seems to be creating a somewhat unnecessary gotcha for new users).

Thoughts?

Thanks again for the feedback!

cboettig · 2020-03-12T04:03:16Z

p.s. accidentally posted the above half-way through writing it, so if you're following via email, see updated thread on GitHub. apologies.

amoeba · 2020-03-12T07:47:18Z

Good point on backwards compatibility, @clnsmth. I think the ideas here are to maintain the ability to bring .txt, .docx, and .md docs into EML via this package but change what happens. The difference is what happens depending on the EML version (emld_db):

emld_db = 2.1.1 or <: Convert txt, md, docx, etc. to DocBook sections. Current behavior.
emld_db = 2.2.0 or >: Convert txt, md, docx, etc. to Markdown and place into <markdown> elements by default.

The current function signature of set_TextType is:

function(file = NULL, text = NULL)

and I think we can get all of this working with a change to

function(file = NULL, text = NULL, markdown = TRUE)

The function will retain the first part of its signature and the added markdown argument controls whether the function makes use of the EML 2.2.0 markdown element, defaulting to TRUE. Here's a confusion matrix to get an idea:

For any file= input, docx, md, or otherwise:

`markdown=`	`emld_db`	Result
`TRUE`	`<=2.1.1`	DocBook w/ warning
`FALSE`	`<=2.1.1`	DocBook
`TRUE`	`>=2.2.0`	`<markdown>`
`FALSE`	`>=2.2.0`	DocBook

Note: When using the text argument instead of file, I think you'd still always get the current behavior of setting a simple text type element instead of DocBook-esque or Markdown.

Does this sound workable?

PS: @clnsmth , you said:

Additionally, continued support for .xml when operating in eml-2.2.0 mode supports data repositories and other metadata consumers using EML 2.2.0 but not ready/wanting to handle sections.

Did you mean "continued support for our mini-DocBook, aka section, and para"? If the default behavior of set_TextType("foo") changed such that it output an <markdown by default, would that be a deal-breaker?

clnsmth · 2020-03-12T17:56:26Z

Thanks @cboettig @amoeba for considering my use case and these clarifying explanations. I'm completely on-board with changing the default behavior to markdown and retaining backwards compatibility through some permutation of arguments while emld_db = 2.2.0.

mbjones added this to the v2.1.0: Supporitng multiple versions of EML milestone Jul 2, 2020

clnsmth mentioned this issue Mar 18, 2021

Better support for markdown and LaTeX EDIorg/EMLassemblyline#85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_textType and friends need better markdown support #298

set_textType and friends need better markdown support #298

cboettig commented Mar 7, 2020

amoeba commented Mar 8, 2020

jeanetteclark commented Mar 9, 2020

clnsmth commented Mar 9, 2020

cboettig commented Mar 10, 2020

clnsmth commented Mar 10, 2020

cboettig commented Mar 12, 2020 •

edited

Loading

cboettig commented Mar 12, 2020

amoeba commented Mar 12, 2020

clnsmth commented Mar 12, 2020

set_textType and friends need better markdown support #298

set_textType and friends need better markdown support #298

Comments

cboettig commented Mar 7, 2020

amoeba commented Mar 8, 2020

jeanetteclark commented Mar 9, 2020

clnsmth commented Mar 9, 2020

cboettig commented Mar 10, 2020

clnsmth commented Mar 10, 2020

cboettig commented Mar 12, 2020 • edited Loading

cboettig commented Mar 12, 2020

amoeba commented Mar 12, 2020

clnsmth commented Mar 12, 2020

cboettig commented Mar 12, 2020 •

edited

Loading