Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_textType and friends need better markdown support #298

Open
cboettig opened this issue Mar 7, 2020 · 9 comments
Open

set_textType and friends need better markdown support #298

cboettig opened this issue Mar 7, 2020 · 9 comments

Comments

@cboettig
Copy link
Member

cboettig commented Mar 7, 2020

I don't think we have a very good way for users to supply markdown input. (The current examples showing support for markdown files still use the 2.1.1 strategy of converting to docbook via pandoc, which is error-prone as a EML doesn't support all docbook terms).

We can get valid 2.2.0 markdown 'manually', e.g. something like:

methods <- list(methodStep = 
                           list(description =
                                 list(markdown = paste(readLines("methods.md"), collapse = "\n"))))

but it would be much better to support something closer to the current sytnax in the example, e.g. maybe:

abstract <- set_TextType(markdown = "abstract.md")
methods <- set_methods(markdown = "methods.md")

(Functions would also need a check to make sure we were in 2.2.0 mode).

Anyone generating EML with markdown sections currently? Do you do something like the above? Any takers on a PR on this one?

@amoeba
Copy link
Collaborator

amoeba commented Mar 8, 2020

Good catch. Agree on the current implementation being a bit error-prone. EML 2.2.0 got Markdown support in part because of its popularity and ubiquity but also because the pre-2.2.0 TextType module was really hard to use. I think defaulting to a Markdown-first experience from here on out is a good idea.

Another API it'd be nice to support would be

list(methodStep = list( # or set_methods(markdown = ) instead
  list(description = list(
    markdown = "# Header

Example text.
"))))

so people can keep everything in the same R/Rmd.

@jeanetteclark are you or others on the Data Team using markdown for your TextType elements? @clnsmth, @earnaud?

@jeanetteclark
Copy link
Contributor

@amoeba not usually

@clnsmth
Copy link
Contributor

clnsmth commented Mar 9, 2020

Thanks for reaching out @amoeba @cboettig. I'm mostly using .docx because that's the preferred file type our data contributors submit their abstract and methods to us. Sadly, I'm using it to remove most of the text formatting before sending it to set_TextType() to bypass the time trouble shooting .docx > pandoc > EML translation issues.

I very much welcome improved support for .md > EML (.xml) and many thanks in advance to whoever takes this on!

@cboettig
Copy link
Member Author

@clnsmth Thanks! that's a great point. I think we should probably replace the existing mechanism which uses pandoc to convert docx and .md into docbook (since too much of modern docbook is just not EML compatible anyway). We can have it instead use pandoc to render .docx files into markdown, and pass-through .md markdown (at least so long as we are in eml-2.2.0 mode). That way, users can continue to rely on .docx as their interface but we'd avoid the myriad docbook translation issues.

I don't have any good solution for what to do about this in eml-2.1.1 mode though where we can't generate <markdown> sections. We could either leave it as is, or perhaps convert to markdown but encode it if it were just a plain text block. This would at least avoid the validation errors. (and we could encourage folks to adopt 2.2.0!)

@clnsmth
Copy link
Contributor

clnsmth commented Mar 10, 2020

Thanks @cboettig. I think I’m on board with this but have a clarifying question.

In the proposed changes, will at least the current level of support for .txt, .docx, .md > .xml be maintained when EML is operating in eml-2.2.0 mode? This is particularly important for tooling I’ve developed for the non-technical user and would be sad to see it go away : (. Additionally, continued support for .xml when operating in eml-2.2.0 mode supports data repositories and other metadata consumers using EML 2.2.0 but not ready/wanting to handle <makdown> sections.

@cboettig
Copy link
Member Author

cboettig commented Mar 12, 2020

Thanks @clnsmth for the feedback, that's good to know. I'm a bit wary of the transforms into docbook xml because they are currently not guaranteed to produce valid EML (basically depending on what kind of markup the user uses), but if they are useful we could certainly make the migration in a non-breaking way instead. It is mostly a matter of figuring out what syntax to use. e.g. perhaps we need a toggle-parameter like:

set_TextType("myfile.docx", as = "markdown")
set_methods("mymethods.docx", as = "markdown")

etc to encode in markdown, and maybe:

set_TextType("myfile.docx", as = "xml")
set_methods("mymethods.docx", as = "xml")

to encode as docbook. we could haggle over whether markdown or xml would be the default. (Making xml the default would obviously be the backwards-compatible option, but I think it also seems to be creating a somewhat unnecessary gotcha for new users).

Thoughts?

Thanks again for the feedback!

@cboettig
Copy link
Member Author

p.s. accidentally posted the above half-way through writing it, so if you're following via email, see updated thread on GitHub. apologies.

@amoeba
Copy link
Collaborator

amoeba commented Mar 12, 2020

Good point on backwards compatibility, @clnsmth. I think the ideas here are to maintain the ability to bring .txt, .docx, and .md docs into EML via this package but change what happens. The difference is what happens depending on the EML version (emld_db):

  1. emld_db = 2.1.1 or <: Convert txt, md, docx, etc. to DocBook sections. Current behavior.
  2. emld_db = 2.2.0 or >: Convert txt, md, docx, etc. to Markdown and place into <markdown> elements by default.

The current function signature of set_TextType is:

function(file = NULL, text = NULL)

and I think we can get all of this working with a change to

function(file = NULL, text = NULL, markdown = TRUE)

The function will retain the first part of its signature and the added markdown argument controls whether the function makes use of the EML 2.2.0 markdown element, defaulting to TRUE. Here's a confusion matrix to get an idea:

For any file= input, docx, md, or otherwise:

markdown= emld_db Result
TRUE <=2.1.1 DocBook w/ warning
FALSE <=2.1.1 DocBook
TRUE >=2.2.0 <markdown>
FALSE >=2.2.0 DocBook

Note: When using the text argument instead of file, I think you'd still always get the current behavior of setting a simple text type element instead of DocBook-esque or Markdown.

Does this sound workable?

PS: @clnsmth , you said:

Additionally, continued support for .xml when operating in eml-2.2.0 mode supports data repositories and other metadata consumers using EML 2.2.0 but not ready/wanting to handle sections.

Did you mean "continued support for our mini-DocBook, aka section, and para"? If the default behavior of set_TextType("foo") changed such that it output an <markdown by default, would that be a deal-breaker?

@clnsmth
Copy link
Contributor

clnsmth commented Mar 12, 2020

Thanks @cboettig @amoeba for considering my use case and these clarifying explanations. I'm completely on-board with changing the default behavior to markdown and retaining backwards compatibility through some permutation of arguments while emld_db = 2.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants