You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The columns contactPoint, publisher, keyword, and theme can be pretty easily converted to vectors, but the distribution column is more complicated. I've outlined some approaches to handling the distribution column below.
The columns contactPoint, publisher are easy to flatten into a single row with two columns. So, contactPoint could become contactPoint:<column1>, contactPoint:<column2>. In the past I've used the following approach:
The key keywords and theme columns are currently lists, but we could easily flatten them by converting them to converting them to JSON. I have been using this approach:
# clean up keywordskeywords<-data_df$keyworddata_df$keyword<-NA_character_for(iin1:nrow(data_df)){
if(!is.null(keywords[[i]])){
data_df$keyword[i] <-jsonlite::toJSON(keywords[[i]])
}
}
The distribution column is a list that contains irregular and somewhat large data.frames. As such, it would be hard to flatten each element of distribution into a single row. However, distribution can be pretty easily normalized and put into a separate data.frame. Here's what I did to separate it out:
distro<-df[, "distribution"]
df[, "distribution"] <-NULLdistro<- lapply(1:length(distro), function(i)
distro[[i]] <- cbind(landingPage=df$landingPage[i], distro[[i]])) ## add landing page as a keydistro<- do.call(rbind, distro)
This is nice, but now we would have df_data and distro to return to the user, which introduces design questions.
Here are some approaches that I would propose to handling the distribution return value.
Scenario 1
At first I thought this would require a two part return value (table + distribution), but then I realized that we could put it into the attributes. It might sound crazy to put a data.frame into the attributes, but there are other functions that do this such as the regex functions.
We could start returning a list with each element listed separately. In a lot of ways this is the cleanest, most flexible approach. However, it's kind of annoying.
df_data : the main table that we already have, but flattened into a proper data.frame. This would keep the attributes as proposed.
distribution as a separate flattened table
@context
@id
@type
conformsTo
describedBy
It would be less annoying with two return values (df_data, and the distribution), and we could continue to attach the other elements as attributes to df_data. This is a new kind of annoying, with attributes nested within a return value.
Scenario 3
We could make it so that the distribution information isn't returned by default, and create a separate function that would retrieve that information if the user wanted (or it could be an option) in ls.socrata.
In that scenario ls.socrata would return
df_data : the main table that we already have, but flattened into a proper data.frame. This would keep the attributes as proposed. There would be no distribution column
Something else like ls.socrata.distros would return the distro data.frame.
The text was updated successfully, but these errors were encountered:
We discussed starting a branch right away, but on second thought I'll wait until we're ready to do the work to avoid the need to rebase later.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/115#issuecomment-258221885, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0YCwXnEe6m8x7eU2SltbFjeLWsqrks5q6h9tgaJpZM4KotkX.
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.
@tomschenkjr
The columns
contactPoint
,publisher
,keyword
, andtheme
can be pretty easily converted to vectors, but thedistribution
column is more complicated. I've outlined some approaches to handling thedistribution
column below.The columns
contactPoint
,publisher
are easy to flatten into a single row with two columns. So,contactPoint
could becomecontactPoint:<column1>
,contactPoint:<column2>
. In the past I've used the following approach:The key
keywords
andtheme
columns are currently lists, but we could easily flatten them by converting them to converting them to JSON. I have been using this approach:The
distribution
column is a list that contains irregular and somewhat large data.frames. As such, it would be hard to flatten each element ofdistribution
into a single row. However,distribution
can be pretty easily normalized and put into a separate data.frame. Here's what I did to separate it out:This is nice, but now we would have
df_data
anddistro
to return to the user, which introduces design questions.Here are some approaches that I would propose to handling the distribution return value.
Scenario 1
At first I thought this would require a two part return value (table + distribution), but then I realized that we could put it into the attributes. It might sound crazy to put a data.frame into the attributes, but there are other functions that do this such as the regex functions.
I've worked out an example of the code for option 1.
https://gist.github.com/geneorama/c03a2db1463b32622f07e8779a8cb712
Scenario 2
We could start returning a list with each element listed separately. In a lot of ways this is the cleanest, most flexible approach. However, it's kind of annoying.
distribution
as a separate flattened table@context
@id
@type
conformsTo
describedBy
It would be less annoying with two return values (df_data, and the distribution), and we could continue to attach the other elements as attributes to df_data. This is a new kind of annoying, with attributes nested within a return value.
Scenario 3
We could make it so that the
distribution
information isn't returned by default, and create a separate function that would retrieve that information if the user wanted (or it could be an option) inls.socrata
.In that scenario
ls.socrata
would returnSomething else like
ls.socrata.distros
would return the distro data.frame.The text was updated successfully, but these errors were encountered: