Skip to content

HBase Schema Brainstorming

asutherland edited this page Mar 10, 2011 · 5 revisions

Braining

Situation Categorization

General Idioms

We expect to store a small, local set of data on the device that corresponds to recent messages/conversations and (recently) important/interesting data. We expect this to be a subset of a much greater set of data stored authoritatively and resiliently elsewhere (cloud/user device/whatever).

For the data the device is paying attention to, we want:

  • small, efficient, delta-style notifications in a timely fashion. We do not need to retrieve the whole conversation each time a new message shows up.

When we go swimming the greater sea of data, we want:

  • blobs of data sized to our presentation or filtering needs. We do not want to have to replay a large journal over a corpus much larger than what we want to present or filter.

User Broad Access Cases

  • Currently new / interesting.
  • User action log / history...
    • Messages the user has sent
    • Action the user took in a specific (semi-recent) time frame.
  • Message spelunking...
    • Previously deemed interesting.
    • Blue sky search.

Device Coordination

  • Changes to message meta-data:
    • made to currently new / interesting messages.
    • made to messages not actively known to the device.
  • Changes to contact (or other atemporal non-message) data.

Incoming Message Cases

  • Starts a new conversation.
  • Addition to a recent conversation.
  • Addition to an old conversation.

Temporal Locality as Disk Locality

Problem

Nutshell:

  • Messages have an inherent time attribute and so providing a persistent time-based identifier is easy.
  • Conversations are aggregates; the only way to provide a persistent identifier is to use the starting message. But the access pattern of a conversation is dependent on the most recently added message.

Broad Stroke Possible Strategies

  • Treat high-churn data (recent messages) as fundamentally different from low-churn data (old messages).
    • Favor a journaling style of implementation for high-churn data.
    • Favor a fully materialized style of implementation for low-churn data.
    • Migrate data from high-churn storage to low-churn storage as it ages out, as it were.
    • Could be implemented with a single implementation with a varying level of what constitutes a maximum acceptable journal size...
  • Use persistent naming in conjunction with first class tombstones.
    • Conversations would always be (location) named based on their most recent message. The previously current conversation would be tombstoned pointing to the new location.
    • High churn situations would be mitigated by Hbase's inherent generational compaction strategy and the locality we got going on.
  • Use windowed-time interval (sharding) and migration/tombstones.
    • Like persistent naming but with reduced migration because we operate at a looser time granularity.
    • Seems most beneficial if we are sharding on other factors already; assumes that other things in the same time window are likely of the same level of interest (or the wasted effort from them coming up from disk at the same time is not a concern).

Journaling

  • Primary data / message events:
    • New message, possibly causing a new conversation.
    • Permanently deleted message, possibly deleting the conversation.
  • Meta-data events:
    • Explicit user actions: Tag/(starred)/read status/marked deleted changes
    • Inferred user interest: Looked at the message some number of times for some duration, appeared to read the whole message, etc.

Atemporal-ish Data (People)

Scale factors of people are such that as long as we perform some reasonable degree of thresholding / sharding to separate (or not store in the first place) random spammers/etc. we could be fine.

How can people be temporal:

  • Last communication initiated with them by us/us by them.
  • Last time the address book card for them consulted.

Candidate Schemas

Churn differentiating

table high-churn-message-spools, key: [user id, time window id, spool id, increasing value].

All new messages go in here.

table high-churn-metadata, key: [user id, time window id, spool id].