Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hadoopy backend? #26

Open
dgleich opened this issue Feb 8, 2011 · 4 comments
Open

hadoopy backend? #26

dgleich opened this issue Feb 8, 2011 · 4 comments

Comments

@dgleich
Copy link

dgleich commented Feb 8, 2011

Hey, I really love the job management stuff in dumbo. However, it seems like the inner-core of hadoopy is more highly optimized. (I get a factor of 2 better performance in my tests.) So it seems to me like the right way of combining the two is to write a hadoopy backend for dumbo. Would this be something you'd be interested in adding to dumbo? I'm happy to work on it in some capacity if there is interest.

@bwhite
Copy link

bwhite commented Feb 8, 2011

I would help with this if there is interest. The purpose of Hadoopy isn't to recreate this functionality, it is to create a thin core python interface for streaming. I use whirr and oozie for cluster and job management respectively (Hadoopy is designed to be compatible with these tools). I can see more casual users not wanting to use these more powerful but complex tools, opting for a more integrated approach.

There are a few things we need to take into account.

  1. Practically, I'd need to relicense my code so that it is compatible (David and Andrew are the only other contributors). This shouldn't be a problem and I'd be willing to do that (I'd most likely dual license it).
  2. Should it be part of dumbo, optional, or a separate fork? I think the cleanest solution is that dumbo can optionally use Hadoopy as a backend if it is available.
  3. Backwards compatibility is going to be an important focus. I'd want to find a diverse set of Dumbo users to work with us running legacy code. Unit tests can help here.

@klbostee
Copy link
Owner

klbostee commented Feb 8, 2011

I'd definitely be interested and I'd be happy to review code or help out with figuring out how to hook things up or so. As I'm pretty busy these days I probably won't be able to help with the actual coding though, but it looks like we might already have enough manpower to get something done I guess. So bring on the code -- I look forward to having a look at it and trying it out.. :)

@dgleich
Copy link
Author

dgleich commented Feb 11, 2011

Okay, this sounds like something worth pursuing. (At least, I would really like it. I had to switch back to dumbo for some last minute tests in a paper recently because I needed some of the libegg/libjar/etc. features.)

One question: Would you need to dual license it if dumbo just used it as a black-box backend? (I am not up to speed on how python's "import" acts with respect to licenses.) I agree that this is the cleanest approach.

@klbostee
Copy link
Owner

Not sure about the licensing either, but surely we could figure something out...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants