Skip to content
This repository has been archived by the owner on Mar 11, 2021. It is now read-only.

"sliding window" bigtable training mode #713

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

amj
Copy link
Contributor

@amj amj commented Feb 15, 2019

instead of repeatedly calling train, create repeated datasets by incrementing along the dataset window.

@gitosaurus this is kind of a first cut -- it has to do all the full key retrieval and shuffle before it can start, and it'd be great if i could make that lazy somehow.

preprocessing.py Outdated
@@ -261,6 +261,40 @@ def get_tpu_bt_input_tensors(games, games_nr, batch_size, num_repeats=1,
return dataset


def get_many_tpu_bt_input_tensors(games, games_nr, batch_size,
start_at, num_datasets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indenting is wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

preprocessing.py Outdated
# is proportionally along compared to last_game_number? comparing
# timestamps?)
ds = games.moves_from_games(start_at + (i * window_increment),
start_at + (i * window_increment) + window_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit indenting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

preprocessing.py Outdated
shuffle=True,
column_family=bigtable_input.TFEXAMPLE,
column='example')
ds = ds.repeat(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably move the repeat and map out of this loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

column='example')
ds = ds.repeat(1)
ds = ds.map(lambda row_name, s: s)
dataset = dataset.concatenate(ds) if dataset else ds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the general approach: if the training loop does multiple scans, I would expect to create a new dataset for each pass, rather than try to create a single enormous dataset, which I imagine would be harder to debug, inspect, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but multiple calls to tpuestimator.train will create new graphs :( I am not sure what a good solution for lazy evaluating of these Datasets would be. As it is, it takes a real long time to build the datasets before training even starts -- i suspect the concatenate is doing something bad as things get slower and slower.

Copy link
Contributor Author

@amj amj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL :)

column='example')
ds = ds.repeat(1)
ds = ds.map(lambda row_name, s: s)
dataset = dataset.concatenate(ds) if dataset else ds
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but multiple calls to tpuestimator.train will create new graphs :( I am not sure what a good solution for lazy evaluating of these Datasets would be. As it is, it takes a real long time to build the datasets before training even starts -- i suspect the concatenate is doing something bad as things get slower and slower.

preprocessing.py Outdated
shuffle=True,
column_family=bigtable_input.TFEXAMPLE,
column='example')
ds = ds.repeat(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

preprocessing.py Outdated
@@ -261,6 +261,40 @@ def get_tpu_bt_input_tensors(games, games_nr, batch_size, num_repeats=1,
return dataset


def get_many_tpu_bt_input_tensors(games, games_nr, batch_size,
start_at, num_datasets,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

preprocessing.py Outdated
# is proportionally along compared to last_game_number? comparing
# timestamps?)
ds = games.moves_from_games(start_at + (i * window_increment),
start_at + (i * window_increment) + window_size,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

train.py Outdated
@@ -132,6 +132,36 @@ def after_run(self, run_context, run_values):
self.before_weights = None


def train_many(start_at=1000000, num_datasets=3):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expose moves here also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? number of steps?

@k8s-ci-robot
Copy link

@amj: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
tf-minigo-presubmit cdfe391 link /test tf-minigo-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants