Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We will never merge this (it will go into actor-math): 55 make polynomial operations parallel #58

Closed
wants to merge 9 commits into from

Conversation

martun
Copy link
Contributor

@martun martun commented Sep 8, 2023

No description provided.

@martun martun self-assigned this Sep 8, 2023
@martun martun linked an issue Sep 8, 2023 that may be closed by this pull request
@martun martun marked this pull request as draft September 8, 2023 13:50
@martun martun changed the title 55 make polynomial operations parallel We will never merge this: 55 make polynomial operations parallel Oct 13, 2023
@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from c3ab96b to a5a65ea Compare January 25, 2024 11:17
Copy link

github-actions bot commented Jan 25, 2024

Test Results

  36 files    36 suites   4s ⏱️
  99 tests   95 ✔️   4 💤 0
496 runs  480 ✔️ 16 💤 0

Results for commit f56af41.

♻️ This comment has been updated with latest results.

@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from 8158bb0 to fce0696 Compare January 29, 2024 08:04
@martun martun force-pushed the 55-make-polynomial-operations-parallel branch 3 times, most recently from eb786ab to 4322bde Compare January 29, 2024 15:30
@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from 4322bde to 07bf357 Compare January 29, 2024 15:39
@martun martun changed the title We will never merge this: 55 make polynomial operations parallel We will never merge this (it will go into actor-math): 55 make polynomial operations parallel Jan 29, 2024
CMakeLists.txt Outdated
@@ -68,7 +68,7 @@ target_include_directories(${CMAKE_WORKSPACE_NAME}_${CURRENT_PROJECT_NAME} INTER
target_link_libraries(${CMAKE_WORKSPACE_NAME}_${CURRENT_PROJECT_NAME} INTERFACE
${CMAKE_WORKSPACE_NAME}::algebra
${CMAKE_WORKSPACE_NAME}::multiprecision

pthread
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://cmake.org/cmake/help/latest/module/FindThreads.html, it would be more idiomatic to use find_package(Threads REQUIRED) previously in the file, and then link against Threads::Threads here instead of pthread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it, looks like it was not needed any more.

Comment on lines 70 to 71
// We need the lambda to be mutable, to be able to modify iterators captured by value.
[this](std::size_t begin, std::size_t end) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the comment is trying to say, but there is no mutable specifier :-).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this comment was a leftover from an older version of code.

for (std::size_t i = 0; i < a.size(); ++i) {
a[i] = a[i] * sconst;
}
nil::crypto3::parallel_foreach(a.begin(), a.end(), [&sconst](value_type& v){v *= sconst.data;});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention of calling the data element a_i really helps make things clearer in other expressions, so I prefer it to just v. Also, consider leaving the loop body on a line by itself — I think it increases readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


std::size_t m = 1; // invariant: m = 2^{s-1}
field_value_type w_m;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we move the declaration of w_m outside of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back.

Comment on lines 46 to 47
static ThreadPool instance0(0, pool_size);
static ThreadPool instance1(1, pool_size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both thread pools get initialized simultaneously upon first execution of get_instance, but we may only ever need one of them. Is that a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about creating 2 separate functions, but then I was afraid that the function for instance1 can be simultaneously called from the threads of the intance0, and I'm not sure how safe it is during the initialization of instance1.

In practice (except the tests), 2 pools will always be used.


// Pool #0 will take care of the lowest level of operations, like polynomial operations.
// We want the minimal size of element_per_cpu to be 65536, otherwise the cores are not loaded.
if (pool_number == 0 && element_per_cpu < 65536) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pool numbers bear special significance, I suggest using an enum instead of the magic constants 0 and 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::size_t element_per_cpu = elements_count / cpu_usage;

// Pool #0 will take care of the lowest level of operations, like polynomial operations.
// We want the minimal size of element_per_cpu to be 65536, otherwise the cores are not loaded.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define a constant for the magic number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

boost::asio::thread_pool pool;
std::size_t pool_size;

// Each pool with know it's number.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "will know its number"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -1293,4 +1296,75 @@ BOOST_AUTO_TEST_CASE(polynomial_dfs_zero_one_test) {
BOOST_CHECK((small_poly - one * small_poly).is_zero());
}

BOOST_AUTO_TEST_CASE(polynomial_dfs_addition_perf_test, *boost::unit_test::disabled()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we need some basic functional tests that are not disabled.

Copy link
Contributor Author

@martun martun Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is already tested, there are lots of tests in this file that cover the new code.

These disabled tests will probably be moved a separate "benchmark" sometime later.

I ended up adding one to check the pools. We will add more tests in the future, once everything works together.

Comment on lines 84 to 85
// Here we can parallelize on the both cycles with 'k' and 'm', because for each value of k and m
// the ranges of array 'a' used do not intersect. Think of these 2 cycles as 1.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cycle" usually refers to something like CPU cycles, so to avoid confusion, let's call these "loops".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@martun martun requested a review from avm January 30, 2024 07:51
@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from 51f3225 to cdea4d6 Compare January 30, 2024 09:51
@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from cdea4d6 to 63fae20 Compare January 30, 2024 10:03
Copy link

@avm avm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left some non-essential suggestions.

Comment on lines 89 to 95
std::size_t element_per_cpu = elements_count / cpu_usage;

// Pool #0 will take care of the lowest level of operations, like polynomial operations.
// We want the minimal size of element_per_cpu to be 65536, otherwise the cores are not loaded.
if (pool_id == PoolID::LOW_LEVEL_POOL_ID && element_per_cpu < POOL_0_MIN_CHUNK_SIZE) {
cpu_usage = elements_count / POOL_0_MIN_CHUNK_SIZE + elements_count % POOL_0_MIN_CHUNK_SIZE ? 1 : 0;
element_per_cpu = elements_count / cpu_usage;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element_per_cpu could be declared const and initialized once after the if.
(Also, elements_per_cpu might be a better name.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


template<class ReturnType>
inline std::future<ReturnType> post(std::function<ReturnType()> task) {
auto packaged_task = std::make_shared<std::packaged_task<ReturnType()>>(std::move(task));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a unique_ptr here?

element_per_cpu = elements_count / cpu_usage;
}

std::size_t begin = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this code, it was supposed to be used.

}

std::size_t begin = 0;
for (int i = 0; i < cpu_usage; i++) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, but it may be more logical to declare i as size_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 117 to 123
std::size_t pool_size;

PoolID pool_id;

// For pool #0 we have experimentally found that operations over chunks of <65536 elements
// do not load the cores. In case we have smaller chunks, it's better to load less cores.
const std::size_t POOL_0_MIN_CHUNK_SIZE = 65536;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool_size and pool_id can probably be const, and the constant can probably be constexpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 46 to 47
LOW_LEVEL_POOL_ID,
HIGH_LEVEL_POOL_ID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since PoolID is the name of the enum, the _POOL_ID suffix seems redundant (and maybe we can call the enum PoolLevel or PoolTier and just have two values LOW and HIGH).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

};

/** Returns a thread pool, based on the pool_id. pool with LOW_LEVEL_POOL_ID is normally used for low-level operations, like polynomial
* operations and fft. Any code that uses these operations and needs to be parallel will submit it's tasks to pool with HIGH_LEVEL_POOL_ID.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "its tasks"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


/** Returns a thread pool, based on the pool_id. pool with LOW_LEVEL_POOL_ID is normally used for low-level operations, like polynomial
* operations and fft. Any code that uses these operations and needs to be parallel will submit it's tasks to pool with HIGH_LEVEL_POOL_ID.
* Submission of higher level tasks to low level pool will immediately result to a deadlock.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "result in a deadlock"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

}

// Waits for all the tasks to complete.
inline void join() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect those methods are automatically inline by value of being defined inside the class definition (but no harm in restating that, of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to write it, to keep the library header-only...

template< class InputIt, class OutputIt, class UnaryOperation >
void parallel_transform(InputIt first1, InputIt last1,
OutputIt d_first, UnaryOperation unary_op,
ThreadPool::PoolID pool_id = ThreadPool::PoolID::LOW_LEVEL_POOL_ID) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is very important to choose the right pool, we might not want a default value here, so the choice of the pool is always explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, let's keep this defaults. Almost always we want the lower pool when doing transforms, and in case someone makes a wrong choice, he will immediately get a deadlock and notice that something's wrong anyway.

@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from 8828e2d to 270f7e0 Compare January 30, 2024 11:20
@martun martun force-pushed the 55-make-polynomial-operations-parallel branch from 270f7e0 to f56af41 Compare January 30, 2024 11:21
@martun martun closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make polynomial operations parallel
2 participants