Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] arithmetic combination (addition, multiplication, etc) of sequences #47

Merged
merged 6 commits into from
Apr 17, 2024

Conversation

fkiraly
Copy link

@fkiraly fkiraly commented Apr 6, 2024

Fixes #21, implements a class Arith which allows to create symbolic arithmetic combinations of sequences. Evaluation methods carry out the eager computation.

Also implements dunders __add__, __mul__ in the base class.

Does not implement __len__, or other sub-class specific methods for now - my intuition would be to make some of the subclasses property tags instead (using scikit-base, see #45).

I'm also a bit puzzled about the test framework, so I did not add any tests, but I reckon the class has to be added to a config somewhere. Advice appreciated.

@fkiraly
Copy link
Author

fkiraly commented Apr 6, 2024

Related: https://github.com/sktime/sktime/blob/main/sktime/dists_kernels/algebra.py - in case you are interested, there are some work items around symbolic combinations of distances.

@fkiraly
Copy link
Author

fkiraly commented Apr 6, 2024

One question might be how you would like to handle the numpy dependency, I realize it is currently not a dep of the package.

Should that not be a dependency, given that your scope seems to involve sequence calculations as well, rather than just pre-defined sequences?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module should not be in sequence.core ? Or it should be not a script in sequence.core? Perhaps arithmetic_operations.py or just operations.py?

Because the classes are helping to create new sequence but are not sequence their-self.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, they are both, aren't they?

Instances of this class will be full sequences, first-class citizens in the interface. In sklearn, sktime, etc, such classes sit in "compose"-like modules, where they can be used by the user.

Whereas, core/base, are usually base classes that users won't see, and implementers wlil inherit from but not interface.

Of course, your package, your rules.

import numpy as np


class Arith(Sequence):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps ArithmeticOperations or IntegerSequenceOperations ?

In fact, the method is_finite is correct just for integer sequences.

For example, take the two non-integer sequences (-1)^n and (-1)^(n+1). Both are infinite, but their sum is finite, as it is the constant sequence 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the method is_finite is correct just for integer sequences.

Do we have the same definition of "finite"? Would appreciate if you could link the one you are using.

I thought this property meant that the sequence is of finite length. "finite sequence" afaik is a standing term with this meaning. So, the constant sequence with a countably infinite number of terms, according to this definition, would not be finite, but infinite.

https://en.wikipedia.org/wiki/Sequence#Finite_and_infinite

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re name, I prefer short ones, otherwise object string identifiers become long and unreadable quickly... how about ArithOp?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the same definition of "finite"? Would appreciate if you could link the one you are using.

I thought this property meant that the sequence is of finite length. "finite sequence" afaik is a standing term with this meaning. So, the constant

I see your point, and I understand the misunderstanding between us. Perhaps, I expressed myself bad. Let re-explain the situation:
Suppose we have two sequences a_n and b_n, and we want to add them. We have 3 cases:

  1. both are infinite
  2. one is finite and the other is infinite
  3. both are finite

In the first case, no problem.
In the second case, how should be defined the sum of a finite sequence with an infinite one?
in the third case we have two subclasses:

  • if the two sequences have the same length -> no problem
  • if the two sequences have different length -> how it is defined the addition?

Possible solutions:

  • you can just add infinite sequences or sequences with the same length
  • we consider the finite sequence as the sub-vector space of all sequence which have compact support in the vector space of all sequences ( REF)

Both solution have pro and contro.

What do you think?

Copy link
Author

@fkiraly fkiraly Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best option is to take the minimum of lengths, for the terms where the combination is defined.

(that operation is also associative)

----------
sequences : List[Sequence]
List of sequences to combine.
operation : None, str, function, or numpy ufunc, optional, default = None = mean
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is default = None = mean? Actually operation as no default value, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that is an oversight. Also, perhaps "addition" is a more sensible default.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or no default? As the class should not be used directly by the final user or?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? A custom operation can be provided. Of course dunders are more convenient for most use cases.

sequences : List[Sequence]
List of sequences to combine.
operation : None, str, function, or numpy ufunc, optional, default = None = mean
if str, must be one of "mean", "+" (add), "*" (multiply), "max", "min"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you take the vector space of all integer sequences, you have that it is close by taking addition, multiplication, min and max (actually is an abstract Riesz space). But it is not close by taking mean. If we restrict the class just to consider integer sequence, this can be a problem.

Moreover, what do you mean for mean? Is the pointiwise mean? i.e., if a_n and b_n are two sequences, then the mean(a_n, b_n) is the sequence (a_n + b_n )/ 2 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, pointwise mean. I copied this coding without much thinking from sktime distances, where mean is a sensible default.

There is of course also a question about scope here: are you interested only in integer sequences? Seems like an unnecessary restriction.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is of course also a question about scope here: are you interested only in integer sequences? Seems like an unnecessary restriction.

Actually I'm interested in all kind of sequences but I started with the integer ones which are already very interesting and in some sense easy to work with as you know their image.

The main ideas was to code as much as possible sequences from https://oeis.org

Copy link
Author

@fkiraly fkiraly Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main ideas was to code as much as possible sequences from https://oeis.org/

Yes, I get that ofc.
How about covering all of them with a parametric object OEIS(id), via https://oeis.org/wiki/JSON_Format,_Compressed_Files ?

Copy link
Author

@fkiraly fkiraly Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symbolic resolution would also be neat, e.g., asking, is OEIS(xyz) + OEIS(abc) = OEIS(def), as integer series

(I would like that for skpro distributions)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a good idea.

I have to admit that, I see the philosophy behind but I don't see how to implement it :-)

From this PR are coming so many good point of inspiration. I'm thinking to open a GitHub-project (or using the one already done) to clarify every point and decide how to proceed :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, absolutely!

How about meeting on discord (e.g., one of the sktime channels), and discussing a mini-roadmap for sequentium, and the "sets" project?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... I will contact you... good idea

return Arith(sequences=[*self.sequences, *other.sequences], operation="+")

def __mul__(self, other):
from sequence.sequences.compose.arith import Arith
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you importing Arith here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste accident, from the base class (where this is indeed needed)

self._op = self._resolve_operation(operation)

def __add__(self, other):
if not isinstance(other, Arith) or self._op != other._op:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have self._op != other._op?

Copy link
Author

@fkiraly fkiraly Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is about dealing with the situation where you have Arith([a, b, c], sym) + Arith([d, e, f], sym2).

If sym == sym2, then you'd want this to become Arith([a,b,c,d,e,f], sym) rather than Arith(Arith([a,b,c],sym), Arith([d,e,f],sym)), because all sym here are associative. Well, except the mean, but that's an oversight which we probably need to fix in sktime as well.

The "or" is for checking that indeed sym==sym2. We could do so by the string, but the same operation has multiple string aliases ("+" or "add"), so we check _op.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah... good point...

But will the final user be able to access this class? Because if the answer is no, we don't have to have different alias for the same operation, not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of it as user facing, potentially. Even if not, it makes sense to minimize the size of the parse tree.

def _as_generator(self):
op = self._op
while True:
yield op([next(sequence) for sequence in self.sequences])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gonna to fail if you have the sum of a finite sequence with an infinite one, or not?

and how it is defined the product of two sequences, one finite and the other infinite?

In general, we say that a sequence a_n is finite if there is N > 0 such that a_n = 0 for every n > N. Therefore, the product of a finite sequence with an infinite one should be finite. Is this respected in this implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see.

I didn't fully understand your model. I thought finite sequence indeed had finite length, I would expect tee generator to be finite too, in this case.

Given this explanation, I now have a strong opinion that that is how it should be, i.e., next should not give 0, but a generator-end.

Because otherwise, you cannot symbolically distinguish infinite sequences that have 0s, and finite sequencese; and, you cannot distinguish finite sequences with trailing 0s, from the same sequence but shorter.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I explain myself bad. I hope my explanation above is better :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which explanation, is there a new one?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that: #47 (comment)

In fact, I was confusing what should be a finite sequence: a sequence with finite length or one which is eventually always 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, what should our definition of finite be for the package?
(I would prefer agreeing with the one on wikipedia - finite length, rather than "becomes constant zero")

lists = [sequence._as_list(stop, start, step) for sequence in self.sequences]
return [op(z) for z in zip(*lists)]

def _at(self, index: int) -> int:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return type is wrong. if you take the mean of the constant sequence 1 with itself, then the resulting sequence is the constant sequence 1/2. Therefore, if you apply the method _at, the return type will be 1/2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you take the mean of the constant sequence 1 with itself, then the resulting sequence is the constant sequence 1/2.

I don't see that, and I must be stupid for not seeing that.

Can you please carry out the computation?

Copy link
Author

@fkiraly fkiraly Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's my computation:

we consider the pointwise mean, so each value in the constant sequence ArithOp([one, one], "mean") will have the value that is the mean of two ones.

Substituting definitions for the mean $\mu$ of two numbers $x_1=1$ and $x_2=1$, we have $$\mu = \frac{1}{n}\sum_{i=1}^n x_i,$$ where $n=2$, so, $\mu = \frac{1}{2} (1 + 1) = 2/2 = 1$. Not 1/2.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha sorry :-) I'm the stupid one...

You are right. Let's find a better example:
Suppose a_n is the constant sequence = 1 and b_n is the constant sequence = 2. Then 1/2*(1+2) = 3/2

But this is always the question if we should restrict ourself to integer sequences or not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any scope explosion from allowing real sequences too (in terms of software complexity) - so, why not?

I agree of course that the mean is not a good choice for a default, or an option at all.

@VascoSch92
Copy link
Owner

One question might be how you would like to handle the numpy dependency, I realize it is currently not a dep of the package.

Should that not be a dependency, given that your scope seems to involve sequence calculations as well, rather than just pre-defined sequences?

For instance, I was just giving a set of finite sequence already coded that the final users could use. But it was an idea to create a method which the final user can use to create its own sequence. Something like, instantiate_sequence(type="explicit", formula="...") and it returns a Sequence object as wished.

I didn't want to use external libraries for this project, but Numpy seems really interesting to solve the problem of arithmetic operation between sequences.

I will probably add it

@VascoSch92
Copy link
Owner

Hey @fkiraly
thank you very much for your time. It is really appreciated.

We have some thing to discuss about, but it seems really interesting how it is going ;-)

@fkiraly
Copy link
Author

fkiraly commented Apr 6, 2024

but Numpy seems really interesting to solve the problem of arithmetic operation between sequences.

The two dependencies I'd find plausible for this project are numpy and, possibly, scipy. Of course, dependency minimization is a good goal to have in mind, so I think it's worth thinking about whether we really need numpy. The case for it, above onboard min, max, etc is whether one would expect users handle larger sequences, or whether users would prefer numpy output above lists or generators.

@VascoSch92
Copy link
Owner

but Numpy seems really interesting to solve the problem of arithmetic operation between sequences.

The two dependencies I'd find plausible for this project are numpy and, possibly, scipy. Of course, dependency minimization is a good goal to have in mind, so I think it's worth thinking about whether we really need numpy. The case for it, above onboard min, max, etc is whether one would expect users handle larger sequences, or whether users would prefer numpy output above lists or generators.

Yes exact. I think it is possible to rewrite your class using the math operation of the math module of Python. As you mentioned, for a small sets of sequences we should not see a difference but for big ones yes

@fkiraly
Copy link
Author

fkiraly commented Apr 7, 2024

ok!

Could you list the changes you'd like me to make? We've discussed a lot but should align on the end state of this PR.

@fkiraly
Copy link
Author

fkiraly commented Apr 11, 2024

?

@VascoSch92
Copy link
Owner

ok!

Could you list the changes you'd like me to make? We've discussed a lot but should align on the end state of this PR.

Yes, sorry. With work I don't have so much time. I will take the time to list the changes this weekend for sure

@fkiraly
Copy link
Author

fkiraly commented Apr 12, 2024

no worries, take your time. Just wanted to make sure this didn't get buried in an inbox or similar.

@VascoSch92
Copy link
Owner

Summary:

  • for instance, the package support just integer sequence. To support other type of sequences we will need some structural work. But this is a project for the future :-)
  • for the moment we don't use numpy and the operation supported are just *, and + (in this way we are sure we are not going out from the integer world)
  • finite sequences are the ones of finite length. Therefore, your code should be correct.
  • I saw now that you are targeting main. Could you close the PR and open one that target staging? the development is made on staging and when we have enough changes I open a PR to main and when the PR is closed the package is automatically released. I don't like this way of working. It was just an idea but it is not working as expected and create just confusion (as in this case). After your PR is closed I'm planning to make possible to develop and release directly from main. Sorry for th inconvenience
  • about the tests: you can merge in staging without tests. I will take care of adding tests for your code ;-). Same for the linting and the formatting, I should explain better how to lint and to format the code before opening a PR and add a pre-commit-hook (this is also planned).
  • update the changelog ;-)

In general, I'm not happy about the CI/CD of the package. Until I work alone it is working fine, but just because I know perfectly how it is made. After your changed are released (in version 0.1.0 probably) I will take time to improve the whole system. Also planning to add a clear documentation.

Finally, thank you very much for the contribution, the effort and, overall, the ideas that this PR created :-)

@fkiraly fkiraly changed the base branch from main to staging April 16, 2024 21:52
@fkiraly
Copy link
Author

fkiraly commented Apr 16, 2024

Addressed changes:

for the moment we don't use numpy and the operation supported are just *, and +

I removed numpy and operations that leave the integers. However, I've added other interesting associative operations that stay in the integers - gcd, lcm.

finite sequences are the ones of finite length.

Ok, also added a length dunder.

Could you close the PR and open one that target staging?

No need to close for that. You can change the target with "edit" (top of GitHub GUI)

the development is made on staging and when we have enough changes I open a PR to main and when the PR is closed the package is automatically released. I don't like this way of working.

My preference is merge to main, and manual release trigger. GitHub allows creating releases manually, and GHA allows to set these as release action trigger, see sktime.

about the tests: you can merge in staging without tests.

I think that isn't good practice - should the tests not run still even if in staging? It is good that they run on the PR as well, to check whether anything gets broken.

update the changelog ;-)

Sure - you know you can also have tihs auto-generated from the commits?

If you (a) squash PRs, and (b) in settings, ensure the commit name defaults to PR title, and commit text to PR first post, the GitHub release interface creates a very readable changelog for you.

@fkiraly
Copy link
Author

fkiraly commented Apr 16, 2024

After your changes are released (in version 0.1.0 probably) I will take time to improve the whole system. Also planning to add a clear documentation.

I would recommend to open an issue to discuss end stage of CI and testing framework?

I would suggest sth like "collect all sequences and run the same set of interface tests on all of them".

An option is doing this with skbase (minimal work for me), but that's not the only way ofc.
Another example with homogenous interface tests: https://github.com/astrogilda/tsbootstrap/blob/main/src/tsbootstrap/tests/test_all_bootstraps.py

@VascoSch92 VascoSch92 merged commit 43da218 into VascoSch92:staging Apr 17, 2024
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement addition between sequences __add__
2 participants