Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Python3 #270

Merged
merged 25 commits into from
May 31, 2019
Merged

Add support for Python3 #270

merged 25 commits into from
May 31, 2019

Conversation

gijzelaerr
Copy link
Member

@gijzelaerr gijzelaerr commented May 16, 2019

I think this is mostly done now.

python2 with casacore 3.0 and python-casacore 3.0 works.

python3 with casacore 3.0 and python-casacore 3.0 gives an error:

  File "/home/gijs/Work/CubiCal/cubical/database/casa_db_adaptor.py", line 60, in init_empty
    t.putcol("TYPE", np.array(db.anttype)[antorder])
  File "/home/gijs/Work/CubiCal/.venv3_no_unicode/lib/python3.6/site-packages/casacore/tables/table.py", line 1157, in putcol
    self._putcol(columnname, startrow, nrow, rowincr, value)
RuntimeError: PycArray: unknown python array data type

Which is reported here:
casacore/python-casacore#138

python2 and python3 with the latest casacore and python-casacore give an error, which is reported here: #269

Both of them are in the same area of code, so I guess they are related.

This has been quite some work (again), so I hope this gets merged soon.

Oleg not sure what you want with your printing voodoo, but this is the result from the auto 2to3 convert things.

@ratt-priv-ci
Copy link
Collaborator

Can one of the admins verify this patch?

@bennahugo
Copy link
Collaborator

ok to test

Copy link
Collaborator

@bennahugo bennahugo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. Main concern is the following issues:

  • range vs. xrange, esp those in nested loops have a high performance and memory impact when running in python 2.7
  • iteritems vs. items, same here please use conditionals and six to check which to call depending on environment
  • pickle is backed in python on python 2.7 and the default levels are different. This has a performance impact when running in python 2.7. Use a conditional and six when importing
  • buildins don't exist in python 2.7 the test coverage probably excludes these files. As it stands those are not backwards compatible with python 2.7

cubical/data_handler/MBTiggerSim.py Show resolved Hide resolved
cubical/data_handler/ms_data_handler.py Show resolved Hide resolved
cubical/data_handler/ms_data_handler.py Show resolved Hide resolved
cubical/data_handler/ms_data_handler.py Show resolved Hide resolved
cubical/data_handler/ms_data_handler.py Show resolved Hide resolved
cubical/solver.py Show resolved Hide resolved
cubical/solver.py Show resolved Hide resolved
cubical/statistics.py Outdated Show resolved Hide resolved
cubical/tools/NpShared.py Show resolved Hide resolved
cubical/tools/shared_dict.py Show resolved Hide resolved
@bennahugo
Copy link
Collaborator

- 12:41:53 - param_db           �[1m�[94m[io] �[0m�[0m[0.1/0.5 2.3/2.8 0.5Gb]   loading G:gain.err, shape 1x115x64x28x2x2
 - 12:41:54 - casa_db_adaptor    �[1m�[94m[io] �[0m�[0m[0.1/0.5 2.4/2.8 0.5Gb] Exporting to CASA gaintables
 - 12:41:54 - main               �[1m�[94m[io] �[0m�[0m[0.2/0.5 2.4/2.8 0.5Gb] �[1m�[91mI/O handler for load None save -1 failed with exception: Python argument types in
    Table.__init__(table, list, list, int, int, int)
did not match C++ signature:
    __init__(_object*, casa::String, casa::String, casa::String, bool, casa::IPosition, casa::String, casa::String, int, int, casa::Vector<casa::String>, casa::Vector<casa::String>)
    __init__(_object*, casa::String, casa::Record, casa::String, casa::String, int, casa::Record, casa::Record)
    __init__(_object*, std::vector<casa::TableProxy, std::allocator<casa::TableProxy> >, casa::Vector<casa::String>, int, int, int)
    __init__(_object*, casa::Vector<casa::String>, casa::Vector<casa::String>, casa::Record, int)
    __init__(_object*, casa::String, casa::Record, int)
    __init__(_object*, casa::String, std::vector<casa::TableProxy, std::allocator<casa::TableProxy> >)
    __init__(_object*, casa::TableProxy)
    __init__(_object*)�[0m�[0m
 - 12:41:54 - main               �[1m�[94m[io] �[0m�[0m[0.2/0.5 2.4/2.8 0.5Gb] Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/cubical/workers.py", line 444, in _io_handler
    solver.gm_factory.close()
  File "/usr/local/lib/python2.7/dist-packages/cubical/machines/abstract_machine.py", line 862, in close
    db.close()
  File "/usr/local/lib/python2.7/dist-packages/cubical/database/casa_db_adaptor.py", line 467, in close
    self.__export()
  File "/usr/local/lib/python2.7/dist-packages/cubical/database/casa_db_adaptor.py", line 449, in __export
    casa_caltable_factory.create_G_table(self, "G:phase")
  File "/usr/local/lib/python2.7/dist-packages/cubical/database/casa_db_adaptor.py", line 170, in create_G_table
    with tbl(db.filename + ".%s.casa" % outname, ack=False, readonly=False) as t:
  File "/usr/local/lib/python2.7/dist-packages/casacore/tables/table.py", line 394, in __init__
    Table.__init__(self, tabname, concatsubtables, 0, 0, 0)
ArgumentError: Python argument types in
    Table.__init__(table, list, list, int, int, int)
did not match C++ signature:
    __init__(_object*, casa::String, casa::String, casa::String, bool, casa::IPosition, casa::String, casa::String, int, int, casa::Vector<casa::String>, casa::Vector<casa::String>)
    __init__(_object*, casa::String, casa::Record, casa::String, casa::String, int, casa::Record, casa::Record)
    __init__(_object*, std::vector<casa::TableProxy, std::allocator<casa::TableProxy> >, casa::Vector<casa::String>, int, int, int)
    __init__(_object*, casa::Vector<casa::String>, casa::Vector<casa::String>, casa::Record, int)
    __init__(_object*, casa::String, casa::Record, int)
    __init__(_object*, casa::String, std::vector<casa::TableProxy, std::allocator<casa::TableProxy> >)
    __init__(_object*, casa::TableProxy)
    __init__(_object*)
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
PicklingError: Can't pickle <class 'Boost.Python.ArgumentError'>: import of module Boost.Python failed

Process leaked file descriptors. See https://jenkins.io/redirect/troubleshooting/process-leaked-file-descriptors for more information
Build step 'Execute shell' marked build as failure

Recording test results
ERROR: Step ‘Publish JUnit test result report’ failed: No test report files were found. Configuration error?
Adding one-line test results to commit status...
Setting status of 747e1c6d1d2e457946fde892852ec4bf55fafe08 to FAILURE with url https://jenkins.meqtrees.net/job/cubical-pr/70/ and message: 'Build finished. No test results found.'

Finished: FAILURE

Smells like a python-casacore bug

@gijzelaerr
Copy link
Member Author

What python and what python-casacore are you using.

@ratt-priv-ci
Copy link
Collaborator

ratt-priv-ci commented May 17, 2019 via email

@gijzelaerr
Copy link
Member Author

This is due to unicode ending up in the python-casacore code. Only the current (python-)casacore has proper Python2 support for unicode 2. If you encounter errors like this the casacore call containing the string needs to have a str() wrapper, or at some point a newer (python-)casacore needs to be used.

Unfortunatly i'm now out of time to work on this.

@bennahugo
Copy link
Collaborator

ok I will change the build system to check the latest revision from casacore and build from there. Unfortunately this can only happen next week as I'm busy packaging killMS at the moment.

@gijzelaerr
Copy link
Member Author

Ideally it works with both the old and new python-casacore, it is probably an easy fix to insert some str() here and there.

@bennahugo
Copy link
Collaborator

hmm ok. I will debug this further next week

@gijzelaerr
Copy link
Member Author

Where do you run this container with what? I can't replicate your issue.

@bennahugo
Copy link
Collaborator

On the Jenkins CI with the following command line

WORKSPACE_ROOT="$WORKSPACE/$BUILD_NUMBER"
TEST_OUTPUT_DIR="$WORKSPACE_ROOT/test-output"
TEST_DATA_DIR="$WORKSPACE/../../../test-data"
mkdir $TEST_OUTPUT_DIR

# build and testrun
docker build -t cubical:${BUILD_NUMBER} ${WORKSPACE_ROOT}/projects/Cubical/
docker run --rm cubical:${BUILD_NUMBER}

#run tests
docker run --rm -m 100g --cap-add sys_ptrace \
				   --memory-swap=-1 \
                   --shm-size=150g \
                   --rm=true \
                   --name=cubical$BUILD_NUMBER \
                   -v ${TEST_OUTPUT_DIR}:/workspace \
                   -v ${TEST_OUTPUT_DIR}:/root/tmp \
                   --entrypoint /bin/bash \
                   cubical:${BUILD_NUMBER} \
                   -c "cd /src/cubical && apt-get install -y git && pip install -r requirements.test.txt && nosetests --with-xunit --xunit-file /workspace/nosetests.xml test"

@gijzelaerr
Copy link
Member Author

for me the tests are running. What specific test fails?

@bennahugo
Copy link
Collaborator

Its the main acceptance test on 147, see https://jenkins.meqtrees.net/job/cubical-pr/70/console

The latest commits don't build successfully
https://jenkins.meqtrees.net/job/cubical-pr/72/console

@gijzelaerr
Copy link
Member Author

Ok, my guess is that this is because you use KERN-3 which has casascore 2.4.1 in it.

@bennahugo
Copy link
Collaborator

ok before this is merged we need to work on a fix for casacore 2. Some of the lofar packages do not work with the new casacore and we do need to be able to run ddfacet, killms and cubical on the same installation, otherwise it just becomes too messy for the users.

@bennahugo
Copy link
Collaborator

Also we need to keep long term support for Ubuntu 16.04

@bennahugo
Copy link
Collaborator

See issue: casacore/python-casacore#174

@o-smirnov
Copy link
Collaborator

Hmm not sure I quite agree - without basic running tests in place how do we know we don't break our hard labour that went into the python 3 mode?

Well we're not breaking it by merging surely. The codebase has been made py3-compatible, we merge it in and carry on the revolution on another branch?

@bennahugo
Copy link
Collaborator

bennahugo commented May 28, 2019

Ok this version runs through on 16.04 py2 and 18.04 py2 and py3. However I had to made changes to the way you compute time indicies since python3 does not accept floating point arrays as index arrays, see the ms_tile provider. I also removed the SIN projection since the new montblanc does this internally.

Since we made so many changes to montblanc itself I've tested it using my small DDFacet use case and it subtracts very well as you can see on the pull request, so I doubt this is a montblanc py3 porting issue.

The variation between 16.04 and 18.04 is on the e-3 level, but both now fail. Whether or not this is a substantial difference I don't know. This needs further investigation but i suggest @o-smirnov and @JSKenyon quote the relative error instead of the absolute difference so we can better understand the
significance of the difference? 16.04 and 18.04 difference:

 - 17:08:59 - main               [0.2/0.2 1.1/1.1 0.6Gb] completed successfully
*** max diff between CORRECTED_DATA and DE_DATA is 0.00152961199638
E
======================================================================

and

*** max diff between CORRECTED_DATA and DE_DATA is 0.002001586603
E
======================================================================

@bennahugo
Copy link
Collaborator

An even better idea is to write out the model and compute the chi^2 between the model and the de-corrected residuals.

For now I've implemented the mean relative error check in decibels . I've set the threshold to -30 dB on the mean relative error and -25 dB on the 95th percentile. This gives us some leeway in the tests. Currently on the ubuntu 18.04 system this stands at:

 - 19:56:10 - main               [0.1/0.2 1.2/1.3 0.5Gb] completed successfully
*** mean relative diff between CORRECTED_DATA and DE_DATA is -35.98283290863037 dB
*** ninety fifth percentile relative diff between CORRECTED_DATA and DE_DATA is -31.854880104464907 dB
.
----------------------------------------------------------------------
Ran 1 test in 135.944s

OK

So I think we can merge this bastard @o-smirnov

@bennahugo
Copy link
Collaborator

@JSKenyon I'll make 2 images to tripple check that this didn't break anything but -30dB is well within instantaneous instrumental noise, so I don't think these gain differences are substantial enough to worry

tindx = np.add.accumulate(tindx)
return tindx

self._row_identifiers = ddid_index * n_bl * ntime + timestep_index(self.times) * n_bl + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced by this - in my opinion we are trying to fix something which should not be broken. We need to find the root cause otherwise we may have unexpected behaviour elsewhere. Doing a search for times only yields 25 results and I cannot see any reason for self.times to become non-integer. For reference, self.time_col contains float values from the TIME column of the MS. self.times is of the same shape but instead contains integer indices associating each time with the timeslot/integration to which it belongs. These indices are generated by uniquify, which is defined init.py in the data_handler folder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be broken upstream from this method somewhere. I can revert the change and if you run it with py3 it will break

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest thing is to sprinkle asserts for the data types into the codebase as they can always be disabled with python if you want full performance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's no way times should be float. I vote to rename it though, as from the variable name it's not obvious at all that this means "timeslot index" (anyone new to the code base will just take it to mean the plural of "time").

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I found the difference it was stemming from your n_bl calculation which I already fixed with //

The relative difference is still

*** mean relative diff between CORRECTED_DATA and DE_DATA is -35.98283290863037 dB
*** ninety fifth percentile relative diff between CORRECTED_DATA and DE_DATA is -31.854880104464907 dB

which points to this being the sin projection you previously employed

@gijzelaerr
Copy link
Member Author

gijzelaerr commented May 29, 2019 via email

@o-smirnov
Copy link
Collaborator

However I had to made changes to the way you compute time indicies since python3 does not accept floating point arrays as index arrays, see the ms_tile provider.

As discussed above, this looks to be a manifestation of another error.

I also removed the SIN projection since the new montblanc does this internally.

Since the reference data was generated with the old code, the DE_DATA test (the only one using Montblanc) can be expected to fail, since the effective model source positions will have shifted slightly. If you're right, the old code uses slightly incorrect positions, and the new code will subtract "better".

The way to verify that the new code is "better" (or at least "as correct") is to run the test with a full MS (in C and D config), and eyeball the residual images.

@bennahugo
Copy link
Collaborator

@o-smirnov I've already verified that montblanc is subtracts correctly within MeerKAT resolution which is comparable to VLA C

@bennahugo
Copy link
Collaborator

See ratt-ru/montblanc#244

@bennahugo
Copy link
Collaborator

I still maintain the differences in projection is well within visibility error bars, however I'm going to merge in montblanc master into the branch and redo the DDF MK deep2 cleaning test, which is an order of magnitude higher resolution than VLA D configuration.

@bennahugo
Copy link
Collaborator

@JSKenyon this last commit fixes your cythonization upon wheel building. It also gets invoked upon standard python setup.py install which ensures wheels can be build directly from the raw release code. I prefer this method because it means you can call pip install git+.... or pip install directly from source on pypi. It also means that your source distribution on pypi can have a direct corresponding tarball on github without the need to include c/cpp files in the one and not the other.

@JSKenyon
Copy link
Collaborator

Ok, I am sort of convinced that this is in a working state for both 2.7 and 3.6. However, post merge there are several additional fixes which need to be made. I will make issues for them, but for the sake of my memeory I will also mention them here. @o-smirnov currently postmortem flagging breaks in flag3_to_col and we need to verify flagging behaviour as my current MWE ends up with unflagged bad data.

@o-smirnov
Copy link
Collaborator

o-smirnov commented May 31, 2019 via email

@JSKenyon
Copy link
Collaborator

@o-smirnov that's just it - I did have those set. Perhaps my parset is outdated and there is some other setting I have missed. I made a local change to the postmortem flagging code to make it run, and that produced sensible results. I am just a little concerned that flags are not propagating as expected. But it is not needed in this PR - we can look at fixing up all the little issues after the merge.

@bennahugo
Copy link
Collaborator

@JSKenyon can you submit all your changes so I can do one final check before we press the (red) merge button

@JSKenyon
Copy link
Collaborator

@bennahugo I have already pushed the minor changes I made. See the last two commits. So feel free to go ahead.

@bennahugo bennahugo merged commit 716cdb7 into master May 31, 2019
@bennahugo bennahugo deleted the py3_v2 branch May 31, 2019 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants