Skip to content

BLD: GH30873 Workaround parallel build failures #40285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mgorny
Copy link
Contributor

@mgorny mgorny commented Mar 7, 2021

Fix parallel build race conditions due to the same source files being
used in multiple targets. To workaround the distutils bug, create
unique .c files for each target that just #include the original file.

Fix parallel build race conditions due to the same source files being
used in multiple targets.  To workaround the distutils bug, create
unique .c files for each target that just #include the original file.
@mgorny
Copy link
Contributor Author

mgorny commented Mar 7, 2021

I know that this solution somewhat sucks but I don't know of a better one.

Before the patch, the breakage can be proven by grepping the build log:

$ grep -o -- '-o [^ ]*' build.log  | sort | uniq -c | grep -v 1
      3 -o build/temp.linux-x86_64-3.9/pandas/_libs/src/parser/tokenizer.o
      4 -o build/temp.linux-x86_64-3.9/pandas/_libs/tslibs/src/datetime/np_datetime.o
      2 -o build/temp.linux-x86_64-3.9/pandas/_libs/tslibs/src/datetime/np_datetime_strings.o

Since all commands are run in parallel, we're dealing with 2-4 commands writing to the same file simultaneously, and racing with linking commands that use these files.

@mgorny mgorny changed the title BUG: GH30873 Workaround parallel build failures BLD: GH30873 Workaround parallel build failures Mar 7, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this just coming up now? exactly how are you compiling?

@mgorny
Copy link
Contributor Author

mgorny commented Mar 7, 2021

We've had mysterious bug for a long time now, and we've finally figured out the root cause. For example, in dask:

_____________ ERROR collecting dask/array/tests/test_array_core.py _____________
ImportError while importing test module '/var/tmp/portage/dev-python/dask-2021.3.0/work/dask-2021.3.0/dask/array/tests/test_array_core.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.8/site-packages/pandas/__init__.py:29: in <module>
    from pandas._libs import hashtable as _hashtable, lib as _lib, tslib as _tslib
/usr/lib/python3.8/site-packages/pandas/_libs/__init__.py:13: in <module>
    from pandas._libs.interval import Interval
pandas/_libs/interval.pyx:1: in init pandas._libs.interval
    ???
pandas/_libs/hashtable.pyx:1: in init pandas._libs.hashtable
    ???
pandas/_libs/missing.pyx:1: in init pandas._libs.missing
    ???
/usr/lib/python3.8/site-packages/pandas/_libs/tslibs/__init__.py:30: in <module>
    from .conversion import OutOfBoundsTimedelta, localize_pydatetime
E   ImportError: /usr/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.cpython-38-x86_64-linux-gnu.so: undefined symbol: pandas_datetime_to_datetimestruct

We're building via python setup.py build -j 36.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2021

can u point to a bug in disutils itself or a bug report?

@mgorny
Copy link
Contributor Author

mgorny commented Mar 7, 2021

No, I don't have any bug link handy. And I don't think there's any point in wasting my time filing one given that Python upstream has abandoned distutils.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2021

No, I don't have any bug link handy. And I don't think there's any point in wasting my time filing one given that Python upstream has abandoned distutils.

and then how do you expect us to merge this?

this is a very hacky solution - if you said the dependencies are not specified correctly then that’s something - but this appears to be a bandaid that may or may not work and is adding a lot of complexity

@mgorny
Copy link
Contributor Author

mgorny commented Mar 7, 2021

I'm doing the best with the very little time I have. If you don't believe me, just run setup.py build -j100 or alike and see for yourself that multiple compiler instances with the same output file are started simultaneously and don't order correctly. The likeliness that Python upstream fixes this is close to zero and even if they do, all the old Python versions that do not receive bugfixes will still be affected.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2021

why is j more than like 4 actually useful? in any event we need a solution that follows c standards
duplicating files is not it

@mgorny
Copy link
Contributor Author

mgorny commented Mar 7, 2021

It's pretty normal for people to set -j to the number of their processors/cores/threads globally.

@SoapGentoo
Copy link
Contributor

why is j more than like 4 actually useful? in any event we need a solution that follows c standards
duplicating files is not it

Have you ever compiled stuff on a 16-core machine? -j16 is essential to scaling build systems to many-core machines

@SoapGentoo
Copy link
Contributor

Here's the numpy issue: numpy/numpy#15957

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 8, 2021

why is j more than like 4 actually useful? in any event we need a solution that follows c standards
duplicating files is not it

@jreback I initially filed this issue in #30873, because I use -j 8 and always get these failures. And it does speed up the build compared to -j 4 especially when you are merging with master and everything has to get rebuilt.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2021

@Dr-Irv I am not averse to solving this. why is this: https://stackoverflow.com/questions/11013851/speeding-up-build-process-with-distutils not the patch.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 8, 2021

@Dr-Irv I am not averse to solving this. why is this: https://stackoverflow.com/questions/11013851/speeding-up-build-process-with-distutils not the patch.

@jreback I don't think that solution will work here. What it does is spawn off separate processes to do the compilation. I think if we did that we'd still have the same issue. What is happening with pandas is that we compile the same file multiple times, and the compilations interfere with each other when done in parallel since they are writing to the same file. So the workaround proposed here in the PR is to ensure that we don't have those collisions.

@simonjayhawkins simonjayhawkins added Build Library building on various platforms Windows Windows OS labels Mar 14, 2021
@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2021

I am not an expert but I have never seen anyone ever suggesting including a C file directly in another to be a good thing. If anything, I only see dire warnings against doing something like this:

https://stackoverflow.com/questions/10448047/include-c-file-in-another

I am generally -1 on adding complexity to pandas for this, especially if we haven't checked upstream where this should really be solved

@jbrockmendel
Copy link
Member

#30873 seems worth fixing, but this doesn't seem like an approach we want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel build_ext might fail due to file collisions
7 participants