Skip to content

Inline library functions based on constant-ness of arguments #352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
CAFxX opened this issue Jun 24, 2020 · 8 comments
Open

Inline library functions based on constant-ness of arguments #352

CAFxX opened this issue Jun 24, 2020 · 8 comments

Comments

@CAFxX
Copy link

CAFxX commented Jun 24, 2020

Many fundamental functions could be effectively inlined without causing significant binary size increase if they could be inlined only if certain arguments were constants, as in such a case significant parts of the function body would turn into dead code, and could be therefore eliminated.

Samples of this kind of functions include: digitalWrite, digitalRead, analogWrite, analogRead, pinMode, turnOffPWM and so on.

GCC apparently does not do this automatically, but it can be done. One way is as follows (let's take analogWrite as an example):

void analogWrite(uint8_t pin, int val)
{
	// body...
}

If we accept to turn analogWrite into a macro, we can rewrite it as follows:

#define analogWrite(pin, val) \
	( __builtin_constant_p(pin) ? _analogWrite_inline((pin), (val)) : _analogWrite((pin), (val)) )

void _analogWrite(uint8_t pin, int val) {
  _analogWrite_inline(pin, val); 
}

inline __attribute__((__always_inline__)) _analogWrite_inline(uint8_t pin, int val) {
	// body...
}

In this way, callsites that have a non-constant pin argument will get the regular _analogWrite function as they do today without causing any increase in binary size, whereas callsites that have a constant pin argument will get an inlined version, and constant propagation will take care of eliminating dead code. If done consistently, this would allow GCC to turn a call to e.g. analogWrite(PIN, HIGH) into the following:

	uint8_t oldSREG = SREG;
    cli();
	*reg |= regbit; // reg, regbit are constant
	SREG = oldSREG;
	uint8_t oldSREG = SREG;
	cli();
	*out |= outbit; // out, outbit are constant
	SREG = oldSREG;

(that does not take much more space than the original call and that, potentially, opens up the potential, if all callsites use constant args, for _analogWrite and related functions to be completely omitted from the final binary)

@matthijskooijman
Copy link
Collaborator

Yeah, this could indeed help a lot. I think the approach you describe is already used in some places (I think SPISettings, maybe others too).

GCC apparently does not do this automatically,

With LTO, GCC might be able to do aggresive inlining automatically, but I'm not sure how well it can predict what the effects will be.

One complication is that some functions involve PROGMEM arrays, an pgm_load_* cannot be resolved at compiletime, but I think those can also be specialcases with __builtin_constant_p (skipping the pgm_load for constants). Maybe just this would allow LTO to already automatically inline a lot. I've been meaning to look at this, but never gotten around to this.

@matthijskooijman
Copy link
Collaborator

Also see this somewhat-related discussion from a long time ago: https://groups.google.com/a/arduino.cc/forum/#!topic/developers/OGAGuF-HQJQ

@CAFxX
Copy link
Author

CAFxX commented Jun 26, 2020

Regarding the progmem arrays, not sure if this has been suggested before but a potential way to sidestep the issue could be to turn them into switches, that hopefully (needs to be tested) gcc will compile as jumptables in text anyway for the non-constant argument case (sure, they would likely occupy a few bytes more)

Once they're switches, we can force inlining on them and this should make everything work regardless of LTO.

@matthijskooijman
Copy link
Collaborator

Regarding the progmem arrays, not sure if this has been suggested before but a potential way to sidestep the issue could be to turn them into switches,

I would be hesistant of this approach, since I'm afraid that even if they become proper jumptables, the flash size of this is significantly higher (I expect the jump table will be 2-bytes per case, against 1 or 2 bytes per case for the current table, and then the target of each jump will need 4 more bytes for a ldi and a rjmp probably, so that means the flash size of these tables will triple or more).

I guess that a switch (which would be inside an inline function declared in the header, I suppose) is an interesting way to sneak a constant table into a header (which is not otherwise possible by duplicating it by making it static, I think, since inline variables are not introduced until C++17 IIRC).

Also, I wonder if this is really needed: if we can write things so that not having LTO will just not optimize the constant case, or maybe raise an error (but not silently break), it would be ok to only support the LTO-case with this optimization?

@CAFxX
Copy link
Author

CAFxX commented Jun 27, 2020

it would be ok to only support the LTO-case with this optimization?

It would, at least as far as I'm concerned. The non-LTO case was only mentioned in the link you posted.

@nerdralph
Copy link

With LTO, gcc will evaluate progmem arrays at compile time. It's a trick I use in my ArduinoShrink library.
https://github.com/nerdralph/ArduinoShrink

@ArminJo
Copy link

ArminJo commented Feb 6, 2021

For the digitalRead and Write you can just use the https://github.com/watterott/Arduino-Libs/tree/master/digitalWriteFast library

@matthijskooijman
Copy link
Collaborator

Also see #264, which discusses a similar idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants