Skip to content

test(all): switch to use GitHub strategy matrix and fix flaky tests #828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 10, 2022

Conversation

ijemmy
Copy link
Contributor

@ijemmy ijemmy commented May 9, 2022

Description of your changes

The e2e tests have been flaky due to following reasons:

  1. Some tests (mostly Logger) time out. Cause: Refactoring tracing increase the number of concurrent deployment. This slows down the overall run and we are more likely to hit timeout. (GitHub runner has only 2 cores)
  2. Tracer tests sometimes fail from not having a particular field (error). Causes: We removed the explicit wait on Tracer during refactoring. We now rely on polling until we receive the expected number of traces. However, not all subsegments are available in the traces yet.

This PR addresses the issues by:

  1. Increases timeout
  2. Use matrix strategies to run 6 combinations in 6 different runners ( 2 Lambda runtimes x 3 packages)
  3. Add a logic on polling to retry if the subsegments are not available yet.

How to verify this change

You can trigger the e2e test workflow manually.

I've run this 5 times without any failure so far. > https://github.com/awslabs/aws-lambda-powertools-typescript/actions/runs/2295223762.

Related issues, RFCs

#825 <-- the log will be descriptive with Matrix strategy

PR status

Is this ready for review?: YES
Is it a breaking change?: NO

Checklist

  • My changes meet the tenets criteria
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in areas that should be flagged with a TODO, or hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding changes to the examples
  • My changes generate no new warnings
  • The code coverage hasn't decreased
  • I have added tests that prove my change is effective and works
  • New and existing unit tests pass locally and in Github Actions
  • Any dependent changes have been merged and published in downstream module
  • The PR title follows the conventional commit semantics

Breaking change checklist

  • I have documented the migration process
  • I have added, implemented necessary warnings (if it can live side by side)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -57,7 +57,7 @@ export const createStackWithLambdaFunction = (params: StackWithLambdaFunctionOpt
};

export const generateUniqueName = (name_prefix: string, uuid: string, runtime: string, testName: string): string =>
`${name_prefix}-${runtime}-${testName}-${uuid}`.substring(0, 64);
`${name_prefix}-${runtime}-${uuid.substring(0,5)}-${testName}`.substring(0, 64);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When test name is really long, the uuid is truncated on some resources. This results in name clashing when the same test cases are run at the same time.

@ijemmy ijemmy marked this pull request as ready for review May 9, 2022 19:06
@dreamorosi dreamorosi added the internal PRs that introduce changes in governance, tech debt and chores (linting setup, baseline, etc.) label May 9, 2022
@dreamorosi dreamorosi added this to the production-ready-release milestone May 9, 2022
dreamorosi
dreamorosi previously approved these changes May 9, 2022
Copy link
Contributor

@dreamorosi dreamorosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already looks good and passing!

As only note I'd recommend removing concurrently from the package-lock.json. Since we are no longer using it we can avoid having to maintain future updates for it.

Copy link
Contributor

@saragerion saragerion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left some comments/questions but already looking good :-)

strategy:
matrix:
version: [12, 14]
package: [logger, metrics, tracing]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the scope of this PR, but we should rename the tracer folder to "tracer" for consistency. Created an issue:
#829

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export const ONE_MINUTE = 60 * 1000;
export const TEST_CASE_TIMEOUT = ONE_MINUTE;
export const SETUP_TIMEOUT = 5 * ONE_MINUTE;
export const TEARDOWN_TIMEOUT = 5 * ONE_MINUTE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this syntax. More readable!

@@ -14,11 +14,16 @@ import {
expectedCustomResponseValue,
expectedCustomErrorMessage,
} from '../e2e/constants';
import { FunctionSegmentNotDefinedError } from './FunctionSegmentNotDefinedError';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what do you mean here with segment not defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X-Ray hasn't fully processed all segments, some of them may be missing.
In our case, we may try to get the Function (AWS::Lambda::Function) segment too early. I use this custom error to distinguish the type of error form others.

// This flag may be set if the segment hasn't been fully processed
// The trace may have already appeared in the `getTraceSummaries` response
// but a segment may still be in_progress
in_progress?: boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the documentation, also the end_time key might be not defined, correct?

end_time – number that is the time the segment was closed. For example, 1480615200.090 or 1.480615200090E9. Specify either an end_time or in_progress.

https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html#api-segmentdocuments-fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In my experiments, end_time field doesn't exist. It should be optional.

image

}
}

retryFlag = retryFlag || (!!invocationSubsegment.in_progress);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: essentially here we are looking at the in_progress key. If it's set, we retry. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right for this line. But there is a case that the whole segment isn't defined at all. That's handled by the lines above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! thanks for clarifying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal PRs that introduce changes in governance, tech debt and chores (linting setup, baseline, etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants