test(all): switch to use GitHub strategy matrix and fix flaky tests #828

ijemmy · 2022-05-09T18:45:10Z

Description of your changes

The e2e tests have been flaky due to following reasons:

Some tests (mostly Logger) time out. Cause: Refactoring tracing increase the number of concurrent deployment. This slows down the overall run and we are more likely to hit timeout. (GitHub runner has only 2 cores)
Tracer tests sometimes fail from not having a particular field (error). Causes: We removed the explicit wait on Tracer during refactoring. We now rely on polling until we receive the expected number of traces. However, not all subsegments are available in the traces yet.

This PR addresses the issues by:

Increases timeout
Use matrix strategies to run 6 combinations in 6 different runners ( 2 Lambda runtimes x 3 packages)
Add a logic on polling to retry if the subsegments are not available yet.

How to verify this change

You can trigger the e2e test workflow manually.

I've run this 5 times without any failure so far. > https://github.com/awslabs/aws-lambda-powertools-typescript/actions/runs/2295223762.

Related issues, RFCs

#825 <-- the log will be descriptive with Matrix strategy

PR status

Is this ready for review?: YES
Is it a breaking change?: NO

Checklist

Breaking change checklist

I have documented the migration process
I have added, implemented necessary warnings (if it can live side by side)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… (and uuid is trimed off)

…ded to x-ray

…ect flaky tests

ijemmy · 2022-05-09T18:50:50Z

packages/commons/tests/utils/e2eUtils.ts

@@ -57,7 +57,7 @@ export const createStackWithLambdaFunction = (params: StackWithLambdaFunctionOpt
 };

 export const generateUniqueName = (name_prefix: string, uuid: string, runtime: string, testName: string): string => 
-  `${name_prefix}-${runtime}-${testName}-${uuid}`.substring(0, 64);
+  `${name_prefix}-${runtime}-${uuid.substring(0,5)}-${testName}`.substring(0, 64);


When test name is really long, the uuid is truncated on some resources. This results in name clashing when the same test cases are run at the same time.

dreamorosi

Already looks good and passing!

As only note I'd recommend removing concurrently from the package-lock.json. Since we are no longer using it we can avoid having to maintain future updates for it.

saragerion

Thanks for the PR! I left some comments/questions but already looking good :-)

saragerion · 2022-05-09T20:16:39Z

.github/workflows/run-e2e-tests.yml

+    strategy:
+      matrix:
+        version: [12, 14]
+        package: [logger, metrics, tracing]


Outside of the scope of this PR, but we should rename the tracer folder to "tracer" for consistency. Created an issue:
#829

saragerion · 2022-05-09T20:19:34Z

packages/logger/tests/e2e/constants.ts

+export const ONE_MINUTE = 60 * 1000;
+export const TEST_CASE_TIMEOUT = ONE_MINUTE;
+export const SETUP_TIMEOUT = 5 * ONE_MINUTE;
+export const TEARDOWN_TIMEOUT = 5 * ONE_MINUTE;


I like this syntax. More readable!

saragerion · 2022-05-09T20:21:51Z

packages/tracing/tests/helpers/tracesUtils.ts

@@ -14,11 +14,16 @@ import {
  expectedCustomResponseValue, 
  expectedCustomErrorMessage,
 } from '../e2e/constants';
+import { FunctionSegmentNotDefinedError } from './FunctionSegmentNotDefinedError';


Question: what do you mean here with segment not defined?

If X-Ray hasn't fully processed all segments, some of them may be missing.
In our case, we may try to get the Function (AWS::Lambda::Function) segment too early. I use this custom error to distinguish the type of error form others.

saragerion · 2022-05-09T20:24:38Z

packages/tracing/tests/helpers/tracesUtils.ts

+  // This flag may be set if the segment hasn't been fully processed
+  // The trace may have already appeared in the `getTraceSummaries` response 
+  // but a segment may still be in_progress
+  in_progress?: boolean 


By looking at the documentation, also the end_time key might be not defined, correct?

end_time – number that is the time the segment was closed. For example, 1480615200.090 or 1.480615200090E9. Specify either an end_time or in_progress.

https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html#api-segmentdocuments-fields

Yes. In my experiments, end_time field doesn't exist. It should be optional.

saragerion · 2022-05-09T20:26:00Z

packages/tracing/tests/helpers/tracesUtils.ts

+        }
+      }
+
+      retryFlag = retryFlag || (!!invocationSubsegment.in_progress);


Question: essentially here we are looking at the in_progress key. If it's set, we retry. Correct?

That's right for this line. But there is a case that the whole segment isn't defined at all. That's handled by the lines above.

I see! thanks for clarifying

ijemmy added 20 commits May 2, 2022 15:32

test(all): extend timeout or it sometimes fail on GitHub action

8ae5f52

test(tracer): increase test case timeout of tracing to 5 mins

6d33329

test: use GitHub matrix strategy instead of concurrently

5426209

test: remove concurrentcy restriction

4ef4807

test: fix missing jest in the environment

583be88

test: fix incorrect command

8bfef37

test: fix environment variable not passed properly

1762f3b

test: always use Node14 when running e2e tests

4bf50de

test: add 5 chars of uuid to avoid clashing when testname is too long…

d7661a4

… (and uuid is trimed off)

test: extract example and package checks out to make jobs run faster

367c90a

test: fix flaky test when traces' subsegements haven't been fully loa…

4a68307

…ded to x-ray

test: make common steps in workflow reusable + add a work flow to det…

5eff569

…ect flaky tests

test: can't avoid checkout step in GH workflow

3f0ff65

test: fix filename typo

7290157

test: check if filepath is the issue

1d272d9

test: reduce number of times in flaky detection

dbf3f35

test: stop using reusable workflow

2dde768

test: remove detect-flaky workflow

1b6e5dc

test: remove debugging logs

da44d1b

test: add a detect flaky workflow and remove auto trigger on push

c62cd4b

ijemmy requested review from flochaz, dreamorosi and saragerion May 9, 2022 18:45

test: remove flaky test detection workflow

bf99b36

ijemmy commented May 9, 2022

View reviewed changes

ijemmy marked this pull request as ready for review May 9, 2022 19:06

dreamorosi assigned ijemmy May 9, 2022

dreamorosi added the internal PRs that introduce changes in governance, tech debt and chores (linting setup, baseline, etc.) label May 9, 2022

dreamorosi added this to the production-ready-release milestone May 9, 2022

dreamorosi previously approved these changes May 9, 2022

View reviewed changes

saragerion reviewed May 9, 2022

View reviewed changes

test: update package.json to remove concurrently

bdcf8b8

ijemmy dismissed dreamorosi’s stale review via bdcf8b8 May 9, 2022 20:58

dreamorosi approved these changes May 9, 2022

View reviewed changes

saragerion approved these changes May 9, 2022

View reviewed changes

dreamorosi merged commit 0a8cbdd into main May 10, 2022

dreamorosi deleted the ijemmy/increase-e2e-tests-timeout branch May 10, 2022 03:49

ijemmy mentioned this pull request May 10, 2022

Add class documentation FunctionSegmentNotDefinedErrorunexpected and make end_time optional #845

Merged

13 tasks

dreamorosi mentioned this pull request May 12, 2022

Maintenance: have descriptive logs in the integration tests #825

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(all): switch to use GitHub strategy matrix and fix flaky tests #828

test(all): switch to use GitHub strategy matrix and fix flaky tests #828

ijemmy commented May 9, 2022 •

edited

Loading

ijemmy May 9, 2022

dreamorosi left a comment

saragerion left a comment

saragerion May 9, 2022

dreamorosi May 12, 2022

saragerion May 9, 2022

saragerion May 9, 2022

ijemmy May 10, 2022

saragerion May 9, 2022

ijemmy May 10, 2022

saragerion May 9, 2022

ijemmy May 10, 2022

saragerion May 10, 2022

test(all): switch to use GitHub strategy matrix and fix flaky tests #828

test(all): switch to use GitHub strategy matrix and fix flaky tests #828

Conversation

ijemmy commented May 9, 2022 • edited Loading

Description of your changes

How to verify this change

Related issues, RFCs

PR status

Checklist

Breaking change checklist

Choose a reason for hiding this comment

dreamorosi left a comment

Choose a reason for hiding this comment

saragerion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijemmy commented May 9, 2022 •

edited

Loading