Skip to main content
Testing

Can ChatGPT effectively generate test cases?

By 12 May 2023August 6th, 2024No Comments

Our experiment with GPT

We tried an experiment with ChatGPT to see how well it understood a statement written as a “business rule” and whether it could analyse it to create a set of valid test cases.

The business rule we gave ChatGPT to analyse was:

Can ChatGPT effectively generate test cases

“What test cases do I need to execute to verify that a person between the ages of 10 and 31 is eligible for a gift card, but if they live in Kansas, they can only get the card between the ages of 10 and 25.”

As you can see, there is some ambiguity in the business rule, like:

  • ‘What happens if the person is from another country, or it doesn’t matter?’
  • ‘Is Kansas the only state that matters in the USA?’
  • ‘Are the ages specified using “between” means greater than and equal to 10, and less than and equal to 25 or 31?’

Requirements could be better in the real world, but we wanted to know how ChatGPT would handle this ambiguity, especially around the ages.

Our analysis works off the following qualification concerning the ages; the word “between” means “equal to and greater than” or “equal to and less than.”

The equivalence partitions are:

  • States = Kansas, or anything else
  • Country = We assume it is irrelevant to the rule.
  • Ages are:
    • >=10 – from Kansas and anywhere else
    • <=25 – only if from Kansas
    • <= 31 – if from anywhere else but Kansas
  • Gift Card = Yes or no

The boundary values would be:

  • 9, 10, 11 – from Kansas and anywhere else.
  • 24, 25, 26 – only if from Kansas.
  • 30, 31, 32 – if from anywhere else but Kansas.

Tester would apply equivalence partitioning logic and modify the boundary table to remove the following:

  • Eleven could be eliminated for greater than and equal to 10 as 10 and 11 are equivalent.
  • Twenty-four could be eliminated for less than and equal to 25, as 25 is the boundary. Therefore, 24 and 25 are equivalent when the condition is “less than and equal to 25”.
  • Thirty could be eliminated for less than and equal to 31, as 31 is the boundary. Therefore, 30 and 31 are equivalent when the condition is “less than and equal to 31”.

The decision table of test cases from our analysis

Using a decision table, these are the test cases we came up with. As noted, there are various ways to reduce this list of test cases by equivalencing partitioning, but we left them all in our decision table to see what ChatGPT would return.

Decision table of test cases

ChatGPT's first pass

This was ChatGPT’s first response to the question:

copy of text from ChatGPT

Mapping the test case

We mapped the test case from ChatGPT to the following test cases in the decision table:

CGPT Test Case 1 doesn’t map to a boundary test case for an age equal to 24. It used age = 18 and state = California

CGPT Test Case 2 doesn’t map to a boundary test case for an age equal to 24. It used age = 18 and state = Kansas

CGPT Test Case 3 doesn’t map to a boundary test case for an age equal to 9. It used age = 8 and state = Kansas

CGPT Test Case 4 doesn’t map to a boundary test case for an age equal to 31. It used age = 35 and state = Kansas.

CGPT Test Case 5 doesn’t map to a boundary test case for an age equal to 26. It used age = 28 and state = Kansas

CGPT Test Case 6 doesn’t map to a boundary test case for an age equal to 25. It used age = 20 and state = Kansas

The question it poses is, “Do these test cases provide enough coverage of the requirement?”

I hope the answer you gave was “NO”!

Test cases 1, 2, and 6 only changed the state, and age wasn’t any value of significance concerning the boundary values, so these test cases were duplicating test effort.

There is no boundary test for ages less than 10, and not in Kansas, where no gift card is valid, but it does test for ages less than ten and in Kansas. This test still results in “no gift card”. We can argue that Kansas is the only state that matters, so do we need to execute another test with a state that isn’t Kansas?

It also didn’t create a test for those older than 31 and not from Kansas, but it did make one for those older than 31 and in Kansas. Both tests result in “no gift card”, so again, do we need the second test case?

We decided to get more specific and ask ChatGPT to analyse the requirement using Boundary Value Analysis (BVA). The outcome is very interesting indeed!

In Part 2, we’ll give you the test cases that ChatGPT created and our analysis of how well ChatGPT did at creating a valid set of test cases to achieve an appropriate level of coverage.

PART 2

We asked ChatGPT to apply Boundary Value Analysis

If we ask ChatGPT to specifically focus on the boundary values, what test cases will it create this time? It showed it understood boundary value analysis, but what results will it provide?

copy of results from ChatGPT

Mapping the test case

We mapped the test case from ChatGPT to the following test cases in the decision table:

CGPT Test Case 1 map to TC001. The invalid age below the boundary is 9, and the state is California.

CGPT Test Case 2 maps to TC002. The valid age at the boundary is 10, and the state is California.

CGPT Test Case 3 maps to TC003. The valid age above the boundary is 11, and the state is California.

CGPT Test Case 4 maps to TC014. The age 25 is at the age boundary of 25, and the state is Kansas.

CGPT Test Case 5 maps to TC014. The age 25 is at the age boundary of 25, and the state is Kansas.

CGPT Test Case 6 maps to TC015. The age 26 is above the age boundary of 25, and the state of Kansas.

CGPT Test Case 7 maps to TC018. The age 32 is above the age boundary of 31, and the state of California.

The ChatGPT test cases 4 and 5 are duplicates. It didn’t understand that boundary condition clearly where test case 4 should have picked an age of 24 if its intent was “Age is below the lower limit for Kansas residents.”

This time it focused the boundary test on the lower boundary greater than and equal to the age of 10 and not being in Kansas. It also didn’t create any test cases on the middle boundary for less than and similar to an age of 25 for Kansas only, and on the upper boundary, it only focused on coming from Kansas.

So, do you need a test case for the age of 26 and coming from anywhere other than Kansas? Can we assume that if we get a “no gift card” result when testing Kansas and the age of 26, then if the location were anywhere other than Kansas, it would return a valid gift card?

We asked ChatGPT to regenerate the answer a second & third time

The algorithm is meant to learn, so we should get improved outcomes when regenerating the result. Its response was an additional test case, and the identified boundaries remained unchanged, but the test case boundary values it used did change.

The second set of test cases it created increased the coverage of the boundaries and is a significantly better set to use than the first set of test cases, where we asked it specifically to consider the boundaries.

The third attempt created a very different set of answers in a very different layout from the previous two responses. Again, the test case total increased by one to create 10 test cases.

ChatGPT results

Mapping the test cases to our decision table

Here is mapping the ten test cases to our decision table from a differently formatted ChatGPT response:

second decision table

The answer this time was better than the previous second set of solutions. It’s grouping under scenarios implies it was getting better at understanding what we were asking it to create.

It identified scenarios for TC008, TC009, TC017, and TC018, but there should have been a test case for TC006 with a state of “Any”. Many of the others highlighted, “Should this be covered?” can be eliminated by applying equivalence partitioning.

I decided to regenerate the answer for a fourth time, but it dropped two test cases from the response and did not have the same coverage as that achieved on the third attempt.

Will ChatGPT replace software testers?

We’ve used ChatGPT for several things, from writing the outline of articles to questioning general testing and coding topics similar to this one. It always comes back with reasonably good ‘high-level’ answers, but the ‘devil is in the detail’. ChatGPT, may, in the future, be able to write these test cases for a set of requirements, but we believe a tester’s critical thinking and analysis skills to decide on what needs to be tested and what doesn’t will be required for several more years.

It does raise a simple question, “how correct is it, and should I believe the answer it provides?” This is the more significant issue facing AI: people believe what the internet tells them without questioning whether the response is correct and factual!

In an interview with Bill Gates on ABC, he raised the same issue.

The problem is, people generally aren’t great at distinguishing whether something on the internet is from “more of a trusted source.” Instead, a lot of people just look for whatever reinforces what they already believe. That’s why misinformation online is so powerful. If you’re predisposed to believe a thing, and then you see a photo of that thing, for example, it becomes very hard to filter through what is true and what isn’t.

Can ChatGPT help?

Definitely! But, if you are new to testing or looking for a shortcut, using something like ChatGPT to help analyse requirements, you could unwittingly add significant risk to the project as the response’s level of coverage isn’t perfect. We still need to question the answer, and you still need to apply your testing smarts to understand what it produced in the context of the question.

The most test coverage for this specific problem was achieved with the third response. We only knew it was a better response because we analysed each response against our initial analysis to see which test cases matched and which didn’t. This forced us to verify the answer ChatGPT provided and, at the same time, validate our initial decision table analysis.

We then applied additional analysis to extend the third set of test cases generated to reach a minimum set of test cases we decided needed to be executed. These gave us a level of risk mitigation we believed was appropriate to verify that the requirement had been implemented correctly.

So for the time being, we recommend you use an AI product such as ChatGPT with an enquiring mind by asking yourself, “Is the answer correct?!”

To discuss the pros and cons of using AI for testing your software,  reach out to our testing team today!

/* For Sub Menu itmes*/