GPTZero vs ChatGPT — A Gray Story.
Can GPTZero really catch AI generated Text?
We have seen an explosion in reporting of ChatGPT and how it simplifies our whole existance. The other I asked ChatGPT to solve all my problems and it did, I’m happily retired now.
In all seriousness, ChatGPT is accessible to all and that includes students with writing assignments, quizzes and virtual tests etc. Teachers were struggling to tell if something was really writtent by a student and in comes our hero GPT Zero to rescue. GPT Zero claims to solve this problem with a measure called perplexity, simply put, as advertied on the testing tool.
https://etedward-gptzero-main-zqgfwb.streamlit.app/
Perplexity — ie. the randomness of the text is — a measurement
of how well a language model like ChatGPT can predict a sample text.
simply put, it measures how much the computer model likes the text.
your text perplexity evaluated on gpt2 (345M parameters) is 40
texts with higher perplexities are more likely to be written by humans
I decided to test it with text from my blog posts, couple of which are generated by ChatGPT and the ones I actually wrote.
In summary higher perplexity means it is not AI written and a lower perplexity means the text is likely generated by AI. From my perspective, there is a problem here, if the range of perplexity in theory is [0, inf), what constitutes low and high? compared to what?, This needs some kind of bounding or context without which it may not be as useful. Let me make my point with some examples below, I ran tests on text content I created and content created using ChatGPT, below are screenshots.
ME: Two examples written by me, taken from my blog posts
Here I have a total perplexity values of 69 and 40, with additional insights on average sentence perplexity and a maximum perplexity and the sentence that generated the max value.
ChatGPT generated
Starting with a section containing two paragraphs from ChatGPT generated text comparing Bloom vs GPT3, we get an total perplexity score of 19 with additional insights on average sentence score and score for each sentence and a maximum perplexity score generating sentence.
With another section from the same article, the total perplexity score is 16 with sentence level score and max ranging from over 20 to 109 for max.
In the next set of images, I used the children’s story text I generated using ChatGPT, you can follow the total scores and others as noted above. A total perplexity score of 15
Another section from the same story with a total perplexity score of 13.
Testing an Other Authors and News Content.
Since the dragon story is a generic children’s story, I tried to find a story similar in theme from project gutenberg and tested it, results below. A total perplexity score of 20. Hmmm… not sure how that author feels about this assessment.
Title: My Father’s Dragon
Author: Ruth Stiles Gannett
Illustrator: Ruth Chrisman Gannett
Release Date: September 18, 2009 [EBook #30017]
Language: English
Credits: Produced by Sankar Viswanathan, Greg Weeks, and the Online Distributed Proofreading Team at http://www.pgdp.net
A Section of text from CNN article generated a total perplexity score of 29 with highest value being 464 for “its his most famous speech”
Conclusion
GPTZero was very good at producing score in the range of 11 to 19 for content generated by AI, in this case ChatGPT and much higher scores above 50 for the content I wrote. If stopped there, this is a clear black and white story, However, Ruth Stiles Gannett text from My Father’s Dragon courtesy of project gutenberg generates a score of 20. A CNN article on MLK day generates a score of 29. So the question remains, where is the cut off, under 10, under 20 or under 50? Unless the bounds are clearly and contextually defined, GPTZero is a filtering tool that needs to be used cautiously with supporting evidence before someone or some content can be written off.