Can LLMs Produce Original Astronomy Research in a Semester? A Graduate Class Experiment

This paper reports on a Fall 2025 graduate astronomy experiment where students used large language models to conduct original research, finding that while LLMs aided in drafting papers, they frequently failed to provide accurate insights, generate complex code, or avoid hallucinations, leading to mixed student perceptions regarding their overall value and impact on creativity.

Original authors: Ann Zabludoff, Chen-Yu Chuang, Parker Thomas Johnson, Yichen Liu, Brina Bianca Martinez, Neev Shah, Lucille Steffes, Gabriel Glen Weible

Published 2026-03-30
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a group of graduate students in a university astronomy class. Their assignment is ambitious: by the end of the semester, they must write a brand-new scientific paper about galaxies that is good enough to be published in a real journal.

Usually, this is like trying to build a house from scratch without a blueprint, using only a hammer and a lot of patience. But for this specific semester, the professor gave them a super-powerful, all-knowing robot assistant (an AI called a Large Language Model, or LLM) and said, "Use this to help you build your house."

The paper you read is the students' report card on how well that robot actually helped. Here is the story of their experiment, told in simple terms.

The Setup: The "Super-Intern"

The students treated the AI like a super-fast, super-read intern. They asked it to do the heavy lifting: find old research papers, write computer code to analyze data, and even suggest what new discoveries they could make.

The Good News (The Magic Moments):

  • The Speed Reader: If you asked a human to read 500 papers on galaxy bars, it would take weeks. The AI read them in seconds and gave a summary. It was like having a librarian who could instantly pull out the exact page you needed from a library of a million books.
  • The Code Fixer: Sometimes, the students wrote a computer program that crashed. The AI could look at the error message and say, "Oh, you forgot a comma here," or "You used the wrong formula." It was like having a mechanic who could hear a car engine and instantly know which part was broken.
  • The Brainstormer: When students were stuck and didn't know what to study next, the AI suggested ideas. It helped them narrow down a huge, scary topic into a manageable project.

The Bad News (The Hallucinations):
However, the AI wasn't perfect. In fact, it had some very dangerous habits:

  • The Fake Librarian: About 20% of the time, the AI would confidently hand a student a "reference" to a scientific paper that didn't exist. It would give a fake title, a fake author, and a fake link. It was like a tour guide pointing to a building and saying, "That's the Museum of Art," when it was actually a gas station. The students had to check every single link manually.
  • The Stubborn Chef: If a student told the AI, "This code is wrong," and the AI said, "No, it's right," and the student said, "No, look, it crashes," the AI would often double down and insist it was right. It was like a chef who keeps adding salt to a soup even after you tell him it's too salty, insisting, "No, it needs more salt."
  • The Blind Navigator: The AI was great at saying, "Go to this website to get data," but terrible at actually going there. It couldn't log into the database, download the files, or understand the complex rules of the website. It was like a GPS that could tell you the address of a store but couldn't actually drive the car to get there.
  • The Guessing Game: When asked for specific numbers (like the uncertainty in a measurement), the AI would sometimes just make up numbers if it didn't know the answer. It preferred to lie confidently rather than say, "I don't know."

The Verdict: Did it save time?

The answer is a split vote, like a coin toss.

  • Team "Yes": About half the students said, "Yes, it saved me time." For them, the AI was a great tool to get started, to learn the basics of a new topic, and to fix small coding errors. It was like having a co-pilot for the boring parts of the flight.
  • Team "No": The other half said, "No, it wasted my time." They spent hours fixing the AI's fake citations, debugging its broken code, and arguing with it when it was wrong. They felt that if they had just done the work themselves, they would have been faster and more accurate.

The Big Lesson: The "Human Touch"

The most important part of the paper isn't about the technology; it's about the students' feelings.

Many students worried that relying too much on the AI would make them lazy thinkers. They felt that if the AI does all the thinking, they lose the "spark" of discovery. One student said, "If the AI tells me what to do next, where is the fun in being a scientist?"

They realized that while the AI is a powerful tool, it lacks scientific taste. It doesn't have the intuition that comes from years of studying the universe. It can't tell the difference between a boring, solved problem and a thrilling, unsolved mystery.

The Conclusion

The students concluded that AI is like a very fast, very confident, but occasionally hallucinating intern.

  • Do use it for: Summarizing old papers, fixing small typos in code, and brainstorming ideas.
  • Don't use it for: Writing the final paper, finding real data, or making up facts.

The paper ends with a warning and a hope: As AI gets smarter, it will become a better intern. But until then, scientists must be the "managers" who check every single thing the intern does. If you don't check the work, the intern might build your house on a foundation of sand.

In short: The AI is a fantastic tool to speed up the work, but it cannot replace the human mind that asks the questions, checks the answers, and decides what is truly important.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →