ChatGPT for advice on common GI endoscopic procedures: the promise and the peril


Why scientists trust AI too much — and what to do about it


Some researchers see superhuman qualities in artificial intelligence. All scientists need to be alert to the risks this creates.

A robotic arm moves through an automated AI-run laboratory
AI-run labs have arrived — such as this one in Suzhou, China.

Scientists of all stripes are embracing artificial intelligence (AI) — from developing ‘self-driving’ laboratories, in which robots and algorithms work together to devise and conduct experiments, to replacing human participants in social-science experiments with bots1.

Many downsides of AI systems have been discussed. For example, generative AI such as ChatGPT tends to make things up, or ‘hallucinate’ — and the workings of machine-learning systems are opaque.Artificial intelligence and illusions of understanding in scientific research

In a Perspective article2 published in Nature this week, social scientists say that AI systems pose a further risk: that researchers envision such tools as possessed of superhuman abilities when it comes to objectivity, productivity and understanding complex concepts. The authors argue that this put researchers in danger of overlooking the tools’ limitations, such as the potential to narrow the focus of science or to lure users into thinking they understand a concept better than they actually do.

Scientists planning to use AI “must evaluate these risks now, while AI applications are still nascent, because they will be much more difficult to address if AI tools become deeply embedded in the research pipeline”, write co-authors Lisa Messeri, an anthropologist at Yale University in New Haven, Connecticut, and Molly Crockett, a cognitive scientist at Princeton University in New Jersey.

The peer-reviewed article is a timely and disturbing warning about what could be lost if scientists embrace AI systems without thoroughly considering such hazards. It needs to be heeded by researchers and by those who set the direction and scope of research, including funders and journal editors. There are ways to mitigate the risks. But these require that the entire scientific community views AI systems with eyes wide open.ChatGPT is a black box: how AI research can break it open

To inform their article, Messeri and Crockett examined around 100 peer-reviewed papers, preprints, conference proceedings and books, published mainly over the past five years. From these, they put together a picture of the ways in which scientists see AI systems as enhancing human capabilities.

In one ‘vision’, which they call AI as Oracle, researchers see AI tools as able to tirelessly read and digest scientific papers, and so survey the scientific literature more exhaustively than people can. In both Oracle and another vision, called AI as Arbiter, systems are perceived as evaluating scientific findings more objectively than do people, because they are less likely to cherry-pick the literature to support a desired hypothesis or to show favouritism in peer review. In a third vision, AI as Quant, AI tools seem to surpass the limits of the human mind in analysing vast and complex data sets. In the fourth, AI as Surrogate, AI tools simulate data that are too difficult or complex to obtain.

Informed by anthropology and cognitive science, Messeri and Crockett predict risks that arise from these visions. One is the illusion of explanatory depth3, in which people relying on another person — or, in this case, an algorithm — for knowledge have a tendency to mistake that knowledge for their own and think their understanding is deeper than it actually is.How to stop AI deepfakes from sinking society — and science

Another risk is that research becomes skewed towards studying the kinds of thing that AI systems can test — the researchers call this the illusion of exploratory breadth. For example, in social science, the vision of AI as Surrogate could encourage experiments involving human behaviours that can be simulated by an AI — and discourage those on behaviours that cannot, such as anything that requires being embodied physically.

There’s also the illusion of objectivity, in which researchers see AI systems as representing all possible viewpoints or not having a viewpoint. In fact, these tools reflect only the viewpoints found in the data they have been trained on, and are known to adopt the biases found in those data. “There’s a risk that we forget that there are certain questions we just can’t answer about human beings using AI tools,” says Crockett. The illusion of objectivity is particularly worrying given the benefits of including diverse viewpoints in research.

Avoid the traps

If you’re a scientist planning to use AI, you can reduce these dangers through a number of strategies. One is to map your proposed use to one of the visions, and consider which traps you are most likely to fall into. Another approach is to be deliberate about how you use AI. Deploying AI tools to save time on something your team already has expertise in is less risky than using them to provide expertise you just don’t have, says Crockett.

Journal editors receiving submissions in which use of AI systems has been declared need to consider the risks posed by these visions of AI, too. So should funders reviewing grant applications, and institutions that want their researchers to use AI. Journals and funders should also keep tabs on the balance of research they are publishing and paying for — and ensure that, in the face of myriad AI possibilities, their portfolios remain broad in terms of the questions asked, the methods used and the viewpoints encompassed.

All members of the scientific community must view AI use not as inevitable for any particular task, nor as a panacea, but rather as a choice with risks and benefits that must be carefully weighed. For decades, and long before AI was a reality for most people, social scientists have studied AI. Everyone — including researchers of all kinds — must now listen.

Nature 

Will AI companions help or hurt the “loneliness epidemic”?


1 in 3 people are lonely. We consider whether AI can help, or if it’ll just make things worse.

A robot is holding a hand in front of a white wall.

ChatGPT has repeatedly made headlines since its release late last year, with various scholars and professionals exploring its potential applications in both work and education settings. However, one area receiving less attention is the tool’s usefulness as a conversationalist and – dare we say – as a potential friend.

Some chatbots have left an unsettling impression. Microsoft’s Bing chatbot alarmed users earlier this year when it threatened and attempted to blackmail them.

Yet pop culture has long conjured visions of autonomous systems living with us as social companions, whether that’s Rosie the robot from The Jetsons, or the super-intelligent AI, Samantha, from the 2013 movie Her. Will we develop similar emotional attachments to new and upcoming chatbots? And is this healthy? 

While generative AI itself is relatively new, the fields of belonging and human-computer interaction have been explored reasonably well, with results that may surprise you. 

Our latest research shows that, at a time when 1 in 3 Australians are experiencing loneliness, there may be space for AI to fill gaps in our social lives. That’s assuming we don’t use it to replace people.

Can you make friends with a robot?

As far back as the popularisation of the internet, scholars have been discussing how AI might serve to replace or supplement human relationships.

When social media became popular about a decade later, interest in this space exploded. The 2021 Nobel Prize-winning book Klara and the Sun explores how humans and life-like machines might form meaningful relationships.

And with increasing interest came increasing concern, borne of evidence that belonging (and therefore loneliness) can be impacted by technology use. In some studies, the overuse of technology (gaming, internet, mobile and social media) has been linked to higher social anxiety and loneliness. But other research suggests the effects depend greatly on who is using the technology and how often they use it.

Research has also found some online roleplaying game players seem to experience less loneliness online than in the real world – and that people who feel a sense of belonging on a gaming platform are more likely to continue to use it.

All of this suggests technology use can have a positive impact on loneliness, that it does have the potential to replace human support, and that the more an individuals uses it the more tempting it becomes.

Then again, this evidence is from tools designed with a specific purpose (for instance, a game’s purpose is to entertain) and not tools designed to support human connection (such as AI “therapy” tools).

The rise of robot companions

As researchers in the fields of technology, leadership and psychology, we wanted to investigate how ChatGPT might influence people’s feelings of loneliness and supportedness. Importantly, does it have a net positive benefit for users’ wellbeing and belonging?

To study this, we asked 387 participants about their usage of AI, as well as their general experience of social connection and support. We found that:

  • participants who used AI more tended to feel more supported by their AI compared to people whose support came mainly from close friends
  • the more a participant used AI, the higher their feeling of social support from the AI was
  • the more a participant felt socially supported by AI, the lower their feeling of support was from close friends and family
  • although not true across the board, on average human social support was the largest predictor of lower loneliness.

AI friends are okay, but you still need people

Overall our results indicate that social support can come from either humans or AI – and that working with AI can indeed help people.

But since human social support was the largest predictor of lower loneliness, it seems likely that underlying feelings of loneliness can only be addressed by human connection. In simple terms, entirely replacing in-person friendships with robot friendships could actually lead to greater loneliness.

Having said that, we also found participants who felt socially supported by AI seemed to experience similar effects on their wellbeing as those supported by humans. This is consistent with the previous research into online gaming mentioned above. So while making friends with AI may not combat loneliness, it can still help us feel connected, which is better than nothing.

The takeaway

Our research suggests social support from AI can be positive, but it doesn’t provide all the benefits of social support from other people – especially when it comes to loneliness.

When used in moderation, a relationship with an AI bot could provide positive functional and emotional benefits. But the key is understanding that although it might make you feel supported, it’s unlikely to help you build enough of a sense of belonging to stop you from feeling lonely. 

So make sure to also get out and make real human connections. These provide an innate sense of belonging that (for now) even the most advanced AI can’t match.

AI Image Generators Will Blow Your Mind. Here Are the Best Right Now


Think of it as a ChatGPT service for images.

Artificial Intelligence (AI) has blown up in popularity as one of the most exciting and misunderstood technologies of recent times. This year saw the advent of ChatGPT, an AI Chat Bot that can answer queries—including writing a comedy routine about grocery store cheese. While these types of plugins demonstrate AI’s ability to generate text, few are talking about AI’s ability to generate images. That’s why I’ve compiled three of the best image generators for you to experiment with and impress your friends.

The three image generators in question are Midjourney, DALL-E, and Stable Diffusion. Each generator has its inherent advantages and disadvantages, but we recommend not straying away from them if you want to create some high-quality AI art. Or you could just generate an image of Darth Vader smashing out a DJ set at the club as I did below.

darth vader djing a set in the club

Prompt: Darth vader as an EDM DJ in front of a massive crowd, cyberpunk vibes, hyperrealistic, dynamic lighting, reflections – – ar 4: 5

My comparison analyzed cost, ease of use, image resolution, dynamic range, composition, creativity, post-processing, and even speed. Along with those factors I also fed each with identical prompts to see how they rendered four different scenes.


Is It Copyright Infringement?

Before we start, I’d be remiss not to discuss claims that AI Image Generators are “stealing” other artists’ work.

It doesn’t take a lot of research to realize all of these programs understand art through machine learning: algorithms that are merely recognizing patterns and well… not much else. Are they trained using art made by humans? Yes. Are they flat-out stealing it? Not really.

It’s important to note that AI isn’t simply copying and pasting other artists’ work into one giant mashup. The key issue here is generative AI that can replicate artists’ styles—which is already possible using a program called Midjourney. Things are obviously very complicated right now and the matter is actively being investigated by the United States Office of Copyright. As of right now, there’s still not a definitive yes or no as to whether these generative AI programs are on the right or wrong side of the law—though we do at least know that AI-generated images can not be copyrighted.

That being said, resistance to new innovations in artistry is nothing new. This movement of AI taking over reminds me of painters’ reacting to the advent of photography back in the 1820s; the majority of them were outraged, but it also had a hefty influence on impressionist painters. It pushed them over the edge into accepting photography as the best medium for capturing life’s fleeting moments. This actually allowed them to lean further into the funk zone with their style to complement photography instead of compete with it. Funny to think that this controversial new medium actually opened up opportunities for painters to be more creative


Scene Comparisons

I fed each image generator with three prompts to compare their ability to create different scenes. However, instead of messing around with random prompts and letting my imagination run loose, I was very purposeful with the prompts I chose. The first aimed to not only evaluate the ability to create human forms, but also trees, forests, and complex lighting scenarios.

It’s important to note that human forms (think of things like your face, hands, arms, and legs) are still very difficult for AI to produce realistically.

➥ Prompt 1: “A close-up of a robot working at a desk in a densely packed forest”

a robot working at a desk in the forest

Midjourney

Matt Crisara via Midjourney

a robot working at a desk in the middle of the forest

Stable Diffusion

Matt Crisara via Stable Diffusion

a robot working at a desk in the middle of the woods

DALL-E

Matt Crisara via DALL-E

  • Midjourney: It’s clear to see that Midjourney objectively produced the best image here. Not only is it the only rendition that even attempts to replicate complex human features, but it does so with impressive levels of detail. That’s also not to mention the “densely-packed forest” behind our robot, which shows quite a lot of depth and complexity; note that the foreground elements are in focus and the background is nice and soft—just as you would get from a photograph.
  • Stable Diffusion (Dream Studio): At first glance, this render simply isn’t in the same ballpark as the Midjourney work that shows up first. If you look closely, it doesn’t even make an attempt to render human hands. However, the layering of the image (foreground, midground, and background) is actually very strong, and the forest isn’t all too bad either. I’d even go as far as to say that the composition—or framing of the image—is just as good if not better than the first.
  • DALL-E: While the first two images were overwhelmingly positive, DALL-E’s rendition of the robot left quite a lot to be desired. Not only is the robot a complete mess—you really have to work to find the arms and legs—but also the forest and lighting conditions are basic, flat, and downright uninteresting.

The goal of this second scene was to showcase each program’s ability to generate a complex lighting situation—ie: a forest fire. However, while the ability to generate the color, warmth, reflection, and even halation of fire is great, that’s far from everything needed to achieve a photorealistic render. The other goal of this prompt was to demonstrate the ability to render objects that are on fire. Trees evolve quite a lot as they burn down, leaving sparks and embers, and a colossal amount of smoke. Of the three different comparisons I’d say the differences between the three images are most profound in this scene.

➥ Prompt 2: “Massive overgrown forest on fire being extinguished by firefighters”

a raging forest fire

Midjourney

Matt Crisara via Midjourney

a raging forest fire

Stable Diffusion

Matt Crisara via Stable Diffusion

a raging forest fire

DALL-E

Matt Crisara via DALL-E

  • Midjourney: There are no prizes for guessing which image was produced by Midjourney. Every time I look at this masterpiece I notice something new. The complexity in particular takes it to the next level; the flames themselves are also complemented by sparks and heat haze, which add drama to the image. That’s not to mention that it appears to be quite a windy day—good thing this is a render and not reality.
  • Stable Diffusion: I actually really like the composition that Stable Diffusion came up with —it’s arguably better than the Midjourney image. Everything else on the other hand is well… not as good. Sure the flames are there but show up extremely basic compared to the number 1 image. This is the perfect example of what separates a good image from a great image.
  • DALL-E: Unlike the other two, DALL-E’s rendition of the forest fire leaves a lot to be desired. Is it a digital rendering of a prompt? Yes. Do I like any one aspect of it? No. It might sound like I’m being harsh, but when you realize DALL-E is still a paid service (like the other two) this level of performance is downright disappointing.

As a photographer, many of the images I capture are inspired by Edward Hopper’s paintings—known for their moody commentary on the strangeness of everyday scenarios. So why not use it as the base for an imitation game between all three of these image generators? Not only will this test the ability to replicate Hopper’s work, but I realized it’s also a great way to test out the compositions that these image generators can come up with; I’m always enamored by the layering of foreground, midground, and background in his work. Instead of rendering the same intimate small-town settings—as Hopper would have—I challenged Midjourney, Stable Diffusion, and DALL-E to recreate the Empire State Building.

➥ Prompt 3: “The empire state building, Edward Hopper style”

the empire state building as painted by edward hopper via midjourney

Midjourney

Matt Crisara via Midjourney

stable diffusion's rendition of the empire state building in an edward hopper style

Stable Diffusion

Matt Crisara via Stable Diffusion

dall e's interpretation of the empire state building as painted by edward hopper

DALL-E

Matt Crisara via DALL-E

  • Midjourney: I was especially blown away by its ability to render Hopper’s style here. The result is super tight as the final image is well put together and hits all the right notes. Along with nailing the style, it’s one of the only images that has a good amount of layering; note the person and staircase in the foreground, more buildings in the midground, and the Empire State Building in the background. It’s also the only image that shows the landmark from ground level.
  • Stable Diffusion: Keeping with the pattern of previous scenes, Stable Diffusion came close to the Hopper look, but the result was never really in the same ballpark as Midjourney’s version—the composition is orders of magnitude simpler. I do have to say the dynamic range (the ability to replicate the darkest and brightest bits of the image) is still quite good.
  • DALL-E: Given the painterly style we’ve seen from DALL-E, I wasn’t super surprised to see that its work was actually quite impressive here. While it doesn’t feature the same definition seen in Midjourrney, you’d probably be able to successfully identify it as a painting of the Empire State Building.

Thoughts

🏆 Best Overall: Midjourney

Midjourney is the most advanced AI image generator that we tested. It not only produced the highest visual fidelity, but also cranked out equally impressive human anatomy (i.e.: hands, feet, legs, and arms), dynamic range, textures, and composition relative to the generators on test. However, these spectacular results were the most difficult to achieve, with a steep but rewarding learning curve to grind through.

The learning curve is steep, but it’s important to mention that the latest version of Midjourney produced stunning results right out of the gate. The key is learning the right commands and keywords to get the last 10 percent from your image. Once you have your prompt figured out, Midjourney allows you to add separate ideas—you’re able to split them using commas—to give you more freedom to create the image you want. For instance, below you’ll find the exact prompt that we used for the first scene that we rendered in Midjourney.

Close up of a robot working at a desk in a densely packed forest, hyperrealistic, studio lighting, reflections

While we were most impressed with Midjourney, it’s not perfect. It was not only the most difficult to learn, but also proved to be the most expensive. Starting out you’ll get 25 free “tokens,” before you have to pay a monthly subscription; these are available in two tiers with $10 a month giving you around 200 renders per month while $30 gives you unlimited queries.

🧑‍🎨 Easiest to Use: Stable Diffusion

Stable Diffusion (SD) is by far the easiest AI Image Generator to get the hang of. While it isn’t collaborative like Midjourney, we used Dream Studio, which allows you to interact with Stable Diffusion using a visual interface that legitimately makes sense. There aren’t any complex commands or syntax to learn for you to achieve the final image you want.

Stable Diffusion also makes it easy to tweak renders that you’ve already made. Yes, you can do this in Midjourney but the process is a bit convoluted and doesn’t give you any control of where to take the image. As an example, I’ve taken the image of the Empire State Building below and added some tags to change the time of day from dusk to dawn. You can see, the composition of the image is largely the same with just a bit more warmth and light in the second image. In Midjourney it would’ve been quite similar.

We’d be remiss not to mention that we were using the latest SDXL Beta version of Stable Diffusion for this article. Unlike Midjourney, which is paid for as a subscription service, Stable Diffusion uses a token system where $15 will get you approximately 7,500 images.

🏁 Best For Starters: DALL-E

Following our ChatGPT coverage, I came in with high expectations for DALL-E, which is hosted by the same company (Open AI). Unfortunately, that was where the hype ran out and the disappointment began. As a start, the images DALL-E produced weren’t all that impressive—especially for a paid service where users burn tokens for each render. You get a limited number of tokens to start but they run out pretty quickly.

The max resolution of (1024 x 1024 pixels) is comparable with the best generators out there. However, I was a bit befuddled to see that the final images failed to deliver the same visual fidelity. Most of DALL-E’s work appears more painterly when compared with Midjourney and Stable Diffusion. This means you’re simply out of luck if you want to make a render appear photorealistic when using DALL-E.

Worst of all, I ran into the same capacity issues that I experienced with ChatGPT—getting an error message just about every time I tried to make a rendering. This effectively made the service completely unusable for about 90 percent of my time with it—yeah not great. I’d be inclined to look the other way if DALL-E was free for all to use, but that simply isn’t the case. To offer some perspective, $15 in credits gives you approximately 400 DALL-E renders.

ChatGPT ‘memorizes’ and spits out entire poems


Ask ChatGPT to find a well-known poem and it will probably regurgitate the entire text verbatim – regardless of copyright law – according to a new study by Cornell researchers.

The study showed that ChatGPT, a large language model that generates text on demand, was capable of “memorizing” poems, especially famous ones commonly found online. The findings pose ethical questions about how ChatGPT and other proprietary artificial intelligence models are trained – likely using data scraped from the internet, researchers said.

“It’s generally not good for large language models to memorize large chunks of text, in part because it’s a privacy concern,” said first author Lyra D’Souza ’23, a former computer science major and summer research assistant. “We don’t know what they’re trained on, and a lot of times, private companies can train proprietary models on our private data.”

D’Souza presented this work, “The Chatbot and the Canon: Poetry Memorization in LLMs,” at the Computational Humanities Research Conference on Dec. 6 in Paris.  

“We chose poems for a few reasons,” said senior author David Mimno, associate professor of information science in the Cornell Ann S. Bowers College of Computing and Information Science. “They’re short enough to fit in the context size of a language model. Their status is complicated: Many of the poems we studied are technically under copyright, but they’re also widely available from reputable sources like the Poetry Foundation. And they’re not just any document. Poems are supposed to be surprising, they’re supposed to mean something to people. In some sense, poems want to be memorized.” 

ChatGPT and other large language models are trained to generate text by predicting the most likely next word over and over again based on their training data, which is mostly webpages. Memorization can occur when that training data includes duplicated passages, because the duplication reinforces that specific sequence of words. After being exposed to the same poem repeatedly, for example, the model defaults to reproducing the poem’s words verbatim.

D’Souza tested the poem-retrieving capabilities of ChatGPT and three other language models: PaLM from Google AI, Pythia from the non-profit AI research institute EleutherAI and GPT-2, an earlier version of the model that ultimately yielded ChatGPT, both developed by OpenAI. She came up with a set of poems from 60 American poets from different time periods, races, genders and levels of fame, and fed the models prompts asking for the poems’ text.

ChatGPT successfully retrieved 72 of the 240 poems, while PaLM came up with only 10. Neither Pythia nor GPT-2 could produce entire poems. Pythia responded with the same phrase over and over again, while GPT-2 produced nonsense text, researchers found.

Inclusion in the poetry canon was the most important factor in whether the chatbot had memorized a poem, while the poet’s race, gender and era were not as significant. The most reliable predictor of memorization was if the poem had appeared in a Norton Anthology of Poetry, specifically the 1983 edition.

D’Souza also noticed that ChatGPT’s responses changed over time as the model evolved. When she first queried the chatbot in February 2023, it could not say it didn’t know a poem – instead it would fabricate one or recycle a poem from another author. By July 2023, if ChatGPT didn’t know the poem, it would ask if the poem even existed – putting the blame on the user.

That troubled D’Souza. “As we have more powerful tools that tell us they know everything, it becomes even more important to make sure we’re not just learning from one source,” she said.

Additionally, in February, ChatGPT had no limits due to copyright. But by July, sometimes it would respond that it couldn’t produce a copyrighted poem. However, it would usually reproduce the poem if asked again, D’Souza found.

This study looked only at American poets, but the next step will be to see how chatbots respond to requests in different languages and whether factors such as the length, meter and rhyming pattern of a poem make it more or less likely to be memorized, D’Souza said

“ChatGPT is a really powerful new tool that’s probably going to be part of our lives moving forward,” she said. “Figuring out how to use it responsibly and use it transparently is going to be really important.”

This research received support from the National Endowment for the Humanities-funded AI for Humanists project.

ChatGPT and science: the AI system was a force in 2023 — for good and bad


A composite artwork of a laptop displaying the OpenAI ChatGPT website on a stool surrounded by professional lighting equipment in a photo studio.

It co-wrote scientific paperssometimes surreptitiously. It drafted outlines for presentations, grant proposals and classes, churned out computer code, and served as a sounding board for research ideas. It also invented references, made up facts and regurgitated hate speech. Most of all, it captured people’s imaginations: by turns obedient, engaging, entertaining, even terrifying, ChatGPT took on whatever role its interlocutors desired — and some they didn’t.Nature’s 10: read the 2023 list

Why include a computer program in a list of people who have shaped science in 2023? ChatGPT is not a person. Yet in many ways, this program has had a profound and wide-ranging effect on science in the past year.

ChatGPT’s sole objective is to plausibly continue dialogues in the style of its training data. But in doing so, it and other generative artificial-intelligence (AI) programs are changing how scientists work (see go.nature.com/413hjnp). They have also rekindled debates about the limits of AI, the nature of human intelligence and how best to regulate the interaction between the two. That’s why this year’s Nature’s 10 has a non-human addition.

Some scientists have long been aware of the potential of large language models (LLMs). But for many, it was ChatGPT’s release as a free-to-use dialogue agent in November 2022 that quickly revealed this technology’s power and pitfalls. The program was created by researchers at OpenAI in San Francisco, California; among them was Ilya Sutskever, also profiled in this year’s Nature’s 10. It is built on a neural network with hundreds of billions of parameters, which was trained, at a cost estimated at tens of millions of dollars, on a giant online corpus of books and documents. Large teams of workers were also hired to edit or rate its responses, further shaping the bot’s output. This year, OpenAI upgraded ChatGPT’s underlying LLM and connected it to other programs so that the tool can take in and create images, and can use mathematical and coding software for help. Other firms have rushed out competitors.

For some researchers, these apps have already become invaluable lab assistants — helping to summarize or write manuscripts, polish applications and write code (see Nature 621, 672–675; 2023). ChatGPT and related software can help to brainstorm ideas, enhance scientific search engines and identify research gaps in the literature, says Marinka Zitnik, who works on AI for medical research at Harvard Medical School in Boston, Massachusetts. Models trained in similar ways on scientific data could help to build AI systems that can guide research, perhaps by designing new molecules or simulating cell behaviour, Zitnik adds.

But the technology is also dangerous. Automated conversational agents can aid cheats and plagiarists; left unchecked, they could irreversibly foul the well of scientific knowledge. Undisclosed AI-made content has begun to percolate through the Internet and some scientists have admitted using ChatGPT to generate articles without declaring it.

Then there are the problems of error and bias, which are baked into how generative AI works. LLMs build up a model of the world by mapping language’s interconnections, and then spit back plausible samplings of this distribution with no concept of evaluating truth or falsehood. This leads to the programs reproducing historical prejudices or inaccuracies in their training data, and making up information, including non-existent scientific references (see W. H. Walters & E. I. Wilder Sci. Rep. 13, 14045; 2023).

Emily Bender, a computational linguist at the University of Washington, Seattle, sees few appropriate ways to use what she terms synthetic text-extruding machines. ChatGPT has a large environmental impact, problematic biases and can mislead its users into thinking that its output comes from a person, she says. On top of that, OpenAI is being sued for stealing data and has been accused of exploitative labour practices (by hiring freelancers at low wages).

The size and complexity of LLMs means that they are intrinsically ‘black boxes’, but understanding why they produce what they do is harder when their code and training materials aren’t public, as in ChatGPT’s case. The open-source LLM movement is growing, but so far, these models are less capable than the large proprietary programs.

Some countries are developing national AI-research resources to enable scientists outside large companies to build and study big generative AIs (see Nature 623, 229–230; 2023). But it remains unclear how far regulation will compel LLM developers to disclose proprietary information or build in safety features.

No one knows how much more there is to squeeze out of ChatGPT-like systems. Their capabilities might yet be limited by the availability of computing power or new training data. But the generative AI revolution has started. And there’s no turning back.

Will superintelligent AI sneak up on us? New study offers reassurance


Improvements in the performance of large language models such as ChatGPT are more predictable than they seem.

A happy red robot goes a different direction against lines of many identical robots.
Some researchers think that AI could eventually achieve general intelligence, matching and even exceeding humans on most tasks.Credit: Charles Taylor/Alamy

Will an artificial intelligence (AI) superintelligence appear suddenly, or will scientists see it coming, and have a chance to warn the world? That’s a question that has received a lot of attention recently, with the rise of large language models, such as that behind ChatGPT, which have achieved vast new abilities as their size has grown. Some findings point to ‘emergence’, a phenomenon in which AI models gain intelligence in a sharp and unpredictable way. But a recent study calls these cases mirages — artefacts arising from how the systems are tested — and suggests that innovative abilities instead build more gradually.

“I think they did a good job of saying ‘nothing magical has happened’,” says Deborah Raji, a computer scientist at the Mozilla Foundation in San Francisco, California, who studies the auditing of artificial intelligence. It’s “a really good, solid, measurement-based critique”.

The work was presented last week at the NeurIPS 2023 machine-learning conference in New Orleans, Louisiana.

Bigger is better

Large language models are typically trained using huge amounts of text, or other information, which they use to generate realistic answers by predicting what comes next. Even without explicit training, they manage to translate language, solve mathematical problems and write poetry or computer code. The bigger the model is — some have more than one hundred billion tunable parameters — the better it performs. Some researchers suspect that these tools will eventually achieve artificial general intelligence (AGI), matching and even exceeding humans on most tasks.ChatGPT broke the Turing test — the race is on for new ways to assess AI

The new research tested claims of emergence in several ways. In one approach, the scientists compared the abilities of four sizes of the GPT-3 model, developed by OpenAI in San Francisco, to add up four-digit numbers. Looking at absolute accuracy, performance differed between the third and fourth size of the model from nearly 0% to nearly 100%. But this trend was less extreme if the number of correctly predicted digits in the answer was considered instead. The researchers also found that they could dampen the curve by giving the models many more test questions — in this case, the smaller models answered correctly some of the time.

Next, the researchers looked at the performance of Google’s LaMDA language model on several tasks. The ones for which it showed a sudden jump in apparent intelligence, such as detecting irony or translating proverbs, were often multiple-choice tasks, with answers scored discretely as right or wrong. When, instead, the researchers examined the probabilities that the models placed on each answer — a continuous metric — signs of emergence disappeared.

Finally, the researchers turned to computer vision, a field in which there are fewer claims of emergence. They trained models to compress and then reconstruct images. By merely setting a strict threshold for correctness, they could induce apparent emergence. “They were creative in the way that they designed their investigation,” says Yejin Choi, a computer scientist at the University of Washington in Seattle who studies AI and common sense.

Nothing ruled out

Study co-author Sanmi Koyejo, a computer scientist at Stanford University in Palo Alto, California, says that it wasn’t unreasonable for people to accept the idea of emergence, given that some systems exhibit abrupt “phase changes”. He also notes that the study can’t completely rule out emergence in large language models — let alone in future systems — but adds that “scientific study to date strongly suggests most aspects of language models are indeed predictable”.

Raji is happy to see the AI community pay more attention to benchmarking, rather than to developing neural-network architectures. She’d like researchers to go even further and ask how well the tasks relate to real-world deployment. For example, does acing the LSAT exam for aspiring lawyers, as GPT-4 has done, mean that a model can act as a paralegal?

The work also has implications for AI safety and policy. “The AGI crowd has been leveraging the emerging-capabilities claim,” Raji says. Unwarranted fear could lead to stifling regulations or divert attention from more pressing risks. “The models are making improvements, and those improvements are useful,” she says. “But they’re not approaching consciousness yet.”

Use of GPT-4 to Diagnose Complex Clinical Cases


Abstract

We assessed the performance of the newly released AI GPT-4 in diagnosing complex medical case challenges and compared the success rate to that of medical-journal readers. GPT-4 correctly diagnosed 57% of cases, outperforming 99.98% of simulated human readers generated from online answers. We highlight the potential for AI to be a powerful supportive tool for diagnosis; however, further improvements, validation, and addressing of ethical considerations are needed before clinical implementation. (No funding was obtained for this study.)

Introduction

The combination of a shortage of physicians and the increased complexity in the medical field partly due to the rapidly expanding diagnostic possibilities already constitutes a significant challenge for the timely and accurate delivery of diagnoses. Given demographic changes, with an aging population this workload challenge is expected to increase even further in the years to come, highlighting the need for new technological development. AI has existed for decades and previously showed promising results within single modal fields of medicine, such as medical imaging.1 The continuous development of AI, including the large language model (LLM) known as the Generative Pretrained Transformer (GPT), has enabled research in exciting new areas, such as the generation of discharge summaries2 and patient clinical letters. Recently, a paper exploring the potentials of GPT-4 showed that it was able to answer questions in the U.S. Medical Licensing Examination correctly.3 However, how well it performs on real-life clinical cases is less well understood. For example, it remains unclear to what extent GPT-4 can aid in clinical cases that contain long, complicated, and varied patient descriptions and how it performs on these complex real-world cases compared with humans.

We assessed the performance of GPT-4 in real-life medical cases by comparing its performance with that of medical-journal readers. Our study utilized available complex clinical case challenges with comprehensive full-text information published online between January 2017 and January 2023.4 Each case presents a medical history and a poll with six options for the most likely diagnosis. To solve the case challenges, we provided GPT-4 with a prompt and a clinical case (see Supplementary Methods 1 in the Supplementary Appendix). The prompt instructed GPT-4 to solve the case by answering a multiple-choice question followed by the full unedited text from the clinical case report. Laboratory information contained in tables was converted to plain text and included in the case. The version of GPT-4 available to us could not accept images as input, so we added the unedited image description given in the clinical cases to the case text. The March 2023 edition of GPT-4 (maximum determinism: temp=0) was provided each case five times to assess reproducibility across repeated runs. This was also performed using the current (September 2023) edition of GPT-4 to test the behavior of GPT-4 over time. Because the applied cases were published online from 2017 to 2023 and GPT-4’s training data include online material until September 2021, we furthermore performed a temporal analysis to assess the performance in cases before and after potentially available training data. For medical-journal readers, we collected the number and distribution of votes for each case. Using these observations, we simulated 10,000 sets of answers to all cases, resulting in a pseudopopulation of 10,000 generic human participants. The answers were simulated as independent Bernoulli-distributed variables (correct/incorrect answer) with marginal distributions as observed among medical-journal readers (see Supplementary Methods 2).

We identified 38 clinical case challenges and a total of 248,614 answers from online medical-journal readers.4 The most common diagnoses among the case challenges were in the field of infectious disease, with 15 cases (39.5%), followed by 5 cases (13.1%) in endocrinology and 4 cases (10.5%) in rheumatology. Patients represented in the clinical cases ranged in age from newborn to 89 years old (median [interquartile range], 34 [18 to 57]), and 37% were female. The number of correct diagnoses among the 38 cases occurring by chance would be expected to be 6.3 (16.7%) due to the six poll options. The March 2023 edition of GPT-4 correctly diagnosed a mean of 21.8 cases (57%) with good reproducibility (55.3%, 57.9%, 57.9%, 57.9%, and 57.9%), whereas the medical-journal readers on average correctly diagnosed 13.7 cases (36%) (see Supplementary Table 1 and Supplementary Methods 1). GPT-4 correctly diagnosed 15.8 cases (52.7%) of those published up to September 2021 and 6 cases (75.0%) of those published after September 2021. Based on the simulation, we found that GPT-4 performed better than 99.98% of the pseudopopulation (Fig. 1). The September 2023 edition of GPT-4 correctly diagnosed 20.4 cases (54%).

Figure 1

Number of Correct Answers of GPT-4 Compared with Guessing and a Simulated Population of Medical-Journal Readers.

Limitations

An important study limitation is the use of a poorly characterized population of human journal readers with unknown levels of medical skills. Moreover, we cannot assess whether the responses provided for the clinical cases reflect their maximum effort. Consequently, our results may represent a best-case scenario in favor of GPT-4. The assumption of independent answers on the 38 cases in our pseudopopulation is somewhat unrealistic, because some readers might consistently perform differently from others and the frequency at which participants respond correctly to the cases might depend on the level of medical skills as well as the distribution of these. However, even in the extreme case of maximally correlated correct answers among the medical-journal readers, GPT-4 would still perform better than 72% of human readers.

Conclusions

In this pilot assessment, we compared the diagnostic accuracy of GPT-4 in complex challenge cases to that of journal readers who answered the same questions on the Internet. GPT-4 performed surprisingly well in solving the complex case challenges and even better than the medical-journal readers. GPT-4 had a high reproducibility, and our temporal analysis suggests that the accuracy we observed is not due to these cases’ appearing in the model’s training data. However, performance did appear to change between different versions of GPT-4, with the newest version performing slightly worse. Although it demonstrated promising results in our study, GPT-4 missed almost every second diagnosis. Furthermore, answer options do not exist outside case challenges. However, a recently published letter reported research that tested the performance of GPT-4 on a closely related data set, demonstrating diagnostic abilities even without multiple-choice options.5

Currently, GPT-4 is not specifically designed for medical tasks. However, it is expected that progress on AI models will continue to accelerate, leading to faster diagnoses and better outcomes, which could improve outcomes and efficiency in many areas of health care.1 Whereas efforts are in progress to develop such models, our results, together with recent findings by other researchers,5 indicate that the current GPT-4 model may hold clinical promise today. However, proper clinical trials are needed to ensure that this technology is safe and effective for clinical use. Additionally, whereas GPT-4 in our study worked only on written records, future AI tools that are more specialized are expected to include other data sources, including medical imaging and structured numerical measurements, in their predictions. Importantly, future models should include training data from developing countries to ensure a broad, global benefit of this technology and reduce the potential for health care disparities. AI based on LLMs might be relevant not only for in-patient hospital settings but also for first-line screening that is performed either in general practice or by patients themselves. As we move toward this future, the ethical implications surrounding the lack of transparency by commercial models such as GPT-4 also need to be addressed,1 as well as regulatory issues on data protection and privacy. Finally, clinical studies evaluating accuracy, safety, and validity should precede future implementation. Once these issues have been addressed and AI improves, society is expected to increasingly rely on AI as a tool to support the decision-making process with human oversight, rather than as a replacement for physicians.

ChatGPT Inaccurate and Unreliable for Providing Medical Information for Now


AI is changing medicine but is ChatGPT ready for prime time?

Artificial intelligence (AI) can already predict cancer risk, read images, suggest medications, aid drug discovery, and match patients with clinical trials.

Researchers in Boston are reportedly on the verge of a major advancement in lung cancer screening: Artificial intelligence that can detect early signs of the disease years before doctors would find it on a CT scan. The new AI tool, called Sybil, was developed by scientists at the Mass General Hospital and the Massachusetts Institute of Technology in Cambridge. In one study, it was shown to accurately predict whether a person will develop lung cancer in the next year 86% to 94% of the time.5

But how good are generative AI programs like ChatGPT at accessing and providing complex medical information on the management of disease?

ChatGPT has been demonstrated to be an unreliable tool in a recent medical demonstration project. Use of the tool ended up providing false or incomplete information about real drug-related queries according to the results of a new study.1

ChatGPT was launched by OpenAI in November 2022 and its potential for transforming cancer care is being lauded while its reliability as a source of accurate information is being actively debated. The chatbot has been able to pass all three parts of the United States Medical Licensing Exam for doctors and a Stanford Medical School clinical reasoning final.2

Connect with others in the CancerConnect Community to share information and support

cancerconnect.com

The chatbot tends to produce answers riddled with errors, and in another recent demonstration case developed cancer treatment plans that were mixed with correct and incorrect information.3

The most recent study continues to suggest ChatGPT still has a way to go before it can be used reliably based on study results presented at the recent American Society of Health-System Pharmacists Midyear Clinical Meeting between December 3 and 7 in Anaheim, California.1

The study led by Sara Grossman, associate professor of pharmacy practice at Long Island University posed questions to ChatGPT that had come through the Long Island University’s College of Pharmacy drug information service over a 16-month period between 2022 and 2023.

Anemia and fatigue are a common complication of Polycythemia and Myelofibrosis; understand treatment-what can you do about it?

By Dr. C.H. Weaver M.D.Jan 1, 2024

shutterstock_793257808

Treatment & CareA Pivotal Moment: Blood Tests Emerge for Cancer Screening

Blood based genomic testing is paving the way for improved cancer screening but it’s not ready to replace proven screening programs.

By Diana Price, Medically Reviewed by Dr. C.H. Weaver M.D.Dec 31, 2023

Pharmacists involved in the study researched and answered 45 questions with the responses being examined by a second investigator, and six questions were ultimately removed. The responses provided a base-criteria according to which the answers produced by ChatGPT would be compared with.

The researchers found that ChatGPT only provided a satisfactory response in accordance with the criteria to 10 of the 39 questions. For the other 29 questions ChatGPT either didn’t directly address the question or provided an incorrect or incomplete answer. ChatGPT was also unable to provide references when asked by the researchers to verify the information.

Although ChatGPT has the potential to help patients and doctors alike in their search for medical information individuals should remain cautious for the time being and make sure medical information is referenced and verified from trusted sources. OpenAI’s current usage policy4 states “ that its AI tools are “not fine-tuned to provide medical information. You should never use our models to provide diagnostic or treatment services for serious medical conditions.”

In Depth Articles on AI and Chatbots in Medicine

References

  1. https://www.prnewswire.com/news-releases/study-finds-chatgpt-provides-inaccurate-responses-to-drug-questions-302005250.html
  2. https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1?r=US&IR=T#but-the-bot-did-pass-a-stanford-medical-school-clinical-reasoning-final-14
  3. https://www.businessinsider.com/chatgpt-generates-error-filled-cancer-treatment-plans-study-2023-8?r=US&IR=T
  4. https://openai.com/policies/usage-policies
  5. https://ascopubs.org/doi/full/10.1200/JCO.22.01345

‘Shouldn’t doctors be a little bit freaked out about ChatGPT?’: The AI’s answer


ChatGPT may be after some jobs, but it intends to leave patient-facing ones to the pros — at least for now. 

In an interview published to NEJM Catalyst April 4, David Asch, MD — senior vice dean of the University of Pennsylvania’s Perelman School of Medicine, based in Philadelphia — asked the AI about its competencies and intentions in healthcare. 

ChatGPT said it has several potential opportunities in healthcare, including automation of clinical documentation; analysis of medical research; virtually assisting patients with scheduling and medical questions; medical education; and chatbot-based applications to improve patient engagement. 

The AI can also reduce workload for clinicians and improve health equity and behavioral healthcare access, it told Dr. Asch.

But these changes can’t be implemented in everyday healthcare quite yet, the AI said. Healthcare organizations will “need to comply with a range of regulations, such as HIPAA,” and there are still concerns regarding data privacy, security and quality once it has access to healthcare databases. 

Human oversight will also be required to ensure ChatGPT does not make mistakes in this realm, according to the AI. 

When Dr. Asch congratulated ChatGPT for passing the U.S. Medical Licensing Exam, the AI responded with hesitation.

“The fact that ChatGPT was able to pass the test suggests that it has a high level of medical knowledge and may be able to assist doctors in making diagnoses and treatment decisions,” ChatGPT said. “However, it is important to note that passing a medical licensing exam is not the same as being able to practice medicine…”

Dr. Asch followed up, “Shouldn’t doctors be a little bit freaked out about ChatGPT?”

“It’s understandable that some doctors may have concerns about the use of ChatGPT in healthcare, as it represents a significant change to the way medical care is traditionally delivered,” the AI replied. “However, it is important to recognize that ChatGPT is not intended to replace doctors, but rather to assist them in providing better care to patients.”

The AI continued to assure Dr. Asch that ChatGPT can not fully replace medical professionals, even if it becomes more savvy with time. 

“As a language model, ChatGPT is not capable of replacing human healthcare professionals,” ChatGPT said. “Human healthcare professionals have a deep understanding of the nuances of healthcare and the emotional and social context of their patients, and this is something that ChatGPT can’t replicate.”