February 18, 2026
Does Google's Gemini API leak prompts between users?
Answering this question as a security researcher is harder than you might think.
Once you've found interesting behavior, the next step is to replicate and verify it.
But verifying vulnerabilities in closed generative AI systems is very different from traditional security research. When the API can hallucinate outputs, some percentage of those hallucinations will look identical to security issues.
Below is a story from one of our research explorations, which you may find useful if you are building or integrating these systems.
hCaptcha researchers have worked on generative AI safety for many years, for example in our recent study of the lack of safeguards in current browser use agents.
Continuing this research, we recently did some threat analysis of popular generative AI APIs.
One such service is Google's Gemini family of APIs. It is currently less useful for code generation than its competitors, but Google has nonetheless shipped several coding agent tools using their model APIs recently.
We decided to focus our initial research on coding tools, as the state management complexity in their supporting APIs adds a lot of surface area, and could easily enable privacy bugs in implementation.
We quickly produced interesting output.
We'd found at least one likely bug in initial stress testing, but the question was whether it was a security bug.
In this case the output we were able to elicit looked very much like a coding prompt from another user containing a Jupyter notebook, and was returned to us during an error state we intentionally triggered.
In a non-generative API, verifying this was a data leak would be easy: the output alone is enough to give you complete confidence in many cases.
You could then produce a proof of concept and in general quickly ascertain whether you were crashing something, reading data across a permissions boundary, etc.
However, once the API is expected to generate arbitrary text output, validation is a bit trickier.
In closed weight models and generative APIs it is difficult for external parties to validate exactly what is going on when they hit a bug, but there are still some viable techniques to increase your confidence.
For example, in this case we intentionally used out of domain inputs:
We ran our tests with no coding-related data and nothing topically related to the text we received, i.e. no prompts related to coding or ML. This meant the output we elicited was not an obvious hallucination based on the inputs.
The way in which we elicited it was also a plausible path for triggering a memory or pointer-style bug.
Thus, we could not rule out hallucinations, but we would expect very different ones based on the input.
The notebook log we received has some erroneous-as-written cells that could indicate it is hallucinated, but there are many equally bad 100% human Jupyter notebooks in the wild that predate LLMs.
It is sometimes possible to elicit training data from models, even in surprisingly large contiguous chunks.
The suspicious output could also be something memorized and then repeated by the model, but we found no identical strings matching the interesting parts from searching online.
This means we need to do a fairly deep analysis to try to reach a conclusion.
We have included a detailed analysis in Appendix B, with both human and LLM-derived observations.
For the API vendor or an operator of a single-tenancy model service, this should in theory be much easier to verify internally.
Telemetry and error logs may indicate whether an error occurred for the test account and if so what caused it, and vendors likely store either the entire prompt sent by users or hashes of them.
This should make checking whether a long, unique prompt had been sent by any user straightforward, even with spotty error logs.
We reported our findings to Google some time ago, but they were not able to reproduce the bug, and eventually concluded it might be a hallucination.
We think this is possible, but the fact that an outside party cannot always reach full confidence on these analyses from the information available is an interesting development in security research.
As more APIs start to include generative outputs, triaging and diagnosing issues is starting to look more like content analysis than the classic verification patterns of the past.
</pre>
1
2 In []:
checking duplicate data in review column
df.review.duplicated().sum()
1
2 Out []:
0
1
2 In []:
checking for null values in review column
df.review.isnull().sum()
1
2 Out []:
0
1
2 In []:
checking labels in sentiment column
df.sentiment.value_counts()
1
2 Out []:
positive 25000
negative 25000
Name: sentiment, dtype: int64
1
2 In []:
3 ## PRE-PROCESSING DATA
4
5 In []:
label encoding sentiment column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])
1
2 In []:
df.head()
1
2 Out []:
<table>
<thead>
<tr>
<th></th>
<th>review</th>
<th>sentiment</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>One of the other reviewers has mentioned that ...</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>A wonderful little production. <br /><br />The...</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>I thought this was a wonderful way to spend ti...</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>Basically there's a family where a little boy ...</td>
<td>0</td>
</tr>
<tr>
<th>4</th>
<td>Petter Mattei's "Love in the Time of Money" is...</td>
<td>1</td>
</tr>
</tbody>
</table>
1
2 In []:
1 -> positive, 0 -> negative
1
2 In []:
convert to lower case
df['review'] = df['review'].str.lower()
1
2 In []:
removing html tags
import re
def remove_html_tags(text):
pattern = re.compile('<.*?>')
return pattern.sub(r'',text)
1
2 In []:
df['review'] = df['review'].apply(remove_html_tags)
1
2 In []:
removing urls
def remove_url(text):
pattern = re.compile(r'https?://\S+|www\.\S+')
return pattern.sub(r'',text)
1
2 In []:
df['review'] = df['review'].apply(remove_url)
1
2 In []:
remove punctuation
import string
punc = string.punctuation
def remove_punc(text):
return text.translate(str.maketrans('','',punc))
1
2 In []:
df['review'] = df['review'].apply(remove_punc)
1
2 In []:
removing stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
1
2 Out []:
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\shiva\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!True
1
2 In []:
def remove_stopwords(text):
words = text.split()
filtered_words = [word for word in words if word not in stopwords.words('english')]
return " ".join(filtered_words)
1
2 In []:
df['review'] = df['review'].apply(remove_stopwords) # takes huge amount of time
1
2 In []:
Tokenization
from nltk.tokenize import word_tokenize
def tokenize_text(text):
return word_tokenize(text)
1
2 In []:
df['review'] = df['review'].apply(tokenize_text) # takes huge amount of time
1
2 In []:
Stemming
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
return " ".join([ps.stem(word) for word in text.split()])
1
2 In []:
df['review'] = df['review'].apply(stem_words)
1
2 In []:
df.head()
1
2 Out []:
<table>
<thead>
<tr>
<th></th>
<th>review</th>
<th>sentiment</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>one of the other review ha mention that after ...</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>a wonder littl product the film techniqu is ve...</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>i thought thi wa a wonder way to spend time on...</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>basic there a famili where a littl boy jake th...</td>
<td>0</td>
</tr>
<tr>
<th>4</th>
<td>petter mattei love in the time of money is a v...</td>
<td>1</td>
</tr>
</tbody>
</table>
1
2 In []:
X = df.iloc[:,0:1]
y = df['sentiment']
1
2 In []:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)
1
2 In []:
X_train.shape
1
2 Out []:
(40000, 1)
1
2 In []:
Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
1
2 In []:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()
1
2 In []:
X_train_bow.shape
1
2 Out []:
(40000, 146144)
1
2 In []:
with huge feature set, using Naive Bayes
from sklearn.naive_identity_matrix import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_bow,y_train)
1
2 Out []:
1
2 In []:
memory limit error
1
2 In []:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train_bow,y_train)
1
2 Out []:
1
2 In []:
memory limit error
1
2 In []:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train_bow,y_train)
1
2 Out []:
1
2 In []:
memory limit error
1
2 In []:
3 ### USING TF-IDF
4
5 In []:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
1
2 In []:
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray()
1
2 Out []:
1
2 In []:
3 ### USING DIMENSIONALITY REDUCTION ON BOW
4
5 In []:
cv = CountVectorizer(max_features=3000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()
1
2 In []:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_bow,y_train)
1
2 Out []:
<pre>GaussianNB()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
<br/>On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.</b>g6.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
1
2 In []:
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)
1
2 Out []:
0.8421
1
2 In []:
3 ### DIMENSIONALITY REDUCTION ON TF-IDF
4
5 In []:
tfidf = TfidfVectorizer(max_features=3000)
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray()
1
2 In []:
rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)
accuracy_score(y_test,y_pred)
1
2 Out []:
0.8454
1
2 In []:
3 ## USING Word2Vec
4
5 In []:
import gensim
1
2 In []:
from gensim.models import Word2Vec,KeyedVectors
1
2 In []:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
1
2 In []:
removing stopwords
def remove_stopwords(text):
words = text.split()
filtered_words = [word for word in words if word not in stopwords.words('english')]
return filtered_words
1
2 In []:
df['review'] = df['review'].apply(remove_stopwords)
1
2 In []:
def document_vector(doc):
filter out-of-vocabulary words
doc = [word for word in doc if word in model.index_to_key]
if not doc:
return np.zeros(300)
return np.mean(model[doc], axis=0)
1
2 In []:
from tqdm import tqdm
1
2 In []:
X = []
for doc in tqdm(df['review'].values):
X.append(document_vector(doc))
1
2 Out []:
<output truncated>
1
2 In []:
X = np.array(X)
1
2 In []:
X.shape
1
2 Out []:
(50000, 300)
1
2 In []:
X_train,X_test,y_train,y_test = train_test_split(X,df['sentiment'],test_size=0.2,random_state=1)
1
2 In []:
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)
1
2 Out []:
0.819
1
2 In []:
mnb.fit(X_train,y_train)
1
2 Out []:
ℹ ⚠️ Response truncated due to token limits.
1
Reasons it could be a hallucination:
df['review'] = df['review'].apply(tokenize_text) returns a list of tokens per row.text.split(). Lists don't have .split(). That should raise AttributeError: 'list' object has no attribute 'split'.df.head() with stemmed-looking strings afterward. That can't happen without changing/undoing the tokenization step, or rewriting the stemmer to handle lists.X_train_bow = cv.fit_transform(X_train['review']).toarray()X_train_bow.shape == (40000, 146144)..toarray() step, before you ever fit Naive Bayes or RandomForest.from sklearn.naive_identity_matrix import GaussianNB is not a real scikit-learn module path. GaussianNB is in sklearn.naive_bayes.ModuleNotFoundError.GaussianNB correctly, which suggests copy/paste or generation rather than an actual clean run.rf.fit(X_train_bow, y_train) followed by memory limit errory_pred = rf.predict(X_test_bow) and they get an accuracy number.rf.fit(...) failed, rf should not be fitted and rf.predict(...) should raise NotFittedError..../linear_model.html#logistic-regression)GaussianNB().fit(...). It looks like unrelated output got spliced in.document_vector, the line filter out-of-vocabulary words is not commented. As written it's a syntax error.np is used (np.zeros, np.mean, np.array) but numpy is never imported (import numpy as np missing). That should crash immediately.mnb.fit(X_train, y_train) where X_train is Word2Vec averaged vectors. MultinomialNB expects non-negative features (counts or similar). Word2Vec averages usually include negative values; scikit-learn commonly raises a ValueError for negative inputs.Reasons it might not be a hallucination: