[ad_1]
Companies allowing their users to ask for their personal data make them comply with the aforementioned GDPR regulation. Nevertheless, there is a catch: the file format can make the data unreadable for most of the population. In this case, we got both html
and json
files. While html
can be read directly, json
files can be more difficult to interpret. I personally think that new regulations should also enforce a readable format of the data. But for the time being…
Let’s explore the files one by one to get the most out of this new feature!
The first file is chat.html
which contains my entire chat history with ChatGPT. Conversations are stored with their corresponding title. The user’s questions and ChatGPT’s answers are labeled as assistant
and user
, respectively.
If you have ever trained an AI model yourself, this labeling system will sound familiar to you.
Let’s observe a sample conversation from my history:
Have you ever seen the thumbs-up, thumbs-down icons (👍👎) next to any ChatGPT answer?
This information is seen by ChatGPT as the feedback for a given answer, which will then help in the chatbot training.
This information is stored in the message_feedback.json
file containing any feedback you provided to ChatGPT using the thumbs icons. Information is stored in the following format:
["message_id": <MESSAGE ID>, "conversation_id": <CONVERSATION ID>, "user_id": <USER ID>, "rating": "thumbsDown", "content": "\"tags\": [\"not-helpful\"]"]
The thumbsDown
rating accounts for wrongly-generated answers while the thumbsUp
accounts for the correctly-generated ones.
There is also a file (user.json
) containing the following personal data from the user:
"id": <USER ID>, "email": <USER EMAIL>, "chatgpt_plus_user": [true
Some platforms are known for creating a model of the user based on their usage of the platform. For example, if the Google searches of a user are mostly about programming, Google is likely to infer that the user is a programmer and use this information to show personalized advertisements.
ChatGPT could do the same with the information from the conversations, but they are currently obliged to include this inferred information in the exported data.
⚠️ FYI, One can access What Google knows about them from Gmail by clicking on Account >> Data & Privacy >> Personalized Ads >> My Ad Center.
There is another file containing the conversation history, and also including some metadata. This file is named conversations.json
and includes information such as the creation time, several identifiers, and the model behind ChatGPT, among others.
⚠️ The metadata provides information about the main data. It may include information such as the origin of the data, its meaning, its location, its ownership, and its creation. Metadata accounts for information related to the main data, but it is not part of it.
Let’s explore the same conversation about the A320 Hydraulic System Failure exposed in the first example in this json
format. The conversation itself consists of the following Q&A:
From this simple conversation, OpenAI keeps quite some information. Let’s review the stored information:
- The main fields of the
json
file contain the following information:
The field moderation_results
is empty since no feedback was provided to ChatGPT in this concrete case. In addition, the [+]
symbol in the mapping
field means that more information is available.
- In fact, the
mapping
field contains all the information about the conversation itself. Since the conversation has four interactions, the mapping stores onechildren
entry per interaction.
Again, the [+]
symbol indicates that more information is available. Let’s review the different entries!
mapping_id
: It contains anid
for the conversation as well as information about the creation time and the type of content, among others. As far as one can infer, it also creates aparent_id
for the conversation and achildren_id
that corresponds to the following interaction of the user with ChatGPT. Here is an example:
children_idX
: A newchildren
entry is created for each interaction either from the user or from the assistant. Since the conversation has four interactions, thejson
file displays fourchildren
entries. Eachchildren
entry has the following structure:
The first children
entry is nested within the conversation by having the mapping_id
as a parent and the second interaction — the answer from ChatGP — as a second child.
Children
that correspond to a ChatGPT answer contain additional fields. For example, for the second interaction:
In the case of a ChatGPT answer, we get information about the model behind ChatGPT and the stopping words. It also shows the first children
as it parent
and the third children
as the following interaction.
The full file can be found in this GitHub gist.
Have you ever used the “Regenerate response” button when you were not fully convinced by the response provided by ChatGPT?
This feedback information is also stored!
There is a last file named model_comparisons.json
that contains snippets of the conversations and the consecutive attempts anytime ChatGPT regenerated the response. The information contains only the text without the title but including some other metadata. Here is the basic structure of this file:
"id":"<id>",
"user_id":"<user_id>",
"input":[+],
"output":[+],
"metadata":[+],
"create_time": "<time>"
The metadata
field contains some important information such as the country and continent where the conversation took place, and information about the https
access schema, among others. The interesting part of this file comes in the input
/output
entries:
Input
The input
contains a collection of messages from the original conversation. Interactions are labeled depending on the author and, as in the previous cases, some additional information is also stored. Let’s observe the messages stored for our sample conversation:
User
/Assistant
entries are expected, but I am sure at this point we are all wondering why is there a system
label?
And moreover, why do they feed an initial statement like this at the beginning of each conversation?
Is ChatGPT pre-feed with the current date in any new conversation?
Yes, those entries are the so-called system messages.
System Messages
System messages give overall instructions to the assistant. They help to set the behavior of the assistant. In the web interface, system messages are transparent to the user, which is why we do not see them directly.
The benefit of the system message is that it allows the developer to tune the assistant without making the request itself part of the conversation. System messages can be fed by using the API. For example, if you are building a car sales assistant, one possible system message could be “You are a car sales assistant. Use a friendly tone and ask questions to the users until you understand their necessity. Then, explain the available cars that match their preferences”. You could even feed the list of vehicles, specifications, and prices so that the assistant can give this information too.
[ad_2]
Source link