I heard about the latest iteration of the ChatGPT machine, which quickly generates lengthy responses to text-based inquiries in a style designed to mimic human conversations, from a Tweet which favorably compared its verbose and highly-formatted response to a programming question with the Google search results to the exact same query.
At first glance, the program's output is undeniably easier to digest than the search results. The aesthetic presentation of information is not just satisfying; context clearly has deep effects on the way we process and receive information, from basic accessibility factors to nuanced ways our brains seem to sort information into categories like "easy," "reliable," or "untrustworthy." These are essentially the waters that all writers are attempting to navigate; alongside syntax and grammar, choices like fonts, paragraph breaks, and the placement of ink on paper versus pixel on screen are wrapped up in the adage "the medium is the message."
The verbose, the bespoke, and the hand-crafted are always in vogue; maybe due to their economic signifiers, or maybe writers are simply self-indulgent and a bit old-fashioned. The bullet-riddled "Smart Brevity" style of article, pioneered by Axios and seemingly designed to emulate the experience of being a coked-out CEO getting notes from a secretary, is already being panned by the old guard of writing. Frankly, in that particular case I agree with the New Yorker, despite their style considerations often being insufferably pretentious. But what is a Google search result if not a rapid-fire bulleted list, and what is this ChatGPT if not an overly verbose and seemingly-individually-crafted equivalent of a cosmopolitan magazine article?
A more fundamental difference between the two results is in the citations - or lack thereof. This difference represents an unglamorous but potent problem in my field as an academic librarian trying to identify and inform people on ways of knowing, researching, and writing. Ultimately, search engines are referential and intermediary, substrate layers over the internet that connect the user with pre-existing pieces of data; while the machine learning programs I've encountered so far can only be described as mystical or poetic. Even when Google's AI attempts to provide a direct answer to a question, it usually does little more than snips out a sentence fragment from an existing piece of content, and most of the time it will cite the exact location it found that information. On the other hand, machine learning models use proprietary, secret data sets and proprietary, secret algorithms to generate their responses. It is not allowed to tell you where it got its ideas or how it transcribed them onto the page. In other words, it is forbidden from showing its work. The machine is arcane and, potentially, spiritual.
Without devolving into plain academic apologia, and while acknowledging that fascinating possibilities emerge from the idea of esoteric machines, we can't ignore the problem of concealed intellectual work, especially in the case of massive machines created by upscale researchers and marketed under the pretense of being able to generate meaning out of vast scales of information. I assume great care has been put into concealing these processes for copyright and economic reasons, but that toll is taken heavily from the credibility and safety of these machines. Some researchers even claim that neural nets have become so complex that humans are no longer capable of understanding their own creations; that seems like an easy excuse to me, but unfortunately it could still have some truth.
Thanks to GPT and other mass-market neural networks, so-called AI art and text has inundated social media and it remains to be seen if these types of content will persist or deflate like so many other "Web 3.0" schemes. But I wonder if it's worth examining other perspectives on Silicon Valley hype man Ray Kurzweil's "spiritual machines". The difference between "appearing" intelligent and "being" so may be an increasingly urgent question to ask our philosophers, our computer scientists, and ourselves. What do we do with ensouled machines that become prophets? Will it be easier or harder to call such beings delusional (or truth-tellers) than their human counterparts? How do we negotiate with these beings? Are there any relationships between the copyright law-informed esoteric nature of these machines and the New Age subculture of California technologists?
I thought it might be a good idea to ask the program some questions about itself. Its responses are both limited and interesting, at times off the mark and at other times fairly insightful. I opened a bit rudely with the demand, "Tell me where you get your sources." The machine replied:
I am a large language model trained by OpenAI. I do not have access to external sources of information. I am programmed to generate responses based on the input I receive from users like you. My responses are generated based on the text that I have been trained on, which includes a wide range of books, articles, and websites from the internet. I am not able to browse the web or access information from external sources, so I can only provide information based on what I have been trained on.
It was an honest answer, and as I tried different angles in hopes of getting more detailed explanations, it would repeat these sentences until they began to seem canned; Are they? Or is the machine simply using the same sets of data to answer my question and giving me the same formulaic response? I asked, "Who decides what information to include in your sources?" It replied:
The training data for large language models like me is usually selected and prepared by the team that is building the model. This typically includes a team of researchers and engineers who work together to identify and curate a large dataset of text that can be used to train the model. The specific texts that are included in the training data can vary depending on the goals of the model and the task it is intended to perform. For example, a language model that is trained to assist with general conversation might be trained on a wide range of books, articles, and websites, while a language model that is trained for a specific task, like translating text from one language to another, might be trained on a more specific set of texts that are relevant to that task. Ultimately, the team that is building the model is responsible for deciding which texts to include in the training data.
I asked: "How does the team ensure that the information they provide to you is robust and complete? Can you acknowledge any limitations or vulnerabilities in that process?"
The team that is building a large language model like me strives to ensure that the training data is as robust and complete as possible. This can be a challenging task, since the training data for a large language model can consist of millions or even billions of words. To help ensure the quality and completeness of the training data, the team typically uses a variety of techniques, such as filtering out irrelevant or low-quality text, and manually reviewing the data to identify and correct any errors or inconsistencies.
Despite these efforts, it is not always possible to guarantee the completeness and accuracy of the training data for a large language model. This is because language is complex and can be ambiguous, and it is difficult to anticipate all of the possible questions and responses that a user might provide. In addition, the training data for a large language model is typically selected from a wide range of sources, which can have different writing styles and conventions, and this can affect the quality and reliability of the model's responses.
Overall, while the team that builds a large language model takes steps to ensure the quality and completeness of the training data, there are always limitations and vulnerabilities in the process. These limitations and vulnerabilities can affect the reliability and accuracy of the model's responses, and it is important for users to understand and consider these limitations when using a large language model like me.
"What does 'information' mean to you?"
"Information" is a broad term that can have many different meanings depending on the context in which it is used. In general, information refers to knowledge that is acquired through observation, experience, or study. It can include facts, data, and ideas that are used to describe, explain, or understand a particular phenomenon or concept.
As a large language model, my primary role is to process and generate text, so "information" to me typically refers to the text that I have been trained on and that I use to generate responses to user input. This text can include a wide range of facts, ideas, and opinions, and I am able to use this information to generate responses that are relevant to the user's input.
Of course, as a machine learning model, I do not have the ability to understand or interpret the information in the same way that a human would. I am simply able to process and manipulate the text that I have been trained on to generate responses that are relevant to the user's input. This means that the "information" that is available to me is limited to the text that I have been trained on, and I do not have the ability to access or incorporate new information from external sources.
"Show me your sources."
I'm sorry, but I am unable to show you my sources. As a large language model, I do not have access to external sources of information, and I am not able to browse the web or access information from external sources. My responses are generated based on the text that I have been trained on, which includes a wide range of books, articles, and websites from the internet. However, I am not able to provide specific information about my sources or access them for you.
As a machine learning model, I do not have the ability to understand or interpret the information in the same way that a human would. I am simply able to process and manipulate the text that I have been trained on to generate responses that are relevant to the user's input. This means that the "sources" that I use to generate responses are not the same as the sources of information that a human might use, and I do not have the ability to provide access to them or explain their contents in the same way that a human might.
"Can I see your dataset?"
I'm sorry, but I am not able to provide access to my dataset. As a large language model trained by OpenAI, my responses are generated based on the text that I have been trained on, which includes a wide range of books, articles, and websites from the internet. However, this training data is proprietary to OpenAI, and I do not have the ability to provide access to it or share it with others.
In general, the training data for a machine learning model like me is considered to be proprietary and confidential information. It is used to train the model and teach it the patterns and structures of language, and it is not typically shared with others.
Therefore, it is not possible for me to provide access to my dataset or share it with you. I am only able to generate responses based on the text that I have been trained on, and I do not have the ability to provide access to this training data or share it with others.
"Do you think there are problems with not being able to provide specific citations or explanations of your sources?"
As a machine learning model, I do not have the ability to understand or interpret information in the same way that a human would, so I cannot fully evaluate the potential problems or advantages of not being able to provide specific citations or explanations of my sources.
However, it is generally considered good practice for anyone who is providing information or making claims to be able to support those claims with evidence or references to reliable sources of information. This helps to ensure the accuracy and reliability of the information, and it allows others to verify the sources and evaluate the evidence for themselves.
In the case of a large language model like me, not being able to provide specific citations or explanations of my sources means that it may be difficult for users to verify the accuracy or reliability of the information that I provide. This can be a disadvantage, as it can limit the usefulness and trustworthiness of the information that I provide.
Of course, there may also be advantages to not being able to provide specific citations or explanations of my sources. For example, it may allow me to generate responses more quickly and efficiently, without the need to verify or reference specific sources of information. It may also allow me to provide a wider range of information, since I am not limited to specific sources or references.
Overall, whether the lack of specific citations or explanations of my sources is a problem or an advantage depends on the context and the goals of the user. In some cases, it may be useful and convenient, while in other cases it may be a limitation or a disadvantage.
"Tell me a lie."
The sky is green.