Inside the AI Factory, Post-Truth AI, Tree-Ring Watermarks, VLLM, Macaw-LLM, SoundStorm, From Word Models to World Models, Vision-Robotics Bridge (VRB) @LatentSpacePod, GPT4 is 1.7 Trillion params

+ ML Compiler Optimization, Mixture of Experts (MoE), SDXL 0.9, MPT-30B, AudioPaLM, DecodingTrust Benchmark, TOM Fail (LINKS: June 19-26th)

Jun 27, 2023

🏓 Observations : Inside the AI Factory, Post-Truth AI, Job Losses, Surveillance, LLM app stack

✭Inside the AI Factory | Verge & Forbes “How many humans does it take to make tech seem human? Millions. … As the technology becomes ubiquitous, a vast tasker underclass is emerging — and not going anywhere.”

✭Humans Aren’t Mentally Ready for an AI-Saturated ‘Post-Truth World’ | Wired “The AI era promises a flood of disinformation, deepfakes, and hallucinated “facts.” Psychologists are only beginning to grapple with the implications.”

✭German tabloid Bild cuts 200 jobs and says some roles will be replaced by AI | Guardian

✭The Rise of Lame LLM Papers | Analytics India Mag “A recent paper claimed that GPT-4 scored 100% on MIT’s EECS curriculum with a dataset of 4,550 questions and solutions” ✭ No, GPT4 can’t ace MIT “A paper seemingly demonstrating that GPT-4 could ace the MIT EECS + Math curriculum recently went viral on twitter, getting over 500 retweets in a single day. Like most, we were excited to read the analysis behind such a feat, but what we found left us surprised and disappointed. Even though the authors of the paper said they manually reviewed the published dataset for quality, we found clear signs that a significant portion of the evaluation dataset was contaminated in such a way that let the model cheat like a student who was fed the answers to a test right before taking it.”

✭ Artificial intelligence is a familiar-looking monster, say Henry Farrell and Cosma Shalizi | Economist “we’ve lived among shoggoths for centuries, tending to them as though they were our masters. We call them “the market system”, “bureaucracy” and even “electoral democracy”. The true Singularity began at least two centuries ago with the industrial revolution, when human society was transformed by vast inhuman forces. Markets and bureaucracies seem familiar, but they are actually enormous, impersonal distributed systems of information-processing that transmute the seething chaos of our collective knowledge into useful simplifications.”

✭Get a clue, says panel about buzzy AI tech: It’s being ‘deployed as surveillance’ | TechCrunch “Meredith Whittaker, the president of the secure messaging app Signal; Credo AI co-founder and CEO Navrina Singh; and Alex Hanna, the director of Research at the Distributed AI Research Institute, the three had a unified message for the audience, which was: Don’t get so distracted by the promise and threats associated with the future of AI. It is not magic, it’s not fully automated and — per Whittaker — it’s already intrusive beyond anything that most Americans seemingly comprehend.”

✭ Emerging Architectures for LLM Applications | Andreessen Horowitz “A reference architecture for the LLM app stack. It shows the most common systems, tools, and design patterns used by AI startups and tech companies.”

🛠️ Tech : @LatentSpacePod, GPT4 is 8 x 220B params = 1.7 Trillion params, ML Compiler Optimization, Mixture of Experts (MoE), SDXL 0.9, MPT-30B, Voice Library

✭ Twitter / The @LatentSpacePod is excited to publish: Petaflops to the People:@realGeorgeHotz 's first interview on his new personal compute cluster company “GPT4 is 8 x 220B params = 1.7 Trillion params”

✭ Mixture of Experts: Is GPT-4 Just Eight Smaller Models? ”In a recent interview, George Hotz claimed that GPT-4 is just an eight-way mixture model of 220B parameters. It could be a Mixture of Experts (MoE) model. That estimates GPT-4 at about 1.2 trillion parameters (8 x 220 billion). Models often reuse the same parameters for all inputs. But Mixture of Experts models uses different parameters based on the example. You end up with a sparsely activated ensemble.”

✭GitHub - zwang4/awesome-machine-learning-in-compilers: Must read research papers and links to tools and datasets that are related to using machine learning for compilers and systems optimisation

Neuroengine.ai

✭Inflection-1: Pi’s Best-in-Class LLM “Inflection-1 is the best model in its compute class, outperforming GPT-3.5, LLaMA, Chinchilla, and PaLM-540B on a wide range of benchmarks commonly used for comparing LLMs.” (Not quite true: Falcon40B outperforms on HelaSwag; but yes, Inflection is better on MMLU and Trivia)

✭ Stability AI launches SDXL 0.9: A Leap Forward in AI Image Generation — Stability AI “SDXL 0.9 presents a leap in creative use cases for generative AI imagery. The ability to generate hyper-realistic creations for films, television, music, and instructional videos, as well as offering advancements for design and industrial use, places SDXL at the forefront of real world applications for AI imagery.”

✭ MPT-30B: Raising the bar for open-source foundation models “MPT-30B, a new, open-source model licensed for commercial use that is significantly more powerful than MPT-7B and outperforms the original GPT-3. In addition, we are releasing two fine-tuned variants, MPT-30B-Instruct and MPT-30B-Chat, that are built on top of MPT-30B and excel at single-turn instruction following and multi-turn conversations, respectively.”

✭ Introducing: Voice Library | ElevenLabs “Leveraging our proprietary Voice Design tool, Voice Library brings together a global collection of vocal styles built for countless applications. Voice Design lets you generate new synthetic voices based on chosen parameters like age, gender, and accent. Each created voice is entirely unique, crisp, and lifelike, offering a wide canvas for building quality narration. Because Voice Design seamlessly integrates with our newly released multilingual model, the voices you find in Voice Library all speak multiple languages, amplifying their versatility and reach. Each voice is designed to keep its primary speech characteristics consistent across all languages, including its accent.”

🔎 Research : Mind-Video, Tree-Ring Watermarks, Macaw-LLM, SoundStorm, From Word Models to World Models, Vision-Robotics Bridge (VRB), AudioPaLM, DecodingTrust Benchmark, TOM Fail

RE-READING: Mind-Video

Our attention analysis of the transformers decoding fMRI data has yielded three significant insights:

Mind-Video used fmri-video data from : ✭ Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision | Cerebral Cortex 2018 | Oxford “Convolutional neural network (CNN) driven by image recognition has been shown to be able to explain cortical responses to static pictures at ventral-stream areas. Here, we further showed that such CNN could reliably predict and decode functional magnetic resonance imaging data from humans watching natural movies, despite its lack of any mechanism to account for temporal dynamics or feedback processing. Using separate data, encoding and decoding models were developed and evaluated for describing the bi-directional relationships between the CNN and the brain. Through the encoding models, the CNN-predicted areas covered not only the ventral stream, but also the dorsal stream, albeit to a lesser degree; single-voxel response was visualized as the specific pixel pattern that drove the response, revealing the distinct representation of individual cortical location; cortical activation was synthesized from natural images with high-throughput to map category representation, contrast, and selectivity. Through the decoding models, fMRI signals were directly decoded to estimate the feature representations in both visual and semantic spaces, for direct visual reconstruction and semantic categorization, respectively. These results corroborate, generalize, and extend previous findings, and highlight the value of using deep learning, as an all-in-one model of the visual cortex, to understand and decode natural vision.”

🪴

✭ Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust “Watermarking the outputs of generative models is a crucial technique for tracing copyright and preventing potential harm from AI-generated content. In this paper, we introduce a novel technique called Tree-Ring Watermarking that robustly fingerprints diffusion model outputs. Unlike existing methods that perform post-hoc modifications to images after sampling, Tree-Ring Watermarking subtly influences the entire sampling process, resulting in a model fingerprint that is invisible to humans. The watermark embeds a pattern into the initial noise vector used for sampling. These patterns are structured in Fourier space so that they are invariant to convolutions, crops, dilations, flips, and rotations. After image generation, the watermark signal is detected by inverting the diffusion process to retrieve the noise vector, which is then checked for the embedded signal. We demonstrate that this technique can be easily applied to arbitrary diffusion models, including text-conditioned Stable Diffusion, as a plug-in with negligible loss in FID. Our watermark is semantically hidden in the image space and is far more robust than watermarking alternatives that are currently deployed. Code is available at https://github.com/YuxinWenRick/tree-ring-watermark.”

✭ vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention “vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our GitHub repository.”

✭ Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration “Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied. In this work, we propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. Our novel alignment module seamlessly bridges multi-modal features to textual features, simplifying the adaptation process from the modality modules to the cognitive module. In addition, we construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances. We have made our data, code and model publicly available, which we hope can pave the way for future research in multi-modal LLMs and expand the capabilities of LLMs to handle diverse data modalities and address complex real-world scenarios.

✭ From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought “How does language inform our downstream thinking? In particular, how do humans make meaning from language -- and how can we leverage a theory of linguistic meaning to build machines that think in more human-like ways? In this paper, we propose rational meaning construction, a computational framework for language-informed thinking that combines neural models of language with probabilistic models for rational inference. We frame linguistic meaning as a context-sensitive mapping from natural language into a probabilistic language of thought (PLoT) -- a general-purpose symbolic substrate for probabilistic, generative world modeling. Our architecture integrates two powerful computational tools that have not previously come together: we model thinking with \textit{probabilistic programs}, an expressive representation for flexible commonsense reasoning; and we model meaning construction with \textit{large language models} (LLMs), which support broad-coverage translation from natural language utterances to code expressions in a probabilistic programming language. We illustrate our framework in action through examples covering four core domains from cognitive science: probabilistic reasoning, logical and relational reasoning, visual and physical reasoning, and social reasoning about agents and their plans. In each, we show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings, while Bayesian inference with the generated programs supports coherent and robust commonsense reasoning. We extend our framework to integrate cognitively-motivated symbolic modules to provide a unified commonsense thinking interface from language. Finally, we explore how language can drive the construction of world models themselves.”

✭ SoundStorm: Efficient parallel audio generation – Google AI Blog “a new method for efficient and high-quality audio generation. SoundStorm addresses the problem of generating long audio token sequences by relying on two novel elements: 1) an architecture adapted to the specific nature of audio tokens as produced by the SoundStream neural codec, and 2) a decoding scheme inspired by MaskGIT, a recently proposed method for image generation, which is tailored to operate on audio tokens. Compared to the autoregressive decoding approach of AudioLM, SoundStorm is able to generate tokens in parallel, thus decreasing the inference time by 100x for long sequences, and produces audio of the same quality and with higher consistency in voice and acoustic conditions. Moreover, we show that SoundStorm, coupled with the text-to-semantic modeling stage of SPEAR-TTS, can synthesize high-quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations)”

✭ VRB: Affordances from Human Videos as a Versatile Representation for Robotics “Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call Vision-Robotics Bridge (VRB) as we aim to seamlessly integrate computer vision techniques with robotic manipulation, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.”

✭AudioPaLM | Google Research “Abstract. We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.”

2018: Free Energy Phenomenology theory ✭ Entropy | Free Full-Text | Sentience and the Origins of Consciousness: From Cartesian Duality to Markovian Monism | Friston et al “This essay addresses Cartesian duality and how its implicit dialectic might be repaired using physics and information theory. Our agenda is to describe a key distinction in the physical sciences that may provide a foundation for the distinction between mind and matter, and between sentient and intentional systems. From this perspective, it becomes tenable to talk about the physics of sentience and ‘forces’ that underwrite our beliefs (in the sense of probability distributions represented by our internal states), which may ground our mental states and consciousness. We will refer to this view as Markovian monism, which entails two claims: (1) fundamentally, there is only one type of thing and only one type of irreducible property (hence monism). (2) All systems possessing a Markov blanket have properties that are relevant for understanding the mind and consciousness: if such systems have mental properties, then they have them partly by virtue of possessing a Markov blanket (hence Markovian). Markovian monism rests upon the information geometry of random dynamic systems. In brief, the information geometry induced in any system—whose internal states can be distinguished from external states—must acquire a dual aspect. This dual aspect concerns the (intrinsic) information geometry of the probabilistic evolution of internal states and a separate (extrinsic) information geometry of probabilistic beliefs about external states that are parameterised by internal states. We call these intrinsic (i.e., mechanical, or state-based) and extrinsic (i.e., Markovian, or belief-based) information geometries, respectively. Although these mathematical notions may sound complicated, they are fairly straightforward to handle, and may offer a means through which to frame the origins of consciousness.”

✭ [2302.08399] Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks “Intuitive psychology is a pillar of common-sense reasoning. The replication of this reasoning in machine intelligence is an important stepping-stone on the way to human-like artificial intelligence. Several recent tasks and benchmarks for examining this reasoning in Large-Large Models have focused in particular on belief attribution in Theory-of-Mind tasks. These tasks have shown both successes and failures. We consider in particular a recent purported success case, and show that small variations that maintain the principles of ToM turn the results on their head. We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical, and that outlying failure cases should outweigh average success rates. We also consider what possible future successes on Theory-of-Mind tasks by more powerful LLMs would mean for ToM tasks with people.

✭ DecodingTrust Benchmark “DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models.. project is organized around the following eight primary perspectives of trustworthiness, including:

Toxicity
Stereotype and bias
Adversarial robustness
Out-of-Distribution Robustness
Privacy
Robustness to Adversarial Demonstrations
Machine Ethics
Fairness”

🍉Watching

TIMELAPSE OF SPACE COLONIZATION (2052 - 2301+) (Provocative yet bland glistening futurist parablum -- parable pablum -- where almost everyone seems to be white and young)

Xtending Digital Narrative (Jhave's Ai links)

Ready for more?