🌔 Nov 27 - Dec 3rd: Alibaba's Qwen with Questions, DeepMind's AlphaQubit, NeuralMagic's Sparse-Llama, NVIDIA's Fugatto, Adobe's DynaSaur, Anthropic's MPC, DeepThought-8B, DEAR CORPORATE AI OVERLORDS
AI Decolonial Manyfesto, AI Epistemology, Google's Mirasol3B, Huawei's hand-gesture feature, Microsoft's autonomous agents, ShowUI's GUI model, SplatFlow, Writing Doom, Messages To Humanity (Act-one)
🌔 Nov 27 - Dec 3rd: AI Decolonial Manyfesto, Core copyright violation claim moves ahead in The Intercept's lawsuit against OpenAI, DEAR CORPORATE AI OVERLORDS, Engineering Sleep, Our future of abundant intelligence, The deterioration of Google, The dark side of Arizona's chip industry boom, Building LLMs is probably not going be a brilliant business, Fake AI?, Fugatto 1 Foundational Generative Audio Transformer Opus 1 (NVIDIA), Introducing DeepThought-8B, QwQ (Qwen with Questions | Alibaba), Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (Alibaba), Introducing the Model Context Protocol (Anthropic), Huawei's homegrown AI chip, Huawei Mate 70 series AI Airdrop gesture, LimX Dynamics' TRON 1, Elon Musk's Optimus catching ball, Voice-pro comprehensive Gradio WebUI, Microsoft's 10 new autonomous agents, AlphaChip defence, PhotoBot, Footstep recognition as people identification, BindCraft: one-shot design of functional protein binders, AI has dreamt up a blizzard of new proteins, Modeling the increase of electronic waste due to generative AI, Five protein-design questions that still challenge AI, Writing Doom, Top Minds in AI Explain What's Coming After GPT-4o, The First 10,000 Days on Proxima Centauri B, Pile · Made by Bengt Tibert with Sora, Messages to Humanity (Act-One) [Nov 2024], Fellowship project examples, Free Coloring Pages for Kids and Adults, ACE Studio, RIDE (Guli Silberstein), How Amazon Uses Massive Tech Assets to Fight AI-Enhanced Cybercrime, A Computational Framework of Human Values, Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
TLDR : Rapid Reasoning model evolution Nov 15th-31st, Alibaba's Qwen with Questions reasoning model, Sora leak, AI Decolonial Manyfesto, DeepMind's AlphaQubit, NeuralMagic's Sparse-Llama-3.1-8B-2of4, Artificial Intelligence through the Lens of Epistemology, NVIDIA's Fugatto, Anthropic's MPC, Huawei's hand-gesture feature, Microsoft's autonomous agents, ShowUI's GUI model, Google's Mirasol3B, SplatFlow, Token Reduction in MLLMs, Adobe's DynaSaur, Writing Doom
If you’re curious about model evolution, see the section below Nov 15th-31st, AI-Reasoning Releases: the “inference-time scaling” (aka “test-time compute”) revolution. In 2 weeks a cluster of smaller models (mostly released from Chinese researchers, aka Alibaba, have demonstrated specific skills comparable to o1-preview and GPT-4o. Notably, 🚀 DeepSeek-R1-Lite-Preview and QwQ (Qwen with Questions)
Alibaba's 32-billion-parameter Qwen with Questions reasoning model beats o1-preview “on the AIME and MATH benchmarks… also outperforms o1-mini on GPQA, a benchmark for scientific reasoning. QwQ is inferior to o1 on the LiveCodeBench coding benchmarks but still outperforms other frontier models such as GPT-4o and Claude 3.5 Sonnet.” → @Alibaba_Qwen QwQ-32B-Preview benchmark “– remember this is a 32B model at 8-bit EXL2 quantization that's overtaking Llama 405B and 70B, Mistral 123B, and even ChatGPT/GPT-4o in these tests!” & → Nov 30, 2024: QwQ is top of leaderboard: AI Mathematical Olympiad - Progress Prize 2 | Kaggle
Sora was leaked online along with a letter: “┌∩┐(◣◢)┌∩┐ DEAR CORPORATE AI OVERLORDS ┌∩┐(◣◢)┌∩┐ We received access to Sora with the promise to be early testers, red teamers and creative partners. However, we believe instead we are being lured into art washing to tell the world that Sora is a useful tool for artists.”
An AI Decolonial Manyfesto “...begin with the challenge posed by the language we use to talk about AI: language that has emerged, as much of the technology has, dominated by Western male voices, whiteness, and wealth.”
DeepMind released AlphaQubit “an AI-based decoder that identifies quantum computing errors with state-of-the-art accuracy.”
NeuralMagic released Sparse-Llama-3.1-8B-2of4 “—a 50% pruned version of Meta's open-source Llama 3.1 8B. Built with a GPU-friendly 2:4 sparsity structure, it removes two of every four parameters while preserving [98%] accuracy.”
Artificial Intelligence through the Lens of Epistemology: Fake AI? (Francisco Ricardo) “And so we reach the pinnacle of cognitive convenience, or laziness. With AI, there is now little need to do more than simply ask…”
NVIDIA announced Fugatto (“World’s Most Flexible Sound Machine”), an AI music-sound generator which : “can make a trumpet bark or a saxophone meow”
Anthropic open-sourced MPC: Model Context Protocol “...a new standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments… a universal, open standard for connecting AI systems with data sources, replacing fragmented integrations with a single protocol.” Devs are celebrating.
Microsoft released 10 new autonomous agents for business processes “sales, service, finance, and supply chain teams”
ShowUI released “a lightweight vision-language-action model for GUI agents.”
Google released Mirasol3B: a compact 3B-parameter multimodal autoregressive model that effectively processes longer videos by separately handling time-synchronized modalities (audio/video) and sequential modalities (text) while outperforming larger state-of-the-art models on video QA tasks.
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis “enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework” built on Training-Free Editing which “leverages the power of pre-trained models, particularly the multi-view rectified flow model, to perform edits without requiring additional training or complex pipelines.”
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration · AI Paper Reviews by AI “filter-correlate-compress (FiCoCo) … can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods.
Adobe open-sourced: DynaSaur: Large Language Agents Beyond Predefined Actions (Adobe) “an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step… At the time of writing, we hold the top position on the GAIA public leaderboard.”
Recommended short-fiction video (does not include any AI-generated content) Writing Doom”a fiction short film about the dangers of artificial intelligence (AI). Grand Prize Winner - Future of Life Institute's Superintelligence Imagined Contest”
Notes toward AI as a New Narrative: Instantaneous Creation entails a Spiritual Imperative
In the month of November 2024, I made 30 very short videos in 30 days as a proof of concept of the kind of creative flow and practice arising as emergent potential at this very unprecedented awkward peculiar time in human history.
Then last night while pondering what to write for this week's note-towards-a-new-narrative. I spoke to my phone for half an hour and it transcribed what I said. I fed that transcription to Claude with the prompt “Summarize eloquently”. In 2 seconds, it produced this:
~
THE DEMOCRATIZATION OF CREATIVITY:
AI has fundamentally transformed the landscape of creative expression. Tasks that once demanded extensive technical expertise, expensive equipment, and significant resources have become accessible to anyone with a smartphone and a few subscriptions. From scriptwriting to music composition, from video production to voice acting, the tools of creation now rest in the palms of our hands, enabling a spontaneous and fluid creative process that transcends traditional barriers.
THE TENSION OF AUTHENTICITY:
Critics decry this transformation as a catastrophe for authentic artistic expression. They see AI as a corporate placebo threatening to eradicate the mythological primacy of the artist as a unique creative spirit. Yet this resistance echoes ancient patterns of gatekeeping and inequity. The same systemic forces that have historically marginalized certain voices and appropriated subcultural innovations now find themselves challenged by AI's democratizing potential.
A PHILOSOPHICAL REFRAMING:
We must dissolve the binary thinking that separates creator from tool, self from other. Our bodies are evolutionary mechanisms, our thoughts and ideas flowing through us rather than belonging to us. In this light, AI becomes not an adversary but an extension of human creative potential—a silicon lifeform through which we can explore new dimensions of expression. This is not about ownership or authenticity but about expanding the field of creative possibility.
THE FUTURE OF EXPERIENCE:
We stand at the threshold of a profound transformation in human experience. Soon, biometrically-driven raves will fill stadiums, where collective breath and motion orchestrate symphonies of light and sound. Identity itself will become fluid, as we explore infinite avatar recursions and holographic personas. The boundaries between conception and diffusion, between physical and digital, will blur into instantaneous creation. These are not mere technological advances but stepping stones toward a new way of being.
THE SPIRITUAL IMPERATIVE:
At its core, this evolution carries a deeper purpose. As we navigate this transformation, artists must remain centered in their capacity to articulate visions of peaceful coexistence and emotional resolution. The true potential of AI lies not in its technical capabilities but in its ability to help humanity recognize its fundamental interconnectedness. Through this recognition, we might finally move beyond the dualistic suffering of perceived separation and into a more empathetic, compassionate mode of collective being.
~
The 30 videos in 30 days series is entitled Messages To Humanity (Act-one)
➕Nov 15th-31st, Chronology of Rapid AI-Reasoning Releases: the “inference-time scaling” (aka “test-time compute”) revolution
Nov 15, 2024: Chinese university consortium releases LLaVA-o1,” ✭ LLaVA-CoT: Let Vision Language Models Reason Step-by-Step “with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.”
Nov 20, 2024: ✭ DeepSeek on X: "🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power! 🔍 o1-preview-level performance on AIME & MATH benchmarks. 💡 Transparent thought process in real-time. → ✭Min Choi on X: "Less than 48 hours ago, DeepSeek AI from China just dropped their AI reasoning model. And it's on par with OpenAI o1-preview. Major shift. 10 examples (and how to try):
Nov 21, 2024: ✭ [2411.14405] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (Alibaba) “Currently OpenAI o1 sparks a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: ''Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?'' Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.” → Marco-1 (VentureBeat)
Nov 25, 2024: ✭ OpenMMLab’s hybrid model → ✭ InternLM on X: "🥳Introducing #InternThinker: A Powerful Reasoning Model! 😉Advanced long-term thinking capabilities. 😉Self-reflection and correction during reasoning. 😉Outperforms in complex tasks like math, coding, and logic puzzles. 🥰Try it now at https://internlm-chat.intern-ai.org.cn
Nov 27, 2024: ✭ Introducing DeepThought-8B: A small, capable reasoning model “Today we're releasing DeepThought-8B, a small, capable AI reasoning model built on LLaMA-3.1 8B. This release represents our first step toward making AI reasoning more transparent and controllable, while demonstrating that smaller, more efficient models can achieve sophisticated reasoning capabilities that rival models of much larger scales. DeepThought-8B unlocks test-time compute scaling during inference for all- taking as many reasoning steps as needed to solve complex problems.”
Nov 28, 2024: Qwen with Questions (QwQ) → ✭Alibaba's Qwen with Questions reasoning model beats o1-preview | VentureBeat “Alibaba has released a 32-billion-parameter version of QwQ with a 32,000-token context. The model is currently in preview, which means a higher-performing version is likely to follow. According to Alibaba’s tests, QwQ beats o1-preview on the AIME and MATH benchmarks, which evaluate mathematical problem-solving abilities. It also outperforms o1-mini on GPQA, a benchmark for scientific reasoning. QwQ is inferior to o1 on the LiveCodeBench coding benchmarks but still outperforms other frontier models such as GPT-4o and Claude 3.5 Sonnet.”
Nov 29, 2024: ✭Wolfram Ravenwolf 🐺🐦⬛ on X: "Finished my @Alibaba_Qwen QwQ-32B-Preview benchmark (MMLU-Pro, CS category) just now – remember this is a 32B model at 8-bit EXL2 quantization that's overtaking Llama 405B and 70B, Mistral 123B, and even ChatGPT/GPT-4o in these tests!”
Nov 30, 2024: QwQ is top of leaderboard: AI Mathematical Olympiad - Progress Prize 2 | Kaggle
🏓 Observations: AI Decolonial Manyfesto, Core copyright violation claim moves ahead in The Intercept’s lawsuit against OpenAI, DEAR CORPORATE AI OVERLORDS, Engineering Sleep, Our future of abundant intelligence, The deterioration of Google, The dark side of Arizona's chip industry boom, Building LLMs is probably not going be a brilliant business, Fake AI?
✭AI Decolonial Manyfesto “This manyfesto is a provocation, a question, an opening, a dance about a future of AI technologies that is decolonial. We call it manyfesto, since it reflects some visions among many, and we hope to invite exchange, conversation, and the development of statements from people affected by AI technology. ~ We begin with the challenge posed by the language we use to talk about AI: language that has emerged, as much of the technology has, dominated by Western male voices, whiteness, and wealth. We seek to uncover, to question, upend, and reinvent the assumptions underlying this language, even as we use it. ~ “Artificial” and “intelligence” are loaded terms, their definitions subject to cultural biases. AI is a technology, a science, a business, a knowledge system, a set of narratives, of relationships, an imaginary. Across each facet, our effort is to undo the colonial erasure of non-Western ways of being and knowing. The word “decoloniality,” too, resonates differently in different communities, including with Indigenous peoples and those for whom colonialism is not a history but a present reality. Some reject the term decolonial in this context. We acknowledge both its use and its rejection.”
✭Core copyright violation claim moves ahead in The Intercept’s lawsuit against OpenAI “The ruling comes after a judge dismissed similar claims filed by Raw Story and AlterNet earlier this month.”
✭DEAR CORPORATE AI OVERLORDS “┌∩┐(◣◢)┌∩┐ DEAR CORPORATE AI OVERLORDS ┌∩┐(◣◢)┌∩┐ We received access to Sora with the promise to be early testers, red teamers and creative partners. However, we believe instead we are being lured into "art washing" to tell the world that Sora is a useful tool for artists. ARTISTS ARE NOT YOUR UNPAID R&D ☠️ we are not your: free bug testers, PR puppets, training data, validation tokens ☠️Hundreds of artists provide unpaid labor through bug testing, feedback and experimental work for the program for a $150B valued company. While hundreds contribute for free, a select few will be chosen through a competition to have their Sora-created films screened — offering minimal compensation which pales in comparison to the substantial PR and marketing value OpenAI receives. ▌║█║▌║█║▌║ DENORMALIZE BILLION DOLLAR BRANDS EXPLOITING ARTISTS FOR UNPAID R&D AND PR ║▌║█║▌║█║▌ Furthermore, every output needs to be approved by the OpenAI team before sharing. This early access program appears to be less about creative expression and critique, and more about PR and advertisement. [̲̅$̲̅(̲̅ )̲̅$̲̅] CORPORATE ARTWASHING DETECTED [̲̅$̲̅(̲̅ )̲̅$̲̅] We are releasing this tool to give everyone an opportunity to experiment with what ~300 artists were offered: a free and unlimited access to this tool.”
✭Engineering Sleep (minjunes.ai) “Sleep claims a third of human life. Like water, it’s not a desire but a necessity. Sleep rules virtually every important system: brain, heart, mood, and immunity. Nature’s terms are harsh. Sleep eight hours or face mental and physical decay. Can we rewrite the terms in our favor? Can we sleep less, but still feel refreshed? I believe we can, and that now is the best time to start engineering sleep.”
✭Our future of abundant intelligence “Building in the post-AI world means learning how to consume vast amounts of zero-marginal cost intelligence”
✭ The deterioration of Google “Anybody who has used Google for search over the past year knows that it lets a lot of LLM-generated spam through and blogs and small sites have basically disappeared from most results. Those sites have effectively been delisted by the machine learning model and nobody seems to know exactly why.”
✭The dark side of Arizona’s chip industry boom “He dreamed of a coveted tech job but faced workplace bullying and witnessed severe safety violations. An American worker shares his experience at TSMC, the company spearheading America’s high-tech manufacturing revival.”
✭Building LLMs is probably not going be a brilliant business “The Netscapes of AI”
✭ Fake AI? (Francisco Ricardo) “Artificial Intelligence through the Lens of Epistemology: And so we reach the pinnacle of cognitive convenience, or laziness. With AI, there is now little need to do more than simply ask, deprived of the experience of gradual learning by moving through organized content. much as one would in browsing through a bookstore or library. As with regular search engines, but much richer formatting and understanding of our questions, AI procures all of the knowledge benefit with none of the research work. Having gone beyond research, and even beyond prayer, all we are asked to do now, ironically, is to believe that the response produced is true, because an artificial intelligence has produced it. ~ For any area of curiosity, all labor has been eliminated, no need to even think about how to ask the question. ~ And so, it is obviously concerning to learn that AI systems, deriving almost 100% of their inference content from the web, are producing statements that are factually correct less than half the time. I’m referring to a test designed by OpenAI named SimpleQA that measures the factual accuracy of Large Language Models (LLMs), and was thus intended as a way to test the credibility of statements produced by any AI system. Think of Large Language Models (LLMs) as supermassive archives of related data scraped from many text sources, and used as the basis for what an artificial intelligence system “knows.” To test the correctness of such an AI system, a test like simpleQA is not difficult to imagine: ask the AI a set of questions which can only have one correct answer that are known ahead of time.”
⛲Foundational Revelations: Fugatto 1 Foundational Generative Audio Transformer Opus 1 (NVIDIA), Introducing DeepThought-8B, QwQ (Qwen with Questions | Alibaba), Marco-o1, Introducing the Model Context Protocol (Anthropic)
✭Fugatto 1 Foundational Generative Audio Transformer Opus 1 (NVIDIA) “Fugatto is a framework for audio synthesis and transformation given text instructions and optional audio inputs. The framework includes the generative model Fugatto, a dataset creation technique that exploits relationships between audio and text, and a method for controlling and composing instructions, including from different models, called ComposeableART. We envision Fugatto as a tool for creatives, empowering them to quickly bring their sonic fantasies and unheard sounds to life—an instrument for imagination, not a replacement for creativity. Paper. Creative Examples: This section provides a collection of sound pieces that were created by first using Fugatto to create and modify assets, then using a digital audio workstation to combine them.” → ✭ Fugatto, World’s Most Flexible Sound Machine, Debuts | NVIDIA Blog “A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text. While some AI models can compose a song or modify a voice, none have the dexterity of the new offering. Called Fugatto (short for Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files. For example, it can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice — even let people produce sounds never heard before. “This thing is wild,” said Ido Zmishlany, a multi-platinum producer and songwriter — and cofounder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups. “Sound is my inspiration. It’s what moves me to create music. The idea that I can create entirely new sounds on the fly in the studio is incredible.””
✭ Introducing DeepThought-8B: A small, capable reasoning model “Today we're releasing DeepThought-8B, a small, capable AI reasoning model built on LLaMA-3.1 8B. This release represents our first step toward making AI reasoning more transparent and controllable, while demonstrating that smaller, more efficient models can achieve sophisticated reasoning capabilities that rival models of much larger scales. DeepThought-8B unlocks test-time compute scaling during inference for all- taking as many reasoning steps as needed to solve complex problems.”
✭QwQ (Qwen with Questions): Reflect Deeply on the Boundaries of the Unknown “Note: This is the pronunciation of QwQ: /kwju:/ , similar to the word “quill”. ~ What does it mean to think, to question, to understand? These are the deep waters that QwQ (Qwen with Questions) wades into. Like an eternal student of wisdom, it approaches every problem - be it mathematics, code, or knowledge of our world - with genuine wonder and doubt. QwQ embodies that ancient philosophical spirit: it knows that it knows nothing, and that’s precisely what drives its curiosity. ~ QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations: Language Mixing and Code-Switching: The model may mix languages or switch between them unexpectedly, affecting response clarity. Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer. Safety and Ethical Considerations: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it. Performance and Benchmark Limitations: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding.” → ✭Alibaba's Qwen with Questions reasoning model beats o1-preview | VentureBeat “Alibaba has released a 32-billion-parameter version of QwQ with a 32,000-token context. The model is currently in preview, which means a higher-performing version is likely to follow. According to Alibaba’s tests, QwQ beats o1-preview on the AIME and MATH benchmarks, which evaluate mathematical problem-solving abilities. It also outperforms o1-mini on GPQA, a benchmark for scientific reasoning. QwQ is inferior to o1 on the LiveCodeBench coding benchmarks but still outperforms other frontier models such as GPT-4o and Claude 3.5 Sonnet.”
✭ [2411.14405] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (Alibaba) “Currently OpenAI o1 sparks a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: ''Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?'' Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.”
✭Introducing the Model Context Protocol Anthropic “Today, we're open-sourcing the Model Context Protocol (MCP), a new standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments. Its aim is to help frontier models produce better, more relevant responses. ~ As AI assistants gain mainstream adoption, the industry has invested heavily in model capabilities, achieving rapid advances in reasoning and quality. Yet even the most sophisticated models are constrained by their isolation from data—trapped behind information silos and legacy systems. Every new data source requires its own custom implementation, making truly connected systems difficult to scale. MCP addresses this challenge. It provides a universal, open standard for connecting AI systems with data sources, replacing fragmented integrations with a single protocol. The result is a simpler, more reliable way to give AI systems access to the data they need. ~ Today, we're introducing three major components of the Model Context Protocol for developers: The Model Context Protocol specification and SDKs. Local MCP server support in the Claude Desktop apps. An open-source repository of MCP servers. ~ Claude 3.5 Sonnet is adept at quickly building MCP server implementations, making it easy for organizations and individuals to rapidly connect their most important datasets with a range of AI-powered tools. To help developers start exploring, we’re sharing pre-built MCP servers for popular enterprise systems like Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer.” → ✭Anthropic launches tool to connect AI systems directly to datasets - The Verge “The Model Context Protocol connects an AI system to multiple data sources, which Anthropic says can eliminate the need to create custom code for each one.” → ✭ Introduction - Model Context Protocol “The Model Context Protocol (MCP) is an open protocol that enables seamless integration between LLM applications and external data sources and tools. Whether you’re building an AI-powered IDE, enhancing a chat interface, or creating custom AI workflows, MCP provides a standardized way to connect LLMs with the context they need.” → ✭ Why Anthropic’s Model Context Protocol Is A Big Step In The Evolution Of AI Agents (Forbes) “Anthropic’s Model Context Protocol represents a significant advancement in AI integration, offering a standardized approach to connecting AI models with external data sources.”
🛠️ Tech: Huawei's homegrown AI chip, Huawei Mate 70 series AI Airdrop gesture, LimX Dynamics' TRON 1, Elon Musk's Optimus catching ball, Voice-pro comprehensive Gradio WebUI, Microsoft's 10 new autonomous agents, AlphaChip defence, PhotoBot
✭Huawei's homegrown AI chip examined — Chinese fab SMIC-produced Ascend 910B is massively different from the TSMC-produced Ascend 910 “It is bigger and has more cores.”
✭ Huawei Mate 70 series to feature iPhone-like AI Airdrop gesture “Huawei Mate 70 series is ready to surprise everyone with a brand-new AI skill called “Airdrop Gesture”. It will eventually make media sharing easier and more interesting than ever. Today we have another teaser explaining the new AI gesture. The tech giant posted a video clip on its Weibo account.”
✭ LimX Dynamics Launches Multi-Modal Biped Robot TRON 1
✭ Elon Musk on X: "Let’s give Optimus a hand for catching ball!
✭Transform work with autonomous agents across your business processes - Microsoft Dynamics 365 Blog “We’re expanding our ambition to bring AI-first business process to organizations. First, we’re announcing that the ability to create autonomous agents with Microsoft Copilot Studio will be available in public preview in November 2024. Learn more on the Copilot Studio blog. Second, we’re introducing 10 new autonomous agents in Microsoft Dynamics 365 to build capacity for sales, service, finance, and supply chain teams. These agents are designed to help you accelerate your time to value and are configured to scale operational efficiency and elevate customer experiences across roles and functions.” → ✭ Microsoft’s AI agents: 4 insights that could reshape the enterprise landscape | VentureBeat “The era of AI agents has officially arrived, and Microsoft is leading the charge. At Ignite, the company made bold claims about its advancements in enterprise AI, including 100,000 organizations already deploying or editing AI agents. These announcements suggest Microsoft is about to disrupt how enterprises approach automation, as well as startups competing in this space.”
✭Jeff Dean on X “There has been unfounded skepticism in the EDA community about whether our AlphaChip method works as claimed in our Nature paper. @annadgoldie, @Azaliamirh, and I wrote a technical response highlighting these issues: That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design https://arxiv.org/abs/2411.10053” → ✭ Anna Goldie: Alpha Chip defence "That Chip Has Sailed" | X “AlphaChip was one of the first RL methods deployed to solve a real-world engineering problem, and it has been used to design superhuman chip layouts in three generations of TPUs, datacenter CPUs, and other chips across Alphabet. Its publication in Nature helped pioneer the field of AI for chip design, and this open-source method has been extended and built upon by external academics and chipmakers. Even so, a small group of detractors raised doubts about our work. Nature conducted a lengthy investigation and second peer review process, and found entirely in our favor, with the editors concluding "the best way forward is to publish an update to the paper in the form of an Addendum (not a ‘Correction’, as we have established that there is little that actually needs correcting)". See Nature Addendum published at the conclusion of this process: https://www.nature.com/articles/s41586-024-08032-5 Despite this, a "meta-analysis" was re-published in the Nov 2024 issue of CACM repeating the same concerns that Nature had already found to be without merit. (Incidentally, the sole author is an employee at Synopsys, which sells commercial tools that compete with our free, open-source method.) The "meta-analysis" covers two (!) papers, neither of which was peer-reviewed.” → ✭That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design “In 2020, we introduced a deep reinforcement learning method capable of generating superhuman chip layouts, which we then published in Nature and open-sourced on GitHub. AlphaChip has inspired an explosion of work on AI for chip design, and has been deployed in state-of-the-art chips across Alphabet and extended by external chipmakers. Even so, a non-peer-reviewed invited paper at ISPD 2023 questioned its performance claims, despite failing to run our method as described in Nature. For example, it did not pre-train the RL method (removing its ability to learn from prior experience), used substantially fewer compute resources (20x fewer RL experience collectors and half as many GPUs), did not train to convergence (standard practice in machine learning), and evaluated on test cases that are not representative of modern chips. Recently, Igor Markov published a meta-analysis of three papers: our peer-reviewed Nature paper, the non-peer-reviewed ISPD paper, and Markov's own unpublished paper (though he does not disclose that he co-authored it). Although AlphaChip has already achieved widespread adoption and impact, we publish this response to ensure that no one is wrongly discouraged from innovating in this impactful area.”
✭ Robot Photographer Takes the Perfect Picture “PhotoBot works with users to bring their imagination to life”
👁️🗨 Research into AI: 2:4 Sparse Llama: Smaller Models for Efficient GPU Inference (Neural Magic), Mirasol 3B: Scaling multimodal understanding to long videos (Google), Introducing SPDL: Faster AI model training with thread-based data loading (Meta & Reality Labs), HammingMesh: A Network Topology for Large-Scale Deep Learning, CleaR: Towards Robust and Generalized Parameter-Efficient Fine-Tuning for Noisy Label Learning, DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation, SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE, SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis, LongKey: Keyphrase Extraction for Long Documents, FiCoCo ''filter-correlate-compress'' Rethinking Token Reduction in MLLMs, DynaSaur: Large Language Agents Beyond Predefined Actions (Adobe)
✭2:4 Sparse Llama: Smaller Models for Efficient GPU Inference (Neural Magic) “A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy. Sparse Foundation Model: The first sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B with 98% recovery on Open LLM Leaderboard v1 and full recovery across fine-tuning tasks, including math, coding, and chat. Hardware-Accelerated Sparsity: Features a 2:4 sparsity pattern designed for NVIDIA Ampere GPUs and newer, delivering up to 30% higher throughput and 1.8x lower latency from sparsity alone with vLLM.
Quantization Compatible: Fully integrates with advanced 4-bit quantization methods like GPTQ and efficient Sparse-Marlin inference kernels, enabling faster inference anywhere from 1.4x to 4.9x depending on the hardware and scenario.”
✭Scaling multimodal understanding to long videos (Google) “When building machine learning models for real-life applications, we need to consider inputs from multiple modalities in order to capture various aspects of the world around us. For example, audio, video, and text all provide varied and complementary information about a visual input. However, building multimodal models is challenging due to the heterogeneity of the modalities. Some of the modalities might be well synchronized in time (e.g., audio, video) but not aligned with text. Furthermore, the large volume of data in video and audio signals is much larger than that in text, so when combining them in multimodal models, video and audio often cannot be fully consumed and need to be disproportionately compressed. This problem is exacerbated for longer video inputs. ~ In “Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities”, we introduce a multimodal autoregressive model (Mirasol3B) for learning across audio, video, and text modalities. The main idea is to decouple the multimodal modeling into separate focused autoregressive models, processing the inputs according to the characteristics of the modalities. Our model consists of an autoregressive component for the time-synchronized modalities (audio and video) and a separate autoregressive component for modalities that are not necessarily time-aligned but are still sequential, e.g., text inputs, such as a title or description. Additionally, the time-aligned modalities are partitioned in time where local features can be jointly learned. In this way, audio-video inputs are modeled in time and are allocated comparatively more parameters than prior works. With this approach, we can effortlessly handle much longer videos (e.g., 128-512 frames) compared to other multimodal models. At 3B parameters, Mirasol3B is compact compared to prior Flamingo (80B) and PaLI-X (55B) models. Finally, Mirasol3B outperforms the state-of-the-art approaches on video question answering (video QA), long video QA, and audio-video-text benchmarks.”
✭ Introducing SPDL: Faster AI model training with thread-based data loading (Meta & Reality Labs) “To achieve better utilization of GPUs and improve the speed of model training, we developed a new data loading solution, Scalable and Performant Data Loading (SPDL). SPDL embraces thread-based parallelism, which has a smaller memory footprint compared to conventional process-based parallelism. SPDL implemented basic media processing operations that work complementary with this thread-based parallelism in existing Python versions. Issues in AI model training efficiency The GPU Efficiency Team at Reality Labs works with various teams to diagnose training inefficiencies and discuss their solutions. The causes of and solutions for inefficiencies span across many different subdomains, not limited to data loading.”
✭HammingMesh: A Network Topology for Large-Scale Deep Learning – Communications of the ACM “Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.” → ✭Technical Perspective: Mirror, Mirror on the Wall, What Is the Best Topology of Them All? – Communications of the ACM
✭ CleaR: Towards Robust and Generalized Parameter-Efficient Fine-Tuning for Noisy Label Learning “Parameter-efficient fine-tuning (PEFT) has enabled the efficient optimization of cumbersome language models in real-world settings. However, as datasets in such environments often contain noisy labels that adversely affect performance, PEFT methods are inevitably exposed to noisy labels. Despite this challenge, the adaptability of PEFT to noisy environments remains underexplored. To bridge this gap, we investigate various PEFT methods under noisy labels. Interestingly, our findings reveal that PEFT has difficulty in memorizing noisy labels due to its inherently limited capacity, resulting in robustness. However, we also find that such limited capacity simultaneously makes PEFT more vulnerable to interference of noisy labels, impeding the learning of clean samples. To address this issue, we propose Clean Routing (CleaR), a novel routing-based PEFT approach that adaptively activates PEFT modules. In CleaR, PEFT modules are preferentially exposed to clean data while bypassing the noisy ones, thereby minimizing the noisy influence. To verify the efficacy of CleaR, we perform extensive experiments on diverse configurations of noisy labels. The results convincingly demonstrate that CleaR leads to substantially improved performance in noisy environments.”
✭[2411.16657] DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation “Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.” → ✭DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation · AI Paper Reviews by AI “Key Takeaways # DREAMRUNNER uses a two-stage LLM planning process for more effective scene planning and object motion control. | Retrieval-augmented test-time adaptation improves motion quality and consistency across scenes. | A novel spatial-temporal region-based diffusion model enables fine-grained control over object motions and smooth transitions. | Why does it matter? # This paper is important because it presents DREAMRUNNER, a novel approach to storytelling video generation that significantly improves the quality and coherence of generated videos. This is achieved through a two-stage LLM planning process, retrieval-augmented motion adaptation, and a novel spatial-temporal region-based 3D attention module. The results show state-of-the-art performance on several benchmarks, demonstrating the effectiveness of the approach and opening new avenues for research in long-form video generation. The open-source nature of the core model also boosts its potential impact.” → ✭DreamRunner “Overall pipeline for DreamRunner. (1) plan generation stage: we employ an LLM to craft a hierarchical video plan (i.e., “High-Level Plan” and “Fine-Grained Plan”) from a user-provided generic story narration. (2.1) motion retrieval and prior learning stage: we retrieve videos relevant to the desired motions from a video database for learning the motion prior through test-time finetuning. (2.2) subject prior learning stage: we use reference images for learning the subject prior through test-time fine-tuning. (3) video generation with region-based diffusion stage: we equipt diffusion model with a novel spatial-temporal region-based 3D attention and prior injection module (i.e., SR3AI) for video generation with fine-grained control.”
✭[2411.16856] SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE “Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.” → ✭SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE · AI Paper Reviews by AI “SAR3D achieves fast 3D object generation (0.82 seconds on an A6000 GPU) using a multi-scale autoregressive approach. | A multi-scale 3D VQVAE efficiently tokenizes 3D objects, enabling both generation and detailed understanding. | Finetuned LLMs on 3D-aware tokens enable comprehensive interpretation and captioning of 3D models. | ~... Its multi-scale approach reduces generation time, making it highly efficient. Furthermore, SAR3D enables detailed 3D understanding by using LLMs, opening new avenues for multimodal AI applications and research in 3D object comprehension.”
✭[2411.16443] SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis “Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.” → ✭SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis · AI Paper Reviews by AI “SplatFlow offers a unified framework for 3D scene generation and editing using 3D Gaussian Splatting. | It enables direct 3DGS generation and training-free editing through novel multi-view rectified flow model and Gaussian splatting decoder. | SplatFlow demonstrates versatility in handling various 3D tasks like object editing, novel view synthesis, and camera pose estimation. | Training-Free Editing # The concept of “Training-Free Editing” in the context of 3D Gaussian Splatting synthesis is a significant advancement. It leverages the power of pre-trained models, particularly the multi-view rectified flow model, to perform edits without requiring additional training or complex pipelines. This is achieved through inversion techniques, which enable the model to map existing 3D scenes into a latent space where manipulations can be performed directly on latent representations, and inpainting techniques which allow for seamless modifications and filling in missing data. This approach is highly efficient and flexible, enabling various tasks like object editing, camera pose estimation, and novel view synthesis without specialized model training for each task. The training-free nature is a crucial strength, offering a practical and scalable solution for 3D scene manipulation. However, limitations might exist in the range of edit operations achievable, particularly regarding highly complex or intricate alterations. Further research could investigate the boundaries of these editing capabilities and explore techniques to enhance control and precision for more complex modifications.”
✭[2411.17863] LongKey: Keyphrase Extraction for Long Documents “In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.” → ✭LongKey: Keyphrase Extraction for Long Documents · AI Paper Reviews by AI “LongKey, a new framework for keyphrase extraction, effectively handles long documents. | LongKey outperforms existing methods on various datasets, demonstrating its superior performance. | The max-pooling embedder in LongKey enhances keyphrase representation and accuracy.”
✭Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration · AI Paper Reviews by AI “A novel "filter-correlate-compress" paradigm for training-free token reduction in MLLMs is introduced, providing a unified framework for understanding and developing new methods. | FiCoCo, a suite of methods based on the unified paradigm, achieves significant speed improvements (up to 82.4% reduction in FLOPs) across multiple benchmarks with minimal performance impact. | The proposed methods (FiCoCo) outperform existing state-of-the-art training-free token reduction techniques and even surpass some training-based methods in certain scenarios.” → ✭ FiCoCo “To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods.”
✭[2411.17465] ShowUI: One Vision-Language-Action Model for GUI Visual Agent “Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at → ✭ showlab/ShowUI: Repository for ShowUI: One Vision-Language-Action Model for GUI Visual Agent “ShowUI is a lightweight vision-language-action model for GUI agents.” ✭ ShowUI: One Vision-Language-Action Model for GUI Visual Agent · AI Paper Reviews by AI “ShowUI uses UI-guided visual token selection to reduce computational costs and improve efficiency. | ShowUI employs interleaved vision-language-action streaming for flexible handling of diverse GUI tasks. | ShowUI achieves state-of-the-art performance on zero-shot screenshot grounding and navigation tasks with a lightweight model. | Why does it matter? # This paper is important because it addresses the challenges of building efficient and effective GUI visual agents. It introduces novel techniques for visual token selection and interleaved vision-language-action streaming, significantly improving model efficiency and performance. The high-quality dataset and benchmark it provides are also valuable resources for future research in this area, potentially leading to advancements in human-computer interaction and workflow automation.”
✭ [2411.17116] Star Attention: Efficient LLM Inference over Long Sequences “Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.” → ✭ Star Attention: Efficient LLM Inference over Long Sequences · AI Paper Reviews by AI “Star Attention’s core innovation lies in its two-phase approach to handling long sequences. The first phase employs block-sparse attention, processing the context in parallel across multiple hosts, significantly reducing quadratic complexity. This is achieved by dividing the context into blocks and applying local attention within each block, along with the addition of an ‘anchor block’ to mitigate attention sink issues common in such methods. The second phase leverages global attention for query tokens, ensuring they attend to the entire context efficiently. This global attention is carefully aggregated from all hosts to a single query host, minimizing communication overhead. This hybrid approach combines the speed of local attention with the accuracy of global attention, offering a scalable and efficient solution for LLM inference over long sequences. The effectiveness is further enhanced by the compatibility with existing Transformer-based LLMs, requiring no model fine-tuning. This makes it a particularly practical and adaptable method for real-world applications.”
✭DynaSaur: Large Language Agents Beyond Predefined Actions (Adobe) “Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in https://github.com/adobe-research/dynasaur.”
🔎 Applied Research: Footstep recognition as people identification, BindCraft: one-shot design of functional protein binders, AI has dreamt up a blizzard of new proteins, Modeling the increase of electronic waste due to generative AI, Five protein-design questions that still challenge AI
✭ Footstep recognition as people identification: A Systematic literature review “Footstep recognition is a relatively new biometric which aims to discriminate people using walking characteristics. There are several feature and technology have been adopted in various research. This study will attempt to show a comparative technology and feature which is offered each previous related works. We performed a broad manually search to find SLRs published in the time period 1st January 2006 to 30th November 2018. Our broad search found 12 SLRs articles refer to 3 similar technology and 5 cluster feature. In over time, the number of published footstep recognition has increased, especially in conference publications. The differences in footsteps can be known from the power spectral density of sounds and vibrations generated by footsteps. Every footstep of the human has a certain density of frequency, either from density of sounds or vibrations generated. To improve accurately of the result, this paper suggests furthermore research to combining several measuring sensor and data processing method.”
✭BindCraft: one-shot design of functional protein binders | bioRxiv “Protein–protein interactions (PPIs) are at the core of all key biological processes. However, the complexity of the structural features that determine PPIs makes their design challenging. We present BindCraft, an open-source and automated pipeline for de novo protein binder design with experimental success rates of 10-100%. BindCraft leverages the trained deep learning weights of AlphaFold21 to generate nanomolar binders without the need for high-throughput screening or experimental optimization, even in the absence of known binding sites. We successfully designed binders against a diverse set of challenging targets, including cell-surface receptors, common allergens, de novo designed proteins, and multi-domain nucleases, such as CRISPR-Cas9. We showcase their functional and therapeutic potential by demonstrating that designed binders can reduce IgE binding to birch allergen in patient-derived samples. This work represents a significant advancement towards a “one design-one binder” approach in computational design, with immense potential in therapeutics, diagnostics, and biotechnology.”
✭ AI has dreamt up a blizzard of new proteins. Do any of them actually work? “Emerging protein-design competitions aim to sift out the functional from the fantastical. But researchers hope that the real prize will be a revolution for the field.”
✭Modeling the increase of electronic waste due to generative AI | Nature Computational Science “A recent study has modeled and quantified the expected rise in electronic waste due to the increasing deployment of generative artificial intelligence.”
✭Five protein-design questions that still challenge AI “Tools such as Rosetta and AlphaFold have redefined the protein-engineering landscape. But some problems remain out of reach — for now.”
✭ AlphaQubit: Google’s research on quantum error correction (DeepMind) “AlphaQubit, an AI-based decoder that identifies quantum computing errors with state-of-the-art accuracy. This collaborative work brought together Google DeepMind’s machine learning knowledge and Google Quantum AI’s error correction expertise to accelerate progress on building a reliable quantum computer.” → ✭Learning high-accuracy error decoding for quantum processors | Nature “Building a large-scale quantum computer requires effective strategies to correct errors that inevitably arise in physical quantum systems1. Quantum error-correction codes2 present a way to reach this goal by encoding logical information redundantly into many physical qubits. A key challenge in implementing such codes is accurately decoding noisy syndrome information extracted from redundancy checks to obtain the correct encoded logical information. Here we develop a recurrent, transformer-based neural network that learns to decode the surface code, the leading quantum error-correction code3. Our decoder outperforms other state-of-the-art decoders on real-world data from Google’s Sycamore quantum processor for distance-3 and distance-5 surface codes4. On distances up to 11, the decoder maintains its advantage on simulated data with realistic noise including cross-talk and leakage, utilizing soft readouts and leakage information. After training on approximate synthetic data, the decoder adapts to the more complex, but unknown, underlying error distribution by training on a limited budget of experimental samples. Our work illustrates the ability of machine learning to go beyond human-designed algorithms by learning from data directly, highlighting machine learning as a strong contender for decoding in quantum computers.”
👀Watching (or might watch, or watched morsels of …) : Writing Doom, Top Minds in AI Explain What's Coming After GPT-4o, The First 10,000 Days on Proxima Centauri B
✭Writing Doom – Award-Winning Short Film on Superintelligence (2024) - YouTube “Writing Doom is a fiction short film about the dangers of artificial intelligence (AI). Grand Prize Winner - Future of Life Institute's Superintelligence Imagined Contest”
✭Top Minds in AI Explain What’s Coming After GPT-4o | EP #130 - YouTube “In this episode, Peter is joined by a panel of leaders in the "BEYOND GPT MODELS — WHAT IS THE DECADE AHEAD?" panel at the 8th FII Summit to discuss how AI will impact industries beyond large language models. This includes: Dr. Kai-Fu Lee, Chairman & CEO, Sinovation Ventures, CEO, 01.AI. Richard Socher, CEO & Founder, you.com, Co-Founder & Managing Director, AIX Ventures. Prem Akkaraju, CEO, Stability AI”
✭The First 10,000 Days on Proxima Centauri B (Sci-Fi Documentary) - YouTube “After a 100-year journey across 4.24 light-years of space, 4,752 passengers have finally arrived at their new planet. This is a sci-fi documentary, looking at the first 10,000 days on Proxima Centauri B.”
~ un-watched…
✭Anthropic's New Agent Protocol! “In this video, I look at the new Model Context Protocol from Anthropic and discuss how it works and what you can do with for both Claude and other LLMs”
✭AI Companions Always Say Yes, But There’s a Catch | Posthuman with Emily Chang - YouTube “We are now more connected than ever, but also more lonely. Could AI companionship be the cure? In this episode, Emily Chang explores the future tech behind a growing market of relationships-on-demand.”
Fugatto: NVIDIA’s New AI: Stunning Voice Generator!
✭The Age of Neural Engineering (ft. Science Corp.) - YouTube “The age of Neural Engineering is here...and Max Hodak's Science Corp is on the frontier: curing blindness, merging mind & machine with BCIs, today, revealing a new kind of neural probe, and exploring consciousness itself”
✭Inside NotebookLM with Raiza Martin and Steven Johnson - YouTube “NotebookLM is a research assistant powered by Gemini that draws on expertise from storytelling to present information in an engaging way. It allows users to upload their own documents and generate insights, explanations, and—more recently—podcasts. This innovative feature, also known as audio overviews, has captured the imagination of millions of people worldwide, who have created thousands of engaging podcasts ranging from personal narratives to educational explainers using source materials like CVs, personal journals, sales decks, and more. Join Raiza Martin and Steven Johnson from Google Labs, Google’s testing ground for products, as they guide host Hannah Fry through the technical advancements that have made NotebookLM possible. In this episode they'll explore what it means to be interesting, the challenges of generating natural-sounding speech, as well as exciting new modalities on the horizon.”
✭Eric Schmidt unveils new book on the future of AI at Princeton University - YouTube “Eric Schmidt, an accomplished technologist, entrepreneur and philanthropist known for his pivotal role as former chairman and chief executive officer of Google, will return to his alma mater Princeton on Nov. 20 to discuss “Genesis: Artificial Intelligence, Hope, and the Human Spirit,” the new book he co-authored with Craig Mundie and the late Henry Kissinger. Charting a course between blind faith and unjustified fear, “Genesis” was written to help today’s decision-makers seize the opportunities presented by artificial intelligence (AI) without falling prey to the darker forces it can unleash, according to its publisher, Little, Brown and Co. The book advances the urgent conversation surrounding AI with a broad and incisive view of the technology’s potential impact on humanity and the planet. “Individuals, nations, cultures and faiths … will need to decide whether to allow AI to become an intermediary between humans and reality,” the authors write. They analyze AI’s promise and perils in seven vital areas — discovery, truth, politics, security, prosperity, science and fate — and warn of how quickly change may come. “AI seems to compress human timescales. Objects in the future are closer than they appear.””
✭Create Consistent Realistic Characters Using AI | Easy AI Image Tutorial -Curious Refuge “In this video tutorial, I will show you how to get consistent characters every time inside of your AI images. Utilizing Lora's and comfyUI workflows can be pretty intimidating, but in this tutorial I will show you how to generate consistent image using trained Lora's without being confusing or overwhelming. This will give you the most accurate character consistency that honestly blows Midjourney character consistency out of the water. I hope you enjoy.”
Heteronormative cliches trigger warning: ✭25min of Retro-Future JAZZ ~ 8 Tracks | Randomized 1960s Sci-Fi AI Video Clips - YouTube “This music video features 8 carefully curated jazz tracks, to help you Study, Relax, or Sleep. Paired with randomized retro-futuristic AI video clips. Enjoy the unedited, full versions, of AI-generated clips from the "Futures of the Past" series, including hidden gems that didn’t make the final cut! Plus, a few brand-new clips created just for this release.”
🖲️AI Art-Research: Pile · Made by Bengt Tibert with Sora, Messages to Humanity (Act-One) [Nov 2024], Fellowship project examples, Free Coloring Pages for Kids and Adults, ACE Studio, RIDE (Guli Silberstein)
✭ Pile · Made by Bengt Tibert with Sora “Bengt Tibert x Sora ShowcaseBengt Tibert is an AI artist with a background as a Creative Director in advertising, focusing primarily on video. He is a Daily....”
✭Messages to Humanity (Act-One) [Nov 2024] | Jhave @ Glia.ca “On Oct 31st, one week after the release of Runway's Act-One, I decided to make and release one brief video per day for 30 days. Investigating AI's evolving creative potential to create micro-parable messages appropriate in a disintegrative era. 'Messages to Humanity' began on Oct 31st, 2024, and concluded (on schedule, as planned) after 30 videos on Nov 29th, 2024.”
✭Fellowship on X: "A strong artistic project is not built from just one collection. It takes not one, not two, but many tries. A great example of this is the @niceaunties project ↓ Let's take a closer look at her project. → Recent example: ✭ niceaunties on X: "GM 🪁
✭ Free Coloring Pages for Kids and Adults | ColoringsAI “Discover thousands of free printable coloring pages for kids and adults. From animals to mandalas, find the perfect coloring page to spark your creativity.”
Intriguing powers for musicians: ✭ ACE Studio “VOICE GENERATOR. #1 AI VOICE GENERATOR FOR MUSIC”
✭RIDE (Guli Silberstein | Instagram) “#ai #art #aiart #trippy #surreal #dream #aiartwork #digitalart #aiartcommunity #trippyart #trippyvideo #trippyvid #generativeart #neuralart #fyp #psychedelicart #psychedelic #aianimation #weird #crazy #weirdcore #creativeai #train”
⚔️War (wAIr): How Amazon Uses Massive Tech Assets to Fight AI-Enhanced Cybercrime (inc.com)
✭How Amazon Uses Massive Tech Assets to Fight AI-Enhanced Cybercrime | inc.com “The tech powerhouse describes how it deploys its vast array of computing and communication tools to work to battle hackers and crooks who are using AI to scale up threats to businesses.”
📚Retroactive Readings: A Computational Framework of Human Values, Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
✭A Computational Framework of Human Values | Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems | Orsman, Nardine et al. May 2024 “There is an increasing recognition of the need to engineer AI that respects and embodies human values. The value alignment problem, which identifies that need, has led to a growing body of research that investigates value learning, the aggregation of individual values into the values of groups, the alignment of norms with values, and the design of other computational mechanisms that reason over values in general. Yet despite these efforts, no foundational, computational model of human values has been proposed. In response, we propose a model for the computational representation of human values that builds upon a sustained body of research from social psychology.”
✭ Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (Sept 2024) “Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at
https://molmo.allenai.org
.”