One Week, 7 Major Foundation Model Releases, GPT-4o mini, Mistral NeMo, FlashAttention-3, TieBot, trolling LLMs, Eureka Labs, SpreadsheetLLM, MathΣtral, SmoLLM
+ microsized optical spectrometer, Achieving AI for All, Consent in Crisis, Taylor & Francis sells access to their academic research to Microsoft AI, Deepfake? Look for the stars in their eyes
🎊 July 16-23 : GPT-4o mini: advancing cost-efficient intelligence (OpenAI), Mistral NeMo: A state-of-the-art 12B model with 128k context length built in collaboration with NVIDIA, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (NVIDIA), One Week, 7 Major Foundation Model Releases (The Sequence), AI-NERD: Elucidation of relaxation dynamics beyond equilibrium through AI-informed X-ray photon correlation spectroscopy (Nature Communications), TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach, Data‐driven target localization using adaptive radar processing and convolutional neural networks (IET Radar, Sonar & Navigation), A microsized optical spectrometer based on an organic photodetector with an electrically tunable spectral response (Nature Electronics), Simple autonomous agents can enhance creative semantic discovery by human groups (Nature Communications), Bringing Communities In Achieving AI for All (Issues in Science and Technology), Data for A.I. Training Is Disappearing Fast, Study Shows (The New York Times), Consent in Crisis: The Rapid Decline of the AI Data Commons (Data Provenance Initiative), Academic authors 'shocked' after Taylor & Francis sells access to their research to Microsoft AI, Want to spot a deepfake: Look for the stars in their eyes (Adejumoke Owolabi), How AI And Robot Job Displacements Could Lead Us Down The Road Of Universal Basic Income And Loss Of Identity (Forbes), The serious science of trolling LLMs, Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency, AI+Education company called Eureka Labs (Karpathy), Transcribro: Private and on-device speech recognition keyboard and service for Android, SAPwned: SAP AI vulnerabilities expose customers’ cloud environments and private AI artifacts (Wiz Blog), Jailbreaking RabbitOS: Uncovering Secret Logs and GPL Violations, Gigapixel Pro (Topaz Labs), Hugging Face Releases SmoLLM; AI Lab: The secrets to keeping machine learning engineers moving fast (Engineering at Meta), New Cohere Toolkit Features: Authentication, HTML and More, How NuminaMath Won the 1st AIMO Progress Prize, Qdrant, Even Realities G1: Next-Gen Smart Glasses with Display, 100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing (SemiAnalysis), Nvidia Blackwell Perf TCO Analysis - B100 vs B200 vs GB200NVL72 (SemiAnalysis), Google Distributed Cloud air-gapped appliance, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (Microsoft), SmolLM (HuggingFace), MathΣtral (Mistral AI), Codestral Mamba (Mistral), xLSTMTime: Long-term Time Series Forecasting With xLSTM, DCLM-Baseline 7B (Apple), Goldfish: Vision-Language Understanding of Arbitrarily Long Videos, Qwen2 Technical Report (Alibaba), GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables (MIT), Prover-Verifier Games improve legibility of language model outputs (OpenAI), MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding, ChatGPT, or the Eschatology of Machines - Journal #137 (Yuk Hui | June 2023), Open Source AI Is the Path Forward (Zuckerberg, Meta)
TL;DR: Want to spot a deepfake? Look for the stars in their eyes; The Sequence details how One Week, 7 Major Foundation Model Releases; including OpenAI release of GPT-4o mini: the most cost-efficient small model in the market; Andrej Karpathy announced he’s starting an AI+Education company called Eureka Labs; Hugging Face and NuminaMath Won the 1st AIMO Progress Prize; Goldfish claims SOTA Vision-Language Understanding of Arbitrarily Long Videos; Mistral in collaboration with NVIDIA released Mistral NeMo A state-of-the-art 12B model with 128k context length; and if, like the other 15 known corporate ventures worldwide, you’re interested in engineering a 100K H100 cluster, SemiAnalysis analyzed 100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing
Notes toward AI and Literature: Projects like AI-NERD: Elucidation of relaxation dynamics beyond equilibrium through AI-informed X-ray photon correlation spectroscopy may seem distant from AI+narrative, but what if material complexity is analogous to psychological complexity? Avoiding fusion plasma tearing instability with deep reinforcement learning (Nature) may not seem applicable to literature, but what if electromagnetic containment of plasma is analogous to emotional inhibition?
🏓 Observations: Bringing Communities In Achieving AI for All (Issues in Science and Technology), Data for A.I. Training Is Disappearing Fast, Study Shows (The New York Times), Consent in Crisis: The Rapid Decline of the AI Data Commons (Data Provenance Initiative), Academic authors 'shocked' after Taylor & Francis sells access to their research to Microsoft AI, Want to spot a deepfake: Look for the stars in their eyes (Adejumoke Owolabi), How AI And Robot Job Displacements Could Lead Us Down The Road Of Universal Basic Income And Loss Of Identity (Forbes), The serious science of trolling LLMs, Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency, Open Source AI Is the Path Forward (Zuckerberg, Meta)
✭Bringing Communities In, Achieving AI for All (Issues in Science and Technology) “To ensure that artificial intelligence meaningfully addresses social inequalities, AI designers and regulators should seek out partnerships with marginalized communities to learn what they need from this emerging technology—and build it. ~ Facial-recognition technology, famously, is proving to be a tool of oppression, as many have feared. Reports of AI triggering false arrests of Black people are becoming routine. Municipalities are using facial-recognition cameras to aggressively surveil and police residents in public housing, many of whom are Black. Against hopes that AI would reduce bias in criminal justice, its use so far has magnified the system’s structural inequalities. Meanwhile, major AI firms like OpenAI are exploiting overseas sweatshop labor to train algorithms. And AI tools meant to benefit people with disabilities are having the opposite effect. This situation is creating real harm for people who are already disadvantaged, while also amplifying distrust in science and government. ~ Proposed responses to these equity and justice concerns typically amount to small tweaks, often of a technical nature. The thinking seems to be that policymakers, academics, and the technical community can solve AI’s problems by identifying statistical biases in datasets, designing systems to be more transparent and explainable in their decisionmaking, and exercising oversight. For instance, experts ask how government agencies might evaluate the safety and efficacy of algorithms. In parallel, the technology industry has tried to educate developers about the impact of social biases on AI algorithms and has suggested minimal “fairness solutions” also focused on bias. ~We must ask ourselves if we really believe that marginalized people should be content to leave their fates to the tinkering of governments and corporations when such measures have had little impact in the past. Where is the input, in this equation, of marginalized people themselves? If we are concerned by equity in the age of AI, shouldn’t those with the most at stake have an important role in shaping the governance agenda?”
✭Data for A.I. Training Is Disappearing Fast, Study Shows (The New York Times) “New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.” → ✭ Consent in Crisis: The Rapid Decline of the AI Data Commons (Data Provenance Initiative) “General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14, 000
web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites’ expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.” → ✭CCC Pioneers Collective Licensing Solution for Content Usage in Internal AI Systems “July 16, 2024 – Danvers, Mass. – CCC, a leader in advancing copyright, accelerating knowledge, and powering innovation, today announced the availability of artificial intelligence (AI) re-use rights within its Annual Copyright Licenses (ACL)”
✭Academic authors 'shocked' after Taylor & Francis sells access to their research to Microsoft AI “Authors claim they have not been told about the AI deal, were not given the opportunity to opt out and are receiving no extra payment.”
✭ AI paid for by Ads – the gpt-4o mini inflection point “AI is so cheap now, that it's cheaper than the average ad impression.”
✭Want to spot a deepfake? Look for the stars in their eyes (Adejumoke Owolabi) “In an era when the creation of artificial intelligence (AI) images is at the fingertips of the masses, the ability to detect fake pictures – particularly deepfakes of people – is becoming increasingly important. So what if you could tell just by looking into someone's eyes? That's the compelling finding of new research shared at the Royal Astronomical Society’s National Astronomy Meeting in Hull, which suggests that AI-generated fakes can be spotted by analysing human eyes in the same way that astronomers study pictures of galaxies. The crux of the work, by University of Hull MSc student Adejumoke Owolabi, is all about the reflection in a person's eyeballs. If the reflections match, the image is likely to be that of a real human. If they don't, they're probably deepfakes.”
✭ How AI And Robot Job Displacements Could Lead Us Down The Road Of Universal Basic Income And Loss Of Identity (Forbes) “Explore the impact of AI and robotics on job displacement, the potential rise of universal basic income and the evolving nature of work. Understand how these changes might affect human identity and employment.”
✭The serious science of trolling LLMs “The internet's oldest pastime finally has a purpose -- and it's more serious than AI companies would like to admit.”
✭ Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency “In this paper we argue that key, often sensational and misleading, claims regarding linguistic capabilities of Large Language Models (LLMs) are based on at least two unfounded assumptions; the assumption of language completeness and the assumption of data completeness. Language completeness assumes that a distinct and complete thing such as `a natural language' exists, the essential characteristics of which can be effectively and comprehensively modelled by an LLM. The assumption of data completeness relies on the belief that a language can be quantified and wholly captured by data. Work within the enactive approach to cognitive science makes clear that, rather than a distinct and complete thing, language is a means or way of acting. Languaging is not the kind of thing that can admit of a complete or comprehensive modelling. From an enactive perspective we identify three key characteristics of enacted language; embodiment, participation, and precariousness, that are absent in LLMs, and likely incompatible in principle with current architectures. We argue that these absences imply that LLMs are not now and cannot in their present form be linguistic agents the way humans are. We illustrate the point in particular through the phenomenon of `algospeak', a recently described pattern of high stakes human language activity in heavily controlled online environments. On the basis of these points, we conclude that sensational and misleading claims about LLM agency and capabilities emerge from a deep misconception of both what human language is and what LLMs are.”
✭Open Source AI Is the Path Forward | Meta “Mark Zuckerberg outlines why he believes open source AI is good for developers, Meta and the world.”
⛲Foundational Revelations: GPT-4o mini: advancing cost-efficient intelligence (OpenAI), Mistral NeMo: A state-of-the-art 12B model with 128k context length built in collaboration with NVIDIA, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (NVIDIA), One Week, 7 Major Foundation Model Releases (The Sequence)
✭ GPT-4o mini: advancing cost-efficient intelligence (OpenAI) “Introducing the most cost-efficient small model in the market”
✭ Mistral NeMo “A state-of-the-art 12B model with 128k context length, built in collaboration with NVIDIA, and released under the Apache 2.0 license.”
✭ [2407.08608] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (NVIDIA) “Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0× with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6× lower numerical error than a baseline FP8 attention.”
✭One Week, 7 Major Foundation Model Releases (The Sequence) “Building high-quality, large-scale foundation models is hard. Just a year ago, it seemed that the foundation model space was going to be highly fragmented, with new models coming to market literally every week. After the high computational and capital realities became obvious, the space seems to have consolidated into a dozen or so relevant models per modality, with a few more in the language space. At the moment, two trends seem to be emerging to catalyze the next generation of foundation models: Domain Specialization: Models more specialized in horizontal domains such as coding, function calling, math, etc. Small Models: 500M-10B parameter models that can run inference on commodity hardware, IoT, or mobile devices.~ Last week was exceptional in terms of model releases in these areas. Just to list a few:
Mistral released two new models covering areas such as math and coding.
Mistral and NVIDIA also released a new small model optimized for enterprise environments.
OpenAI unveiled a smaller, cheaper version of its flagship GPT-4 model.
Apple open sourced a series of small models that outperform Mistral-7B.
Groq open-sourced a series of 7B models that seem best in class in function calling.
HuggingFace open-sourced a series of small, high-performance LLMs.”
🛠️ Tech: AI+Education company called Eureka Labs (Karpathy), Transcribro: Private and on-device speech recognition keyboard and service for Android, SAPwned: SAP AI vulnerabilities expose customers’ cloud environments and private AI artifacts (Wiz Blog), Jailbreaking RabbitOS: Uncovering Secret Logs and GPL Violations, Gigapixel Pro (Topaz Labs), Hugging Face Releases SmoLLM; AI Lab: The secrets to keeping machine learning engineers moving fast (Engineering at Meta), New Cohere Toolkit Features: Authentication, HTML and More, How NuminaMath Won the 1st AIMO Progress Prize, Qdrant, Even Realities G1: Next-Gen Smart Glasses with Display, 100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing (SemiAnalysis), Nvidia Blackwell Perf TCO Analysis - B100 vs B200 vs GB200NVL72 (SemiAnalysis), Google Distributed Cloud air-gapped appliance
✭Andrej Karpathy on X: "⚡️ Excited to share that I am starting an AI+Education company called Eureka Labs. → ✭ Eureka Labs “We are Eureka Labs and we are building a new kind of school that is AI native. How can we approach an ideal experience for learning something new? For example, in the case of physics one could imagine working through very high quality course materials together with Feynman, who is there to guide you every step of the way. Unfortunately, subject matter experts who are deeply passionate, great at teaching, infinitely patient and fluent in all of the world's languages are also very scarce and cannot personally tutor all 8 billion of us on demand. ~ However, with recent progress in generative AI, this learning experience feels tractable. The teacher still designs the course materials, but they are supported, leveraged and scaled with an AI Teaching Assistant who is optimized to help guide the students through them. This Teacher + AI symbiosis could run an entire curriculum of courses on a common platform. If we are successful, it will be easy for anyone to learn anything, expanding education in both reach (a large number of people learning something) and extent (any one person learning a large amount of subjects, beyond what may be possible today unassisted). ~ Our first product will be the world's obviously best AI course, LLM101n. This is an undergraduate-level class that guides the student through training their own AI, very similar to a smaller version of the AI Teaching Assistant itself. The course materials will be available online, but we also plan to run both digital and physical cohorts of people going through it together. ~ Today, we are heads down building LLM101n, but we look forward to a future where AI is a key technology for increasing human potential. What would you like to learn?”
✭ SAPwned: SAP AI vulnerabilities expose customers’ cloud environments and private AI artifacts | Wiz Blog “Wiz Research uncovers vulnerabilities in SAP AI Core, allowing malicious actors to take over the service and access customer data.”
✭ Jailbreaking RabbitOS: Uncovering Secret Logs, and GPL Violations
✭ Topaz Labs | Gigapixel Pro™ | Image upscaling built for the studio. “600% upscale. 2x faster with CLI support.”
✭AI Lab: The secrets to keeping machine learning engineers moving fast - Engineering at Meta “The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and automatically preventing regressions on TTFB. AI Lab prevents TTFB regressions whilst enabling experimentation to develop improvements. For example, during the rollout of the open source Python Cinder runtime, AI Lab was used to yield a 2x increase on original TTFB improvements, reducing TTFB by up to 40%.”
✭ New Cohere Toolkit Features: Authentication, HTML and More “At Cohere, we’re continuing to enable developers to accelerate generative AI application development. We’re expanding our open source Cohere Toolkit features, introducing HTML rendering, configurable authentication, and multi-step tool use. The toolkit is an open-source repository of production-ready applications that you can deploy across cloud providers. It contains the components required to build an end-to-end AI-powered assistant including source code for interfaces, models, and retrieval. Users can leverage the toolkit to build applications that are conversational, grounded, and customizable for powerful business solutions.”
✭ How NuminaMath Won the 1st AIMO Progress Prize “This year, Numina and Hugging Face collaborated to compete in the 1st Progress Prize of the AI Math Olympiad (AIMO). This competition involved fine-tuning open LLMs to solve difficult math problems that high school students use to train for the International Math Olympiad. We’re excited to share that our model — NuminaMath 7B TIR — was the winner and managed to solve 29 out of 50 problems on the private test set 🥳!”
✭Qdrant - Vector Database - Qdrant “High-Performance Vector Search at Scale. Powering the next generation of AI applications with advanced, open-source vector similarity search technology.”
✭Even Realities G1: Next-Gen Smart Glasses with Display “G1 redefines prescription eyewear with smart features including QuickNote, Translate, Navigate, and Even AI.”
✭100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing (SemiAnalysis) “There is a camp that feels AI capabilities have stagnated ever since GPT-4’s release. This is generally true, but only because no one has been able to massively increase the amount of compute dedicated to a single model. Every model that has been released is roughly GPT-4 level (~2e25 FLOP of training compute). This is because the training compute dedicated to these models have also been roughly the same level. In the case of Google’s Gemini Ultra, Nvidia Nemotron 340B, and Meta LLAMA 3 405B, the FLOPS dedicated were of similar magnitude or even higher when compared to GPT-4, but an inferior architecture was utilized, resulting in these models falling short of unlocking new capabilities. ~ Today we will dive into large training AI clusters and the infrastructure around them. Building these clusters is a lot more complicated than just throwing money at the problem. Achieving high utilization with them is even more difficult due to the high failure rates of various components, especially networking. We will also walk through power challenges, reliability, checkpointing, network topology options, parallelism schemes, rack layouts, and total bill of materials for these systems. Over a year ago, we covered Nvidia’s InfiniBand problem which resulted in some companies choosing Spectrum-X Ethernet over InfiniBand. We will also cover the major flaw with Spectrum-X which has hyperscalers going with Broadcom’s Tomahawk 5. ~ To put in perspective how much compute a 100,000 GPU cluster can provide, OpenAI’s training BF16 FLOPS for GPT-4 was ~2.15e25 FLOP (21.5 million ExaFLOP), on ~20,000 A100s for 90 to 100 days. That cluster only had 6.28 BF16 ExaFLOP/second peak throughput. On a 100k H100 cluster, this number would soar to 198/99 FP8/FP16 ExaFLOP/second. This is a 31.5x increase in peak theoretical AI training FLOPs compared to the 20k A100 cluster. → ✭ Nvidia Blackwell Perf TCO Analysis - B100 vs B200 vs GB200NVL72 (SemiAnalysis) “GPT-4 MoE’s 55B attention parameters and 16 experts of 111B parameters each across 120 layers, that’s 1.831 trillion parameters at 8 bits per parameter, in total requiring 1,831 GB of memory.”
✭ Google Distributed Cloud air-gapped appliance is GA | Google Cloud Blog “The Google Distributed Cloud air-gapped appliance is an integrated hardware and software solution that lets you run workloads at the tactical edge.”
👁️🗨️ Research into AI: SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (Microsoft), SmolLM (HuggingFace), MathΣtral (Mistral AI), Codestral Mamba (Mistral), xLSTMTime: Long-term Time Series Forecasting With xLSTM, DCLM-Baseline 7B (Apple), Goldfish: Vision-Language Understanding of Arbitrarily Long Videos, Qwen2 Technical Report (Alibaba), GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables (MIT), Prover-Verifier Games improve legibility of language model outputs (OpenAI), MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
✭SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (Microsoft) “Spreadsheets are characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs’ powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs’ token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25×, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.”
✭Hugging Face Releases SmoLLM, a Series of Small Language Models, Beats Qwen2 and Phi 1.5 (AnalyticsIndia) ✭ SmolLM - blazingly fast and remarkably powerful (HuggingFace) “There is increasing interest in small language models that can operate on local devices. This trend involves techniques such as distillation or quantization to compress large models, as well as training small models from scratch on large datasets. These approaches enable novel applications while dramatically reducing inference costs and improving user privacy. Microsoft's Phi series, Alibaba's Qwen2 (less than 2B), and Meta's MobileLLM demonstrate that small models can achieve impressive results when designed and trained thoughtfully. However, most of the details about the data curation and training of these models are not publicly available. In this blog post, we're excited to introduce SmolLM, a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are built on a meticulously curated high-quality training corpus, which we are releasing as SmolLM-Corpus. Smollm Corpus includes: Cosmopedia v2: A collection of synthetic textbooks and stories generated by Mixtral (28B tokens); Python-Edu: educational Python samples from The Stack (4B tokens); FineWeb-Edu (deduplicated): educational web samples from FineWeb (220B tokens). ~ Our evaluations demonstrate that SmolLM models outperform other models in their size categories across a diverse set of benchmarks, testing common sense reasoning and world knowledge. In this blog post, we will go over the curation of each subset in the training corpus and then discuss the training and evaluation of SmolLM models.”
✭ MathΣtral | Mistral AI | Frontier AI in your hands “As a tribute to Archimedes, whose 2311th anniversary we’re celebrating this year, we are proud to release our first Mathstral model, a specific 7B model designed for math reasoning and scientific discovery. The model has a 32k context window published under the Apache 2.0 license. We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning. The Mathstral release is part of our broader effort to support academic projects—it was produced in the context of our collaboration with Project Numina. ~ Akin to Isaac Newton in his time, Mathstral stands on the shoulders of Mistral 7B and specializes in STEM subjects. It achieves state-of-the-art reasoning capacities in its size category across various industry-standard benchmarks. In particular, it achieves 56.6% on MATH and 63.47% on MMLU,” → ✭Project Numina – Numina is a non-profit organization fostering the development of human and artificial intelligence in the field of fundamental sciences, starting with mathematics.
✭ Codestral Mamba (Mistral) “As a tribute to Cleopatra, whose glorious destiny ended in tragic snake circumstances, we are proud to release Codestral Mamba, a Mamba2 language model specialised in code generation, available under an Apache 2.0 license. ~ ollowing the publishing of the Mixtral family, Codestral Mamba is another step in our effort to study and provide new architectures. It is available for free use, modification, and distribution, and we hope it will open new perspectives in architecture research. Codestral Mamba was designed with help from Albert Gu and Tri Dao. ~ Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length. It allows users to engage with the model extensively with quick responses, irrespective of the input length. This efficiency is especially relevant for code productivity use cases—this is why we trained this model with advanced code and reasoning capabilities, enabling it to perform on par with SOTA transformer-based models.”
✭ xLSTMTime : Long-term Time Series Forecasting With xLSTM “In recent years, transformer-based models have gained prominence in multivariate long-term time series forecasting (LTSF), demonstrating significant advancements despite facing challenges such as high computational demands, difficulty in capturing temporal dynamics, and managing long-term dependencies. The emergence of LTSF-Linear, with its straightforward linear architecture, has notably outperformed transformer-based counterparts, prompting a reevaluation of the transformer's utility in time series forecasting. In response, this paper presents an adaptation of a recent architecture termed extended LSTM (xLSTM) for LTSF. xLSTM incorporates exponential gating and a revised memory structure with higher capacity that has good potential for LTSF. Our adopted architecture for LTSF termed as xLSTMTime surpasses current approaches. We compare xLSTMTime's performance against various state-of-the-art models across multiple real-world da-tasets, demonstrating superior forecasting capabilities. Our findings suggest that refined recurrent architectures can offer competitive alternatives to transformer-based models in LTSF tasks, po-tentially redefining the landscape of time series forecasting.”
✭ Apple Open Sources DCLM-Baseline 7B, Outperforms Meta’s Llama 2 “The new model integrates data from DCLM-BASELINE, StarCoder, and ProofPile2, achieving an MMLU score of 0.6372, placing it between Mistral and Llama3 in performance metrics.”
✭Goldfish: Vision-Language Understanding of Arbitrarily Long Videos “Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as “noise and redundancy", as well as “memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models’ capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed MiniGPT4-Video that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF,and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding.Our models and code have been made publicly available at https://vision-cair.github.io/Goldfish_website/”
✭[2407.10671] Qwen2 Technical Report (Alibaba) “This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.”
✭GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables | Proceedings of the ACM on Programming Languages “This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL’s query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies—an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab—and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.” → ✭ MIT researchers introduce generative AI for databases | MIT News | Massachusetts Institute of Technology “A new tool makes it easier for database users to perform complicated statistical analyses of tabular data without the need to know what is going on behind the scenes. GenSQL, a generative AI system for databases, could help users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a few keystrokes. For instance, if the system were used to analyze medical data from a patient who has always had high blood pressure, it could catch a blood pressure reading that is low for that particular patient but would otherwise be in the normal range. GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model, which can account for uncertainty and adjust their decision-making based on new data. Moreover, GenSQL can be used to produce and analyze synthetic data that mimic the real data in a database. This could be especially useful in situations where sensitive data cannot be shared, such as patient health records, or when real data are sparse.”
✭Prover-Verifier Games improve legibility of language model outputs | OpenAI “We trained strong language models to produce text that is easy for weak language models to verify and found that this training also made the text easier for humans to evaluate. → ✭ [2407.13692] Prover-Verifier Games improve legibility of LLM outputs “One way to increase confidence in the outputs of Large Language Models (LLMs) is to support them with reasoning that is clear and easy to check -- a property we call legibility. We study legibility in the context of solving grade-school math problems and show that optimizing chain-of-thought solutions only for answer correctness can make them less legible. To mitigate the loss in legibility, we propose a training algorithm inspired by Prover-Verifier Game from Anil et al. (2021). Our algorithm iteratively trains small verifiers to predict solution correctness, "helpful" provers to produce correct solutions that the verifier accepts, and "sneaky" provers to produce incorrect solutions that fool the verifier. We find that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training. Furthermore, we show that legibility training transfers to time-constrained humans tasked with verifying solution correctness. Over course of LLM training human accuracy increases when checking the helpful prover's solutions, and decreases when checking the sneaky prover's solutions. Hence, training for checkability by small verifiers is a plausible technique for increasing output legibility. Our results suggest legibility training against small verifiers as a practical avenue for increasing legibility of large LLMs to humans, and thus could help with alignment of superhuman models.”
✭MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding “With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets.”
🔎 Applied Research: AI-NERD: Elucidation of relaxation dynamics beyond equilibrium through AI-informed X-ray photon correlation spectroscopy (Nature Communications), TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach, Data‐driven target localization using adaptive radar processing and convolutional neural networks (IET Radar, Sonar & Navigation), A microsized optical spectrometer based on an organic photodetector with an electrically tunable spectral response (Nature Electronics), Simple autonomous agents can enhance creative semantic discovery by human groups (Nature Communications)
✭AI-NERD: Elucidation of relaxation dynamics beyond equilibrium through AI-informed X-ray photon correlation spectroscopy | Nature Communications “Understanding and interpreting dynamics of functional materials in situ is a grand challenge in physics and materials science due to the difficulty of experimentally probing materials at varied length and time scales. X-ray photon correlation spectroscopy (XPCS) is uniquely well-suited for characterizing materials dynamics over wide-ranging time scales. However, spatial and temporal heterogeneity in material behavior can make interpretation of experimental XPCS data difficult. In this work, we have developed an unsupervised deep learning (DL) framework for automated classification of relaxation dynamics from experimental data without requiring any prior physical knowledge of the system. We demonstrate how this method can be used to accelerate exploration of large datasets to identify samples of interest, and we apply this approach to directly correlate microscopic dynamics with macroscopic properties of a model system. Importantly, this DL framework is material and process agnostic, marking a concrete step towards autonomous materials discovery.” ✭ Scientists develop new AI method to create material 'fingerprints' “Like people, materials evolve over time. They also behave differently when they are stressed and relaxed. Scientists looking to measure the dynamics of how materials change have developed a new technique that leverages X-ray photon correlation spectroscopy (XPCS), artificial intelligence (AI) and machine learning. ~ This technique creates "fingerprints" of different materials that can be read and analyzed by a neural network to yield new information that scientists previously could not access.”
✭TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach “The tie-knotting task is highly challenging due to the tie's high deformation and long-horizon manipulation actions. This work presents TieBot, a Real-to-Sim-to-Real learning from visual demonstration system for the robots to learn to knot a tie. We introduce the Hierarchical Feature Matching approach to estimate a sequence of tie's meshes from the demonstration video. With these estimated meshes used as subgoals, we first learn a teacher policy using privileged information. Then, we learn a student policy with point cloud observation by imitating teacher policy. Lastly, our pipeline learns a residual policy when the learned policy is applied to real-world execution, mitigating the Sim2Real gap. We demonstrate the effectiveness of TieBot in simulation and the real world. In the real-world experiment, a dual-arm robot successfully knots a tie, achieving 50% success rate among 10 trials.”
✭ Data‐driven target localization using adaptive radar processing and convolutional neural networks (IET Radar, Sonar & Navigation) “Leveraging the advanced functionalities of modern radio frequency (RF) modeling and simulation tools, specifically designed for adaptive radar processing applications, this paper presents a data-driven approach to improve accuracy in radar target localization post adaptive radar detection. To this end, we generate a large number of radar returns by randomly placing targets of variable strengths in a predefined area, using RFView®, a high-fidelity, site-specific, RF modeling & simulation tool. We produce heatmap tensors from the radar returns, in range, azimuth [and Doppler], of the normalized adaptive matched filter (NAMF) test statistic. We then train a regression convolutional neural network (CNN) to estimate target locations from these heatmap tensors, and we compare the target localization accuracy of this approach with that of peak-finding and local search methods. This empirical study shows that our regression CNN achieves a considerable improvement in target location estimation accuracy. The regression CNN offers significant gains and reasonable accuracy even at signal-to-clutter-plus-noise ratio (SCNR) regimes that are close to the breakdown threshold SCNR of the NAMF. We also study the robustness of our trained CNN to mismatches in the radar data, where the CNN is tested on heatmap tensors collected from areas that it was not trained on. We show that our CNN can be made robust to mismatches in the radar data through few-shot learning, using a relatively small number of new training samples.” → ✭ Enhancing adaptive radar with AI and an enormous open-source dataset (Phys.org) “Duke engineers show that using convolutional neural networks (CNNs)—a type of AI that revolutionized computer vision—can greatly enhance modern adaptive radar systems. And in a move that parallels the impetus of the computer vision boom, they have released a large dataset of digital landscapes for other AI researchers to build on their work.” → ✭[2406.09638] RASPNet: A Benchmark Dataset for Radar Adaptive Signal Processing Applications
✭A microsized optical spectrometer based on an organic photodetector with an electrically tunable spectral response (Nature Electronics) “Miniaturized optical spectrometers could be of use in portable and wearable applications. Such devices have typically been based on arrays of photodetectors that provide distinct spectral responses or use complex miniaturized dispersive optics. However, these approaches often result in large centimetre-sized systems. Here we report a microsized optical spectrometer that is based on an optical-spacer-integrated photomultiplication-type organic photodetector with a bias-tunable spectral response. The approach allows the computational reconstruction of an incident light spectrum from photocurrents measured under a set of different bias voltages. The device, which has a footprint of 0.0004 cm2, is capable of broadband operation across the entire visible wavelength with a sub-5-nm resolution. To illustrate the capabilities of this approach, we fabricate an 8 × 8 spectroscopic sensor array that can be used for hyperspectral imaging.” → ✭ Micro-sized optical spectrometer operates across visible spectrum with sub-5-nm resolution (Phys.org) “"The approach allows the computational reconstruction of an incident light spectrum from photocurrents measured under a set of different bias voltages," He, Li and their colleagues wrote. "The device, which has a footprint of 0.0004 cm2, is capable of broadband operation across the entire visible wavelength with a sub-5-nm resolution."”
✭Simple autonomous agents can enhance creative semantic discovery by human groups | Nature Communications “Innovation is challenging, and theory and experiments indicate that groups may be better able to identify and preserve innovations than individuals. But innovation within groups faces its own challenges, including groupthink and truncated diffusion. We performed experiments involving a game in which people search for ideas in various conditions: alone, in networked social groups, or in networked groups featuring autonomous agents (bots). The objective was to search a semantic space of 20,000 nouns with defined similarities for an arbitrary noun with the highest point value. Participants (N = 1875) were embedded in networks (n = 125) of 15 nodes to which we sometimes added 2 bots. The bots had 3 possible strategies: they shared a random noun generated by their immediate neighbors, or a noun most similar from among those identified, or a noun least similar. We first confirm that groups are better able to explore a semantic space than isolated individuals. Then we show that when bots that share the most similar noun operate in groups facing a semantic space that is relatively easy to navigate, group performance is superior. Simple autonomous agents with interpretable behavior can affect the capacity for creative discovery of human groups.”
📚Retroactive Readings: ChatGPT, or the Eschatology of Machines - Journal #137 (Yuk Hui | June 2023)
✭ChatGPT, or the Eschatology of Machines - Journal #137 (Yuk Hui | June 2023) “If the opposition between mechanism and organism characterizes a grand debate of modern philosophy, determining the direction of its development, then the debate persists today, when so many of the statements discrediting AI and ChatGPT assume that machines are only mechanistic and therefore unable to understand semantic meaning. It would be equally wrong to claim that machines are merely a failed imitation of human understanding when it comes to semantic meaning. Philosopher and cognitive scientist Brian Cantwell Smith has sharply criticized this anthropomorphic thinking, defending a machinic intentionality. For him, even if one finds no human intentionality in a machine, it remains a form of intentionality nonetheless; it is semantic, even if not in the sense of human language. ~ Such a separation of anthropomorphic semantics from machine semantics is fundamental for rethinking our relations with machines, yet only as a first step. ~ Searle’s argument fundamentally ignores the recursive form of calculation performed by today’s machines. One may argue that computer science shouldn’t be conflated with cybernetics, since cybernetics is too overarching a science. However, one can also think about Gödel’s recursive function and its equivalence with the Turing Machine and Alonzo Church’s lambda calculus (a well-known story in the history of computation). The term “recursivity” doesn’t only belong to cybernetics; it also belongs to post-mechanistic thinking. The advent of cybernetics only announced the possibility of realizing this recursive thinking in cybernetic machines. The “intelligence” found in machines today is a reflective form of operation, as both Gotthard Günther and Gilbert Simondon rightly observed.”


