Artificial Intelligence: 5 Technologies That Will Change the Future

How ChatGPT is revolutionizing robot programming

Modern artificial intelligence technologies, including ChatGPT, are radically changing approaches to programming. Researchers from Microsoft have developed advanced methods that make it possible to effectively use this powerful tool to automate tasks related to the control of robots, drones, and manipulators. By integrating AI into the coding process, developers can significantly speed up software creation, improve code quality, and increase its performance. The use of artificial intelligence in programming opens up new possibilities for creating complex automation and robotics systems, which makes this field especially relevant in the context of rapid technological progress.

Developing software for robot control previously required significant time investments and in-depth knowledge of the internal architecture of devices. With the advent of ChatGPT, this process has become significantly easier. Now users can simply formulate commands in natural language, and the neural network converts them into program code. This simplifies the creation of robot applications and makes the technology more accessible to a wider audience, including those with no programming experience. ChatGPT's capabilities open up new horizons in automation and robot control, accelerating their implementation in various areas of life.

ChatGPT creates a new approach to robotics, enabling high-level interaction between the user and AI. Infographics: Olya Ezhak for Skillbox Media

Previously, controlling robots required the use of specialized commands and libraries for each individual device. However, Microsoft has developed a universal library of functions based on simple commands, such as Python, which has significantly simplified the process of controlling robots. Now developers can create automated systems faster and more efficiently, using a unified approach and minimizing the time spent integrating various technologies. This opens up new opportunities for the development and implementation of robotics in various fields.

The researchers note that the goal is to create conditions for people to interact with robots without the need to master complex programming languages and technical aspects. This will make the technology more accessible and convenient for a wider audience, facilitating the integration of robotics into everyday life.

Initially, ChatGPT did not have knowledge of the new library, but the development team provided examples and instructions. This allowed the neural network to efficiently generate code, adapting to new requirements and tasks. Developers leveraged ChatGPT's capabilities to create high-quality software, significantly simplifying the development process. Using ChatGPT, drones were programmed to find objects indoors, navigate along specified routes, and perform tasks such as photography. For example, when asked to "take a selfie using a reflective surface," the AI generated code that allowed the quadcopter to detect a mirror and capture its reflection. This demonstrates the potential of AI in drone control, opening up new horizons for their use in a variety of fields, including entertainment, security, and research.

The programming was successfully tested on a robotic arm, which was able to assemble the Microsoft logo from blocks without a visual representation. ChatGPT used information available during training to recall the company's corporate colors. This experience demonstrates the capabilities of modern technologies in robotics and artificial intelligence, emphasizing the importance of programming in creating complex problems and solutions.

Using ChatGPT in robotics has its limitations, one of which is the model's inability to perceive visual information. To overcome this problem, the researchers integrated the YOLO neural network, designed for object recognition. This integration allows ChatGPT to receive data about the environment, making it possible to control robots in real time. Thus, the combination of ChatGPT and YOLO opens new horizons in automation and interaction of robots with the surrounding world, significantly expanding their functionality.

Using data from YOLO, ChatGPT was able to effectively control the robot's movements to successfully catch a basketball. This example illustrates how a language model can create spatial representations based on text information. The use of cutting-edge technologies like YOLO and ChatGPT opens new horizons in robotics and human-machine interaction. This also highlights the importance of integrating artificial intelligence into tasks requiring visual perception and spatial analysis.

Researchers have developed a virtual simulator integrated with ChatGPT. As part of this project, the PromptCraft-Robotics community was created on the GitHub platform. Users can test the new simulation method and share examples, thereby promoting the development and improvement of robotics technologies.

Competitors at Google introduced a similar concept, known as Code as Policies (CaP), releasing the source code on GitHub three months before Microsoft. However, their method has not attracted the same level of interest as ChatGPT.

Google's Code as Policies method uses a language model to convert natural language into robot control code. Image: googleblog.com (translated by Skillbox Media)

Authors from Google focused on the potential risks of using artificial intelligence in programming robots. A key aspect is the possibility of unpredictable behavior of devices if the generated programs do not pass human checks. This highlights the importance of control and validation of AI solutions to ensure the safety and reliability of robotic systems.

PaLM-SayCan and PaLM-E: Innovations in Embodied AI from Google

Since the launch of the PaLM transformer neural network, Google researchers have been actively working on the integration of large-scale language models (LLM) into robot control. With a remarkable 540 billion tunable parameters, PaLM significantly outperforms GPT-3.5, having three times more capabilities. This powerful neural network opens new horizons for developing intelligent systems capable of more complex interaction and learning. Integrating LLM into robotics can significantly improve the functionality and adaptability of robots in a variety of fields, from industry to everyday life.

Large language models like PaLM offer impressive capabilities for describing task execution processes. However, their knowledge remains largely theoretical, as language neural networks lack a physical body and are unable to interact with the real world. This aspect has long been considered their main limitation. However, as technology advances and language models are integrated with robotics and other systems, their practical applications are expanding significantly.

In April 2022, the Google team announced the next stage of technological development by combining their PaLM language model with the Everyday Robot, a robotic assistant. This robot is designed to perform routine tasks in both office and home environments. This convergence of technologies opens up new possibilities for automation and increased efficiency in everyday life, highlighting Google's commitment to creating innovative solutions for users.

Everyday Robot is an ambitious Google project that aims to develop robots that can perform various household tasks, such as cleaning and cooking. First demonstrated in 2019, these robotic assistants serve as a test platform for integrating neural network control systems. Everyday Robot is equipped with ultrasonic sensors, multiple cameras, inertial measurement units (IMUs), and lidar, allowing them to effectively navigate and interact with their surroundings. The project aims to make everyday life easier by improving home services through advanced technology and automation.

PaLM-SayCan is an innovative combination of the PaLM language model and the Everyday Robot. In this system, the language model functions as the "brain," while the robot acts as the "eyes and hands." This marks the first successful integration of artificial intelligence into a language model, opening new horizons for the application of AI in robotics. Using PaLM-SayCan demonstrates how the synergy between language models and robots can improve human-machine interaction and expand automation capabilities across a variety of domains.

When interacting with the PaLM robot, natural language commands such as "I spilled a drink, help me clean it up" are processed. In response to this command, the robot suggests a set of actions to help solve the problem. PaLM effectively understands context and can adapt to different situations, providing users with convenience and support in everyday tasks.

The internal interface of PaLM-SayCan resembles the interface of the T-800 robot from the movie "Terminator". Image: googleblog.com (translated by Skillbox Media)

Experts have developed an innovative method that allows the PaLM language model to more effectively understand the context and environment of the robot. This model is capable of generating action plans, but not all of them can be implemented in practice. For example, PaLM may recommend the robot to use a vacuum cleaner, which is not available in the house. This limitation highlights the importance of further development of models to improve their practical applicability and accuracy in interaction with the real world.

The SayCan control method includes two key components. The first part is responsible for defining possible actions suggested by the language model, such as "use a vacuum cleaner" or "mop". The second part focuses on assessing the probability of successful execution of each of the suggested actions. This approach enables efficient selection of the most appropriate solutions in the context of user interactions and task execution.

The robot analyzes commands and selects the most appropriate ones for execution, dividing complex tasks into simpler and more manageable steps. This approach ensures efficiency and accuracy in task execution, helping to optimize the workflow.

PaLM-SayCan demonstrates the ability of robots to execute complex natural language commands by combining the intelligence of large language models with already mastered actions. This opens new horizons for human-machine interaction, allowing robots to better understand and respond to user needs. The development of such technologies is an important step in creating more intuitive and versatile automation systems, significantly simplifying the performance of everyday tasks.

Google researchers presented a paper titled "Towards Helpful Robots: Grounding Language in Robotic Affordances". This paper examines ways in which robots can better understand natural language based on their capabilities and functions. This research aims to improve interactions between people and robots by making technology more accessible and useful. The authors emphasize the importance of the connection between language and robotic actions, which can significantly improve their effectiveness in various fields.

Given the task 'I spilled cola, can you bring me something to clean it up?', PaLM-SayCan planned and executed Steps: 1. Find a sponge. 2. Take a sponge. 3. Bring it. 4. Done. The options the AI explored at each step are highlighted in color: language assessment (blue), accessibility assessment (red), and a combination of the two (green). Image: say-can.github.io (translated by Skillbox Media)

In practical tests, PaLM-SayCan was found to effectively process commands in various languages, such as Chinese, French, and Spanish, while maintaining high performance. This confirms its multilingual capabilities and versatility in use.

The system can interpret vague commands, for example, “I’m home from training, bring me a snack.” In response to such a command, the robot analyzes the available products in the kitchen and selects the most appropriate option, for example, a nutrition bar. This feature makes interaction with the device more natural, allowing the user to get what they want without having to formulate precise requests.

The longest cycle consisted of 16 sequential steps, which were planned and executed by artificial intelligence. Using PaLM-SayCan, robots demonstrated the ability to successfully select the correct sequence of actions 84% of the time and implement it 74%. In addition, robots were recorded as becoming 26% more efficient in completing tasks consisting of eight or more steps. This confirms significant progress in the fields of robotics and artificial intelligence, opening up new possibilities for applications in various fields.

We are impressed with the achievements of PaLM-SayCan. Our research confirmed its ability to effectively plan and implement long-term abstract instructions in natural language. This progress opens up new possibilities for the application of technologies in various fields, including automation and user interfaces.

Google researchers presented a paper titled "Towards Helpful Robots: Grounding Language in Robotic Affordances". This paper discusses how interactions between people and robots can be improved based on an understanding of language and robotic affordances. The authors focus on the importance of linking linguistic expressions to physical actions that robots can perform. This research opens new frontiers in the development of more intuitive and helpful robots capable of performing complex tasks, and improves interactions in various fields, including manufacturing and everyday life.

In March 2023, the team improved the PaLM model by integrating the state-of-the-art ViT-22B transformer network, specifically designed for processing visual data. The updated system was named PaLM-E, where the letter "E" stands for indicates an "embodied" approach to information processing. This improvement significantly expanded the functionality of PaLM-E, allowing it to work more effectively with a variety of visual tasks and provide a deeper understanding of content.

ViT-22B significantly improved the capabilities of the PaLM-E model, turning it into a multimodal visual-linguistic model (VLM). This innovation enabled the model to "see" and associate images with textual information. The total number of system parameters now amounts to 562 billion, which contributes to its high efficiency in data processing and analysis.

Multimodal capabilities of PaLM-E. Image: palm-e.github.io (translated by Skillbox Media)

The core architectural concept of PaLM-E is to integrate continuous observations, such as images and sensor data, into a pre-trained language model. This architectural idea enables a deeper connection between visual information and text data, which contributes to improved understanding and processing of information. Integrating such observations into a language model opens up new possibilities for applications in various fields, including natural language processing and computer vision.

Google researchers presented a paper titled "PaLM-E: An Embodied Multimodal Language Model", which was published on the arXiv platform. This paper focuses on the development of a multi-featured language model capable of processing and integrating various types of data, including text and visual information. This research aims to improve interactions between humans and machines, opening up new possibilities in the fields of artificial intelligence and natural language processing. The importance of this work lies in its potential to create more advanced systems that can better understand context and perform complex tasks based on multimodal data.

PaLM-E is a powerful, versatile language model. It is capable of processing and generating natural language text, making it an indispensable tool for a variety of tasks. PaLM-E can answer questions, create text in various formats, translate languages, and perform data analysis. This model also demonstrates a high degree of contextual understanding, allowing it to provide accurate and relevant answers. Thanks to its technologies, PaLM-E finds applications in education, business, and research, improving human-machine interaction. Its learning and adaptation capabilities make it one of the most advanced models in the field of natural language processing.

Provide a multimodal logical chain of reasoning containing linguistic and visual data.
Respond promptly to changes in the situation during task completion.
Transfer knowledge and skills gained from previous tasks to new ones.

Robotics Innovations: Transformers and Visual "Hallucinations"

Modern models such as PaLM-SayCan play a key role in high-level robotic action planning. PaLM-SayCan can be thought of as the "mind" that makes strategic decisions, while the Robotics Transformer (RT-1) model, unveiled by Google in December 2022, is responsible for more instinctive and reflexive reactions. These technologies significantly improve the interaction of robots with their environment, allowing them to quickly adapt to changing conditions and perform tasks with high efficiency. The implementation of such models opens new horizons in the field of robotics, making robots more intelligent and capable of autonomous learning. RT-1, with 35 million parameters, accepts images and text commands, such as "pick up an object," and based on these, generates control commands for robotic systems. This model clearly illustrates how a robot can learn to perform various tasks by simulating actions that occur in the kitchen. Using RT-1 opens new horizons in the field of robotics, allowing the creation of more adaptive and intelligent mechanisms capable of effectively interacting with the environment. To train the RT-1 model, researchers collected over 130,000 annotated videos of 13 Everyday Robot robots performing 700 standard tasks in conditions as close as possible to a typical kitchen. The data collection process took 17 months. This large-scale project aims to improve robots' skills in performing everyday tasks, opening up new prospects for automating household chores and increasing kitchen efficiency.

The data obtained showed that RT-1 successfully completed 97% of 700 tasks, a 25% improvement over previous algorithms. Furthermore, thanks to its ability to generalize, RT-1 demonstrates high performance when solving new tasks for which examples were not provided in the training data, achieving a 76% success rate. These results highlight RT-1's advantages in the field of artificial intelligence and expand its capabilities in solving complex problems.

Google researchers realized that more data was needed to further develop the RT-1 model, despite the extensive dataset already available. As a result, they added 209,000 new examples collected using a KUKA robotic arm. This solution significantly improved the robot's skills and increased its efficiency in performing various tasks.

RT-1 surprisingly learned new skills by observing the behavior of other robots. For example, after integrating data from KUKA, its efficiency in waste disposal nearly doubled. This progress highlights the importance of information sharing and collaboration between robots to achieve high results in process automation.

Vincent Vanhoucke, head of robotics at Google Research, emphasized that despite the lack of direct communication between robots, it is possible to effectively combine different data sets from different types of robots. This allows knowledge to be transferred between them, similar to the process of information exchange between people. This approach opens new horizons in the development of artificial intelligence and robotics, paving the way for more complex interactions and improving robot functionality.

Researchers aim to reduce the time it takes to collect new data using generative models such as DALL-E 2 and Stable Diffusion. These technologies not only generate new images but also modify existing ones, adding elements that can be described as "hallucinations" of artificial intelligence. The use of such models opens up new possibilities in the field of digital art and visual content, enabling the creation of unique visual solutions with minimal investment of time and resources.

In February 2023, a new methodology called ROSIE, which stands for Scaling Robot Learning with Semantically Imagined Experience, was presented. This innovative approach involves the use of three neural networks: OWL-ViT, designed for image segmentation, GPT-3, which is responsible for generating text prompts, and Imagen, which creates synthetic images. The ROSIE methodology opens new horizons in the field of robotics and machine learning, improving the quality of robot training and their interaction with the environment.

ROSIE analyzes text instructions and makes modifications to original videos. For example, if a video uses a blue sponge, ROSIE can replace it with a red one or create an entirely new object. This approach allows content to be adapted to different requirements and preferences, improving the viewer experience and increasing its appeal. Using ROSIE during video editing significantly simplifies the task, allowing changes to be made quickly and efficiently without the need to reshoot the material.

Analyzes text instructions to identify areas of the original video that require changes.
Uses inpainting to modify specific parts of the image while keeping other elements intact.

This method not only facilitates the RT-1's acquisition of new objects but also increases its resilience to visual distractions. For example, ROSIE has the ability to "imaginate" objects that are not present in the original videos. This significantly expands learning capabilities and improves the efficiency of working with visual information.

An evaluation of 243 AI-enhanced examples demonstrated that the ROSIE method significantly improves the model's generalization ability and its resilience to distractions. This allows the RT-1 system to effectively solve more complex problems, increasing its efficiency by 75%.

Robots learn from synthetic videos and develop internal dialogue

Researchers from Google Brain, the University of California at Berkeley, MIT, and the University of Alberta have presented a new approach that eliminates the need for real data to train robots. Instead of traditional methods, the scientists propose using artificial intelligence to generate training videos. This approach not only improves training efficiency but also opens up new horizons in the development of robotics, allowing for the creation of more adaptive and intelligent systems.

In January 2023, the Universal Policy (UniPi) model was announced, which uses a powerful T5-XXL language neural network with 4.6 billion parameters. This model combines generative artificial intelligence to create video frames based on text descriptions. UniPi represents a significant advance in video generation, delivering high-quality and accurate visualization based on text content. This innovative technology opens new horizons for application in various fields, such as entertainment, education, and advertising.

UniPi uses images as a universal interface, text tasks serve as task specifiers, and the scheduling module operates independently of the type of action performed. This improves the efficiency of interaction with the system and improves process management. The use of visual elements and text instructions optimizes workflows, making them more visual and understandable for users.

The UniPi algorithm includes several key steps that enable efficient process organization. First, data is collected from various sensors and devices connected to the system. This data is then processed using built-in algorithms, providing up-to-date information on the system's state. The obtained information is then analyzed, facilitating informed decision-making. Based on the analysis, commands are generated to control connected devices, which completes the closed-loop operation of UniPi. This approach ensures reliable and efficient operation of the system in a variety of conditions.

The neural network receives a photograph showing the initial position of the manipulator and the surrounding environment as input.
A text task formulated by a person is added to the photograph.
Using the photograph as the first frame, the neural network generates subsequent frames, imagining how the manipulator should move to complete the task.
Each frame of the generated video is converted into a set of commands for the real manipulator.
Following these commands, the robot performs the actions shown in the synthetic video.

The process of learning to wash dishes can be compared to how a person, looking at a pile of dirty dishes, begins to imagine how he will clean each of them. This requires not only visualization, but also a practical approach to the matter, in which skill and patience are essential. Observing a task helps you develop strategies that you can then put into practice.

The project's official website features examples of tasks effectively performed by a robot using UniPi. The robot can rearrange blocks, clean dishes with a sponge, carefully place spoons in a tray, open a faucet, and carry groceries. These demonstrations highlight the versatility and practical applicability of UniPi technology in robotics.

Modern robots can now learn not only from synthetic data but also from real-world videos available online. Using the UniPi platform, robots can simply watch a training video on YouTube, allowing them to effectively master new tasks. This opens up new horizons in machine learning and automation, significantly simplifying the training process and expanding the capabilities of robots in various fields.

Illustration of the UniPi method: on the left is the original frame, on the right are frames generated by the neural network demonstrating imaginary actions. Image: universal-policy.github.io (translated by Skillbox Media)

Research in artificial intelligence continues to advance, and advances in this field still require further study. Google scientists have presented an innovative system called Inner Monologue, which allows robots to conduct internal dialogues. This technology opens new horizons for androids, allowing them not only to interact with people but also to independently analyze and discuss their actions. The implementation of such a system can significantly increase the level of autonomy and effectiveness of robots in various tasks.

Inner Monologue gives robots the ability to interact with an integrated language model, which allows them to evaluate the effectiveness of their actions and make necessary adjustments to their plans in case of unforeseen situations. This approach helps to increase the adaptability and effectiveness of robotic systems, allowing them to better respond to changes in the environment.

Researchers distinguish three categories of internal conversations within the Inner Monologue concept: passive description, active description of the environment, and identification of success. Passive description allows artificial intelligence to formulate verbal representations of recognized objects, for example, "In front of me is a table with an apple, a chocolate bar, and a bag of chips." Active description, on the other hand, involves more detailed observations and interactions with the environment. Identifying success allows the AI to analyze the results of its actions and make informed decisions for future interactions. These categories help better understand how AI perceives and interacts with the world, which is important for the development of artificial intelligence technologies. Active description involves asking questions about the current situation, for example, "Should I choose an apple, a chocolate bar, or chips?" Answers to such questions can be provided by both built-in language models and real people. This process facilitates more informed choice and can be useful in a variety of contexts, including dietary and lifestyle decisions. Using active description helps to better understand preferences and needs, and develops critical thinking skills.

Detecting success is a key aspect for robots, allowing them to determine when to complete a task or continue working. Artificial intelligence periodically evaluates its performance by asking itself, "Did I achieve the desired result?" and forming an answer. This self-monitoring process helps improve performance and achieve goals more effectively.

During one test, the researcher asked the robot to bring a soda. When the machine detected a can of Coke and attempted to pick it up, the human discreetly removed the drink from the table. This triggered an internal dialogue, prompting the robot to ask clarifying questions. It assessed the changing situation and adjusted its actions. Ultimately, the robot found another can of Coke and successfully served it. This experiment demonstrates the ability of robots to adapt to changing conditions and make decisions based on situational analysis, which is an important step toward creating smarter and more autonomous machines.

We were impressed that "Inner Monologue," when presented with new information about the situation, demonstrates an intelligent approach, going beyond the initial textual instructions. Rather than simply following instructions, it actively seeks solutions, suggesting alternative goals if previous ones become unavailable. This demonstrates his ability to adapt to change and find creative solutions to complex situations.

Google researchers presented a paper titled "Inner Monologue: Embodied Reasoning through Planning with Language Models." This study examines the potential of language models for performing complex tasks using internal dialogue. The authors emphasize how combining language models and planning can improve decision-making and task performance. The work contains important findings that can influence the development of natural language processing technologies and their application in various fields.

The future of technology... Still: film "Terminator" / Orion Pictures

The Future of Robotics: Threat or Opportunity?

With the recent closure of Google's Everyday Robots division in February 2023, the world of robotics faces a major turning point. This decision was part of a broader cost optimization strategy in which the company laid off 12,000 employees and closed unprofitable divisions. The closure of Everyday Robots highlights the complex challenges faced by high-tech companies seeking to improve efficiency and reduce costs. Amid increasing competition and the need for innovation, the future of robotics remains uncertain, calling into question further development and research in this field.

According to a former Everyday Robots employee, the team was close to making important discoveries in the field of robotics. "We were just beginning to understand the potential of robots to perform meaningful tasks. Given the opportunity, we could develop a truly valuable product in five years," he noted. This perspective underscores the importance of investing in robotics research and development to achieve breakthrough results.

In today's technological world, concerns about artificial intelligence are growing. Elon Musk, Steve Wozniak, and over a thousand experts have signed an open letter calling for a temporary halt to the development of advanced AI systems. The letter emphasizes the need for government intervention to impose a moratorium on such developments. This initiative reflects growing concern about the potential risks associated with the use of artificial intelligence and emphasizes the importance of a responsible approach to its development.

The letter addresses a key question: should all jobs be automated and is it worth risking our civilization by creating artificial intelligence that could surpass us in both numbers and intelligence? These issues require deep analysis and meaningful discussion. Workplace automation has the potential to significantly change the economy and society, so it's important to weigh the potential risks and benefits to ensure a stable and secure future.

Some experts highlight the potential risks associated with the development of robotics, while others emphasize the importance of innovation in this field. The market for anthropomorphic robots is rapidly growing, leading to increased interest in their functionality and applications in various fields, such as medicine, education, and industry. It's important to consider that with the advancement of technology comes new challenges that require a careful approach to safety and ethics. Innovations in robotics are opening up new horizons, enabling the creation of effective solutions to improve quality of life and optimize production processes.

Revised text: