M.A.V.I.S My Average Very Intelligent System

Ryan, Kellen, Charlie, Biruk, Anthony

Introduction

Artificial intelligence (AI) is advancing quickly, reshaping the way people interact with technology. M.A.V.I.S (My Average Very Intelligent System) is an AI-powered voice assistant designed to provide an engaging and interactive user experience. It is inspired by the fictional AI J.A.R.V.I.S from the Iron Man comics and movies. J.A.R.V.I.S is similar to traditional AI assistants like Siri and Google Assistant, but offers a more conversational and witty personality. In contrast to the majority of existing voice assistants that rely on rigid command-based systems, M.A.V.I.S is designed to take into account contextual awareness and user engagement. The model is offered as a mobile app on the IOS platform designed within the XCode editor. By leveraging open-source applications VOSK for speech recognition, RASA for natural language understanding, and Mozilla TTS for speech synthesis, the model balances performance, flexibility, and ethical AI principles.

M.A.V.I.S is an AI system that not only understands user intent but also communicates in a human-like manner. Our approach to creating M.A.V.I.S uses a sophisticated pipeline that transforms voice input into engaging interactions. It is a complete redesign of the typical AI assistant, from recording audio waveforms to generating smart responses and rendering speech output. We emphasize user interaction, accessibility, and ethics, and have created an AI assistant that has made technology interactions more responsive and engaging. M.A.V.I.S extends its functionality beyond engaging conversations by integrating seamlessly with everyday apps. With direct access to your calendar, it can schedule, reschedule, or remind you of appointments, ensuring you never miss an important date. Additionally, by linking to weather applications, M.A.V.I.S provides real-time weather updates and forecasts, helping you plan your day better. These functionality features are planned as implementation within a minimum viable product.

When building M.A.V.I.S, we looked at a lot of other projects and research to understand what’s already out there and where we could do something different. One of the biggest areas we focused on was the ethics of advanced AI assistants. A paper by Gabriel et al. (2024) brings up important questions about trust, misuse, and value alignment1. That stuck with us, and it’s why we’re putting a lot of emphasis on transparency, feedback loops, and diverse data from the start. We don’t want to just build something that works—we want it to be responsible and respectful of the people using it.

On the technical side, M.A.V.I.S uses a combination of VOSK for speech-to-text, RASA for natural language understanding, and Mozilla TTS for generating speech. This setup is pretty aligned with a few other projects, like My Assistant SRSTC2 and another voice assistant built using GPT-3.5 and the Web Speech API3. Those projects show how to stitch together the components of a speech pipeline, and they’ve been helpful in figuring out how to structure ours. The difference is that we’re going beyond just making the assistant functional—we’re adding personality, visual waveform feedback, and a more dynamic way of responding to users.

There’s also some cool research like LLaSM (Large Language and Speech Model) that tries to merge speech and text into a single training process4. We’re not doing exactly that, but it definitely influenced how we think about context in speech. While LLaSM is more focused on large-scale, instruction-following models, we’re focused on real-time interaction and creating an assistant that feels alive—more like J.A.R.V.I.S than a basic Q&A bot.

We also took inspiration from projects that focus on specific user groups. One study looked at how older adults interact with voice assistants, and it pointed out how rule-based models don’t always understand different speech patterns, especially as people age5. This made us think more about inclusivity and how we can make M.A.V.I.S flexible enough to work well for everyone—not just the tech-savvy.

Another interesting idea came from a paper about real-time emotion recognition in smart home assistants. They used things like MFCCs and CNNs to detect emotional tone in speech6. We’re not fully there yet, but it got us thinking about how emotional awareness could be a future direction for M.A.V.I.S—making it not just smart, but empathetic and reactive in more human ways.

And finally, to help with some of the nuts and bolts of building this, we leaned on resources like Natural Language Processing: An Introduction7. It’s more of a general guide, but it helped us troubleshoot and understand how to better structure our NLP models, especially when users go off-script or say something unexpected.

Overall, our approach to M.A.V.I.S blends what’s been done before with some fresh takes—like giving it a real personality, making it visually and auditorily engaging, and thinking through the ethical side from the beginning. It’s not just about building a tool, it’s about reimagining what interacting with AI could feel like.

Methods

Our project is developed entirely in Python, which serves as the coding language in which we will integrate various open-source tools and libraries. The first step in our pipeline is we use the Whisper model developed by OpenAI for speech recognition that will efficiently convert our spoken input into text. The audio files are uploaded through the web interface built with Flask and then stored and transcribed using Whisper. The resulting text is then returned to the user in real time via a JSON API response. Google OAuth 2.0 is also integrated for authentication and calendar API access which allows the model to act as an assistant. The application maintains session state using Flask’s session mechanism and stores OAuth tokens securely for continued use. Audio files are time stamped and saved in a local directory to ensure traceability and allow use for analysis and enhancements as well as more personalization. Potential pitfalls include handling different audio qualities and file formats, and also maintaining data security and privacy throughout the system.

Ethical Sweep

General Questions: Should we even be doing this? Yes—if built responsibly, M.A.V.I.S can enhance user interaction and accessibility, but it requires careful oversight to ensure ethical use. What might be the accuracy of a simple non-ML alternative? A non-ML approach would likely have lower accuracy and flexibility in understanding natural language compared to ML models. What processes will we use to handle appeals/mistakes? We plan to implement user feedback systems, clear error reporting channels, and regular reviews to correct issues and update the system. How diverse is our team? Our team includes members from various technical and creative backgrounds. How will we maintain transparency for the model? We will document model decisions, communicate its limitations, and provide explanations for its generated responses when possible

Data Questions: Is our data valid for its intended use? The data will require ongoing validation and updates to maintain its relevance and accuracy. What bias could be in our data? The data may contain language, cultural, or demographic biases favoring certain perspectives over others which we must identify and address. How could we minimize bias in our data and model? We can minimize bias by sourcing diverse and balanced datasets and applying bias detection techniques during training. This requires implementing comprehensive sampling strategies across different demographics, languages, and cultural contexts. Regular testing against known bias benchmarks and establishing feedback loops from diverse user groups will help identify and mitigate biases as they emerge. Additionally, maintaining transparent documentation about known limitations and potential bias areas will allow for ongoing improvement of the system. How should we “audit” our code and data? You could implement automated testing, clear documentation, and human auditing of the system. How can we ensure data security? If M.A.V.I.S was to be released at large scale, encryption for stored and transmitted data will be of utmost importance to ensure user privacy. We will also limit data retention and anonymize sensitive information.

Discussion/Results

We will evaluate M.A.V.I.S based on its responsiveness, conversational accuracy, and user engagement in various testing scenarios. Initial results will focus on system latency, speech recognition quality, and how naturally the assistant interacts with users. Metrics like response time and word error rate will be used to assess the performance of the speech recognition and text-to-speech systems. We also plan to log and analyze how well RASA is able to classify intents and how it handles follow-up questions. To do this evaluation we will include both objective data and subjective user feedback to determine whether interactions feel smooth and intuitive. We would also like to compare M.A.V.I.S to existing voice assistants and compare between the two transparency, customization, and open-source accessibility. We will also touch on the trade-offs between customizability and ease-of-use compared to existing proprietary systems.

User Feedback / Evaluation of Model

We will collect feedback using a short user survey after using M.A.V.I.S using the following questions:

How would you rate the quality/accuracy/relevancy of responses?*

How would you rate the model’s “personality”?*

How would you rate the model’s weather/calendar functionality?*

*To be rated from a scale from 1 to 5

Reflection

The project explored speech recognition, generative artificial intelligence, and text-to-speech technology and frameworks. We demonstrated the feasibility of our approach with our final product–one that differentiates itself from preexisting projects out there. We investigated the ethical implications of our research and evaluated model performance. Artificial intelligence is a rapidly progressing and evolving field, and so the AI tool can be retrofitted with more advanced models in the future. We can improve upon the user experience by refining the front-end as to make the tool more intuitive to the user. We can also expand upon the current functionality of the web application to also include other useful features. Moving forward, we can direct attention to developing the IOS application with similar practical features.

What’s Left

Here is what is left for us to do: -Make using tool more streamlined -Make the UI more visually appealing -More consistency with editing calendar -Make more secure (don’t have api tokens hardcoded) -Clean up paper -Revise sections -Add more details in methods -Evaluate the model and include in paper -Get sources for methods section and update references

Conclusion/Future Work

Our goal is to create a voice assistant that feels more human and less robotic. Future work includes improving emotional intelligence, expanding accessibility for diverse user groups, and optimizing performance for mobile deployment. We also want to work to make the AI able to more accurately pick up on different speech patterns to account for conversations it will have with different people.

Impact Questions

Do we expect different error rates for different sub-groups in the data? Yes, variations in demographics can lead to different errors and subgroup testing could help. What are likely misinterpretations of the results and what can be done to prevent those misinterpretations? Users may interpret the answers of the AI as definitive answers which can be mitigated with clear disclaimers. How might we impinge individuals’ privacy and/or anonymity? Handling the voice-to-text data is important and can be more secure with encryption. What is the environmental impact of training and running M.A.V.I.S? All AI models require significant of computational power, which in turn has an energy cost. We will consider optimizing efficiency and minimizing energy use where possible.

References

  1. The Ethics of Advanced AI Assistants. https://arxiv.org/abs/2404.16244 ↩ 

  2. My Assistant SRSTC: Speech Recognition and Speech to Text Conversion. https://ieeexplore.ieee.org/abstract/document/10593324 ↩ 

  3. Artificial Intelligence-Based Chatbot with Voice Assistance. https://ieeexplore.ieee.org/abstract/document/10545197 ↩ 

  4. LLASM: Large Language and Speech Model. https://arxiv.org/pdf/2308.15930 ↩ 

  5. Understanding Older People’s Voice Interactions with Smart Voice Assistants: A New Modified Rule-Based Natural Language Processing Model with Human Input. https://pmc.ncbi.nlm.nih.gov/articles/PMC11135128/ ↩ 

  6. Real-Time Speech Emotion Analysis for Smart Home Assistants. https://ieeexplore.ieee.org/abstract/document/9352018 ↩ 

  7. Natural Language Processing: An Introduction. https://www.researchgate.net/publication/51576224_Natural_language_processing_An_introduction ↩