AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
![]() ![]() Per our recent blog post, we aim to make AI systems we build have reasonable default behaviors that reflect a wide swathe of users’ values, allow those systems to be customized within broad bounds, and get public input on what those bounds should be. The model can have various biases in its outputs-we have made progress on these but there’s still more to do. Medical Knowledge Self-Assessment Program Graduate Record Examination (GRE) Writing Graduate Record Examination (GRE) Quantitative We’re also open-sourcing OpenAI Evals, our framework for automated evaluation of AI model performance, to allow anyone to report shortcomings in our models to help guide further improvements. To prepare the image input capability for wider availability, we’re collaborating closely with a single partner to start. We are releasing GPT-4’s text input capability via ChatGPT and the API (with a waitlist). As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance-something we view as critical for safety. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. We found and fixed some bugs and improved our theoretical foundations. A year ago, we trained GPT-3.5 as a first “test run” of the system. Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails. For example, it passes a simulated bar exam with a score around the top 10% of test takers in contrast, GPT-3.5’s score was around the bottom 10%. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. ![]() We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. ![]()
0 Comments
Read More
Leave a Reply. |