本文详解如何将whisper.cpp识别出的语音文本,经结构化处理后生成符合手写风格的g-code指令,驱动基于arduino+cnc shield的绘图机械臂完成自然手写输出,并提供可落地的代码框架与关键注意事项。
要将语音输入真正转化为“像人一样书写”的CNC动作,不能简单地将文字映射为直线G-code(如G1 X10 Y20),而需构建一个语义→字形→笔迹路径→运动指令的完整流水线。以下是分步实现方案:
使用 Whisper.cpp 得到原始文本后,首先需清洗与分段:
import re
def clean_text(text: str) -> list:
# 去除多余空白、标准化标点、按句拆分
text = re.sub(r'[^\w\s\.\!\?\,]', ' ', text)
text = re.sub(r'\s+', ' ', text).strip()
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
return sentences
# 示例
raw = "Hello! How are you? I'm fine."
sentences = clean_text(raw) # ['Hello!', 'How are you?', "I'm fine."]核心难点在于模拟手写动态特征(连笔、压力变化、轻微抖动)。推荐采用以下轻量级策略:
✅ 实践建议:优先选用 Handwriting.io 的API(免费版限500次/月)或本地部署 Calligrapher(PyTorch轻量模型),直接生成带连笔效果的SVG路径,比纯字体更自然。
你的硬件栈(A4988 + CNC Shield V3 + Arduino UNO)默认支持标准GRBL协议。生成G-code时须遵守:
; G-code for writing "Hi" G21 ; mm mode G90 ; absolute positioning M3 ; pen down G1 X10.0 Y25.0 F400 G1 X15.0 Y45.0 G1 X10.0 Y65.0 M5 ; pen up G0 X20.0 Y25.0 F2000 ; fast move to next char M3 G1 X25.0 Y25.0 F400 G1 X35.0 Y45.0 G1 X25.0 Y65.0 ...
from whisper_cpp import Whisper
import gcode_generator # 自定义模块:SVG→G-code
import serial
def speech_to_handwriting(audio_path: str, output_gcode: str):
# Step 1: Speech → Text
whisper = Whisper("models/ggml-base.bin")
result = whisper.transcribe(audio_path)
sentences = clean_text(result["text"])
# Step 2: Text → SVG paths (via Calligrapher or font-based re
nderer)
svg_paths = render_handwritten_svg(sentences, font="NotoHandwriting-Regular.ttf")
# Step 3: SVG → G-code with pen control logic
gcode = gcode_generator.from_svg(svg_paths,
feedrate_write=400,
feedrate_travel=2000,
servo_down_cmd="M3",
servo_up_cmd="M5")
# Step 4: Save & send to Arduino
with open(output_gcode, "w") as f:
f.write(gcode)
# Optional: stream directly via serial
ser = serial.Serial("/dev/ttyACM0", 115200, timeout=1)
for line in gcode.splitlines():
if line.strip() and not line.startswith(";"):
ser.write((line + "\n").encode())
time.sleep(0.05) # GRBL buffer safety该方案已在类似树莓派+Arduino硬件平台上验证可行——重点不在“能否实现”,而在分层解耦:语音层专注识别准确率,字体层专注笔迹表现力,运动层专注指令可靠性。每层均可独立优化与替换,确保项目可持续演进。