Smart Phone Sensor Data Fusion: A Joint Learning Approach to Activity Recognition
A deep learning system for smartphone-based human activity recognition using accelerometer and gyroscope data, comparing a Full Transformer model with a proposed Joint Learning fusion architecture.
Overview
This project developed a deep learning-based system to classify human activities using smartphone sensor data from accelerometers and gyroscopes. Two architectures were designed and tested: a Full Transformer model and a Joint Learning model that fuses CNN-LSTM and Transformer components.
The goal was to combine spatial and temporal modeling techniques so the system could recognize activity patterns from noisy, multi-dimensional sensor signals while handling subtle transitions between similar movements.
Motivation
Smartphones and wearable devices have made activity recognition useful for fitness tracking, elderly care, and mobile health. Real-world sensor data is noisy and high-dimensional, which makes it difficult for traditional models to capture both local patterns and long-range dependencies.
This research used a hybrid architecture that integrates CNNs for spatial pattern extraction, Transformers for long-range dependencies, and LSTMs for sequential modeling.
Dataset
The work used the UCI-HAR Human Activity Recognition dataset, a benchmark dataset with 7,352 records across six labeled activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.
- Each record includes 561 sensor-derived features.
- Features were scaled using StandardScaler.
- The dataset was split into 70% training data and 30% test data.
Architectures
The Full Transformer architecture used positional encoding, six parallel Conv1D branches, multi-head self-attention, LSTM layers, global average pooling, dropout regularization, and a final Softmax classifier over six activities.
The proposed Joint Learning architecture combined a CNN-LSTM stream for local temporal feature learning with a Transformer stream for global attention-based modeling. The branches were merged and trained jointly with Adam for 50 epochs.
Results
- Full Transformer accuracy: 96%.
- Joint Learning accuracy: 98%.
- Joint Learning F1 score: 98%.
- The model was evaluated using confusion matrix, classification report, and ROC curves.
My Contributions
- Designed, implemented, and trained both architectures using TensorFlow and Keras.
- Preprocessed the UCI-HAR dataset, standardized features, and performed data analysis.
- Benchmarked the results against state-of-the-art models and visualized evaluation metrics.
- Drafted model comparisons and handled performance tuning.
Future Work
- Experiment with GRU-Transformer hybrid architectures.
- Study performance scaling with more CNN heads.
- Collect larger and more diverse activity datasets.
- Deploy the model on mobile or edge environments for real-time activity tracking.
Final Takeaway
This work shows how deep fusion architectures can extract both short-term features and long-term patterns from multi-sensor data. Combining CNNs, LSTMs, and Transformers produced a robust activity-recognition model suitable for real-world applications.