June 2025Research Work

Smart Phone Sensor Data Fusion: A Joint Learning Approach to Activity Recognition

A deep learning system for smartphone-based human activity recognition using accelerometer and gyroscope data, comparing a Full Transformer model with a proposed Joint Learning fusion architecture.

Read paper

Overview

This project developed a deep learning-based system to classify human activities using smartphone sensor data from accelerometers and gyroscopes. Two architectures were designed and tested: a Full Transformer model and a Joint Learning model that fuses CNN-LSTM and Transformer components.

The goal was to combine spatial and temporal modeling techniques so the system could recognize activity patterns from noisy, multi-dimensional sensor signals while handling subtle transitions between similar movements.

Motivation

Smartphones and wearable devices have made activity recognition useful for fitness tracking, elderly care, and mobile health. Real-world sensor data is noisy and high-dimensional, which makes it difficult for traditional models to capture both local patterns and long-range dependencies.

This research used a hybrid architecture that integrates CNNs for spatial pattern extraction, Transformers for long-range dependencies, and LSTMs for sequential modeling.

Dataset

The work used the UCI-HAR Human Activity Recognition dataset, a benchmark dataset with 7,352 records across six labeled activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.

  • Each record includes 561 sensor-derived features.
  • Features were scaled using StandardScaler.
  • The dataset was split into 70% training data and 30% test data.

Architectures

The Full Transformer architecture used positional encoding, six parallel Conv1D branches, multi-head self-attention, LSTM layers, global average pooling, dropout regularization, and a final Softmax classifier over six activities.

The proposed Joint Learning architecture combined a CNN-LSTM stream for local temporal feature learning with a Transformer stream for global attention-based modeling. The branches were merged and trained jointly with Adam for 50 epochs.

Results

  • Full Transformer accuracy: 96%.
  • Joint Learning accuracy: 98%.
  • Joint Learning F1 score: 98%.
  • The model was evaluated using confusion matrix, classification report, and ROC curves.

My Contributions

  • Designed, implemented, and trained both architectures using TensorFlow and Keras.
  • Preprocessed the UCI-HAR dataset, standardized features, and performed data analysis.
  • Benchmarked the results against state-of-the-art models and visualized evaluation metrics.
  • Drafted model comparisons and handled performance tuning.

Future Work

  • Experiment with GRU-Transformer hybrid architectures.
  • Study performance scaling with more CNN heads.
  • Collect larger and more diverse activity datasets.
  • Deploy the model on mobile or edge environments for real-time activity tracking.

Final Takeaway

This work shows how deep fusion architectures can extract both short-term features and long-term patterns from multi-sensor data. Combining CNNs, LSTMs, and Transformers produced a robust activity-recognition model suitable for real-world applications.