We investigate option hedging in an incomplete market with a reinforcement learning algorithm called double deep Q-network (DDQN). The agent of DDQN learns the optimal policy that generates replicating portfolios without prior knowledge of the stochastic representation of an underlying asset price process. First, we interpret a mean-variance approach in quadratic hedging in a reinforcement learning framework. This study includes three simulation studies for different underlying asset price processes: geometric Brownian motion (GBM), Heston, and GBM with compound Poisson jumps. For each study, a DDQN agent learns the optimal policy, and we compare the algorithm performance with delta hedging. Second, we discuss limitations that stem from the structure of reinforcement learning in finance.